vignettes/Wuidart_mouse_single-cell_preprocess.Rmd
Wuidart_mouse_single-cell_preprocess.Rmd
This vignette exemplifies the analysis of single-cell RNA-seq data with DTUrtle. The data used in this vignette is publicly available as Bioproject PRJNA433520 and the used FASTQ-files can be downloaded from here. The corresponding publication from Wuidart et al. can be found here.
The FASTQ-files can be directly obtained from ENA, alternatively they are also available as SRA-files from GEO, which can be converted to FASTQ-format.
As this vignette deals with single-cell RNA-seq data generated via the Smart-seq2 protocol, a (standard) sample of this Project represents reads from a single cell. We initially download all 384 available samples/cells - this can be done directly from ENA.
Suppose you have downloaded the Bioproject report table like this to your YOUR_PATH
directory - selecting the columns fastq_ftp
, fastq_md5
, sample_accession
, sample_title
, scientific_name
and study_accession
.
This table contains all needed information to download and MD5-check the files - as well as to create a basic sample meta information table.
Downloading can be done like this:
#YOUR_PATH contains the downloaded Bioproject report table `filereport_read_run_PRJNA433520_tsv.txt`
cd 'YOUR_PATH'
mkdir samples
#create a file of only the ftp-links for download:
sed 1d filereport_read_run_PRJNA433520_tsv.txt | cut -f 5 | sed -e 's/^/ftp:\/\//' > samples/wget_links.txt
#download with wget
cd samples
wget -i wget_links.txt
After downloading a MD5-check is strongly encouraged.
#create a reference table with filenames and expected MD5-sum
sed 1d ../filereport_read_run_PRJNA433520_tsv.txt | cut -f 4,5 | awk 'sub(".*/",$1" ")' > md5_check.txt
#perform the check with md5sum
md5sum -c md5_check.txt
You should only proceed, if all MD5-sums match!
As described in Wuidart et al., the sequences likely contain left-over adapter sequences and an adapter trimming is recommended.
We will trim the reads to remove adapters and perform a very relaxed quality filtering. We will use the tool trim-galore
(Version 0.6.4_dev) for this task.
#create needed folders
mkdir -p ../trim/fastqc/multiqc
trim_galore -fastqc_args "--outdir 'YOUR_PATH'/trim/fastqc/ -t 4" --stringency 3 --trim-n --cores 4 -o ../trim/ *.fastq.gz
The trimmed files are the newly created folder /trim
. By looking at the generated FastQC
reports, we can check how many sequences have been removed by the trimming, and if other QC paramters indicate problems. Optionally we can summarize the hundreds of reports to a single report with MultiQC
:
The files are now ready for transcript level quantification. We will use the tool Salmon
(Version 1.1.0) for this. Although we have single-cell RNA-seq data, samples/cells from the Smart-seq2 protocol can be treated as standard bulk samples (you have quite a lot of them, though). Therefore we can simply use the Salmon
for quantification and do not have to use specialised single-cell variants like Alevin
.
Salmon
does not perform a standard genomic alignment, but performs a quasi-mapping directly to the transcriptome. Reads, that could be originating from multiple transcript isoforms, are assigned to equivalence classes, where actual counts are derived by expectation maximization algorithms.
To run the Salmon
quantification, we first need a transcriptomic index. For this example, we will use the transcript sequences and comprehensive gene annotation from Gencode Version 34.
Additionally, in this experiment ERCC Spike-In sequences have been used in the sequencing. To also be able to quantify those not naturally occuring sequences, we should add those manually to the transcriptome and the annotation.
We can get the ERCC FASTA
-sequences and a corresponing GTF
-file directly from ThermoFisher and combine those files with the Gencode files:
#create folder and change directory
mkdir ../salmon
cd ../salmon
#download gencode
curl -LO ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.transcripts.fa.gz
gunzip gencode.v34.transcripts.fa.gz
curl -LO ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.annotation.gtf.gz
gunzip gencode.v34.annotation.gtf.gz
#download ERCC
curl -LO https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip
unzip ERCC92.zip
#combine references
cat ../tabula_muris/gencode.vM24.transcripts.fa ERCC92.fa > gencode.vM24.transcripts_spike.fa
cat ../tabula_muris/gencode.vM24.annotation.gtf ERCC92.gtf > gencode.vM24.annotation_spike.gtf
#optionally remove original gencode files
rm gencode.v34.transcripts.fa.gz
rm gencode.v34.annotation.gtf.gz
This reference FASTA
including the ERCC-sequences can then be used to build the Salmon
index:
#installation in conda evironment
conda install salmon
#build index - `p` specifies the number of threads to use
salmon index -t ../gencode.vM24.transcripts_spike.fa --gencode -p 4 -i index
The actual quantification can be performed in bash with:
cd ../trim
#loop over files
for i in *.fq.gz; do salmon quant -l A -i ../salmon/index/ -r $i --seqBias --gcBias --validateMappings --rangeFactorizationBins 4 --threads 4 -o ../salmon/${i/_trimmed.fq.gz/}; done
This concludes the preprocessing of the data. Please see the corresponding analysis vignette for an usage example of DTUrtle.