vignettes/Hoffman_human_bulk_preprocess.Rmd
Hoffman_human_bulk_preprocess.Rmd
This vignette exemplifies the pre-processing of bulk RNA-seq data for analysis with DTUrtle. The data used in this vignette is publicly available as Bioproject PRJNA594939 and the used FASTQ-files can be downloaded from here. The corresponding publication from Hoffman et al. based on this data set can be found here.
For this vignette we focus on two groups with 3 biological replicates each:
The FASTQ-files can be directly obtained from ENA, alternatively they are also available as SRA-files from GEO, which can be converted to FASTQ-format.
After downloading a MD5-check is strongly encouraged.
For this vignette, it is assumed that the FASTQ-files mentioned above have been downloaded to a directory called samples
.
After downloading the FASTQ-files, we first run a basic quality control step to assess if there might be some quality issues or left over adapter sequences. We will use the tools FastQC
and MultiQC
for this task with the following bash commands:
cd 'YOUR_PATH'/samples
#rename files
mv SRR10669441_1.fastq.gz bulk_EtOH_rep1_1.fastq.gz
mv SRR10669441_2.fastq.gz bulk_EtOH_rep1_2.fastq.gz
mv SRR10669442_1.fastq.gz bulk_EtOH_rep2_1.fastq.gz
mv SRR10669442_2.fastq.gz bulk_EtOH_rep2_2.fastq.gz
mv SRR10669443_1.fastq.gz bulk_EtOH_rep3_1.fastq.gz
mv SRR10669443_2.fastq.gz bulk_EtOH_rep3_2.fastq.gz
mv SRR10669447_1.fastq.gz bulk_D2hr_rep1_1.fastq.gz
mv SRR10669447_2.fastq.gz bulk_D2hr_rep1_2.fastq.gz
mv SRR10669448_1.fastq.gz bulk_D2hr_rep2_1.fastq.gz
mv SRR10669448_2.fastq.gz bulk_D2hr_rep2_2.fastq.gz
mv SRR10669449_1.fastq.gz bulk_D2hr_rep3_1.fastq.gz
mv SRR10669449_2.fastq.gz bulk_D2hr_rep3_2.fastq.gz
#Quality control
fastqc -t 4 -o fastqc/ *.fastq.gz
#summarize reports
multiqc -o multiqc ./fastqc/
We calculated a FastQC
report for every file and summarized the 12 reports with MultiQC
. The reports show minor leftovers of an Illumina sequencing adapter in the samples. This makes sense, as the samples were sequenced by an Illumina NovaSeq 6000. Although the adapter content is not dramatic, we will trim the reads to remove those adapters and perform a very relaxed quality filtering. We will use the tool trim-galore
(Version 0.6.4_dev) for this task.
The trimmed files are the newly created folder /trim
. Less than 1% of the total base pairs have been removed by the trimming process. Optionally one can again perform a FastQC
and MultiQC
analysis to confirm that the adapter sequences have been removed.
The files are now ready for transcript level quantification. We will use the tool Salmon
(Version 1.1.0) for this.
Salmon
does not perform a standard genomic alignment, but performs a quasi-mapping directly to the transcriptome. Reads, that could be originating from multiple transcript isoforms, are assigned to equivalence classes, where actual counts are derived by expectation maximization algorithms.
To run the Salmon
quantification, we first need a transcriptomic index. For this example, we will use the transcript sequences and comprehensive gene annotation from Gencode Version 34.
Assume you have downloaded and unpacked the above mentioned files in the analysis root directory. To build the actual index we run this bash code:
#installation in conda evironment
conda install salmon
#create folder and change directory
mkdir ../salmon
cd ../salmon
#build index - `p` specifies the number of threads to use
salmon index -t ../gencode.v34.transcripts.fa --gencode -p 4 -i index
The actual quantification can be performed in bash with:
cd ../trim
#loop over files
for i in bulk_*_1.fq.gz; do salmon quant -l A -i ../salmon/index/ -1 $i -2 ${i/_1_val_1/_2_val_2} --seqBias --gcBias --validateMappings --rangeFactorizationBins 4 --threads 4 -o ../salmon/${i/_1_val_1.fq.gz/}; done
This concludes the preprocessing of the data. Please see the corresponding analysis vignette for an usage example of DTUrtle.