vignettes/Hoffman_human_bulk_preprocess.Rmd
Hoffman_human_bulk_preprocess.RmdThis vignette exemplifies the pre-processing of bulk RNA-seq data for analysis with DTUrtle. The data used in this vignette is publicly available as Bioproject PRJNA594939 and the used FASTQ-files can be downloaded from here. The corresponding publication from Hoffman et al. based on this data set can be found here.
For this vignette we focus on two groups with 3 biological replicates each:
The FASTQ-files can be directly obtained from ENA, alternatively they are also available as SRA-files from GEO, which can be converted to FASTQ-format.
After downloading a MD5-check is strongly encouraged.
For this vignette, it is assumed that the FASTQ-files mentioned above have been downloaded to a directory called samples.
After downloading the FASTQ-files, we first run a basic quality control step to assess if there might be some quality issues or left over adapter sequences. We will use the tools FastQC and MultiQC for this task with the following bash commands:
cd 'YOUR_PATH'/samples
#rename files
mv SRR10669441_1.fastq.gz bulk_EtOH_rep1_1.fastq.gz
mv SRR10669441_2.fastq.gz bulk_EtOH_rep1_2.fastq.gz
mv SRR10669442_1.fastq.gz bulk_EtOH_rep2_1.fastq.gz
mv SRR10669442_2.fastq.gz bulk_EtOH_rep2_2.fastq.gz
mv SRR10669443_1.fastq.gz bulk_EtOH_rep3_1.fastq.gz
mv SRR10669443_2.fastq.gz bulk_EtOH_rep3_2.fastq.gz
mv SRR10669447_1.fastq.gz bulk_D2hr_rep1_1.fastq.gz
mv SRR10669447_2.fastq.gz bulk_D2hr_rep1_2.fastq.gz
mv SRR10669448_1.fastq.gz bulk_D2hr_rep2_1.fastq.gz
mv SRR10669448_2.fastq.gz bulk_D2hr_rep2_2.fastq.gz
mv SRR10669449_1.fastq.gz bulk_D2hr_rep3_1.fastq.gz
mv SRR10669449_2.fastq.gz bulk_D2hr_rep3_2.fastq.gz
#Quality control
fastqc -t 4 -o fastqc/ *.fastq.gz
#summarize reports
multiqc -o multiqc ./fastqc/We calculated a FastQC report for every file and summarized the 12 reports with MultiQC. The reports show minor leftovers of an Illumina sequencing adapter in the samples. This makes sense, as the samples were sequenced by an Illumina NovaSeq 6000. Although the adapter content is not dramatic, we will trim the reads to remove those adapters and perform a very relaxed quality filtering. We will use the tool trim-galore (Version 0.6.4_dev) for this task.
The trimmed files are the newly created folder /trim. Less than 1% of the total base pairs have been removed by the trimming process. Optionally one can again perform a FastQC and MultiQC analysis to confirm that the adapter sequences have been removed.
The files are now ready for transcript level quantification. We will use the tool Salmon (Version 1.1.0) for this.
Salmon does not perform a standard genomic alignment, but performs a quasi-mapping directly to the transcriptome. Reads, that could be originating from multiple transcript isoforms, are assigned to equivalence classes, where actual counts are derived by expectation maximization algorithms.
To run the Salmon quantification, we first need a transcriptomic index. For this example, we will use the transcript sequences and comprehensive gene annotation from Gencode Version 34.
Assume you have downloaded and unpacked the above mentioned files in the analysis root directory. To build the actual index we run this bash code:
#installation in conda evironment
conda install salmon
#create folder and change directory
mkdir ../salmon
cd ../salmon
#build index - `p` specifies the number of threads to use
salmon index -t ../gencode.v34.transcripts.fa --gencode -p 4 -i indexThe actual quantification can be performed in bash with:
cd ../trim
#loop over files
for i in bulk_*_1.fq.gz; do salmon quant -l A -i ../salmon/index/ -1 $i -2 ${i/_1_val_1/_2_val_2} --seqBias --gcBias --validateMappings --rangeFactorizationBins 4 --threads 4 -o ../salmon/${i/_1_val_1.fq.gz/}; doneThis concludes the preprocessing of the data. Please see the corresponding analysis vignette for an usage example of DTUrtle.