vignettes/Hoffman_human_single-cell_preprocess.Rmd
Hoffman_human_single-cell_preprocess.Rmd
This vignette exemplifies the pre-processing of single-cell RNA-seq data for analysis with DTUrtle. The data used in this vignette is publicly available as Bioproject PRJNA594939 and the used FASTQ-files can be downloaded from here. The corresponding publication from Hoffman et al. based on this data set can be found here.
For this vignette we focus on two of the available single-cell RNA-seq samples:
The FASTQ-files can be directly obtained from ENA, alternatively they are also available as SRA-files from GEO, which can be converted to FASTQ-format.
After downloading a MD5-check is strongly encouraged.
For this vignette, it is assumed that the FASTQ-files mentioned above have been downloaded to a directory called samples
.
After downloading the FASTQ-files, the first step is to rename the files and split them into two files each. This data set was produced using a Illumina SureCell 3’ kit, which produces paired end reads (as all current single-cell protocols). For unknown reasons the original two FASTQ-files per sample have been appended for / during upload. More information about the SureCell format can be found here.
The read files can be split with the following bash commands:
cd $PATH_TO_ANAYLSIS_ROOT_DIRECTORY$/samples
#rename
mv SRR10669459.fastq.gz sc_EtOH.fastq.gz
mv SRR10669461.fastq.gz sc_Dex2hr.fastq.gz
#unpack the files
gunzip sc_*.fastq.gz
#split
# the line numbers can be computed grepping the first linker sequence
# or counting the number of '1' at the end of the read name.
split -l 1215291200 --numeric-suffixes=1 --additional-suffix=.fastq.gz sc_EtOH.fastq sc_EtOH_
split -l 1394785580 --numeric-suffixes=1 --additional-suffix=.fastq.gz sc_Dex2hr.fastq sc_Dex2hr_
Unfortunately, the read names do not directly match between the samples. The downstream tools would refuse to analyze samples without matching read names. We can overcome this with the following bash commands:
#create files with matching names
cat sc_EtOH_2.fastq | sed -e 's/^\(@SRR10669459.\)\(.*\) /echo \1$((\2-303822800)) /e' > sc_EtOH_2_matched.fastq
cat sc_Dex2hr_2.fastq | sed -e 's/^\(@SRR10669461.\)\(.*\) /echo \1$((\2-348696395)) /e' > sc_Dex2hr_2_matched.fastq
#pack again
gzip sc_*_1.fastq
gzip sc_*_matched.fastq
#optionally: remove temporary files
rm *.fastq
#optionally: remove concatenated files
rm sc_EtOH.fastq.gz
rm sc_Dex2hr.fastq.gz
There should be four packed FASTQ-files present in the folder - two per sample.
As with most single-cell data, the first analysis step is the determination of valid cellular barcodes. As the Illumina SureCell 3’ protocol was used, the R1 reads only contain barcode and UMI information.
To error-correct the cell barcodes we use umi_tools
(Version 1.0.1). The read data was created using the Illumina SureCell protocol, which uses a quite complex barcode / UMI pattern. Luckily, umi_tools
can process arbitrary complex barcode / UMI structures via regular expressions. We allow correction of barcodes with up to one error and allow for an edit distance of 1 in each linker segment with the following bash code:
#installation in conda evironment
conda install -c bioconda umi_tools
#create folder and change path
mkdir ../umi_tools
cd ../umi_tools
#collect and error correct barcode / UMI - `subset-reads` and `set-cell-number` have been chosen high to not exclude any relevant information
umi_tools whitelist --ed-above-threshold='correct' --subset-reads=1000000000 --error-correct-threshold=1 --set-cell-number=10000 --extract-method='regex' -p '(?P<discard_1>.{0,20})(?P<cell_1>.{6})(?P<discard_2>TAGCCATCGCATTGC){e<=1}(?P<cell_2>.{6})(?P<discard_3>TACCTCTGAGCTGAA){e<=1}(?P<cell_3>.{6})(?P<discard_4>ACG)(?P<umi_1>.{8})(?P<discard_5>GAC).*' -I ../samples/sc_EtOH_1.fastq.gz -S etoh_whitelist.tsv
umi_tools whitelist --ed-above-threshold='correct' --subset-reads=1000000000 --error-correct-threshold=1 --set-cell-number=10000 --extract-method='regex' -p '(?P<discard_1>.{0,20})(?P<cell_1>.{6})(?P<discard_2>TAGCCATCGCATTGC){e<=1}(?P<cell_2>.{6})(?P<discard_3>TACCTCTGAGCTGAA){e<=1}(?P<cell_3>.{6})(?P<discard_4>ACG)(?P<umi_1>.{8})(?P<discard_5>GAC).*' -I ../samples/sc_Dex2hr_1.fastq.gz -S dex2hr_whitelist.tsv
The TSV files contain all found barcode sequences in the samples, together with possible erroneous sequenced barcodes (less than 2 edit distance away) which will be corrected.
In the publication, only 400 cells of each sample were used in the analysis. For simplicity, we stick to this process. We can get the 400 valid cell barcodes for each sample from the supplementary count files from GEO: GSE141834_scRNAseq_rawCounts.txt.gz
Assume you have downloaded and unpacked the mentioned file. Lists of the valid cell barcodes can be created with this R code:
setwd("$PATH_TO_ANAYLSIS_ROOT_DIRECTORY$/umi_tools") #get valid barcodes from raw count table raw_expr <- read.table("GSE141834_scRNAseq_rawCounts.txt", header=TRUE, sep="\t") etoh_selected_cells <- gsub(".*_","",grep("Dex.00.2_", colnames(raw_expr), value = TRUE)) dex2hr_selected_cells <- gsub(".*_","",grep("Dex.02.2_", colnames(raw_expr), value = TRUE)) #read-in created umi_tools whitelists etoh_white <- read.table("etoh_whitelist.tsv", stringsAsFactors = FALSE) dex2hr_white <- read.table("dex2hr_whitelist.tsv", stringsAsFactors = FALSE) #write subset to file etoh_white_sub <- etoh_white[etoh_white$V1 %in% etoh_selected_cells,] write.table(etoh_white_sub, "etoh_whitelist_sub.tsv", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE) dex2hr_white_sub <- dex2hr_white[dex2hr_white$V1 %in% dex2hr_selected_cells,] write.table(dex2hr_white_sub, "dex2hr_whitelist_sub.tsv", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
The subsetted TSV files contain only the 400 valid barcodes per cell, together with the barcodes that shall be corrected.
The actual barcode extraction and correction is again performed with umi_tools
. In this process, only the reads of the R2 files are kept, that had a matching barcode in the R1 file. The barcode and UMI sequence are added to the sequence identifier line of each read in the new files. The not matching parts of the R1 reads are kept in the new R1 file. The umi_tools
bash command looks like this:
umi_tools extract --filter-cell-barcode --error-correct-cell --whitelist=etoh_whitelist_sub.tsv --extract-method=regex -p '(?P<discard_1>.{0,20})(?P<cell_1>.{6})(?P<discard_2>TAGCCATCGCATTGC){e<=1}(?P<cell_2>.{6})(?P<discard_3>TACCTCTGAGCTGAA){e<=1}(?P<cell_3>.{6})(?P<discard_4>ACG)(?P<umi_1>.{8})(?P<discard_5>GAC).*' -I ../samples/sc_EtOH_1.fastq.gz -S sc_EtOH_1_extracted.fastq.gz --read2-in ../samples/sc_EtOH_2_matched.fastq.gz --read2-out sc_EtOH_2_extracted.fastq.gz
umi_tools extract --filter-cell-barcode --error-correct-cell --whitelist=dex2hr_whitelist_sub.tsv --extract-method=regex -p '(?P<discard_1>.{0,20})(?P<cell_1>.{6})(?P<discard_2>TAGCCATCGCATTGC){e<=1}(?P<cell_2>.{6})(?P<discard_3>TACCTCTGAGCTGAA){e<=1}(?P<cell_3>.{6})(?P<discard_4>ACG)(?P<umi_1>.{8})(?P<discard_5>GAC).*' -I ../samples/sc_Dex2hr_1.fastq.gz -S sc_Dex2hr_1_extracted.fastq.gz --read2-in ../samples/sc_Dex2hr_2_matched.fastq.gz --read2-out sc_Dex2hr_2_extracted.fastq.gz
The umi_tools
folder should now contain four gzipped FASTQ files, two for each sample.
We plan to use Alevin
for transcript level quantification, which does not support the Illumina SureCell protocol by default. We can overcome this limitation by slightly editing the new R1 files with some bash code:
zgrep "^@" sc_EtOH_1_extracted.fastq.gz | awk -F '_' -v OFS='\n' '{split($3,a," ");print $0,$2a[1],"+","KKKKKKKKKKKKKKKKKKKKKKKKKK"}' | gzip > sc_EtOH_1_altered.fastq.gz
zgrep "^@" sc_Dex2hr_1_extracted.fastq.gz | awk -F '_' -v OFS='\n' '{split($3,a," ");print $0,$2a[1],"+","KKKKKKKKKKKKKKKKKKKKKKKKKK"}' | gzip > sc_Dex2hr_1_altered.fastq.gz
#optional: remove `extracted` R1 files
rm sc_*_1_extracted.fastq.gz
With this command, we copied the barcode and UMI sequence of each reads sequence identifier to the actual raw sequence line. We also added some quality values for the bases (the actual value is not relevant).
There should now be an altered
R1 FASTQ file per sample, together with an extracted
R2 FASTQ file.
Alevin
is the single-cell counterpart of Salmon
, one of the standard tools for bulk RNA-seq quantification. It is also integrated into the Salmon
(Version 1.1.0) package. As most of the available transcript quantifiers, Alevin/Salmon
do not perform a standard genomic alignment, but perform a quasi-mapping directly to the transcriptome. Reads, that could be originating from multiple transcript isoforms, are assigned to equivalence classes, where actual counts are later derived by expectation maximization algorithms.
To run the Alevin
quantification, we first need a transcriptomic index. For this example, we will use the transcript sequences and comprehensive gene annotation from Gencode Version 34.
Assume you have downloaded and unpacked the above mentioned files in the analysis root directory. To build the actual index we run this bash code:
#installation in conda evironment
conda install salmon
#create folder and change directory
mkdir ../alevin
cd ../alevin
#build index - `p` specifies the number of threads to use
salmon index -t ../gencode.v34.transcripts.fa --gencode -p 4 -i index
To perform the actual quantification, we need one last file specifying the transcript to gene mapping. Alevin
, unlike Salmon
, aggregates the quantification counts to gene level by default. For our DTU analysis we require transcript level counts, so we will only provide a transcript_id
to transcript_name
mapping. This could also be a transcript_id
to transcript_id
mapping file (so each id mapping to itself).
We can create such a mapping file with the help of the DTUrtle
package in R:
library(DTUrtle) setwd("$PATH_TO_ANAYLSIS_ROOT_DIRECTORY$/alevin") #import information from the GTF file tx2gene <- import_gtf(gtf_file = "../gencode.v34.annotation.gtf") ##optional for later name consistency: tx2gene$gene_name <- one_to_one_mapping(name = tx2gene$gene_name, id = tx2gene$gene_id) tx2gene$transcript_name <- one_to_one_mapping(name = tx2gene$transcript_name, id = tx2gene$transcript_id) write.table(x = tx2gene[,c("transcript_id", "transcript_name")], file = "txmap.tsv", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
The actual quantification can be performed in bash with:
salmon alevin -l A -i index/ -1 ../umi_tools/sc_EtOH_1_altered.fastq.gz -2 ../umi_tools/sc_EtOH_2_extracted.fastq.gz --keepCBFraction 1 --end 5 --umiLength 8 --barcodeLength 18 --dumpFeatures --tgMap txmap.tsv -o sc_EtOH
salmon alevin -l A -i index/ -1 ../umi_tools/sc_Dex2hr_1_altered.fastq.gz -2 ../umi_tools/sc_Dex2hr_2_extracted.fastq.gz --keepCBFraction 1 --end 5 --umiLength 8 --barcodeLength 18 --dumpFeatures --tgMap txmap.tsv -o sc_Dex2hr
This concludes the preprocessing of the data. Please see the corresponding analysis vignette for an usage example of DTUrtle.