This vignette exemplifies the pre-processing of single-cell RNA-seq data for analysis with DTUrtle. The data used in this vignette is from the Tabular Muris consortium and publicly available as Bioproject PRJNA432002. The used files can be downloaded from here. The corresponding publication for the data set can be found here.

For this vignette we focus on two of the available single-cell RNA-seq samples:

  • SAMN08712875 which is a 10x Chromium v2 sample from the mammary gland (Mammary_Gland-10X_P7_13)
  • SAMN08712876 which is a 10x Chromium v2 sample from the mammary gland (Mammary_Gland-10X_P7_12)

There are BAM-files directly available from ENA, which will be converted back to FASTQ-files in this vignette.

After downloading a MD5-check is strongly encouraged.

Preparing BAM-files

For this vignette, it is assumed that the BAM-files mentioned above have been downloaded to a directory called samples.

After downloading the BAM-files, which originate from an analysis with the 10x CellRanger pipeline, can be transferred back to the FASTQ-format without data loss. For this task we will use the 10x toolkit bamtofastq (Version 1.3.1), which can be obtained from GitHub

Download the bamtofastq_linux or bamtofastq_macos binaries from the releases page into the samples folder and continue with the following bash commands:

There are now two folders with the FASTQ-files present in the samples folder. Apparently the reads originate from two sequencing lanes each - this was probably done to increase the total sequencing depth and coverage.

Quantification with Alevin

Alevin is the single-cell counterpart of Salmon, one of the standard tools for bulk RNA-seq quantification. It is also integrated into the Salmon (Version 1.1.0) package. As most of the available transcript quantifiers, Alevin/Salmon do not perform a standard genomic alignment, but perform a quasi-mapping directly to the transcriptome. Reads, that could be originating from multiple transcript isoforms, are assigned to equivalence classes, where actual counts are later derived by expectation maximization algorithms.

To run the Alevin quantification, we first need a transcriptomic index. For this example, we will use the transcript sequences and comprehensive gene annotation from Gencode Version M24.

Assume you have downloaded and unpacked the above mentioned files in the analysis root directory. To build the actual index we run this bash code:

To perform the actual quantification, we need one last file specifying the transcript to gene mapping. Alevin, unlike Salmon, aggregates the quantification counts to gene level by default. For our DTU analysis we require transcript level counts, so we will only provide a transcript_id to transcript_name mapping. This could also be a transcript_id to transcript_id mapping file (so each id mapping to itself).

We can create such a mapping file with the help of the DTUrtle package in R:


#import information from the GTF file
tx2gene <- import_gtf(gtf_file = "../gencode.vM24.annotation.gtf")

  ##optional for later name consistency:
  tx2gene$gene_name <- one_to_one_mapping(name = tx2gene$gene_name, 
                                          id = tx2gene$gene_id)
  tx2gene$transcript_name <- one_to_one_mapping(name = tx2gene$transcript_name, 
                                                id = tx2gene$transcript_id)

write.table(x = tx2gene[,c("transcript_id", "transcript_name")], 
            file = "txmap.tsv", sep = "\t", quote = FALSE, 
            row.names = FALSE, col.names = FALSE)

The actual quantification can be performed in bash with:

For this analysis we used the Alevin algorithms for cell detection and barcode sequence correction, resulting in transcript level quantification data for 3992 (10X_P7_12) and 4326 cells (10X_P7_13).

This concludes the preprocessing of the data. Please see the corresponding analysis vignette for an usage example of DTUrtle.