This vignette exemplifies the analysis of single-cell RNA-seq data with DTUrtle. The data used in this vignette is publicly available as Bioproject PRJNA433520 and the used FASTQ-files can be downloaded from here. The corresponding publication from Wuidart et al. can be found here.

The FASTQ-files can be directly obtained from ENA, alternatively they are also available as SRA-files from GEO, which can be converted to FASTQ-format.

As this vignette deals with single-cell RNA-seq data generated via the Smart-seq2 protocol, a (standard) sample of this Project represents reads from a single cell. We initially download all 384 available samples/cells - this can be done directly from ENA.

Suppose you have downloaded the Bioproject report table like this to your YOUR_PATH directory - selecting the columns fastq_ftp, fastq_md5, sample_accession, sample_title, scientific_name and study_accession.

This table contains all needed information to download and MD5-check the files - as well as to create a basic sample meta information table.

Downloading can be done like this:

After downloading a MD5-check is strongly encouraged.

You should only proceed, if all MD5-sums match!

Preparing FASTQ-files

As described in Wuidart et al., the sequences likely contain left-over adapter sequences and an adapter trimming is recommended.

We will trim the reads to remove adapters and perform a very relaxed quality filtering. We will use the tool trim-galore (Version 0.6.4_dev) for this task.

The trimmed files are the newly created folder /trim. By looking at the generated FastQC reports, we can check how many sequences have been removed by the trimming, and if other QC paramters indicate problems. Optionally we can summarize the hundreds of reports to a single report with MultiQC:

Transcript level quantification

The files are now ready for transcript level quantification. We will use the tool Salmon (Version 1.1.0) for this. Although we have single-cell RNA-seq data, samples/cells from the Smart-seq2 protocol can be treated as standard bulk samples (you have quite a lot of them, though). Therefore we can simply use the Salmon for quantification and do not have to use specialised single-cell variants like Alevin.

Salmon does not perform a standard genomic alignment, but performs a quasi-mapping directly to the transcriptome. Reads, that could be originating from multiple transcript isoforms, are assigned to equivalence classes, where actual counts are derived by expectation maximization algorithms.

To run the Salmon quantification, we first need a transcriptomic index. For this example, we will use the transcript sequences and comprehensive gene annotation from Gencode Version 34.

Additionally, in this experiment ERCC Spike-In sequences have been used in the sequencing. To also be able to quantify those not naturally occuring sequences, we should add those manually to the transcriptome and the annotation.

We can get the ERCC FASTA-sequences and a corresponing GTF-file directly from ThermoFisher and combine those files with the Gencode files:

This reference FASTA including the ERCC-sequences can then be used to build the Salmon index:

The actual quantification can be performed in bash with:

This concludes the preprocessing of the data. Please see the corresponding analysis vignette for an usage example of DTUrtle.