This vignette exemplifies the analysis of single-cell RNA-seq data with DTUrtle. The data used in this vignette is publicly available as Bioproject PRJNA432002 and the used FASTQ-files can be downloaded from here. The corresponding publication can be found here.


The following code shows an example of an DTUrtle workflow. Assume we have performed the pre-processing as described here and the R working directory is a newly created folder called dtu_results.


Setup

Load the DTUrtle package and set the BiocParallel parameter. It is recommended to perform computations in parallel, if possible.

library(DTUrtle)
#> Loading required package: sparseDRIMSeq
#> To cite DTUrtle in publications use:
#> 
#>  Tobias Tekath, Martin Dugas.
#>  Differential transcript usage analysis
#>  of bulk and single-cell RNA-seq data with DTUrtle.
#>  Bioinformatics,
#>  2021.
#>  https://doi.org/10.1093/bioinformatics/btab629.
#> 
#> Use citation(DTUrtle) for BibTeX information.
#use up to 10 cores for computation
biocpar <- BiocParallel::MulticoreParam(10)

Import and format data

We want to start by reading in our quantification counts, as well as a file specifying which transcript ID or name belongs to which gene ID or name.

Importing and processing GTF annotation (tx2gene)

To get this transcript to gene (tx2gene) mapping, we will utilize the already present Gencode annotation file gencode.vM24.annotation.gtf. The import_gtf() function utilizes the a rtracklayer package and returns a transcript-level filtered version of the available data.

tx2gene <- import_gtf(gtf_file = "../gencode.vM24.annotation.gtf")
head(tx2gene, n=3)
#>   seqnames   start     end width strand  source       type score phase
#> 1     chr1 3073253 3074322  1070      +  HAVANA transcript    NA    NA
#> 2     chr1 3102016 3102125   110      + ENSEMBL transcript    NA    NA
#> 3     chr1 3205901 3216344 10444      -  HAVANA transcript    NA    NA
#>                gene_id        transcript_id      gene_type     gene_name
#> 1 ENSMUSG00000102693.1 ENSMUST00000193812.1            TEC 4933401J01Rik
#> 2 ENSMUSG00000064842.1 ENSMUST00000082908.1          snRNA       Gm26206
#> 3 ENSMUSG00000051951.5 ENSMUST00000162897.1 protein_coding          Xkr4
#>        transcript_type   transcript_name level transcript_support_level
#> 1                  TEC 4933401J01Rik-201     2                       NA
#> 2                snRNA       Gm26206-201     3                       NA
#> 3 processed_transcript          Xkr4-203     2                        1
#>        mgi_id   tag          havana_gene    havana_transcript protein_id ccdsid
#> 1 MGI:1918292 basic OTTMUSG00000049935.1 OTTMUST00000127109.1       <NA>   <NA>
#> 2 MGI:5455983 basic                 <NA>                 <NA>       <NA>   <NA>
#> 3 MGI:3528744  <NA> OTTMUSG00000026353.2 OTTMUST00000086625.1       <NA>   <NA>
#>    ont
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>

There are a lot of columns present in the data frame, but at the moment we are mainly interested in the columns gene_id, gene_name, transcript_id and transcript_name.


As we want to use gene and transcript names as the main identifiers in our analysis (so we can directly say: Gene x is differential), we should ensure that each gene / transcript name maps only to a single gene / transcript id.

For this we can use the DTUrtle function one_to_one_mapping(), which checks if there are identifiers, which relate to the same name. If this is the case, the names (not the identifiers) are slightly altered by appending a number. If id_x and id_y both have the name ABC, the id_y name is altered to ABC_2 by default.

tx2gene$gene_name <- one_to_one_mapping(name = tx2gene$gene_name, id = tx2gene$gene_id)
#> Changed 109 names.
tx2gene$transcript_name <- one_to_one_mapping(name = tx2gene$transcript_name, id = tx2gene$transcript_id)
#> No changes needed -> already one to one mapping.

We see that it was a good idea to ensure the one to one mapping, as many doublets have been corrected.


For the run_drimseq() tx2gene parameter, we need a data frame, where the first column specifies the transcript identifiers and the second column specifying the corresponding gene names. Rather than subsetting the data frame, a column reordering is proposed, so that additional data can still be used in further steps. DTUrtle makes sure to carry over additional data columns in the analysis steps. To reorder the columns of our tx2gene data frame, we can utilize the move_columns_to_front() functionality.

tx2gene <- move_columns_to_front(df = tx2gene, columns = c("transcript_name", "gene_name"))
head(tx2gene, n=5)
#>     transcript_name     gene_name seqnames   start     end  width strand
#> 1 4933401J01Rik-201 4933401J01Rik     chr1 3073253 3074322   1070      +
#> 2       Gm26206-201       Gm26206     chr1 3102016 3102125    110      +
#> 3          Xkr4-203          Xkr4     chr1 3205901 3216344  10444      -
#> 4          Xkr4-202          Xkr4     chr1 3206523 3215632   9110      -
#> 5          Xkr4-201          Xkr4     chr1 3214482 3671498 457017      -
#>    source       type score phase              gene_id        transcript_id
#> 1  HAVANA transcript    NA    NA ENSMUSG00000102693.1 ENSMUST00000193812.1
#> 2 ENSEMBL transcript    NA    NA ENSMUSG00000064842.1 ENSMUST00000082908.1
#> 3  HAVANA transcript    NA    NA ENSMUSG00000051951.5 ENSMUST00000162897.1
#> 4  HAVANA transcript    NA    NA ENSMUSG00000051951.5 ENSMUST00000159265.1
#> 5  HAVANA transcript    NA    NA ENSMUSG00000051951.5 ENSMUST00000070533.4
#>        gene_type      transcript_type level transcript_support_level
#> 1            TEC                  TEC     2                       NA
#> 2          snRNA                snRNA     3                       NA
#> 3 protein_coding processed_transcript     2                        1
#> 4 protein_coding processed_transcript     2                        1
#> 5 protein_coding       protein_coding     2                        1
#>        mgi_id   tag          havana_gene    havana_transcript
#> 1 MGI:1918292 basic OTTMUSG00000049935.1 OTTMUST00000127109.1
#> 2 MGI:5455983 basic                 <NA>                 <NA>
#> 3 MGI:3528744  <NA> OTTMUSG00000026353.2 OTTMUST00000086625.1
#> 4 MGI:3528744  <NA> OTTMUSG00000026353.2 OTTMUST00000086624.1
#> 5 MGI:3528744  CCDS OTTMUSG00000026353.2 OTTMUST00000065166.1
#>             protein_id      ccdsid  ont
#> 1                 <NA>        <NA> <NA>
#> 2                 <NA>        <NA> <NA>
#> 3                 <NA>        <NA> <NA>
#> 4                 <NA>        <NA> <NA>
#> 5 ENSMUSP00000070648.4 CCDS14803.1 <NA>

This concludes the tx2gene formatting.


Reading in quantification data

The read-in of the quantification counts can be achieved with import_counts(), which uses the tximport package in the background. This function is able to parse the output of many different quantification tools. Advanced users might be able to tune parameters to parse arbitrary output files from currently not supported tools.

In the pre-processing vignette we quantified the counts with Alevin. The folder structure of the quantification results folder looks like this:

list.files("../alevin/")
#> [1] "10X_P7_12" "10X_P7_13" "index"     "txmap.tsv"

We will create a named files vector, pointing to the quants_mat.gz file for each sample. The names help to differentiate the samples later on.

files <- Sys.glob("../alevin/10X_P7_*/alevin/quants_mat.gz")
names(files) <- basename(gsub("/alevin/quants_mat.gz", "", files))

The files object looks like this:

#>                                  10X_P7_12 
#> "../alevin/10X_P7_12/alevin/quants_mat.gz" 
#>                                  10X_P7_13 
#> "../alevin/10X_P7_13/alevin/quants_mat.gz"

The actual import will be performed with import_counts().

cts_list <- import_counts(files = files, type = "alevin")
#> Reading in 2 alevin runs.
#> importing alevin data is much faster after installing `fishpond` (>= 1.2.0)
#> reading in alevin gene-level counts across cells
#> importing alevin data is much faster after installing `fishpond` (>= 1.2.0)
#> reading in alevin gene-level counts across cells

The cts_list object is a named list, with a sparse Matrix per sample. In single-cell data, each sample normally consists of many different cells with an unique cell barcode. These cell barcodes might overlap between samples though. For this reason, many single-cell workflow use a cell barcode extension, uniquely assigning each cell to a sample. This can also be done in DTUrtle with combine_to_matrix(), which is only applicable if you are analyzing single-cell data.

This function will make sure that there are no duplicated barcodes between you samples, before merging the matrices together. If there are duplicated barcodes, a cell extension is added. Additionally, all not expressed features are removed to reduce the size of the data.

To have matching cell names with the sample annotation we add in the next step, we set the cell extensions explicitly and prepend them.

cts <- combine_to_matrix(tx_list = cts_list, cell_extensions = c("10X_P7_12", "10X_P7_13"), cell_extension_side = "prepend")
#> Found overall duplicated cellnames. Trying cellname extension per sample.
#> Map extensions:
#>  10X_P7_12 --> '10X_P7_12_'
#>  10X_P7_13 --> '10X_P7_13_'
#>  
#> Merging matrices
#> Excluding 49494 overall not expressed features.
#> 91454 features left.

Apparently there were some duplicated cell barcodes between our samples, so a cell barcode extensions are useful either way.

dim(cts)
#> [1] 91454  8318

There are ~91k features left for 8318 cells.


Sample metadata

Finally, we need a sample metadata data frame, specifying which sample belongs to which comparison group. This table is also convenient to store and carry over additional metadata.

For single-cell data, this is not a sample metadata data frame, but a cell metadata data frame. We have to specify the information on cell level, with the barcodes as identifiers.

As we want to use the cell type assignments from the Tabula Muris project, we download the Robj file containing the annotation. The Tabula Muris Data has been deposited at Figshare

We can perform the download with the following bash code:

The meta.data data frame of this object contains the wanted annotation.

library(Seurat)
load("../droplet_Mammary_Gland_seurat_tiss.Robj", verbose=TRUE)
#> Loading objects:
#>   tiss

tabula_muris_metadata <- tiss@meta.data
dim(tabula_muris_metadata)
#> [1] 4481   15
head(tabula_muris_metadata, n=5)
#>                            nGene nUMI orig.ident   channel        tissue
#> 10X_P7_12_AAACCTGAGTTGAGAT  2125 7442        10X 10X_P7_12 Mammary_Gland
#> 10X_P7_12_AAACCTGTCGTCACGG  2750 9544        10X 10X_P7_12 Mammary_Gland
#> 10X_P7_12_AAACCTGTCTTGTACT  1837 6479        10X 10X_P7_12 Mammary_Gland
#> 10X_P7_12_AAACCTGTCTTTAGTC  2051 7284        10X 10X_P7_12 Mammary_Gland
#> 10X_P7_12_AAACGGGCAGTGGGAT  1129 3797        10X 10X_P7_12 Mammary_Gland
#>                            subtissue mouse.sex mouse.id percent.ercc
#> 10X_P7_12_AAACCTGAGTTGAGAT                   F   3-F-56            0
#> 10X_P7_12_AAACCTGTCGTCACGG                   F   3-F-56            0
#> 10X_P7_12_AAACCTGTCTTGTACT                   F   3-F-56            0
#> 10X_P7_12_AAACCTGTCTTTAGTC                   F   3-F-56            0
#> 10X_P7_12_AAACGGGCAGTGGGAT                   F   3-F-56            0
#>                            percent.ribo         free_annotation
#> 10X_P7_12_AAACCTGAGTTGAGAT    0.2758667 luminal progenitor cell
#> 10X_P7_12_AAACCTGTCGTCACGG    0.2768231                    <NA>
#> 10X_P7_12_AAACCTGTCTTGTACT    0.1967896                    <NA>
#> 10X_P7_12_AAACCTGTCTTTAGTC    0.2397035                    <NA>
#> 10X_P7_12_AAACGGGCAGTGGGAT    0.4234922                    <NA>
#>                                                 cell_ontology_class res.1
#> 10X_P7_12_AAACCTGAGTTGAGAT luminal epithelial cell of mammary gland     6
#> 10X_P7_12_AAACCTGTCGTCACGG                                   B cell     1
#> 10X_P7_12_AAACCTGTCTTGTACT                             stromal cell     3
#> 10X_P7_12_AAACCTGTCTTTAGTC                             stromal cell     3
#> 10X_P7_12_AAACGGGCAGTGGGAT                                   B cell     1
#>                            cluster.ids cell_ontology_id
#> 10X_P7_12_AAACCTGAGTTGAGAT           6       CL:0002326
#> 10X_P7_12_AAACCTGTCGTCACGG           1       CL:0000236
#> 10X_P7_12_AAACCTGTCTTGTACT           3       CL:0000499
#> 10X_P7_12_AAACCTGTCTTTAGTC           3       CL:0000499
#> 10X_P7_12_AAACGGGCAGTGGGAT           1       CL:0000236

As we have only cell type information available for cells analyzed in Tabula Muris, we subset our data set to the matching cells.

cts <- cts[,rownames(tabula_muris_metadata)]
dim(cts)
#> [1] 91454  4481

All 4481 cells from Tabula Muris have been recovered. The tabula_muris_metadata data frame can be used as our meta data table.

For an example workflow utilizing the Seurat object directly, please see Workflow with Seurat object


(optional) Estimate influence of priming-bias

As the 10X Chromium 3’ protocol preferentially generates reads of the 3’-end of an mRNA, DTU effects of some specific transcripts might not be detectable. DTUrtle offers a novel scoring scheme, called “detection probability”, to assess which transcripts might be prone to be impaired by such a bias. This score can be computed with the DTUrtle function priming_bias_detection_probability().

The score calculation can be summarized like this: For each gene, a reference transcript is chosen (based on the expression data, selecting the major proportionally expressed transcript as reference). Each other transcript of a gene is now compared to this reference transcript, calculating where the first detectable exon-level difference occurs between the transcript entities - and at what relative coordinate this difference is located (looking from the priming enriched end). Based on this information, a detection probability is calculated, with 1 indicating no influence of the prime-biased reads on the detection ability and 0 indicating a very heavy influence. Thus, DTU effects for transcripts with a low score would be expected less likely to be detectable with the given data.

We can add this score information to an already existing table, like the tx2gene table (add_to_table = tx2gene). As we need exon-level information for the calculation, we should provide an unfiltered GTF GRanges or data frame object - or alternatively a file path.

The newly added columns are detection_probability and used_as_ref:

head(tx2gene[,c("transcript_name", "gene_name", "detection_probability", "used_as_ref")])
#>     transcript_name     gene_name detection_probability used_as_ref
#> 1 4933401J01Rik-201 4933401J01Rik             1.0000000       FALSE
#> 2       Gm26206-201       Gm26206             1.0000000       FALSE
#> 3          Xkr4-203          Xkr4             1.0000000        TRUE
#> 4          Xkr4-202          Xkr4             0.7340248       FALSE
#> 5          Xkr4-201          Xkr4             0.4870666       FALSE
#> 6       Gm18956-201       Gm18956             1.0000000       FALSE

We can calculate, that a potential 3’-bias would not influence the majority of annotated transcripts, relevant for the DTU analysis:

#only genes with at least two transcript isoforms are relevant for DTU analysis
dtu_relevant_genes <- unique(tx2gene$gene_name[duplicated(tx2gene$gene_name)])
summary(tx2gene$detection_probability[tx2gene$gene_name %in% dtu_relevant_genes])
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.000138 0.394984 1.000000 0.719740 1.000000 1.000000

DTU analysis

We have prepared all necessary data to perform the differentially transcript usage (DTU) analysis. DTUrtle only needs two simple commands to perform it. Please be aware that these steps are the most compute intensive and, depending on your data, might take some time to complete. It is recommended to parallelize the computations with the BBPARAM parameter, if applicable.

First, we want to set-up and perform the statistical analysis with DRIMSeq, a DTU specialized statistical framework utilizing a Dirichlet-multinomial model. This can be done with the run_drimseq() command. We use the previously imported data as parameters, specifying which column in the cell metadata data frame contains ids and which the group information we want. We should also specify which of the groups should be compared (if there are more than two) and in which order. The order given in the cond_levels parameter also specifies the comparison formula.

In this example we choose two specific cell types (from the column cell_ontology_class) and do not specify the id column, as rownames are chosen by default.

Note: By default run_drimseq() converts sparse count matrix to a dense format for statistical computations (force_dense=TRUE). While this increases memory usage, it currently also reduces the run time. The computations can be performed keeping the sparse counts by setting force_dense=FALSE.

dturtle <- run_drimseq(counts = cts, tx2gene = tx2gene, pd=tabula_muris_metadata,
                    cond_col = "cell_ontology_class", 
                    cond_levels = c("T cell", "luminal epithelial cell of mammary gland"),
                    filtering_strategy = "sc", BPPARAM = biocpar)
#> Using tx2gene columns:
#>  transcript_name ---> 'feature_id'
#>  gene_name ---> 'gene_id'
#> 
#> Comparing in 'cell_ontology_class': 'T cell' vs 'luminal epithelial cell of mammary gland'
#> Excluding 2272 cells/samples for this comparison!
#> 
#> Proceed with cells/samples: 
#>                                   T cell 
#>                                     1750 
#> luminal epithelial cell of mammary gland 
#>                                      459
#> 
#> Filtering...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%
#> Retain 32578 of 78489 features.
#> Removed 45911 features.
#> 
#> Performing statistical tests...
#> * Calculating mean gene expression..
#> * Estimating common precision..
#> ! Using common_precision = 21.2862 as prec_init !
#> * Estimating genewise precision..
#> ! Using loess fit as a shrinkage factor !
#> * Fitting the DM model..
#>    Using the one way approach.
#> * Fitting the BB model..
#>    Using the one way approach.
#> * Fitting the DM model..
#>    Using the one way approach.
#> * Calculating likelihood ratio statistics..
#> * Fitting the BB model..
#>    Using the one way approach.
#> * Calculating likelihood ratio statistics..

As in all statistical procedures, it is of favor to perform as few tests as possible but as much tests as necessary, to maintain a high statistical power. This is achieved by filtering the data to remove inherently uninteresting items, for example very lowly expressed genes or features. DTUrtle includes a powerful and customizable filtering functionality for this task, which is an optimized version of the dmFilter() function of the DRIMSeq package.

Above we used a predefined filtering strategy for single-cell data, requiring that features contribute at least 5% of the total expression in at least 5% of the cells of the smallest group. Additionally, all genes are filtered, which only have a single transcript left, as they can not be analyzed in DTU analysis. The filtering options can be extended or altered by the user.

dturtle$used_filtering_options
#> $DRIM
#> $DRIM$min_samps_gene_expr
#> [1] 0
#> 
#> $DRIM$min_samps_feature_expr
#> [1] 0
#> 
#> $DRIM$min_samps_feature_prop
#> [1] 22.95
#> 
#> $DRIM$min_gene_expr
#> [1] 0
#> 
#> $DRIM$min_feature_expr
#> [1] 0
#> 
#> $DRIM$min_feature_prop
#> [1] 0.05
#> 
#> $DRIM$run_gene_twice
#> [1] TRUE

The resulting dturtle object will be used as our main results object, storing all necessary and interesting data of the analysis. It is a simple and easy-accessible list, which can be easily extended / altered by the user. By default three different meta data tables are generated:

  • meta_table_gene: Contains gene level meta data.
  • meta_table_tx: Contains transcript level meta data.
  • meta_table_sample: Contains sample level meta data (as the created pd data frame)

These meta data tables are used in for visualization purposes and can be extended by the user.

head(dturtle$meta_table_gene, n=5)
#>                        gene     exp_in exp_in_T cell
#> 0610009B22Rik 0610009B22Rik 0.07424174   0.024571429
#> 0610010F05Rik 0610010F05Rik 0.02580353   0.007428571
#> 0610010K14Rik 0610010K14Rik 0.28700770   0.222285714
#> 0610030E20Rik 0610030E20Rik 0.16115890   0.120571429
#> 1110004F10Rik 1110004F10Rik 0.43549117   0.368571429
#>               exp_in_luminal epithelial cell of mammary gland seqnames strand
#> 0610009B22Rik                                      0.26361656    chr11      -
#> 0610010F05Rik                                      0.09586057    chr11      -
#> 0610010K14Rik                                      0.53376906    chr11      -
#> 0610030E20Rik                                      0.31590414     chr6      +
#> 1110004F10Rik                                      0.69063181     chr7      +
#>                     type             gene_id.1      gene_type
#> 0610009B22Rik transcript  ENSMUSG00000007777.9 protein_coding
#> 0610010F05Rik transcript ENSMUSG00000042208.15 protein_coding
#> 0610010K14Rik transcript ENSMUSG00000020831.18 protein_coding
#> 0610030E20Rik transcript  ENSMUSG00000058706.5 protein_coding
#> 1110004F10Rik transcript ENSMUSG00000030663.12 protein_coding

This table shows the single genes, together with their total expressed-in ratio (“exp_in”), the specific expressed-in ratios in the specific comparison groups, as well with other information from the provided GTF-file.


As proposed in Love et al. (2018), we will use a two-stage statistical testing procedure together with a post-hoc filtering on the standard deviations in proportions (posthoc_and_stager()). We will use stageR to determine genes, that show a overall significant change in transcript proportions. For these significant genes, we will try to pinpoint specific transcripts, which significantly drive this overall change. As a result, we will have a list of significant genes (genes showing the overall change) and a list of significant transcripts (one or more transcripts of the significant genes). Please note, that not every significant gene does have one or more significant transcripts. It is not always possible to attribute the overall change in proportions to single transcripts. These two list of significant items are computed and corrected against a overall false discovery rate (OFDR).

Additionally, we will apply a post-hoc filtering scheme to improve the targeted OFDR control level. The filtering strategy will discard transcripts, which standard deviation of the proportion per cell/sample is below the specified threshold. For example by setting posthoc=0.1, we will exclude all transcripts, which proportional expression (in regard to the total gene expression) deviates by less than 0.1 between cells. This filtering step should mostly exclude ‘uninteresting’ transcripts, which would not have been called as significant either way.

dturtle <- posthoc_and_stager(dturtle = dturtle, ofdr = 0.05, posthoc = 0.1)
#> Posthoc filtered 25829 features.
#> The returned adjusted p-values are based on a stage-wise testing approach and are only valid for the provided target OFDR level of 5%. If a different target OFDR level is of interest,the entire adjustment should be re-run.
#> Found 2100 significant genes with 3130 significant transcripts (OFDR: 0.05)

The dturtle object now contains additional elements, including the lists of significant genes and significant transcripts.

head(dturtle$sig_gene)
#> [1] "1110004F10Rik" "1110008P14Rik" "1110032A03Rik" "1110038B12Rik"
#> [5] "1700037C18Rik" "1810058I24Rik"
head(dturtle$sig_tx)
#>       1110008P14Rik       1110008P14Rik       1110032A03Rik       1110032A03Rik 
#> "1110008P14Rik-202" "1110008P14Rik-201" "1110032A03Rik-201" "1110032A03Rik-206" 
#>       1110038B12Rik       2310009A05Rik 
#> "1110038B12Rik-201" "2310009A05Rik-201"

(optional) DGE analysis

Alongside a DTU analysis, a DGE analysis might be of interest for your research question. DTUrtle offers a basic DGE calling workflow for bulk and single-cell RNA-seq data via DESeq2.

To utilize this workflow, we should re-scale our imported counts. The transcript-level count matrix has been scaled for the DTU analysis, but for a DGE analysis un-normalized counts should be used (as DESeq2 normalizes internally). We can simply re-import the count data, by providing the already defined files object to the DGE analysis specific function import_dge_counts(). This function will make sure that the counts are imported but not scaled and also summarizes to gene-level.

Our files object looks like this:

head(files)
#>                                  10X_P7_12 
#> "../alevin/10X_P7_12/alevin/quants_mat.gz" 
#>                                  10X_P7_13 
#> "../alevin/10X_P7_13/alevin/quants_mat.gz"

We re-import the files and use combine_to_matrix() to create one big matrix (as we did for the DTU analysis):

cts_dge <- import_dge_counts(files, type="alevin", tx2gene=tx2gene)
#> Reading in 2 alevin runs.
#> importing alevin data is much faster after installing `fishpond` (>= 1.2.0)
#> reading in alevin gene-level counts across cells
#> importing alevin data is much faster after installing `fishpond` (>= 1.2.0)
#> reading in alevin gene-level counts across cells
cts_dge <- combine_to_matrix(tx_list = cts_dge, cell_extensions = c("10X_P7_12", "10X_P7_13"), cell_extension_side = "prepend")
#> Found overall duplicated cellnames. Trying cellname extension per sample.
#> Map extensions:
#>  10X_P7_12 --> '10X_P7_12_'
#>  10X_P7_13 --> '10X_P7_13_'
#>  
#> Merging matrices
#> Excluding 25835 overall not expressed features.
#> 28496 features left.

With this data, we can perform the DGE analysis with DESeq2 with the DTUrtle function run_deseq2(). DESeq2 is one of the gold-standard tools for DGE calling in bulk RNA-seq data and showed very good performance for single-cell data in multiple benchmarks. The DESeq2 vignette recommends adjusting some parameters for DGE analysis of single-cell data - DTUrtle incorporates these recommendations and adjusts the specific parameters based on the provided dge_calling_strategy.

After the DESeq2 DGE analysis, the estimated log2 fold changes are shrunken (preferably with apeglm) - to provide a secondary ranking variable beside the statistical significance. If a shrinkage with apeglm is performed, run_deseq2() defaults to also compute s-values rather than adjusted p-values. The underlying hypothesis for s-values and (standard) p-values differs slightly, with s-values hypothesis being presumably preferential in a real biological context. Far more information about all these topics can be found in the excellent DESeq2 vignette and the associated publications.

run_deseq2() will tell the user, if one or more advised packages are missing. It is strongly advised to follow these recommendations.

dturtle$dge_analysis <- run_deseq2(counts = cts_dge, pd = tabula_muris_metadata, cond_col = "cell_ontology_class", 
                                   cond_levels = c("T cell", "luminal epithelial cell of mammary gland"), lfc_threshold = 1,
                                   sig_threshold = 0.01, dge_calling_strategy = "sc", BPPARAM = biocpar)
#> 
#> Comparing in 'cell_ontology_class': 'T cell' vs 'luminal epithelial cell of mammary gland'
#> Excluding 2272 cells/samples for this comparison!
#> 
#> Proceed with cells/samples: 
#>                                   T cell 
#>                                     1750 
#> luminal epithelial cell of mammary gland 
#>                                      459
#> converting counts to integer mode
#>   Note: levels of factors in the design contain characters other than
#>   letters, numbers, '_' and '.'. It is recommended (but not required) to use
#>   only letters, numbers, and delimiters '_' or '.', as these are safe characters
#>   for column names in R. [This is a message, not a warning or an error]
#> Warning in (function (object, test = c("Wald", "LRT"), fitType =
#> c("parametric", : parallelization of DESeq() is not implemented for
#> fitType='glmGamPoi'
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates: 10 workers
#> mean-dispersion relationship
#> final dispersion estimates, fitting model and testing: 10 workers
#> using 'apeglm' for LFC shrinkage. If used in published research, please cite:
#>     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
#>     sequence count data: removing the noise and preserving large differences.
#>     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
#> computing FSOS 'false sign or small' s-values (T=1)
#> Found 11071 significant DEGs.
#>      Over-expressed: 552
#>      Under-expressed: 10519

We provided a log2 fold change threshold of 1 (on log2 scale), thus preferring an effect size of 2x or more. The output of run_deseq2() is a list, containing various elements (see run_deseq2()’s description). This result list can easily added to an existing DTUrtle object, as shown above.

We can now identify genes which show both a DGE and DTU signal:

dtu_dge_genes <- intersect(dturtle$sig_gene, dturtle$dge_analysis$results_sig$gene)
length(dtu_dge_genes)
#> [1] 1717

Result aggregation and visualization

The DTUrtle package contains multiple visualization options, enabling an in-depth inspection.

DTU table creation

We will start by aggregating the analysis results to a data frame with create_dtu_table(). This function is highly flexible and allows aggregation of gene or transcript level metadata in various ways. By default some useful information are included in the dtu table, in this example we further specify to include the seqnames column of the gene level metadata (which contains chromosome information) as well as the maximal expressed in ratio per cell type from the transcript level metadata (“expressed in” specifies the ratio of cells, were the expression of the gene / transcript is > 0). Furthermore we add the a column with the absolute difference between the expressed in ratios.

dturtle <- create_dtu_table(dturtle = dturtle, add_gene_metadata = list("chromosome"="seqnames"), 
                               add_tx_metadata = list("max_tx_expr_in_T_cell" = c("exp_in_T cell", max),
                                                      "max_tx_expr_in_luminal_epithelial_cell_of_mammary_gland" = 
                                                        c("exp_in_luminal epithelial cell of mammary gland", max)))

# add absolute difference between "expressed in" columns.
dturtle$dtu_table$abs_diff_expr_in <- abs(dturtle$dtu_table$max_tx_expr_in_T_cell-dturtle$dtu_table$max_tx_expr_in_luminal_epithelial_cell_of_mammary_gland)
head(dturtle$dtu_table, n=5)
#>           gene_ID   gene_qvalue minimal_tx_qvalue number_tx
#> Pde4d       Pde4d 2.507274e-209     1.631713e-169         5
#> Atp6v0e2 Atp6v0e2  2.576117e-21      3.358998e-23         3
#> Lamb3       Lamb3  2.737907e-65      3.706752e-66         4
#> Cgref1     Cgref1  1.990326e-36      0.000000e+00         2
#> Nr6a1       Nr6a1  2.403643e-37      2.382682e-40         4
#>          number_significant_tx
#> Pde4d                        3
#> Atp6v0e2                     1
#> Lamb3                        1
#> Cgref1                       2
#> Nr6a1                        1
#>          max(T cell-luminal epithelial cell of mammary gland) chromosome
#> Pde4d                                              -0.8201809      chr13
#> Atp6v0e2                                            0.8084999       chr6
#> Lamb3                                               0.7969231       chr1
#> Cgref1                                             -0.7603835       chr5
#> Nr6a1                                               0.7230380       chr2
#>          max_tx_expr_in_T_cell
#> Pde4d               0.16514286
#> Atp6v0e2            0.03028571
#> Lamb3               0.16400000
#> Cgref1              0.05257143
#> Nr6a1               0.08742857
#>          max_tx_expr_in_luminal_epithelial_cell_of_mammary_gland
#> Pde4d                                                  0.5098039
#> Atp6v0e2                                               0.1938998
#> Lamb3                                                  0.1677560
#> Cgref1                                                 0.3246187
#> Nr6a1                                                  0.1633987
#>          abs_diff_expr_in
#> Pde4d         0.344661064
#> Atp6v0e2      0.163614068
#> Lamb3         0.003755991
#> Cgref1        0.272047308
#> Nr6a1         0.075970121

This table is our basis for creating an interactive HTML-table of the results.

The column definitions are as follows:

  • “gene_ID”: Gene name or identifier used for the analysis.
  • “gene_qvalue”: Multiple testing corrected p-value (a.k.a. q-value) comparing all transcripts together between the two groups (“gene level”).
  • “minimal_tx_qvalue”: The minimal multiple testing corrected p-value from comparing all transcripts individually between the two groups (“transcript level”). I.e. the q-value of the most significant transcript.
  • “number_tx”: The number of analyzed transcripts for the specific gene.
  • “number_significant_tx”: The number of significant transcripts from the ‘transcript level’ analysis.
  • “max(T cell-luminal epithelial cell of mammary gland)”: Maximal proportional difference between the two groups (T cell vs luminal epithelial cell of mammary gland). E.g. one transcript of ‘Pde4d’ is ~82% more expressed in ‘luminal epithelial cell of mammary gland’ cells compared to ‘T cell’ cells.
  • “chromosome”: the chromosome the gene resides on.

Please note, that the columns and their meaning can easily be altered and redefined by the user.


Proportion barplot

As a first visualization option we will create a barplot of the proportions of each transcript per sample. We can use the function plot_proportion_barplot() for this task, which also adds the mean proportion fit per subgroup to the plot (by default as a red line).

As an example, we will create the plot for the gene Lcn2, which is one of the significant genes found in the analysis. We will optionally provide the gene_id for Lcn2, which is stored in the column gene_id.1 of dturtle$meta_table_gene.

temp <- plot_proportion_barplot(dturtle = dturtle, genes = "Lcn2", meta_gene_id = "gene_id.1")
#> Creating 1 plots:
temp$Lcn2

We see, that most of the proportional differences for Lcn2 are driven by 2 of the 3 transcripts (Lcn2-201 and Lcn2-203). These transcripts are also the significant transcripts found in the analysis (as they are marked in red). Notably 3 other transcripts of Lcn2 have been removed by the filtering thresholds in run_drimseq(). Lcn2-201 is proportionally overexpressed in luminal epithelial cell of mammary gland cells, in contrast to Lcn2-203 which is proportionally overexpressed in T cell cells. We additionally see, that the overall expression level of Lcn2 is way higher in luminal epithelial cell of mammary gland than in T cell.

For the interactive HTML-table we would need to save the images to disk (in the to-be-created sub folder “images” of the current working directory). There is also a convenience option, to directly add the file paths to the dtu_table. As multiple plots are created, we can provide a BiocParallel object to speed up the creation. If no specific genes are provided, all significant genes will be plotted.

head(dturtle$dtu_table$barplot)
#> [1] "images/Pde4d_barplot.png"    "images/Atp6v0e2_barplot.png"
#> [3] "images/Lamb3_barplot.png"    "images/Cgref1_barplot.png"  
#> [5] "images/Nr6a1_barplot.png"    "images/Rtn2_barplot.png"
head(list.files("./images/"))
#> [1] "Abca3_barplot.png"   "Abcb9_barplot.png"   "Abce1_barplot.png"  
#> [4] "Abhd17a_barplot.png" "Abhd2_barplot.png"   "Abhd8_barplot.png"

Proportion heatmap

A different visualization option is a heatmap, where additional meta data can be displayed alongside the transcript proportions (plot_proportion_pheatmap()). This visualization uses the pheatmap package, the user can specify any of the available parameters to customize the results.

temp <- plot_proportion_pheatmap(dturtle = dturtle, genes = "Lcn2", 
                                 sample_meta_table_columns = c("sample_id","condition"),
                                 include_expression = TRUE, treeheight_col=20)
#> Creating 1 plots:
temp$Lcn2

By default, row and column annotations are added. This plot helps to examine the transcript composition of groups of cells. We see, the vast majority of luminal epithelial cell of mammary gland cells are exclusively expressing Lcn2-201. The Lcn2 expressing cells of the T cell cells are mostly split-expressing Lcn2-201 and Lcn2-203. The significant transcripts are indicated by the row annotation on the left side of the heatmap.

Again, we can save the plots to disk and add them to the dtu_table:

head(dturtle$dtu_table$pheatmap)
#> [1] "images/Pde4d_pheatmap.png"    "images/Atp6v0e2_pheatmap.png"
#> [3] "images/Lamb3_pheatmap.png"    "images/Cgref1_pheatmap.png"  
#> [5] "images/Nr6a1_pheatmap.png"    "images/Rtn2_pheatmap.png"
head(list.files("./images/"))
#> [1] "Abca3_barplot.png"  "Abca3_pheatmap.png" "Abcb9_barplot.png" 
#> [4] "Abcb9_pheatmap.png" "Abce1_barplot.png"  "Abce1_pheatmap.png"

Transcript overview

Until now, we looked at the different transcripts as abstract entities. Alongside proportional differences, the actual difference in the exon-intron structure of transcripts is of great importance for many research questions. This structure can be visualized with the plot_transcripts_view() functionality of DTUrtle.

This visualization is based on the Gviz package and needs a path to a GTF file (or a read-in object). In Import and format data we already imported a GTF file. This was subset to transcript-level (via the import_gtf() function), thus this is not sufficient for the visualization. We can reuse the actual GTF file though, which should in general match with the one used for the tx2gene data frame.

As we have ensured the one_to_one mapping in Import and format data and potentially renamed some genes, we should specify the one_to_one parameter in this call.

plot_transcripts_view(dturtle = dturtle, 
                              genes = "Lcn2", 
                              gtf = "../gencode.vM24.annotation.gtf", 
                              genome = 'mm10',
                              one_to_one = TRUE)
#> 
#> Importing gtf file from disk.
#> 
#> Performing one to one mapping in gtf
#> 
#> Found gtf GRanges for 1 of 1 provided genes.
#> 
#> Fetching ideogram tracks ...
#> Creating 1 plots:

#> $Lcn2
#> $Lcn2$chr2
#> Ideogram track 'chr2' for chromosome 2 of the mm10 genome 
#> 
#> $Lcn2$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $Lcn2$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $Lcn2$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $Lcn2$titles
#> An object of class "ImageMap"
#> Slot "coords":
#>              x1       y1   x2        y2
#> chr2          6 120.0000 58.2  186.8141
#> OverlayTrack  6 186.8141 58.2  506.5427
#> OverlayTrack  6 506.5427 58.2  826.2714
#> OverlayTrack  6 826.2714 58.2 1146.0000
#> 
#> Slot "tags":
#> $title
#>           chr2   OverlayTrack   OverlayTrack   OverlayTrack 
#>         "chr2" "OverlayTrack" "OverlayTrack" "OverlayTrack"

This visualization shows the structure of the transcripts of Lcn2. Our two significant transcripts (Lcn2-201 and Lcn2-203) are quite different, with alternative end points as well as some retained intron sequences in Lcn2-203. The arrows on the right side indicate the mean fitted proportional change in the comparison groups, thus showing a under-expression of Lcn2-201 in T cell compared to luminal epithelial cell of mammary gland.

The areas between exons indicate intron sequences, which have been compressed in this representation to highlight the exon structure. Only consensus introns are compressed to a defined minimal size. This can be turned off with reduce_introns=FALSE or alternatively reduced introns can be highlighted by setting a colorful reduce_introns_fill.

Analogous as before, we can save plots to disk and add them to the dtu_table:

head(dturtle$dtu_table$transcript_view)
#> [1] "images/Pde4d_transcripts.png"    "images/Atp6v0e2_transcripts.png"
#> [3] "images/Lamb3_transcripts.png"    "images/Cgref1_transcripts.png"  
#> [5] "images/Nr6a1_transcripts.png"    "images/Rtn2_transcripts.png"
head(list.files("./images/"))
#> [1] "Abca3_barplot.png"     "Abca3_pheatmap.png"    "Abca3_transcripts.png"
#> [4] "Abcb9_barplot.png"     "Abcb9_pheatmap.png"    "Abcb9_transcripts.png"

Dimensional reduction

Most RNA-seq analysis perform some kind of dimensional reduction method to visualize the samples in two (or three) dimensions. This is most useful for single-cell data sets or bulk experiments with a very high number of samples.

For these kind of experiments, DTUrtle provides a functionality to visualize the proportional differences in dimensional reduction coordinates. For the visualization with plot_dimensional_reduction(), we need a data frame of two coordinates per cell / samples - presumably from a dimensional reduction method like PCA, TSNE or UMAP.

In this example data set, the tiss object (from Sample metadata) contains usable TSNE coordinates:

head(tiss@dr$tsne@cell.embeddings, n=5)
#>                               tSNE_1     tSNE_2
#> 10X_P7_12_AAACCTGAGTTGAGAT  15.82803 -25.601272
#> 10X_P7_12_AAACCTGTCGTCACGG  39.56226 -14.578341
#> 10X_P7_12_AAACCTGTCTTGTACT -26.24403  -5.222502
#> 10X_P7_12_AAACCTGTCTTTAGTC -29.54653   2.016900
#> 10X_P7_12_AAACGGGCAGTGGGAT  22.76953  -5.078945

We can provide this data frame directly to the reduction_df parameter of plot_dimensional_reduction() function:

temp <- plot_dimensional_reduction(dturtle = dturtle, 
                           reduction_df = tiss@dr$tsne@cell.embeddings,
                           plot = "proportions", genes = "Lcn2", plot_scale = "free_y")
#> Retrieved coordinates for 2209 out of 2209 cells / samples.
#> Creating 1 plots:
grid::grid.draw(temp$Lcn2)

By default significant transcripts are highlighted by a red background name. Because of this modification, a grob object is returned. If indicate_significant_tx is set to FALSE, a more common ggplot2 object is returned.

Besides the actual transcript proportions, it is also possible to display the (log-transformed) expression values of the gene and the single transcripts in the plot.

head(dturtle$dtu_table$transcript_view)
#> [1] "images/Pde4d_transcripts.png"    "images/Atp6v0e2_transcripts.png"
#> [3] "images/Lamb3_transcripts.png"    "images/Cgref1_transcripts.png"  
#> [5] "images/Nr6a1_transcripts.png"    "images/Rtn2_transcripts.png"
head(list.files("./images/"))
#> [1] "Abca3_barplot.png"     "Abca3_dim_reduce.png"  "Abca3_pheatmap.png"   
#> [4] "Abca3_transcripts.png" "Abcb9_barplot.png"     "Abcb9_dim_reduce.png"

Visualize DTU table

The dturtle$dtu_table is now ready to be visualized as an interactive HTML-table. Please note, that it is optional to add any plots or additional columns to the table. Thus the visualization will work directly after calling create_dtu_table().

The dtu_table object looks like this:

head(dturtle$dtu_table)
#>           gene_ID   gene_qvalue minimal_tx_qvalue number_tx
#> Pde4d       Pde4d 2.507274e-209     1.631713e-169         5
#> Atp6v0e2 Atp6v0e2  2.576117e-21      3.358998e-23         3
#> Lamb3       Lamb3  2.737907e-65      3.706752e-66         4
#> Cgref1     Cgref1  1.990326e-36      0.000000e+00         2
#> Nr6a1       Nr6a1  2.403643e-37      2.382682e-40         4
#> Rtn2         Rtn2  2.575898e-04      1.000000e+00         3
#>          number_significant_tx
#> Pde4d                        3
#> Atp6v0e2                     1
#> Lamb3                        1
#> Cgref1                       2
#> Nr6a1                        1
#> Rtn2                         0
#>          max(T cell-luminal epithelial cell of mammary gland) chromosome
#> Pde4d                                              -0.8201809      chr13
#> Atp6v0e2                                            0.8084999       chr6
#> Lamb3                                               0.7969231       chr1
#> Cgref1                                             -0.7603835       chr5
#> Nr6a1                                               0.7230380       chr2
#> Rtn2                                                0.7157688       chr7
#>          max_tx_expr_in_T_cell
#> Pde4d               0.16514286
#> Atp6v0e2            0.03028571
#> Lamb3               0.16400000
#> Cgref1              0.05257143
#> Nr6a1               0.08742857
#> Rtn2                0.01371429
#>          max_tx_expr_in_luminal_epithelial_cell_of_mammary_gland
#> Pde4d                                                 0.50980392
#> Atp6v0e2                                              0.19389978
#> Lamb3                                                 0.16775599
#> Cgref1                                                0.32461874
#> Nr6a1                                                 0.16339869
#> Rtn2                                                  0.06971678
#>          abs_diff_expr_in                     barplot
#> Pde4d         0.344661064    images/Pde4d_barplot.png
#> Atp6v0e2      0.163614068 images/Atp6v0e2_barplot.png
#> Lamb3         0.003755991    images/Lamb3_barplot.png
#> Cgref1        0.272047308   images/Cgref1_barplot.png
#> Nr6a1         0.075970121    images/Nr6a1_barplot.png
#> Rtn2          0.056002490     images/Rtn2_barplot.png
#>                              pheatmap                 transcript_view
#> Pde4d       images/Pde4d_pheatmap.png    images/Pde4d_transcripts.png
#> Atp6v0e2 images/Atp6v0e2_pheatmap.png images/Atp6v0e2_transcripts.png
#> Lamb3       images/Lamb3_pheatmap.png    images/Lamb3_transcripts.png
#> Cgref1     images/Cgref1_pheatmap.png   images/Cgref1_transcripts.png
#> Nr6a1       images/Nr6a1_pheatmap.png    images/Nr6a1_transcripts.png
#> Rtn2         images/Rtn2_pheatmap.png     images/Rtn2_transcripts.png
#>                   dimensional_reduction
#> Pde4d       images/Pde4d_dim_reduce.png
#> Atp6v0e2 images/Atp6v0e2_dim_reduce.png
#> Lamb3       images/Lamb3_dim_reduce.png
#> Cgref1     images/Cgref1_dim_reduce.png
#> Nr6a1       images/Nr6a1_dim_reduce.png
#> Rtn2         images/Rtn2_dim_reduce.png

Before creating the actual table, we can optionally define column formatter functions, which colour the specified columns. The colouring might help with to quickly dissect the results.

DTUrtle come with some pre-defined column formatter functions (for p-values and percentages), other formatter functions from the formattable package can also be used. Advanced users might also define their own functions.

We create a named list, linking column names to formatter functions:

column_formatter_list <- list(
      "gene_qvalue" = table_pval_tile("white", "orange", digits = 3),
      "minimal_tx_qvalue" = table_pval_tile("white", "orange", digits = 3),
      "number_tx" = formattable::color_tile('white', "lightblue"),
      "number_significant_tx" = formattable::color_tile('white', "lightblue"),
      "max(T cell-luminal epithelial cell of mammary gland)" = table_percentage_bar('lightgreen', "#FF9999", digits=2),
      "max_tx_expr_in_T_cell" = table_percentage_bar('white', "lightblue", color_break = 0, digits=2),
      "max_tx_expr_in_luminal_epithelial_cell_of_mammary_gland" = table_percentage_bar('white', "lightblue", color_break = 0, digits=2),
      "abs_diff_expr_in" = table_percentage_bar('white', "lightblue", color_break = 0, digits=2))

This column_formatter_list is subsequently provided to plot_dtu_table():

plot_dtu_table(dturtle = dturtle, savepath = "my_results.html", 
               column_formatters = column_formatter_list)

Note: ️As seen above, the paths to the plots are relative. Please make sure that the saving directory in plot_dtu_table() is correctly set and the plots are reachable from that directory with the given path.

The links in the following example are just for demonstration purposes and do not work!