Analysis of the Hoffman et al. human single-cell RNA-seq data

This vignette exemplifies the analysis of single-cell RNA-seq data with DTUrtle. The data used in this vignette is publicly available as Bioproject PRJNA594939 and the used FASTQ-files can be downloaded from here. The corresponding publication from Hoffman et al. can be found here.

The following code shows an example of an DTUrtle workflow. Assume we have performed the pre-processing as described here and the R working directory is a newly created folder called dtu_results.

Setup

Load the DTUrtle package and set the BiocParallel parameter. It is recommended to perform computations in parallel, if possible.

library(DTUrtle)
#> Loading required package: sparseDRIMSeq
#> To cite DTUrtle in publications use:
#> 
#>  Tobias Tekath, Martin Dugas.
#>  Differential transcript usage analysis
#>  of bulk and single-cell RNA-seq data with DTUrtle.
#>  Bioinformatics,
#>  2021.
#>  https://doi.org/10.1093/bioinformatics/btab629.
#> 
#> Use citation(DTUrtle) for BibTeX information.
#use up to 10 cores for computation
biocpar <- BiocParallel::MulticoreParam(10)

Import and format data

We want to start by reading in our quantification counts, as well as a file specifying which transcript ID or name belongs to which gene ID or name.

Importing and processing GTF annotation (tx2gene)

To get this transcript to gene (tx2gene) mapping, we will utilize the already present Gencode annotation file gencode.v34.annotation.gtf. The import_gtf() function utilizes the a rtracklayer package and returns a transcript-level filtered version of the available data.

tx2gene <- import_gtf(gtf_file = "../gencode.v34.annotation.gtf")

head(tx2gene, n=3)
#>   seqnames start   end width strand source       type score phase
#> 1     chr1 11869 14409  2541      + HAVANA transcript    NA    NA
#> 2     chr1 12010 13670  1661      + HAVANA transcript    NA    NA
#> 3     chr1 14404 29570 15167      - HAVANA transcript    NA    NA
#>             gene_id     transcript_id                          gene_type
#> 1 ENSG00000223972.5 ENST00000456328.2 transcribed_unprocessed_pseudogene
#> 2 ENSG00000223972.5 ENST00000450305.2 transcribed_unprocessed_pseudogene
#> 3 ENSG00000227232.5 ENST00000488147.1             unprocessed_pseudogene
#>   gene_name                    transcript_type transcript_name level
#> 1   DDX11L1               processed_transcript     DDX11L1-202     2
#> 2   DDX11L1 transcribed_unprocessed_pseudogene     DDX11L1-201     2
#> 3    WASH7P             unprocessed_pseudogene      WASH7P-201     2
#>   transcript_support_level    hgnc_id   tag          havana_gene
#> 1                        1 HGNC:37102 basic OTTHUMG00000000961.1
#> 2                       NA HGNC:37102 basic OTTHUMG00000000961.1
#> 3                       NA HGNC:38034 basic OTTHUMG00000000958.1
#>      havana_transcript         ont protein_id ccdsid
#> 1 OTTHUMT00000362751.1        <NA>       <NA>   <NA>
#> 2 OTTHUMT00000002844.1 PGO:0000019       <NA>   <NA>
#> 3 OTTHUMT00000002839.1 PGO:0000005       <NA>   <NA>

There are a lot of columns present in the data frame, but at the moment we are mainly interested in the columns gene_id, gene_name, transcript_id and transcript_name.

As we want to use gene and transcript names as the main identifiers in our analysis (so we can directly say: Gene x is differential), we should ensure that each gene / transcript name maps only to a single gene / transcript id.

For this we can use the DTUrtle function one_to_one_mapping(), which checks if there are identifiers, which relate to the same name. If this is the case, the names (not the identifiers) are slightly altered by appending a number. If id_x and id_y both have the name ABC, the id_y name is altered to ABC_2 by default.

tx2gene$gene_name <- one_to_one_mapping(name = tx2gene$gene_name, id = tx2gene$gene_id)
#> Changed 78 names.
tx2gene$transcript_name <- one_to_one_mapping(name = tx2gene$transcript_name, id = tx2gene$transcript_id)
#> Changed 161 names.

We see that it was a good idea to ensure the one to one mapping, as many doublets have been corrected.

For the run_drimseq() tx2gene parameter, we need a data frame, where the first column specifies the transcript identifiers and the second column specifying the corresponding gene names. Rather than subsetting the data frame, a column reordering is proposed, so that additional data can still be used in further steps. DTUrtle makes sure to carry over additional data columns in the analysis steps. To reorder the columns of our tx2gene data frame, we can utilize the move_columns_to_front() functionality.

tx2gene <- move_columns_to_front(df = tx2gene, columns = c("transcript_name", "gene_name"))

head(tx2gene, n=5)
#>   transcript_name   gene_name seqnames start   end width strand  source
#> 1     DDX11L1-202     DDX11L1     chr1 11869 14409  2541      +  HAVANA
#> 2     DDX11L1-201     DDX11L1     chr1 12010 13670  1661      +  HAVANA
#> 3      WASH7P-201      WASH7P     chr1 14404 29570 15167      -  HAVANA
#> 4   MIR6859-1-201   MIR6859-1     chr1 17369 17436    68      - ENSEMBL
#> 5 MIR1302-2HG-202 MIR1302-2HG     chr1 29554 31097  1544      +  HAVANA
#>         type score phase           gene_id     transcript_id
#> 1 transcript    NA    NA ENSG00000223972.5 ENST00000456328.2
#> 2 transcript    NA    NA ENSG00000223972.5 ENST00000450305.2
#> 3 transcript    NA    NA ENSG00000227232.5 ENST00000488147.1
#> 4 transcript    NA    NA ENSG00000278267.1 ENST00000619216.1
#> 5 transcript    NA    NA ENSG00000243485.5 ENST00000473358.1
#>                            gene_type                    transcript_type level
#> 1 transcribed_unprocessed_pseudogene               processed_transcript     2
#> 2 transcribed_unprocessed_pseudogene transcribed_unprocessed_pseudogene     2
#> 3             unprocessed_pseudogene             unprocessed_pseudogene     2
#> 4                              miRNA                              miRNA     3
#> 5                             lncRNA                             lncRNA     2
#>   transcript_support_level    hgnc_id   tag          havana_gene
#> 1                        1 HGNC:37102 basic OTTHUMG00000000961.1
#> 2                       NA HGNC:37102 basic OTTHUMG00000000961.1
#> 3                       NA HGNC:38034 basic OTTHUMG00000000958.1
#> 4                       NA HGNC:50039 basic                 <NA>
#> 5                        5 HGNC:52482 basic OTTHUMG00000000959.1
#>      havana_transcript         ont protein_id ccdsid
#> 1 OTTHUMT00000362751.1        <NA>       <NA>   <NA>
#> 2 OTTHUMT00000002844.1 PGO:0000019       <NA>   <NA>
#> 3 OTTHUMT00000002839.1 PGO:0000005       <NA>   <NA>
#> 4                 <NA>        <NA>       <NA>   <NA>
#> 5 OTTHUMT00000002840.1        <NA>       <NA>   <NA>

This concludes the tx2gene formatting.

Reading in quantification data

The read-in of the quantification counts can be achieved with import_counts(), which uses the tximport package in the background. This function is able to parse the output of many different quantification tools. Advanced users might be able to tune parameters to parse arbitrary output files from currently not supported tools.

In the pre-processing vignette we quantified the counts with Alevin. The folder structure of the quantification results folder looks like this:

list.files("../alevin/")
#> [1] "index"     "sc_Dex2hr" "sc_EtOH"   "txmap.tsv"

We will create a named files vector, pointing to the quants_mat.gz file for each sample. The names help to differentiate the samples later on.

files <- Sys.glob("../alevin/sc_*/alevin/quants_mat.gz")
names(files) <- c("sc_Dex2hr", "sc_EtOH")

The files object looks like this:

#>                                  sc_Dex2hr 
#> "../alevin/sc_Dex2hr/alevin/quants_mat.gz" 
#>                                    sc_EtOH 
#>   "../alevin/sc_EtOH/alevin/quants_mat.gz"

The actual import will be performed with import_counts().

cts_list <- import_counts(files = files, type = "alevin")
#> Reading in 2 alevin runs.
#> importing alevin data is much faster after installing `fishpond` (>= 1.2.0)
#> reading in alevin gene-level counts across cells
#> importing alevin data is much faster after installing `fishpond` (>= 1.2.0)
#> reading in alevin gene-level counts across cells

The cts_list object is a named list, with a sparse Matrix per sample. In single-cell data, each sample normally consists of many different cells with an unique cell barcode. These cell barcodes might overlap between samples though. For this reason, many single-cell workflow use a cell barcode extension, uniquely assigning each cell to a sample. This can also be done in DTUrtle with combine_to_matrix(), which is only applicable if you are analyzing single-cell data.

This function will make sure that there are no duplicated barcodes between you samples, before merging the matrices together. If there are duplicated barcodes, a cell extension is added. Additionally, all not expressed features are removed to reduce the size of the data.

cts <- combine_to_matrix(tx_list = cts_list)
#> Merging matrices
#> Excluding 149854 overall not expressed features.
#> 77347 features left.

Apparently there were no duplicated cell barcodes between our samples, so a cell barcode extension is not necessary. Optionally you could still force the addition by specifying a vector of cell_extensions.

dim(cts)
#> [1] 77347   796

There are ~77k features left for 796 cells.

Sample metadata

Finally, we need a sample metadata data frame, specifying which sample belongs to which comparison group. This table is also convenient to store and carry over additional metadata.

For single-cell data, this is not a sample metadata data frame, but a cell metadata data frame. We have to specify the information on cell level, with the barcodes as identifiers.

If such a table is not already present, it can be easily prepared:

pd <- data.frame("id"=colnames(cts), 
                 "group"=ifelse(colnames(cts) %in% colnames(cts_list$sc_Dex2hr),"Dex2hr","EtOH"),  
                 stringsAsFactors = FALSE)

head(pd, n=5)
#>                   id  group
#> 1 GCCAGAGCGAATGTTAAC Dex2hr
#> 2 TGAATTCAGACTAGTCTG Dex2hr
#> 3 CGGCGTCTTACGTTCTTG Dex2hr
#> 4 TCGGGAGGACGATACCGA Dex2hr
#> 5 TCGGGACCTCTAGAGTGA Dex2hr

(optional) Estimate influence of priming-bias

As the Illumina SureCell protocol preferentially generates reads of the 3’-end of an mRNA, DTU effects of some specific transcripts might not be detectable. DTUrtle offers a novel scoring scheme, called “detection probability”, to assess which transcripts might be prone to be impaired by such a bias. This score can be computed with the DTUrtle function priming_bias_detection_probability().

The score calculation can be summarized like this: For each gene, a reference transcript is chosen (based on the expression data, selecting the major proportionally expressed transcript as reference). Each other transcript of a gene is now compared to this reference transcript, calculating where the first detectable exon-level difference occurs between the transcript entities - and at what relative coordinate this difference is located (looking from the priming enriched end). Based on this information, a detection probability is calculated, with 1 indicating no influence of the prime-biased reads on the detection ability and 0 indicating a very heavy influence. Thus, DTU effects for transcripts with a low score would be expected less likely to be detectable with the given data.

We can add this score information to an already existing table, like the tx2gene table (add_to_table = tx2gene). As we need exon-level information for the calculation, we should provide an unfiltered GTF GRanges or data frame object - or alternatively a file path.

unfilt_gtf <- import_gtf("../gencode.v34.annotation.gtf", feature_type = NULL, out_df = FALSE)
#set priming_enrichment to '3', as we expect reads enriched towards the 3' end of the mRNA.
tx2gene <- priming_bias_detection_probability(counts = cts, tx2gene = tx2gene, gtf = unfilt_gtf, one_to_one = TRUE,
                                              priming_enrichment = "3", add_to_table = tx2gene, BPPARAM = biocpar)
#> 
#> Performing one to one mapping in gtf
#> 
#> Found gtf GRanges for 60669 of 60669 provided genes.
#> Scoring transcripts of 60669 genes.
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%

The newly added columns are detection_probability and used_as_ref:

head(tx2gene[,c("transcript_name", "gene_name", "detection_probability", "used_as_ref")])
#>   transcript_name   gene_name detection_probability used_as_ref
#> 1     DDX11L1-202     DDX11L1             1.0000000        TRUE
#> 2     DDX11L1-201     DDX11L1             0.6550633       FALSE
#> 3      WASH7P-201      WASH7P             1.0000000       FALSE
#> 4   MIR6859-1-201   MIR6859-1             1.0000000       FALSE
#> 5 MIR1302-2HG-202 MIR1302-2HG             1.0000000        TRUE
#> 6 MIR1302-2HG-201 MIR1302-2HG             1.0000000       FALSE

We can calculate, that a potential 3’-bias would not influence the majority of annotated transcripts, relevant for the DTU analysis:

#only genes with at least two transcript isoforms are relevant for DTU analysis
dtu_relevant_genes <- unique(tx2gene$gene_name[duplicated(tx2gene$gene_name)])
summary(tx2gene$detection_probability[tx2gene$gene_name %in% dtu_relevant_genes])
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> 0.0001018 0.4102195 0.9696605 0.7164060 1.0000000 1.0000000

DTU analysis

We have prepared all necessary data to perform the differentially transcript usage (DTU) analysis. DTUrtle only needs two simple commands to perform it. Please be aware that these steps are the most compute intensive and, depending on your data, might take some time to complete. It is recommended to parallelize the computations with the BBPARAM parameter, if applicable.

First, we want to set-up and perform the statistical analysis with DRIMSeq, a DTU specialized statistical framework utilizing a Dirichlet-multinomial model. This can be done with the run_drimseq() command. We use the previously imported data as parameters, specifying which column in the cell metadata data frame contains ids and which the group information we want. We should also specify which of the groups should be compared (if there are more than two) and in which order. The order given in the cond_levels parameter also specifies the comparison formula.

Note: By default run_drimseq() converts sparse count matrix to a dense format for statistical computations (force_dense=TRUE). While this increases memory usage, it currently also reduces the run time. The computations can be performed keeping the sparse counts by setting force_dense=FALSE.

dturtle <- run_drimseq(counts = cts, tx2gene = tx2gene, pd=pd, id_col = "id",
                    cond_col = "group", cond_levels = c("Dex2hr", "EtOH"), filtering_strategy = "sc", 
                    BPPARAM = biocpar)
#> Using tx2gene columns:
#>  transcript_name ---> 'feature_id'
#>  gene_name ---> 'gene_id'
#> 
#> Comparing in 'group': 'Dex2hr' vs 'EtOH'
#> 
#> Proceed with cells/samples: 
#> Dex2hr   EtOH 
#>    399    397
#> 
#> Filtering...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%
#> Retain 21452 of 77347 features.
#> Removed 55895 features.
#> 
#> Performing statistical tests...
#> * Calculating mean gene expression..
#> * Estimating common precision..
#> ! Using common_precision = 993.3516 as prec_init !
#> * Estimating genewise precision..
#> ! Using loess fit as a shrinkage factor !
#> * Fitting the DM model..
#>    Using the one way approach.
#> * Fitting the BB model..
#>    Using the one way approach.
#> * Fitting the DM model..
#>    Using the one way approach.
#> * Calculating likelihood ratio statistics..
#> * Fitting the BB model..
#>    Using the one way approach.
#> * Calculating likelihood ratio statistics..

As in all statistical procedures, it is of favor to perform as few tests as possible but as much tests as necessary, to maintain a high statistical power. This is achieved by filtering the data to remove inherently uninteresting items, for example very lowly expressed genes or features. DTUrtle includes a powerful and customizable filtering functionality for this task, which is an optimized version of the dmFilter() function of the DRIMSeq package.

Above we used a predefined filtering strategy for single-cell data, requiring that features contribute at least 5% of the total expression in at least 5% of the cells of the smallest group. Additionally, all genes are filtered, which only have a single transcript left, as they can not be analyzed in DTU analysis. The filtering options can be extended or altered by the user.

dturtle$used_filtering_options
#> $DRIM
#> $DRIM$min_samps_gene_expr
#> [1] 0
#> 
#> $DRIM$min_samps_feature_expr
#> [1] 0
#> 
#> $DRIM$min_samps_feature_prop
#> [1] 19.85
#> 
#> $DRIM$min_gene_expr
#> [1] 0
#> 
#> $DRIM$min_feature_expr
#> [1] 0
#> 
#> $DRIM$min_feature_prop
#> [1] 0.05
#> 
#> $DRIM$run_gene_twice
#> [1] TRUE

The resulting dturtle object will be used as our main results object, storing all necessary and interesting data of the analysis. It is a simple and easy-accessible list, which can be easily extended / altered by the user. By default three different meta data tables are generated:

meta_table_gene: Contains gene level meta data.
meta_table_tx: Contains transcript level meta data.
meta_table_sample: Contains sample level meta data (as the created pd data frame)

These meta data tables are used in for visualization purposes and can be extended by the user.

dturtle$meta_table_gene[1:5,1:5]
#>        gene     exp_in exp_in_Dex2hr exp_in_EtOH seqnames
#> A1BG   A1BG 0.03768844    0.04511278  0.03022670    chr19
#> AAAS   AAAS 0.04020101    0.04511278  0.03526448    chr12
#> AAGAB AAGAB 0.07286432    0.09273183  0.05289673    chr15
#> AAMDC AAMDC 0.03894472    0.03508772  0.04282116    chr11
#> AAMP   AAMP 0.09422111    0.09774436  0.09068010     chr2

As proposed in Love et al. (2018), we will use a two-stage statistical testing procedure together with a post-hoc filtering on the standard deviations in proportions (posthoc_and_stager()). We will use stageR to determine genes, that show a overall significant change in transcript proportions. For these significant genes, we will try to pinpoint specific transcripts, which significantly drive this overall change. As a result, we will have a list of significant genes (genes showing the overall change) and a list of significant transcripts (one or more transcripts of the significant genes). Please note, that not every significant gene does have one or more significant transcripts. It is not always possible to attribute the overall change in proportions to single transcripts. These two list of significant items are computed and corrected against a overall false discovery rate (OFDR).

Additionally, we will apply a post-hoc filtering scheme to improve the targeted OFDR control level. The filtering strategy will discard transcripts, which standard deviation of the proportion per cell/sample is below the specified threshold. For example by setting posthoc=0.1, we will exclude all transcripts, which proportional expression (in regard to the total gene expression) deviates by less than 0.1 between cells. This filtering step should mostly exclude ‘uninteresting’ transcripts, which would not have been called as significant either way.

dturtle <- posthoc_and_stager(dturtle = dturtle, ofdr = 0.05, posthoc = 0.1)
#> Posthoc filtered 20608 features.
#> The returned adjusted p-values are based on a stage-wise testing approach and are only valid for the provided target OFDR level of 5%. If a different target OFDR level is of interest,the entire adjustment should be re-run.
#> Found 33 significant genes with 34 significant transcripts (OFDR: 0.05)

The dturtle object now contains additional elements, including the lists of significant genes and significant transcripts.

head(dturtle$sig_gene)
#> [1] "AGR2"    "ATP5F1B" "COX6A1"  "EEF1A1"  "HSPB1"   "NPM1"
head(dturtle$sig_tx)
#>          AGR2          AGR2       ATP5F1B       ATP5F1B        COX6A1 
#>    "AGR2-203"    "AGR2-204" "ATP5F1B-204" "ATP5F1B-201"  "COX6A1-201" 
#>        COX6A1 
#>  "COX6A1-202"

(optional) DGE analysis

Alongside a DTU analysis, a DGE analysis might be of interest for your research question. DTUrtle offers a basic DGE calling workflow for bulk and single-cell RNA-seq data via DESeq2.

To utilize this workflow, we should re-scale our imported counts. The transcript-level count matrix has been scaled for the DTU analysis, but for a DGE analysis un-normalized counts should be used (as DESeq2 normalizes internally). We can simply re-import the count data, by providing the already defined files object to the DGE analysis specific function import_dge_counts(). This function will make sure that the counts are imported but not scaled and also summarizes to gene-level.

Our files object looks like this:

head(files)
#>                                  sc_Dex2hr 
#> "../alevin/sc_Dex2hr/alevin/quants_mat.gz" 
#>                                    sc_EtOH 
#>   "../alevin/sc_EtOH/alevin/quants_mat.gz"

We re-import the files and use combine_to_matrix() to create one big matrix (as we did for the DTU analysis):

cts_dge <- import_dge_counts(files, type="alevin", tx2gene=tx2gene)
#> Reading in 2 alevin runs.
#> importing alevin data is much faster after installing `fishpond` (>= 1.2.0)
#> reading in alevin gene-level counts across cells
#> importing alevin data is much faster after installing `fishpond` (>= 1.2.0)
#> reading in alevin gene-level counts across cells
cts_dge <- combine_to_matrix(tx_list = cts_dge)
#> Merging matrices
#> Excluding 41828 overall not expressed features.
#> 18412 features left.

With this data, we can perform the DGE analysis with DESeq2 with the DTUrtle function run_deseq2(). DESeq2 is one of the gold-standard tools for DGE calling in bulk RNA-seq data and showed very good performance for single-cell data in multiple benchmarks. The DESeq2 vignette recommends adjusting some parameters for DGE analysis of single-cell data - DTUrtle incorporates these recommendations and adjusts the specific parameters based on the provided dge_calling_strategy.

After the DESeq2 DGE analysis, the estimated log2 fold changes are shrunken (preferably with apeglm) - to provide a secondary ranking variable beside the statistical significance. If a shrinkage with apeglm is performed, run_deseq2() defaults to also compute s-values rather than adjusted p-values. The underlying hypothesis for s-values and (standard) p-values differs slightly, with s-values hypothesis being presumably preferential in a real biological context. Far more information about all these topics can be found in the excellent DESeq2 vignette and the associated publications.

run_deseq2() will tell the user, if one or more advised packages are missing. It is strongly advised to follow these recommendations.

dturtle$dge_analysis <- run_deseq2(counts = cts_dge, pd = pd, id_col = "id",
                                   cond_col = "group", cond_levels = c("Dex2hr", "EtOH"), 
                                   lfc_threshold = 1, sig_threshold = 0.01, dge_calling_strategy = "sc",
                                   BPPARAM = biocpar)
#> 
#> Comparing in 'group': 'Dex2hr' vs 'EtOH'
#> 
#> Proceed with cells/samples: 
#> Dex2hr   EtOH 
#>    399    397
#> converting counts to integer mode
#> Warning in (function (object, test = c("Wald", "LRT"), fitType =
#> c("parametric", : parallelization of DESeq() is not implemented for
#> fitType='glmGamPoi'
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates: 10 workers
#> mean-dispersion relationship
#> final dispersion estimates, fitting model and testing: 10 workers
#> using 'apeglm' for LFC shrinkage. If used in published research, please cite:
#>     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
#>     sequence count data: removing the noise and preserving large differences.
#>     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
#> computing FSOS 'false sign or small' s-values (T=1)
#> Found 7 significant DEGs.
#>      Over-expressed: 7
#>      Under-expressed: 0

We provided a log2 fold change threshold of 1 (on log2 scale), thus preferring an effect size of 2x or more. The output of run_deseq2() is a list, containing various elements (see run_deseq2()’s description). This result list can easily added to an existing DTUrtle object, as shown above.

We can now identify genes which show both a DGE and DTU signal:

dtu_dge_genes <- intersect(dturtle$sig_gene, dturtle$dge_analysis$results_sig$gene)
length(dtu_dge_genes)
#> [1] 0

Result aggregation and visualization

The DTUrtle package contains multiple visualization options, enabling a in-depth inspection.

DTU table creation

We will start by aggregating the analysis results to a data frame with create_dtu_table(). This function is highly flexible and allows aggregation of gene or transcript level metadata in various ways. By default some useful information are included in the dtu table, in this example we further specify to include the seqnames column of the gene level metadata (which contains chromosome information) as well as the maximal expressed in ratio from the transcript level metadata.

dturtle <- create_dtu_table(dturtle = dturtle, add_gene_metadata = list("chromosome"="seqnames"), 
                               add_tx_metadata = list("tx_expr_in_max" = c("exp_in", max)))

head(dturtle$dtu_table, n=5)
#>        gene_ID  gene_qvalue minimal_tx_qvalue number_tx number_significant_tx
#> PSMB4    PSMB4 2.384975e-03      7.503188e-04         7                     1
#> RPL13A  RPL13A 8.107779e-11      9.569946e-12        12                     1
#> RPLP1    RPLP1 1.877647e-15      2.179297e-17         4                     3
#> COX6A1  COX6A1 1.793097e-02      3.124200e-03         3                     2
#> RPL7A    RPL7A 8.693153e-32      1.946863e-19         6                     3
#>        max(Dex2hr-EtOH) chromosome tx_expr_in_max
#> PSMB4         0.1549467       chr1      0.6306533
#> RPL13A        0.1484132      chr19      0.9183417
#> RPLP1         0.1381594      chr15      0.9623116
#> COX6A1       -0.1380718      chr12      0.5778894
#> RPL7A         0.1369719       chr9      0.9786432

The column definitions are as follows:

“gene_ID”: Gene name or identifier used for the analysis.
“gene_qvalue”: Multiple testing corrected p-value (a.k.a. q-value) comparing all transcripts together between the two groups (“gene level”).
“minimal_tx_qvalue”: The minimal multiple testing corrected p-value from comparing all transcripts individually between the two groups (“transcript level”). I.e. the q-value of the most significant transcript.
“number_tx”: The number of analyzed transcripts for the specific gene.
“number_significant_tx”: The number of significant transcripts from the ‘transcript level’ analysis.
“max(Dex2hr-EtOH)”: Maximal proportional difference between the two groups (Dex2hr vs EtOH). E.g. one transcript of ‘PSMB4’ is ~15% more expressed in ‘Dex2hr’ cells compared to ‘EtOH’ cells.
“chromosome”: the chromosome the gene resides on.
“tx_expr_in_max”: The fraction of cells, the most expressed transcript is expressed in. “Expressed in” is defined as expression>0.

This table is our basis for creating an interactive HTML-table of the results.

Proportion barplot

As a first visualization option we will create a barplot of the proportions of each transcript per sample. We can use the function plot_proportion_barplot() for this task, which also adds the mean proportion fit per subgroup to the plot (by default as a red line).

As an example, we will create the plot for the gene PPIA, which is one of the significant genes found in the analysis. We will optionally provide the gene_id for PPIA, which is stored in the column gene_id.1 of dturtle$meta_table_gene.

temp <- plot_proportion_barplot(dturtle = dturtle, genes = "PPIA", meta_gene_id = "gene_id.1")
#> Creating 1 plots:
temp$PPIA

We see, that most of the proportional differences for PPIA are driven by 2 of the 7 transcripts (PPIA-204 and PPIA-209). These transcripts are also the significant transcripts found in the analysis (as they are marked in red). We additionally see, that a DGE analysis would most likely not report PPIA as differential, because the mean gene expression is quite alike between the groups (together with a common high coefficient of variation).

For the interactive HTML-table we would need to save the images to disk (in the to-be-created sub folder “images” of the current working directory). There is also a convenience option, to directly add the file paths to the dtu_table. As multiple plots are created, we can provide a BiocParallel object to speed up the creation. If no specific genes are provided, all significant genes will be plotted.

dturtle <- plot_proportion_barplot(dturtle = dturtle, 
                                   meta_gene_id = "gene_id.1", 
                                   savepath = "images", 
                                   add_to_table = "barplot",
                                   BPPARAM = biocpar)
#> Creating 33 plots:
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%

head(dturtle$dtu_table$barplot)
#> [1] "images/PSMB4_barplot.png"   "images/RPL13A_barplot.png" 
#> [3] "images/RPLP1_barplot.png"   "images/COX6A1_barplot.png" 
#> [5] "images/RPL7A_barplot.png"   "images/ATP5F1B_barplot.png"
head(list.files("./images/"))
#> [1] "AGR2_barplot.png"    "ATP5F1B_barplot.png" "COX6A1_barplot.png" 
#> [4] "EEF1A1_barplot.png"  "HSPB1_barplot.png"   "NPM1_barplot.png"

Proportion heatmap

A different visualization option is a heatmap, where additional meta data can be displayed alongside the transcript proportions (plot_proportion_pheatmap()). This visualization uses the pheatmap package, the user can specify any of the available parameters to customize the results.

temp <- plot_proportion_pheatmap(dturtle = dturtle, genes = "PPIA", 
                                 include_expression = TRUE, treeheight_col=20)
#> Creating 1 plots:
temp$PPIA

By default, row and column annotations are added. This plot helps to examine the transcript composition of groups of cells. We see, there is a subgroup of cells, that seem to almost only express both PPIA-204 and PPIA-209 at equal proportions. The same goes for a even smaller subgroup of cells, which express PPIA-204 alongside with PPIA-202.

Again, we can save the plots to disk and add them to the dtu_table:

dturtle <- plot_proportion_pheatmap(dturtle = dturtle,
                                    include_expression = TRUE,
                                    treeheight_col=20,
                                    savepath = "images", 
                                    add_to_table = "pheatmap",
                                    BPPARAM = biocpar)
#> Creating 33 plots:
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%

head(dturtle$dtu_table$pheatmap)
#> [1] "images/PSMB4_pheatmap.png"   "images/RPL13A_pheatmap.png" 
#> [3] "images/RPLP1_pheatmap.png"   "images/COX6A1_pheatmap.png" 
#> [5] "images/RPL7A_pheatmap.png"   "images/ATP5F1B_pheatmap.png"
head(list.files("./images/"))
#> [1] "AGR2_barplot.png"     "AGR2_pheatmap.png"    "ATP5F1B_barplot.png" 
#> [4] "ATP5F1B_pheatmap.png" "COX6A1_barplot.png"   "COX6A1_pheatmap.png"

Transcript overview

Until now, we looked at the different transcripts as abstract entities. Alongside proportional differences, the actual difference in the exon-intron structure of transcripts is of great importance for many research questions. This structure can be visualized with the plot_transcripts_view() functionality of DTUrtle.

This visualization is based on the Gviz package and needs a path to a GTF file (or a read-in object). In Import and format data we already imported a GTF file. This was subset to transcript-level (via the import_gtf() function), thus this is not sufficient for the visualization. We can reuse the actual GTF file though, which should in general match with the one used for the tx2gene data frame.

As we have ensured the one_to_one mapping in Import and format data and potentially renamed some genes, we should specify the one_to_one parameter in this call.

plot_transcripts_view(dturtle = dturtle, 
                              genes = "PPIA", 
                              gtf = "../gencode.v34.annotation.gtf", 
                              genome = 'hg38',
                              one_to_one = TRUE)
#> 
#> Importing gtf file from disk.
#> 
#> Performing one to one mapping in gtf
#> 
#> Found gtf GRanges for 1 of 1 provided genes.
#> 
#> Fetching ideogram tracks ...
#> Creating 1 plots:

#> $PPIA
#> $PPIA$chr7
#> Ideogram track 'chr7' for chromosome 7 of the hg38 genome 
#> 
#> $PPIA$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $PPIA$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $PPIA$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $PPIA$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $PPIA$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $PPIA$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $PPIA$OverlayTrack
#> OverlayTrack 'OverlayTrack' containing 2 subtracks
#> 
#> $PPIA$titles
#> An object of class "ImageMap"
#> Slot "coords":
#>              x1        y1   x2        y2
#> chr7          6  120.0000 58.2  186.8141
#> OverlayTrack  6  186.8141 58.2  323.8406
#> OverlayTrack  6  323.8406 58.2  460.8672
#> OverlayTrack  6  460.8672 58.2  597.8938
#> OverlayTrack  6  597.8938 58.2  734.9203
#> OverlayTrack  6  734.9203 58.2  871.9469
#> OverlayTrack  6  871.9469 58.2 1008.9734
#> OverlayTrack  6 1008.9734 58.2 1146.0000
#> 
#> Slot "tags":
#> $title
#>           chr7   OverlayTrack   OverlayTrack   OverlayTrack   OverlayTrack 
#>         "chr7" "OverlayTrack" "OverlayTrack" "OverlayTrack" "OverlayTrack" 
#>   OverlayTrack   OverlayTrack   OverlayTrack 
#> "OverlayTrack" "OverlayTrack" "OverlayTrack"

This visualization shows the structure of the transcripts of PPIA. Our two significant transcripts (PPIA-204 and PPIA-209) are quite different, with alternative start and end points as well as some retained intron sequences. The arrows on the right side indicate the mean fitted proportional change in the comparison groups, thus showing a over-expression of PPIA-204 in Dex2hr compared to EtOH.

The areas between exons indicate intron sequences, which have been compressed in this representation to highlight the exon structure. Only consensus introns are compressed to a defined minimal size. This can be turned off with reduce_introns=FALSE or alternatively reduced introns can be highlighted by setting a colorful reduce_introns_fill.

Analogous as before, we can save plots to disk and add them to the dtu_table:

dturtle <- plot_transcripts_view(dturtle = dturtle, 
                                 gtf = "../gencode.v34.annotation.gtf", 
                                 genome = 'hg38', 
                                 one_to_one = TRUE,
                                 savepath = "images", 
                                 add_to_table = "transcript_view",
                                 BPPARAM = biocpar)
#> 
#> Importing gtf file from disk.
#> 
#> Performing one to one mapping in gtf
#> 
#> Found gtf GRanges for 33 of 33 provided genes.
#> 
#> Fetching ideogram tracks ...
#> Creating 33 plots:
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%

head(dturtle$dtu_table$transcript_view)
#> [1] "images/PSMB4_transcripts.png"   "images/RPL13A_transcripts.png" 
#> [3] "images/RPLP1_transcripts.png"   "images/COX6A1_transcripts.png" 
#> [5] "images/RPL7A_transcripts.png"   "images/ATP5F1B_transcripts.png"
head(list.files("./images/"))
#> [1] "AGR2_barplot.png"        "AGR2_pheatmap.png"      
#> [3] "AGR2_transcripts.png"    "ATP5F1B_barplot.png"    
#> [5] "ATP5F1B_pheatmap.png"    "ATP5F1B_transcripts.png"

Visualize DTU table

The dturtle$dtu_table is now ready to be visualized as an interactive HTML-table. Please note, that it is optional to add any plots or additional columns to the table. Thus the visualization will work directly after calling create_dtu_table().

The dtu_table object looks like this:

head(dturtle$dtu_table)
#>         gene_ID  gene_qvalue minimal_tx_qvalue number_tx number_significant_tx
#> PSMB4     PSMB4 2.384975e-03      7.503188e-04         7                     1
#> RPL13A   RPL13A 8.107779e-11      9.569946e-12        12                     1
#> RPLP1     RPLP1 1.877647e-15      2.179297e-17         4                     3
#> COX6A1   COX6A1 1.793097e-02      3.124200e-03         3                     2
#> RPL7A     RPL7A 8.693153e-32      1.946863e-19         6                     3
#> ATP5F1B ATP5F1B 4.810916e-05      1.274058e-06         5                     2
#>         max(Dex2hr-EtOH) chromosome tx_expr_in_max                    barplot
#> PSMB4          0.1549467       chr1      0.6306533   images/PSMB4_barplot.png
#> RPL13A         0.1484132      chr19      0.9183417  images/RPL13A_barplot.png
#> RPLP1          0.1381594      chr15      0.9623116   images/RPLP1_barplot.png
#> COX6A1        -0.1380718      chr12      0.5778894  images/COX6A1_barplot.png
#> RPL7A          0.1369719       chr9      0.9786432   images/RPL7A_barplot.png
#> ATP5F1B        0.1359806      chr12      0.8128141 images/ATP5F1B_barplot.png
#>                            pheatmap                transcript_view
#> PSMB4     images/PSMB4_pheatmap.png   images/PSMB4_transcripts.png
#> RPL13A   images/RPL13A_pheatmap.png  images/RPL13A_transcripts.png
#> RPLP1     images/RPLP1_pheatmap.png   images/RPLP1_transcripts.png
#> COX6A1   images/COX6A1_pheatmap.png  images/COX6A1_transcripts.png
#> RPL7A     images/RPL7A_pheatmap.png   images/RPL7A_transcripts.png
#> ATP5F1B images/ATP5F1B_pheatmap.png images/ATP5F1B_transcripts.png

Before creating the actual table, we can optionally define column formatter functions, which colour the specified columns. The colouring might help with to quickly dissect the results.

DTUrtle come with some pre-defined column formatter functions (for p-values and percentages), other formatter functions from the formattable package can also be used. Advanced users might also define their own functions.

We create a named list, linking column names to formatter functions:

column_formatter_list <- list(
      "gene_qvalue" = table_pval_tile("white", "orange", digits = 3),
      "min_tx_qval" = table_pval_tile("white", "orange", digits = 3),
      "n_tx" = formattable::color_tile('white', "lightblue"),
      "n_sig_tx" = formattable::color_tile('white', "lightblue"),
      "max(Dex2hr-EtOH)" = table_percentage_bar('lightgreen', "#FF9999", digits=2),
      "tx_expr_in_max" = table_percentage_bar('white', "lightblue", color_break = 0, digits=2))

This column_formatter_list is subsequently provided to plot_dtu_table():

plot_dtu_table(dturtle = dturtle, savepath = "my_results.html", 
               column_formatters = column_formatter_list)

Note: ️As seen above, the paths to the plots are relative. Please make sure that the saving directory in plot_dtu_table() is correctly set and the plots are reachable from that directory with the given path.

The links in the following example are just for demonstration purposes and do not work!

For later use we can save the final DTUrtle object to disk:

saveRDS(dturtle, "./dturtle_res.RDS")

Workflow with Seurat object

Some users might already have analyzed their (gene-level) single-cell data and performed clustering, dimensional reduction etc. For these users, DTUrtle offers direct support for Seurat objects (Seurat V3 or newer).

Seurat object creation

For this tutorial, we will create a Seurat object on the fly - users with an already existing object might want to directly jump to the second step.

We create a Seurat object on the gene-level counts, using the DTUrtle functionality summarize_to_gene():

library(Seurat)

#summarize counts to gene-level
cts_gene <- summarize_to_gene(cts, tx2gene = tx2gene, fun = "sum")

##compare dimensions
dim(cts)
#> [1] 77347   796
dim(cts_gene)
#> [1] 18412   796

We use these gene level counts for a standard Seurat analysis (Normalization + Scaling, PCA, Clustering, UMAP).

seur_obj <- CreateSeuratObject(counts = cts_gene, meta.data = data.frame(row.names=pd$id, "group"=pd$group, stringsAsFactors = FALSE))
#> Warning: Feature names cannot have underscores ('_'), replacing with dashes
#> ('-')

#check if names are kept
all(Cells(seur_obj) %in% pd$id)
#> [1] TRUE

#run basic pipeline
seur_obj <- NormalizeData(seur_obj)
seur_obj <- FindVariableFeatures(seur_obj, selection.method = "vst", nfeatures = 5000)
seur_obj <- ScaleData(seur_obj, model.use = "negbinom")
#> Centering and scaling data matrix
seur_obj <- RunPCA(seur_obj, seed.use = 42, features = VariableFeatures(object = seur_obj))
#> PC_ 1 
#> Positive:  RPL7A, RPS3A, PPIA, RPSA, RPL13, RPS10, RPL6, RPS17, KRT19, TUBA1B 
#>     KRT18, EEF1G, AGR2, C1QBP, UBB, GAGE12F, NME1-NME2, PRDX1, SNRPD3, PPIB 
#>     SERPINA6, ARL6IP1, GSTM3, NQO1, TP53I3, PIP, C15orf65, GNG11, LRRC59, UBE2C 
#> Negative:  RPL6P27, AC091153.3, PNMT, BBS9, DHX58, PCLO, ZNF629, STIM1, AL031058.1, SAMD4A 
#>     RAB28, SVIL, LINC02163, FOXP1, PRELID2, RBM38, ARHGEF16, AF235103.3, XYLB, KIFC3 
#>     ZNF248, TMEM229B, FUT1, MAPK15, AC131009.3, KDM3A, ARNTL, TMEM17, CHPF2, CIART 
#> PC_ 2 
#> Positive:  RBAK, AL049840.6, HMGCS2, JUNB, RAD51B, INO80D, MAN1B1-DT, CAMKK1, ELF2, CSRNP2 
#>     F3, ELMOD3, AC011445.2, AC090809.1, AC012073.1, DNAJC27-AS1, PPP2CB, PARD3, LRFN1, ASXL2 
#>     ZNF345, LINC01205, AL845472.2, BX470102.2, RALY-AS1, CARD10, LINC00664, AC022784.1, HEATR5B, ID1 
#> Negative:  DNMT3B, SBF2, SRP14-AS1, TNFAIP2, AC104667.2, AC022784.5, TFCP2, RPL13P12, C17orf75, KPNA3 
#>     UBFD1, PGR, CC2D1A, ZNF131, RBM47, FOXRED2, ZBTB5, RPRD2, ZNF239, FDXACB1 
#>     LTV1, ALDH1A3, MYO1F, CEP55, ADAT1, GLUL, UBE3C, CCDC43, IGSF5, VWA5A 
#> PC_ 3 
#> Positive:  NME1-NME2, PSMD7, PIP, AC116036.2, EEF1G, ZNF197, C19orf53, PRDX1, KRT18, KRT19 
#>     ZCRB1, RPS10-NUDT3, LENG1, NADSYN1, PLOD3, GAGE12J, RPSA, LINC01089, PRDM11, AL358472.2 
#>     SETD4, TAC3, CCNB1, PPM1J, BMP1, MCM7, FASTKD3, RPL13, MFHAS1, DAGLB 
#> Negative:  H1-0, DNAJB4, SLC16A9, LINC01446, ZDHHC9, ZFYVE26, FAM210B, DDX21, MIS18A, ZNF131 
#>     KLHL18, ADAT1, UBE3C, GORAB, RNF103, SLC26A11, DDI2, BTN2A2, TMEM87A, TAF1C 
#>     ARHGAP11A, MRPS35, EYA3, LINC01311, ZNF492, MT2A, LPAR1, DHFR2, UBFD1, FOXRED2 
#> PC_ 4 
#> Positive:  C15orf65, SRP14-AS1, TNFAIP2, BX890604.2, DNMT3B, EPHX1, AC022784.5, C17orf75, ALKBH2, DCAF12 
#>     EML4, AC104667.2, TSPAN12, BICD2, MTATP6P1, C9orf85, CHST15, ASMTL, TBILA, ARVCF 
#>     MREG, RGS12, PIP5K1C, ABCC1, ZNF702P, PLAAT2, SRGN, LRRC59, BMP1, RIC8B 
#> Negative:  PANK2-AS1, RAB20, MT1X, AKR1A1, ZNF37BP, MX1, NOL6, LIPT2, TJP1, CLIC3 
#>     EXOC1, SLC15A3, COMMD2, ELMO3, PNPLA3, FAM122A, RIPK2, TRMT2A, AC005332.4, CYP2U1 
#>     OSBPL3, KIF16B, AC132872.1, KIAA1211L, ARSD, TBC1D10A, AC245033.1, PRDX1, SPATA13, AL356277.3 
#> PC_ 5 
#> Positive:  DCAF7, PPP4R4, IGFBP5, GOPC, RAB7A, PTPN14, ANLN, CLPTM1L, CYP2B7P, YAP1 
#>     FAM120A, ACIN1, ZIC2, GRHL2, ALOXE3, SOAT1, TPH1, SLC2A6, PDCD4, HLCS 
#>     XXYLT1, NIPSNAP2, ARHGAP11A, OTUB2, TXNIP, RACGAP1, ATR, ERO1B, KIF14, SLC5A3 
#> Negative:  KRT18, AC027682.6, AC129926.1, KRT19, PPIA, RPL17-C18orf32, METTL4, NUDT14, CCDC97, TYMP 
#>     CDIP1, FBXO2, RAB19, CCNJ, HSD17B8, GRAMD2B, NME1-NME2, PMS2P4, C1GALT1C1, RASSF8-AS1 
#>     AC010913.1, GLIS2, SPR, IFNGR1, FASTKD2, ENPP4, ZNF696, PANK2-AS1, NME4, EFCAB2

ElbowPlot(seur_obj)


#7 PCs seem like a reasonable cut-off
dims <- 1:7

seur_obj <- FindNeighbors(seur_obj, dims = dims, k.param=10)
#> Computing nearest neighbor graph
#> Computing SNN
seur_obj <- FindClusters(seur_obj, resolution = 0.7, n.start=100, n.iter=100, random.seed = 42)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 796
#> Number of edges: 11143
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 100 random starts: 0.7644
#> Number of communities: 6
#> Elapsed time: 0 seconds
seur_obj <- RunUMAP(seur_obj, dims = dims, n.epochs=1000, seed.use=42, umap.method = "umap-learn")

The computed UMAP coordinates look like this:

DimPlot(seur_obj, reduction = "umap", group.by = c("seurat_clusters", "group"))

Seurat usage with DTUrtle

Seurat objects can be utilized in two parts of the DTUrtle workflow: combine_to_matrix() and run_umap().

To better highlight the functionality, lets assume we only want to perform a DTU analysis in one of the identified cell clusters - Cluster 4.

We subset the Seurat object to this cluster only:

seur_obj_sub <- subset(seur_obj, subset = seurat_clusters==4)
dim(seur_obj_sub)
#> [1] 18412   103

We can provide this subsetted Seurat object as a parameter in combine_to_matrix(). DTUrtle will try to auto-detect cell name extensions and will subset the count matrix to the cells from the Seurat object. Additionally we can provide a tx2gene data frame, which will be added as feature-level metadata to the Seurat assay.

seur_obj_sub <- combine_to_matrix(tx_list = cts_list, seurat_obj = seur_obj_sub, tx2gene = tx2gene)
#> Trying to infer cell extensions from Seurat object
#> Could not find cell name extension in Seurat object.
#> Merging matrices
#> Of 796 cells, 103 (13%) could be found in the provided seurat object.
#> 693 (87%) are unique to the transcriptomic files.
#> The seurat object contains 0 additional cells.
#> Subsetting!
#> Excluding 172918 overall not expressed features.
#> 54283 features left.
#> Adding assay 'dtutx'

The transcript level expression of the 103 cells have been carried over to the Seurat object in an own assay.

seur_obj_sub@assays$dtutx@data[1:4, 1:4]
#> 4 x 4 sparse Matrix of class "dgCMatrix"
#>                TGAATTCAGACTAGTCTG CGGCGTCTTACGTTCTTG TCGGGACCTCTAGAGTGA
#> AL627309.1-201                  .                  .                  .
#> CICP27-201                      .                  .                  .
#> AL627309.6-201                  .                  .                  .
#> AL627309.5-206                  .                  .                  .
#>                AACGTGTCCAAGTTGCTC
#> AL627309.1-201                  .
#> CICP27-201                      .
#> AL627309.6-201                  .
#> AL627309.5-206                  .
seur_obj_sub@assays$dtutx@meta.features[1:4,]
#>                transcript_name  gene_name seqnames  start    end width strand
#> AL627309.1-201  AL627309.1-201 AL627309.1     chr1  89295 120932 31638      -
#> CICP27-201          CICP27-201     CICP27     chr1 131025 134836  3812      +
#> AL627309.6-201  AL627309.6-201 AL627309.6     chr1 135141 135895   755      -
#> AL627309.5-206  AL627309.5-206 AL627309.5     chr1 167129 169240  2112      -
#>                source       type score phase           gene_id
#> AL627309.1-201 HAVANA transcript    NA    NA ENSG00000238009.6
#> CICP27-201     HAVANA transcript    NA    NA ENSG00000233750.3
#> AL627309.6-201 HAVANA transcript    NA    NA ENSG00000268903.1
#> AL627309.5-206 HAVANA transcript    NA    NA ENSG00000241860.7
#>                    transcript_id            gene_type      transcript_type
#> AL627309.1-201 ENST00000466430.5               lncRNA               lncRNA
#> CICP27-201     ENST00000442987.3 processed_pseudogene processed_pseudogene
#> AL627309.6-201 ENST00000494149.2 processed_pseudogene processed_pseudogene
#> AL627309.5-206 ENST00000655252.1               lncRNA               lncRNA
#>                level transcript_support_level    hgnc_id    tag
#> AL627309.1-201     2                        5       <NA>  basic
#> CICP27-201         1                       NA HGNC:48835  basic
#> AL627309.6-201     2                       NA       <NA>  basic
#> AL627309.5-206     2                     <NA>       <NA> TAGENE
#>                         havana_gene    havana_transcript         ont protein_id
#> AL627309.1-201 OTTHUMG00000001096.1 OTTHUMT00000003225.1        <NA>       <NA>
#> CICP27-201     OTTHUMG00000001257.1 OTTHUMT00000003691.1 PGO:0000004       <NA>
#> AL627309.6-201 OTTHUMG00000182518.1 OTTHUMT00000461982.1 PGO:0000004       <NA>
#> AL627309.5-206 OTTHUMG00000002480.1 OTTHUMT00000528579.1        <NA>       <NA>
#>                ccdsid detection_probability used_as_ref
#> AL627309.1-201   <NA>                     1        TRUE
#> CICP27-201       <NA>                     1       FALSE
#> AL627309.6-201   <NA>                     1       FALSE
#> AL627309.5-206   <NA>                     1        TRUE

Seurat functions can be used for visualizing expression of single transcripts:

FeaturePlot(seur_obj_sub, features = c("PPIA-204", "PPIA-209"))

The enhanced Seurat object can be used in the actual DTU analysis, if we provide it to the DTUrtle function run_drimseq(). As the tx2gene information was already added in the combine_to_matrix() step, we only need to provide the column names of the transcript and gene identifier columns of the feature level metadata. Everything we need for the analysis is present in the Seurat object:

dturtle_sub <- run_drimseq(counts = seur_obj_sub, pd = seur_obj_sub@meta.data,
                           tx2gene = c("transcript_name", "gene_name"),
                           cond_col="group", cond_levels = c("Dex2hr", "EtOH"), 
                           filtering_strategy = "sc", BPPARAM = biocpar)
#> Using tx2gene columns:
#>  transcript_name ---> 'feature_id'
#>  gene_name ---> 'gene_id'
#> 
#> Comparing in 'group': 'Dex2hr' vs 'EtOH'
#> 
#> Proceed with cells/samples: 
#> Dex2hr   EtOH 
#>     68     35
#> 
#> Filtering...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%
#> Retain 35301 of 54283 features.
#> Removed 18982 features.
#> 
#> Performing statistical tests...
#> * Calculating mean gene expression..
#> * Estimating common precision..
#> ! Using common_precision = 778.5341 as prec_init !
#> * Estimating genewise precision..
#> ! Using loess fit as a shrinkage factor !
#> * Fitting the DM model..
#>    Using the one way approach.
#> * Fitting the BB model..
#>    Using the one way approach.
#> * Fitting the DM model..
#>    Using the one way approach.
#> * Calculating likelihood ratio statistics..
#> * Fitting the BB model..
#>    Using the one way approach.
#> * Calculating likelihood ratio statistics..

With this DTUrtle object we can proceed in the standard way, i.e. with posthoc_and_stager() and following visualization.

dturtle_sub <- posthoc_and_stager(dturtle = dturtle_sub, ofdr = 0.05, posthoc = 0.1)
#> Posthoc filtered 33200 features.
#> The returned adjusted p-values are based on a stage-wise testing approach and are only valid for the provided target OFDR level of 5%. If a different target OFDR level is of interest,the entire adjustment should be re-run.
#> Found 2 significant genes with 1 significant transcripts (OFDR: 0.05)