Estimate transcript detection probability for 3'- or 5'-biased data

priming_bias_detection_probability(
  counts,
  gtf,
  tx2gene,
  one_to_one = NULL,
  priming_enrichment = "3",
  genes = NULL,
  add_to_table = NULL,
  BPPARAM = BiocParallel::SerialParam()
)

Arguments

counts

A (sparse) count matrix, where columns represent a sample / cell and rows represent a single transcript isoform. This data is used to infer each gene's reference transcript.

gtf

A GTF file with gene and exon-level information. Can be a filepath or a previously imported gtf file (as GRanges or data frame). It is advised to read-in the file like this: gtf <- import_gtf("YOUR_PATH", feature_type = NULL, out_df=FALSE).

tx2gene

Data frame, where the first column consists of feature identifiers and the second column consists of corresponding gene identifiers. Feature identifiers must match with the rownames of the counts object.

one_to_one

Specify TRUE, if one_to_one mapping of gene/transcript identifiers with their respective names was enforced before (with one_to_one_mapping()). If a non default extension character (ext) has been used, please specify the used extension character.

priming_enrichment

Specify, which end of the mRNA is supposed to be enriched in your (single-cell) RNA-seq protocol. Can be either '3' or '5', for the 3'-end or the 5'-end respectively.

genes

(Optional) Specify certain genes, that shall be analysed. If NULL, defaults to all genes in the provided tx2gene data frame.

add_to_table

(Optional) add the detection_probability and used_as_ref column directly to the here provided data frame. First column of the data frame must match with transcript identifiers.

BPPARAM

If multicore processing should be used, specify a BiocParallelParam object here. Among others, can be SerialParam() (default) for non-multicore processing or MulticoreParam('number_cores') for multicore processing. See BiocParallel for more information.

Value

A data frame with the columns:

  • gene: A gene identifier.

  • tx: A transcript identifier.

  • detection_probability: The calculated detection probability score.

  • used_as_ref: Boolean vector, indicating which transcripts were used as reference transcript for the specific gene.

If a valid data frame in add_to_table is provided, this data frame is returned with the added detection_probability and used_as_ref column.

Details

Many (single-cell) RNA-seq protocols do not produce reads from the full-length of the mRNA, but instead favor fragments of the 3' or 5' end of the mRNA. Such protocols limit the ability to detect DTU events for specific transcripts, e.g. for transcripts of the same gene, where the first exon-level difference is close to the non-favoured priming end. This function tries to estimate, which transcripts might not pop up in a DTU analysis, because of this effect.

First, this function sets the major proportionally expressed transcript as the reference transcript for that specific gene. If no count information are availble, the first transcript is chosen as reference.

Then, for each other transcript of that gene, the first exon-level difference compared to the reference transcript is detected and a probability score is calculated based on the exonic distance between that difference and the favoured priming end.

The probability score ranges from 0 to 1, where 1 indicates no influence by the prime-biased protocol, and 0 indicates an extreme heavy influence. Thus, DTU effects for transcripts with a low score are less likely to be detectable with the given data.

See also