Import quantification results for DGE analysis — import_dge

Import the quantification results for DGE analysis of many RNA-seq quantifiers, including alevin and bustools for single-cell data. Most likely the first step in your DTUrtle DGE analysis.

import_dge_counts(files, type, ...)

Arguments

files

files	Vector of files to be imported. Optionally can be named to keep the samples names.
type	Type of the quantification data. All tools supported by `tximport` can be selected, additionally to the newly implemented `bustools` support for single-cell data. If you have single-cell data, the use of `alevin` or `bustools` is proposed. `'salmon'` `'alevin'` `'kallisto'` `'bustools'` `'rsem'` `'stringtie'` `'sailfish'` `'none'`
...	Arguments passed on to `tximport::tximport` `txIn` logical, whether the incoming files are transcript level (default TRUE) `txOut` logical, whether the function should just output transcript-level (default FALSE) `countsFromAbundance` character, either "no" (default), "scaledTPM", "lengthScaledTPM", or "dtuScaledTPM". Whether to generate estimated counts using abundance estimates: scaled up to library size (scaledTPM), scaled using the average transcript length over samples and then the library size (lengthScaledTPM), or scaled using the median transcript length among isoforms of a gene, and then the library size (dtuScaledTPM). dtuScaledTPM is designed for DTU analysis in combination with `txOut=TRUE`, and it requires specifing a `tx2gene` data.frame. dtuScaledTPM works such that within a gene, values from all samples and all transcripts get scaled by the same fixed median transcript length. If using scaledTPM, lengthScaledTPM, or geneLengthScaledTPM, the counts are no longer correlated across samples with transcript length, and so the length offset matrix should not be used. `tx2gene` a two-column data.frame linking transcript id (column 1) to gene id (column 2). the column names are not relevant, but this column order must be used. this argument is required for gene-level summarization, and the tximport vignette describes how to construct this data.frame (see Details below). An automated solution to avoid having to create `tx2gene` if one has quantified with Salmon or alevin with human or mouse transcriptomes is to use the `tximeta` function from the tximeta Bioconductor package. `varReduce` whether to reduce per-sample inferential replicates information into a matrix of sample variances `variance` (default FALSE). alevin computes inferential variance by default for bootstrap inferential replicates, so this argument is ignored/not necessary `dropInfReps` whether to skip reading in inferential replicates (default FALSE). For alevin, `tximport` will still read in the inferential variance matrix if it exists `infRepStat` a function to re-compute counts and abundances from the inferential replicates, e.g. `matrixStats::rowMedians` to re-compute counts as the median of the inferential replicates. The order of operations is: first counts are re-computed, then abundances are re-computed. Following this, if `countsFromAbundance` is not "no", `tximport` will again re-compute counts from the re-computed abundances. `infRepStat` should operate on rows of a matrix. (default is NULL) `ignoreTxVersion` logical, whether to split the tx id on the '.' character to remove version information to facilitate matching with the tx id in `tx2gene` (default FALSE) `ignoreAfterBar` logical, whether to split the tx id on the '\|' character to facilitate matching with the tx id in `tx2gene` (default FALSE) `geneIdCol` name of column with gene id. if missing, the `tx2gene` argument can be used `txIdCol` name of column with tx id `abundanceCol` name of column with abundances (e.g. TPM or FPKM) `countsCol` name of column with estimated counts `lengthCol` name of column with feature length information `importer` a function used to read in the files `existenceOptional` logical, should tximport not check if files exist before attempting import (default FALSE, meaning files must exist according to `file.exists`) `sparse` logical, whether to try to import data sparsely (default is FALSE). Initial implementation for `txOut=TRUE`, `countsFromAbundance="no"` or `"scaledTPM"`, no inferential replicates. Only counts matrix is returned (and abundance matrix if using `"scaledTPM"`) `sparseThreshold` the minimum threshold for including a count as a non-zero count during sparse import (default is 1) `readLength` numeric, the read length used to calculate counts from StringTie's output of coverage. Default value (from StringTie) is 75. The formula used to calculate counts is: `cov * transcript length / read length` `alevinArgs` named list, with logical elements `filterBarcodes`, `tierImport`, `forceSlow`. See Details for definitions.

Vector of files to be imported. Optionally can be named to keep the samples names.

type

Type of the quantification data. All tools supported by tximport can be selected, additionally to the newly implemented bustools support for single-cell data. If you have single-cell data, the use of alevin or bustools is proposed.

'salmon'
'alevin'
'kallisto'
'bustools'
'rsem'
'stringtie'
'sailfish'
'none'

...

Arguments passed on to tximport::tximport

txIn

logical, whether the incoming files are transcript level (default TRUE)

txOut

logical, whether the function should just output transcript-level (default FALSE)

countsFromAbundance

character, either "no" (default), "scaledTPM", "lengthScaledTPM", or "dtuScaledTPM". Whether to generate estimated counts using abundance estimates:

scaled up to library size (scaledTPM),
scaled using the average transcript length over samples and then the library size (lengthScaledTPM), or
scaled using the median transcript length among isoforms of a gene, and then the library size (dtuScaledTPM).

dtuScaledTPM is designed for DTU analysis in combination with txOut=TRUE, and it requires specifing a tx2gene data.frame. dtuScaledTPM works such that within a gene, values from all samples and all transcripts get scaled by the same fixed median transcript length. If using scaledTPM, lengthScaledTPM, or geneLengthScaledTPM, the counts are no longer correlated across samples with transcript length, and so the length offset matrix should not be used.

tx2gene

a two-column data.frame linking transcript id (column 1) to gene id (column 2). the column names are not relevant, but this column order must be used. this argument is required for gene-level summarization, and the tximport vignette describes how to construct this data.frame (see Details below). An automated solution to avoid having to create tx2gene if one has quantified with Salmon or alevin with human or mouse transcriptomes is to use the tximeta function from the tximeta Bioconductor package.

varReduce

whether to reduce per-sample inferential replicates information into a matrix of sample variances variance (default FALSE). alevin computes inferential variance by default for bootstrap inferential replicates, so this argument is ignored/not necessary

dropInfReps

whether to skip reading in inferential replicates (default FALSE). For alevin, tximport will still read in the inferential variance matrix if it exists

infRepStat

a function to re-compute counts and abundances from the inferential replicates, e.g. matrixStats::rowMedians to re-compute counts as the median of the inferential replicates. The order of operations is: first counts are re-computed, then abundances are re-computed. Following this, if countsFromAbundance is not "no", tximport will again re-compute counts from the re-computed abundances. infRepStat should operate on rows of a matrix. (default is NULL)

ignoreTxVersion

logical, whether to split the tx id on the '.' character to remove version information to facilitate matching with the tx id in tx2gene (default FALSE)

ignoreAfterBar

logical, whether to split the tx id on the '|' character to facilitate matching with the tx id in tx2gene (default FALSE)

geneIdCol

name of column with gene id. if missing, the tx2gene argument can be used

txIdCol

name of column with tx id

abundanceCol

name of column with abundances (e.g. TPM or FPKM)

countsCol

name of column with estimated counts

lengthCol

name of column with feature length information

importer

a function used to read in the files

existenceOptional

logical, should tximport not check if files exist before attempting import (default FALSE, meaning files must exist according to file.exists)

sparse

logical, whether to try to import data sparsely (default is FALSE). Initial implementation for txOut=TRUE, countsFromAbundance="no" or "scaledTPM", no inferential replicates. Only counts matrix is returned (and abundance matrix if using "scaledTPM")

sparseThreshold

the minimum threshold for including a count as a non-zero count during sparse import (default is 1)

readLength

numeric, the read length used to calculate counts from StringTie's output of coverage. Default value (from StringTie) is 75. The formula used to calculate counts is: cov * transcript length / read length

alevinArgs

named list, with logical elements filterBarcodes, tierImport, forceSlow. See Details for definitions.

Value

For bulk data: A list containing a count matrix, a matrix of average effective transcript lengths and a flag how counts where inferred from abundance estimates.
For single-cell data: A list of count matrices per sample. Should be combined and optionally added to a Seurat object with combine_to_matrix().

Details

It is necessary to specify a tx2gene data frame as a parameter. This data frame must be a a two-column data frame linking transcript id (column 1) to gene id/name (column 2). Please see import_gtf(), move_columns_to_front() and one_to_one_mapping() to help with tx2gene creation. See also combine_to_matrix(), when output is a list of single-cell runs.