Import the quantification results for DGE analysis of many RNA-seq quantifiers, including alevin
and bustools
for single-cell data.
Most likely the first step in your DTUrtle DGE analysis.
import_dge_counts(files, type, ...)
Arguments
files |
Vector of files to be imported. Optionally can be named to keep the samples names. |
type |
Type of the quantification data. All tools supported by tximport can be selected, additionally to the newly implemented bustools support for single-cell data. If you have single-cell data, the use of alevin or bustools is proposed.
'salmon'
'alevin'
'kallisto'
'bustools'
'rsem'
'stringtie'
'sailfish'
'none'
|
... |
Arguments passed on to tximport::tximport
txIn logical, whether the incoming files are transcript level (default TRUE)
txOut logical, whether the function should just output
transcript-level (default FALSE)
countsFromAbundance character, either "no" (default), "scaledTPM",
"lengthScaledTPM", or "dtuScaledTPM".
Whether to generate estimated counts using abundance estimates:
scaled up to library size (scaledTPM),
scaled using the average transcript length over samples
and then the library size (lengthScaledTPM), or
scaled using the median transcript length among isoforms of a gene,
and then the library size (dtuScaledTPM).
dtuScaledTPM is designed for DTU analysis in combination with txOut=TRUE ,
and it requires specifing a tx2gene data.frame.
dtuScaledTPM works such that within a gene, values from all samples and
all transcripts get scaled by the same fixed median transcript length.
If using scaledTPM, lengthScaledTPM, or geneLengthScaledTPM,
the counts are no longer correlated across samples with transcript length,
and so the length offset matrix should not be used.
tx2gene a two-column data.frame linking transcript id (column 1)
to gene id (column 2).
the column names are not relevant, but this column order must be used.
this argument is required for gene-level summarization, and the tximport
vignette describes how to construct this data.frame (see Details below).
An automated solution to avoid having to create tx2gene if
one has quantified with Salmon or alevin with human or mouse transcriptomes
is to use the tximeta function from the tximeta Bioconductor package.
varReduce whether to reduce per-sample inferential replicates
information into a matrix of sample variances variance (default FALSE).
alevin computes inferential variance by default for bootstrap
inferential replicates, so this argument is ignored/not necessary
dropInfReps whether to skip reading in inferential replicates
(default FALSE). For alevin, tximport will still read in the
inferential variance matrix if it exists
infRepStat a function to re-compute counts and abundances from the
inferential replicates, e.g. matrixStats::rowMedians to re-compute counts
as the median of the inferential replicates. The order of operations is:
first counts are re-computed, then abundances are re-computed.
Following this, if countsFromAbundance is not "no",
tximport will again re-compute counts from the re-computed abundances.
infRepStat should operate on rows of a matrix. (default is NULL)
ignoreTxVersion logical, whether to split the tx id on the '.' character
to remove version information to facilitate matching with the tx id in tx2gene
(default FALSE)
ignoreAfterBar logical, whether to split the tx id on the '|' character
to facilitate matching with the tx id in tx2gene (default FALSE)
geneIdCol name of column with gene id. if missing, the tx2gene
argument can be used
txIdCol name of column with tx id
abundanceCol name of column with abundances (e.g. TPM or FPKM)
countsCol name of column with estimated counts
lengthCol name of column with feature length information
importer a function used to read in the files
existenceOptional logical, should tximport not check if files exist before attempting
import (default FALSE, meaning files must exist according to file.exists )
sparse logical, whether to try to import data sparsely (default is FALSE).
Initial implementation for txOut=TRUE , countsFromAbundance="no"
or "scaledTPM" , no inferential replicates. Only counts matrix
is returned (and abundance matrix if using "scaledTPM" )
sparseThreshold the minimum threshold for including a count as a
non-zero count during sparse import (default is 1)
readLength numeric, the read length used to calculate counts from
StringTie's output of coverage. Default value (from StringTie) is 75.
The formula used to calculate counts is:
cov * transcript length / read length
alevinArgs named list, with logical elements filterBarcodes ,
tierImport , forceSlow . See Details for definitions.
|
Value
For bulk data: A list containing a count matrix, a matrix of average effective transcript lengths and a flag how counts where inferred from abundance estimates.
For single-cell data: A list of count matrices per sample. Should be combined and optionally added to a Seurat object with combine_to_matrix()
.
Details
It is necessary to specify a tx2gene
data frame as a parameter.
This data frame must be a a two-column data frame linking transcript id (column 1) to gene id/name (column 2).
Please see import_gtf()
, move_columns_to_front()
and one_to_one_mapping()
to help with tx2gene creation.
See also combine_to_matrix()
, when output is a list of single-cell runs.
See also