Classifier structure
- class celltypist.classifier.Classifier(filename: AnnData | str = '', model: Model | str = '', transpose: bool = False, gene_file: str | None = None, cell_file: str | None = None)[source]
Bases:
object
Class that wraps the celltyping and majority voting processes.
- Parameters:
filename – Path to the input count matrix (supported types are csv, txt, tsv, tab and mtx) or AnnData object (h5ad). If it’s the former, a cell-by-gene format is desirable (see transpose for more information). Also accepts the input as an
AnnData
object already loaded in memory. Genes should be gene symbols. Non-expressed genes are preferred to be provided as well.model – A
Model
object that wraps the logistic Classifier and the StandardScaler, the path to the desired model file, or the model name.transpose – Whether to transpose the input matrix. Set to True if filename is provided in a gene-by-cell format. (Default: False)
gene_file – Path to the file which stores each gene per line corresponding to the genes used in the provided mtx file. Ignored if filename is not provided in the mtx format.
cell_file – Path to the file which stores each cell per line corresponding to the cells used in the provided mtx file. Ignored if filename is not provided in the mtx format.
- filename
Path to the input dataset. This attribute exists only when the input is a file path.
- indata
The expression matrix used for predictions stored in the log1p normalized format.
- indata_genes
All the genes included in the input data.
- indata_names
All the cells included in the input data.
- celltype(mode: str = 'best match', p_thres: float = 0.5) AnnotationResult [source]
Run celltyping jobs to predict cell types of input data.
- Parameters:
mode – The way cell prediction is performed. For each query cell, the default (‘best match’) is to choose the cell type with the largest score/probability as the final prediction. Setting to ‘prob match’ will enable a multi-label classification, which assigns 0 (i.e., unassigned), 1, or >=2 cell type labels to each query cell. (Default: ‘best match’)
p_thres – Probability threshold for the multi-label classification. Ignored if mode is ‘best match’. (Default: 0.5)
- Returns:
An
AnnotationResult
object. Four important attributes within this class are: 1)predicted_labels
, predicted labels from celltypist. 2)decision_matrix
, decision matrix from celltypist. 3)probability_matrix
, probability matrix from celltypist. 4)adata
, AnnData object representation of the input data.- Return type:
- static majority_vote(predictions: AnnotationResult, over_clustering: list | tuple | ndarray | Series | Index, min_prop: float = 0) AnnotationResult [source]
Majority vote the celltypist predictions using the result from the over-clustering.
- Parameters:
predictions – An
AnnotationResult
object containing thepredicted_labels
.over_clustering – A list, tuple, numpy array, pandas series or index containing the over-clustering information.
min_prop – For the dominant cell type within a subcluster, the minimum proportion of cells required to support naming of the subcluster by this cell type. (Default: 0)
- Returns:
An
AnnotationResult
object. Four important attributes within this class are: 1)predicted_labels
, predicted labels from celltypist. 2)decision_matrix
, decision matrix from celltypist. 3)probability_matrix
, probability matrix from celltypist. 4)adata
, AnnData object representation of the input data.- Return type:
- over_cluster(resolution: float | None = None, use_GPU: bool = False) Series [source]
Over-clustering input data with a canonical Scanpy pipeline. A neighborhood graph will be used (or constructed if not found) for the over-clustering.
- Parameters:
resolution – Resolution parameter for leiden clustering which controls the coarseness of the clustering. Default to 5, 10, 15, 20, 25 and 30 for datasets with cell numbers less than 5k, 20k, 40k, 100k, 200k and above, respectively.
use_GPU – Whether to use GPU for over clustering on the basis of rapids-singlecell. (Default: False)
- Returns:
A
Series
object showing the over-clustering result.- Return type: