Classifier structure

class celltypist.classifier.Classifier(filename: AnnData | str = '', model: Model | str = '', transpose: bool = False, gene_file: str | None = None, cell_file: str | None = None)[source]

Bases: object

Class that wraps the celltyping and majority voting processes.

Parameters:
  • filename – Path to the input count matrix (supported types are csv, txt, tsv, tab and mtx) or AnnData object (h5ad). If it’s the former, a cell-by-gene format is desirable (see transpose for more information). Also accepts the input as an AnnData object already loaded in memory. Genes should be gene symbols. Non-expressed genes are preferred to be provided as well.

  • model – A Model object that wraps the logistic Classifier and the StandardScaler, the path to the desired model file, or the model name.

  • transpose – Whether to transpose the input matrix. Set to True if filename is provided in a gene-by-cell format. (Default: False)

  • gene_file – Path to the file which stores each gene per line corresponding to the genes used in the provided mtx file. Ignored if filename is not provided in the mtx format.

  • cell_file – Path to the file which stores each cell per line corresponding to the cells used in the provided mtx file. Ignored if filename is not provided in the mtx format.

filename

Path to the input dataset. This attribute exists only when the input is a file path.

adata

An AnnData object which stores the log1p normalized expression data in .X or .raw.X.

indata

The expression matrix used for predictions stored in the log1p normalized format.

indata_genes

All the genes included in the input data.

indata_names

All the cells included in the input data.

model

A Model object that wraps the logistic Classifier and the StandardScaler.

celltype(mode: str = 'best match', p_thres: float = 0.5) AnnotationResult[source]

Run celltyping jobs to predict cell types of input data.

Parameters:
  • mode – The way cell prediction is performed. For each query cell, the default (‘best match’) is to choose the cell type with the largest score/probability as the final prediction. Setting to ‘prob match’ will enable a multi-label classification, which assigns 0 (i.e., unassigned), 1, or >=2 cell type labels to each query cell. (Default: ‘best match’)

  • p_thres – Probability threshold for the multi-label classification. Ignored if mode is ‘best match’. (Default: 0.5)

Returns:

An AnnotationResult object. Four important attributes within this class are: 1) predicted_labels, predicted labels from celltypist. 2) decision_matrix, decision matrix from celltypist. 3) probability_matrix, probability matrix from celltypist. 4) adata, AnnData object representation of the input data.

Return type:

AnnotationResult

static majority_vote(predictions: AnnotationResult, over_clustering: list | tuple | ndarray | Series | Index, min_prop: float = 0) AnnotationResult[source]

Majority vote the celltypist predictions using the result from the over-clustering.

Parameters:
  • predictions – An AnnotationResult object containing the predicted_labels.

  • over_clustering – A list, tuple, numpy array, pandas series or index containing the over-clustering information.

  • min_prop – For the dominant cell type within a subcluster, the minimum proportion of cells required to support naming of the subcluster by this cell type. (Default: 0)

Returns:

An AnnotationResult object. Four important attributes within this class are: 1) predicted_labels, predicted labels from celltypist. 2) decision_matrix, decision matrix from celltypist. 3) probability_matrix, probability matrix from celltypist. 4) adata, AnnData object representation of the input data.

Return type:

AnnotationResult

over_cluster(resolution: float | None = None, use_GPU: bool = False) Series[source]

Over-clustering input data with a canonical Scanpy pipeline. A neighborhood graph will be used (or constructed if not found) for the over-clustering.

Parameters:
  • resolution – Resolution parameter for leiden clustering which controls the coarseness of the clustering. Default to 5, 10, 15, 20, 25 and 30 for datasets with cell numbers less than 5k, 20k, 40k, 100k, 200k and above, respectively.

  • use_GPU – Whether to use GPU for over clustering on the basis of rapids-singlecell. (Default: False)

Returns:

A Series object showing the over-clustering result.

Return type:

Series