{ "cells": [ { "cell_type": "markdown", "id": "fancy-fifty", "metadata": {}, "source": [ "# Best practice in large-scale cross-dataset label transfer using CellTypist\n", "This notebook demonstrates how to perform cell type label transfer between large scRNA-seq datasets using CellTypist." ] }, { "cell_type": "markdown", "id": "unlike-sussex", "metadata": {}, "source": [ "## About model training and cell type prediction\n", "Cell type prediction using existing models (e.g. the CellTypist built-in models) is usually fast. In other cases, a bespoke model needs to be trained based on the reference dataset of interest. This notebook deals with the latter, with a particular focus on large datasets. Make sure you have at least 30~40GB RAM before running this notebook." ] }, { "cell_type": "markdown", "id": "greater-sapphire", "metadata": {}, "source": [ "## Install CellTypist" ] }, { "cell_type": "code", "execution_count": 1, "id": "assisted-earthquake", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting celltypist\n", " Using cached celltypist-0.2.1-py3-none-any.whl (5.3 MB)\n", "Requirement already satisfied: requests>=2.23.0 in /opt/conda/lib/python3.8/site-packages (from celltypist) (2.25.1)\n", "Requirement already satisfied: openpyxl>=3.0.4 in /opt/conda/lib/python3.8/site-packages (from celltypist) (3.0.7)\n", "Requirement already satisfied: leidenalg>=0.8.3 in /opt/conda/lib/python3.8/site-packages (from celltypist) (0.8.3)\n", "Requirement already satisfied: scikit-learn>=0.24.1 in /opt/conda/lib/python3.8/site-packages (from celltypist) (0.24.1)\n", "Requirement already satisfied: scanpy>=1.7.0 in /opt/conda/lib/python3.8/site-packages (from celltypist) (1.7.1)\n", "Requirement already satisfied: numpy>=1.19.0 in /opt/conda/lib/python3.8/site-packages (from celltypist) (1.20.1)\n", "Requirement already satisfied: click>=7.1.2 in /opt/conda/lib/python3.8/site-packages (from celltypist) (7.1.2)\n", "Requirement already satisfied: pandas>=1.0.5 in /opt/conda/lib/python3.8/site-packages (from celltypist) (1.2.3)\n", "Requirement already satisfied: et-xmlfile in /opt/conda/lib/python3.8/site-packages (from openpyxl>=3.0.4->celltypist) (1.0.1)\n", "Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.8/site-packages (from pandas>=1.0.5->celltypist) (2.8.1)\n", "Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.8/site-packages (from pandas>=1.0.5->celltypist) (2021.1)\n", "Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas>=1.0.5->celltypist) (1.15.0)\n", "Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests>=2.23.0->celltypist) (4.0.0)\n", "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests>=2.23.0->celltypist) (2020.12.5)\n", "Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests>=2.23.0->celltypist) (2.10)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests>=2.23.0->celltypist) (1.26.3)\n", "Requirement already satisfied: patsy in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (0.5.1)\n", "Requirement already satisfied: statsmodels>=0.10.0rc2 in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (0.12.2)\n", "Requirement already satisfied: umap-learn>=0.3.10 in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (0.4.6)\n", "Requirement already satisfied: h5py>=2.10.0 in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (3.1.0)\n", "Requirement already satisfied: packaging in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (20.9)\n", "Requirement already satisfied: tqdm in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (4.58.0)\n", "Requirement already satisfied: matplotlib>=3.1.2 in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (3.3.4)\n", "Requirement already satisfied: legacy-api-wrap in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (0.0.0)\n", "Requirement already satisfied: sinfo in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (0.3.1)\n", "Requirement already satisfied: scipy>=1.4 in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (1.6.1)\n", "Requirement already satisfied: seaborn in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (0.11.1)\n", "Requirement already satisfied: networkx>=2.3 in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (2.5)\n", "Requirement already satisfied: natsort in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (7.1.1)\n", "Requirement already satisfied: numba>=0.41.0 in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (0.51.2)\n", "Requirement already satisfied: anndata>=0.7.4 in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (0.7.5)\n", "Requirement already satisfied: tables in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (3.6.1)\n", "Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from scanpy>=1.7.0->celltypist) (1.0.1)\n", "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/conda/lib/python3.8/site-packages (from matplotlib>=3.1.2->scanpy>=1.7.0->celltypist) (2.4.7)\n", "Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.8/site-packages (from matplotlib>=3.1.2->scanpy>=1.7.0->celltypist) (8.1.2)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib>=3.1.2->scanpy>=1.7.0->celltypist) (1.3.1)\n", "Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.8/site-packages (from matplotlib>=3.1.2->scanpy>=1.7.0->celltypist) (0.10.0)\n", "Requirement already satisfied: decorator>=4.3.0 in /opt/conda/lib/python3.8/site-packages (from networkx>=2.3->scanpy>=1.7.0->celltypist) (4.4.2)\n", "Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /opt/conda/lib/python3.8/site-packages (from numba>=0.41.0->scanpy>=1.7.0->celltypist) (0.34.0)\n", "Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (from numba>=0.41.0->scanpy>=1.7.0->celltypist) (49.6.0.post20210108)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from scikit-learn>=0.24.1->celltypist) (2.1.0)\n", "Requirement already satisfied: get-version>=2.0.4 in /opt/conda/lib/python3.8/site-packages (from legacy-api-wrap->scanpy>=1.7.0->celltypist) (2.1)\n", "Requirement already satisfied: stdlib-list in /opt/conda/lib/python3.8/site-packages (from sinfo->scanpy>=1.7.0->celltypist) (0.7.0)\n", "Requirement already satisfied: numexpr>=2.6.2 in /opt/conda/lib/python3.8/site-packages (from tables->scanpy>=1.7.0->celltypist) (2.7.3)\n", "Installing collected packages: celltypist\n", "Successfully installed celltypist-0.2.1\n" ] } ], "source": [ "!pip install celltypist" ] }, { "cell_type": "code", "execution_count": 2, "id": "living-preference", "metadata": {}, "outputs": [], "source": [ "import scanpy as sc\n", "import celltypist\n", "import time\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "lucky-diversity", "metadata": {}, "source": [ "## Download two datasets for label transfer" ] }, { "cell_type": "markdown", "id": "frozen-trainer", "metadata": {}, "source": [ "Both datasets used in this notebook can be easily downloaded from the human [gut cell atlas](https://www.gutcellatlas.org/)." ] }, { "cell_type": "markdown", "id": "coupled-committee", "metadata": {}, "source": [ "Download the dataset of 428k intestinal cells from fetal, pediatric, adult donors, and up to 11 intestinal regions ([Elmentaite et al. 2021](https://doi.org/10.1038/s41586-021-03852-1))." ] }, { "cell_type": "code", "execution_count": 3, "id": "honey-antarctica", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "03b3895a93c94fc9a0ff0a1624d5c763", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0.00/5.72G [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "adata_Elmentaite = sc.read('celltypist_demo_folder/gut_cell_atlas_Elmentaite.h5ad', backup_url = 'https://cellgeni.cog.sanger.ac.uk/gutcellatlas/Full_obj_raw_counts_nosoupx.h5ad')" ] }, { "cell_type": "markdown", "id": "primary-africa", "metadata": {}, "source": [ "Since this object stores raw count expression matrix, we first log-normalise it with the library size of 10,000." ] }, { "cell_type": "code", "execution_count": 4, "id": "varied-adolescent", "metadata": {}, "outputs": [], "source": [ "sc.pp.normalize_total(adata_Elmentaite, target_sum = 1e4)\n", "sc.pp.log1p(adata_Elmentaite)" ] }, { "cell_type": "markdown", "id": "coastal-judgment", "metadata": {}, "source": [ "Download the other dataset of 42k immune cells from the MLNs and lamina propria of the cecum, transverse colon and sigmoid colon ([James et al. 2020](https://doi.org/10.1038/s41590-020-0602-z))." ] }, { "cell_type": "code", "execution_count": 5, "id": "referenced-scanning", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "60e49adfcd9e470992fea363faf7af2f", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0.00/503M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "adata_James = sc.read('celltypist_demo_folder/gut_cell_atlas_James.h5ad', backup_url = 'https://cellgeni.cog.sanger.ac.uk/gutcellatlas/Colon_cell_atlas.h5ad')" ] }, { "cell_type": "markdown", "id": "fleet-strap", "metadata": {}, "source": [ "This object is already log-normalised to 10,000 counts, so no processing is needed here." ] }, { "cell_type": "markdown", "id": "alike-republic", "metadata": {}, "source": [ "Cell type annotation information is stashed in `Integrated_05` and `cell_type`, respectively." ] }, { "cell_type": "code", "execution_count": 6, "id": "neutral-argument", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['DZ GC cell', 'Cycling B cell', 'gdT', 'Memory B', 'Treg', ..., 'EC cells (NPW+)', 'β cells (INS+)', 'Branch A2 (IPAN/IN)', 'Branch A3 (IPAN/IN)', 'Germ']\n", "Length: 134\n", "Categories (134, object): ['DZ GC cell', 'Cycling B cell', 'gdT', 'Memory B', ..., 'β cells (INS+)', 'Branch A2 (IPAN/IN)', 'Branch A3 (IPAN/IN)', 'Germ']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 134 cell types in the first data.\n", "adata_Elmentaite.obs.Integrated_05.unique()" ] }, { "cell_type": "code", "execution_count": 7, "id": "acknowledged-inflation", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['B cell IgA Plasma', 'B cell memory', 'CD8 T', 'gd T', 'Mast', ..., 'Activated CD4 T', 'pDC', 'Tfh', 'Lymphoid DC', 'cycling DCs']\n", "Length: 25\n", "Categories (25, object): ['B cell IgA Plasma', 'B cell memory', 'CD8 T', 'gd T', ..., 'pDC', 'Tfh', 'Lymphoid DC', 'cycling DCs']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 25 cell types in the second data.\n", "adata_James.obs.cell_type.unique()" ] }, { "cell_type": "markdown", "id": "intense-johnson", "metadata": {}, "source": [ "## Transfer cell type labels from the first dataset to the second dataset\n", "This section shows how to transfer cell type labels from `adata_Elmentaite` to `adata_James`, and to assess and visualise the prediction result." ] }, { "cell_type": "markdown", "id": "distant-monster", "metadata": {}, "source": [ "### (Optional) Downsample cells for the first dataset\n", "First we downsample the 428k cells in `adata_Elmentaite`." ] }, { "cell_type": "code", "execution_count": 8, "id": "ambient-colon", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(428469, 33538)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata_Elmentaite.shape" ] }, { "cell_type": "markdown", "id": "proved-conditions", "metadata": {}, "source": [ "There are several ways to downsample cells, including: 1) downsampling cells to a given number; 2) downsampling cells to a given number, with cell types fairly represented (i.e. rare cell types are sampled with a higher probability); 3) downsampling cells from each cell type to a given number; 4) other downsampling strategies. For this dataset, we used the third approach with the aid of [celltypist.samples.downsample_adata](https://celltypist.readthedocs.io/en/latest/celltypist.samples.downsample_adata.html). You can also try the first or second options using this function, or any other custom downsampling strategies suited to your data." ] }, { "cell_type": "markdown", "id": "described-turkey", "metadata": {}, "source": [ "Downsampling will be beneficial when cells are well annotated, say, cells from a given cell type are transcriptionally homogeneous. *Skip the whole sub-section if you think the available cell type information is coarse and downsampling may skew the cell type representations in the original dataset.*" ] }, { "cell_type": "code", "execution_count": 9, "id": "conceptual-cherry", "metadata": {}, "outputs": [], "source": [ "# Sample 500 cells from each cell type for `adata_Elmentaite`.\n", "# All cells from a given cell type will be selected if the cell type size is < 500.\n", "sampled_cell_index = celltypist.samples.downsample_adata(adata_Elmentaite, mode = 'each', n_cells = 500, by = 'Integrated_05', return_index = True)" ] }, { "cell_type": "markdown", "id": "manual-snowboard", "metadata": {}, "source": [ "By default, only sampled cell indices are returned, which can keep the original `adata_Elmentaite` intact. Note these sampled cells are only used for model training." ] }, { "cell_type": "code", "execution_count": 10, "id": "individual-schedule", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of downsampled cells for training: 54853\n" ] } ], "source": [ "print(f\"Number of downsampled cells for training: {len(sampled_cell_index)}\")" ] }, { "cell_type": "markdown", "id": "lesser-resident", "metadata": {}, "source": [ "### (Suggested) Feature selection for the first dataset" ] }, { "cell_type": "markdown", "id": "negative-quantity", "metadata": {}, "source": [ "A feature selection step will restrict the number of genes during training, and can improve both training efficiency and prediction accuracy. It is recommended in most cases (though CellTypist models are proved to be robust when all genes are used). \n", " \n", "One example of feature selection is using the [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html). Depending on the datasets, you may need to take into account batch effects (e.g. by specifying `batch_key`), add/remove a specific list of genes (e.g. VDJ genes for immune cells), combine high-confidence features across zoomed-in compartments, or any other approaches tailored to your data." ] }, { "cell_type": "markdown", "id": "hispanic-inclusion", "metadata": {}, "source": [ "In this notebook, we performed feature selection using CellTypist. This may not be the best approach for your data as noted above, but performs well as tested in several datasets." ] }, { "cell_type": "markdown", "id": "technical-entity", "metadata": {}, "source": [ "First, use [celltypist.train](https://celltypist.readthedocs.io/en/latest/celltypist.train.html) to train a quick CellTypist model by stochastic gradient descent learning on 10 cpus, with only a limited number of iterations (5)." ] }, { "cell_type": "code", "execution_count": 11, "id": "bearing-product", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "🍳 Preparing data before training\n", "✂️ 4047 non-expressed genes are filtered out\n", "⚖️ Scaling input data\n", "🏋️ Training data using SGD logistic regression\n", "✅ Model training done!\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time elapsed: 299.6024606227875 seconds\n" ] } ], "source": [ "# Use `celltypist.train` to quickly train a rough CellTypist model.\n", "# You can also set `mini_batch = True` to enable mini-batch training.\n", "t_start = time.time()\n", "model_fs = celltypist.train(adata_Elmentaite[sampled_cell_index], 'Integrated_05', n_jobs = 10, max_iter = 5, use_SGD = True)\n", "t_end = time.time()\n", "print(f\"Time elapsed: {t_end - t_start} seconds\")" ] }, { "cell_type": "markdown", "id": "heated-mobile", "metadata": {}, "source": [ "It takes 5 minutes to train this dataset with 33k genes and 134 cell types from 55k cells. Note you can increase `max_iter` to get a more accurate model at the expense of increased runtime." ] }, { "cell_type": "markdown", "id": "romantic-guinea", "metadata": {}, "source": [ "This model is trained from all genes with only five epochs, and thus is not accurate enough for cell type prediction. But the information about genes can be utilised. Here, we drew top 100 important genes from each cell type as ranked by their absolute regression coefficients associated with the given cell type. For datasets with only several cell types, you may want to increase the top gene number from 100 to for example 300 in order to get a sufficient number of genes for final use." ] }, { "cell_type": "code", "execution_count": 12, "id": "painful-tracker", "metadata": {}, "outputs": [], "source": [ "gene_index = np.argpartition(np.abs(model_fs.classifier.coef_), -100, axis = 1)[:, -100:]" ] }, { "cell_type": "markdown", "id": "specific-process", "metadata": {}, "source": [ "We next combine/union these genes across cell types." ] }, { "cell_type": "code", "execution_count": 13, "id": "saved-movie", "metadata": {}, "outputs": [], "source": [ "gene_index = np.unique(gene_index)" ] }, { "cell_type": "markdown", "id": "manufactured-surface", "metadata": {}, "source": [ "These genes will be used for downstream model training." ] }, { "cell_type": "code", "execution_count": 14, "id": "solar-ballot", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of genes selected: 5103\n" ] } ], "source": [ "print(f\"Number of genes selected: {len(gene_index)}\")" ] }, { "cell_type": "markdown", "id": "normal-bouquet", "metadata": {}, "source": [ "### Model training and label transfer" ] }, { "cell_type": "markdown", "id": "infinite-intervention", "metadata": {}, "source": [ "With the downsampled cells (55k) and selected features (5k), we next train `adata_Elmentaite` using [celltypist.train](https://celltypist.readthedocs.io/en/latest/celltypist.train.html)." ] }, { "cell_type": "markdown", "id": "facial-thermal", "metadata": {}, "source": [ "To allow for unbiased probability estimates, here we used the non-SGD version of CellTypist training (i.e. a traditional logistic regression framework)." ] }, { "cell_type": "code", "execution_count": 15, "id": "innocent-lafayette", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "🍳 Preparing data before training\n", "✂️ 596 non-expressed genes are filtered out\n", "⚖️ Scaling input data\n", "🏋️ Training data using logistic regression\n", "✅ Model training done!\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time elapsed: 42.38707377115885 minutes\n" ] } ], "source": [ "# Add `check_expression = False` to bypass expression check with only a subset of genes.\n", "t_start = time.time()\n", "model = celltypist.train(adata_Elmentaite[sampled_cell_index, gene_index], 'Integrated_05', check_expression = False, n_jobs = 10, max_iter = 100)\n", "t_end = time.time()\n", "print(f\"Time elapsed: {(t_end - t_start)/60} minutes\")" ] }, { "cell_type": "markdown", "id": "packed-jesus", "metadata": {}, "source": [ "It takes 42 minutes to train this dataset with 5k genes and 134 cell types from 55k cells. Note you can increase `max_iter` to possibly get a more accurate model at the expense of increased runtime." ] }, { "cell_type": "markdown", "id": "international-absence", "metadata": {}, "source": [ "First, save this model locally for future use." ] }, { "cell_type": "code", "execution_count": 16, "id": "affected-apache", "metadata": {}, "outputs": [], "source": [ "# Save the model.\n", "model.write('celltypist_demo_folder/model_from_Elmentaite_2021.pkl')" ] }, { "cell_type": "markdown", "id": "graphic-marble", "metadata": {}, "source": [ "Next, use [celltypist.annotate](https://celltypist.readthedocs.io/en/latest/celltypist.annotate.html) to predict `adata_James` using this model." ] }, { "cell_type": "code", "execution_count": 17, "id": "numeric-musical", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "🔬 Input data has 41650 cells and 18927 genes\n", "🔗 Matching reference genes in the model\n", "🧬 2866 features used for prediction\n", "⚖️ Scaling input data\n", "🖋️ Predicting labels\n", "✅ Prediction done!\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time elapsed: 6.5466063022613525 seconds\n" ] } ], "source": [ "# CellTypist prediction without over-clustering and majority-voting.\n", "t_start = time.time()\n", "predictions = celltypist.annotate(adata_James, model = 'celltypist_demo_folder/model_from_Elmentaite_2021.pkl')\n", "t_end = time.time()\n", "print(f\"Time elapsed: {t_end - t_start} seconds\")" ] }, { "cell_type": "markdown", "id": "compact-mouth", "metadata": {}, "source": [ "It takes 6 seconds to predict a dataset of 42k cells and 19k genes." ] }, { "cell_type": "markdown", "id": "collect-saturn", "metadata": {}, "source": [ "By default (`majority_voting = False`), CellTypist will infer the identity of each query cell independently. This leads to raw predicted cell type labels, and usually finishes within seconds. You can also turn on the majority-voting classifier (`majority_voting = True`), which refines cell identities within local subclusters after an over-clustering approach at the cost of increased runtime." ] }, { "cell_type": "code", "execution_count": 18, "id": "brown-clinic", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "🔬 Input data has 41650 cells and 18927 genes\n", "🔗 Matching reference genes in the model\n", "🧬 2866 features used for prediction\n", "⚖️ Scaling input data\n", "🖋️ Predicting labels\n", "✅ Prediction done!\n", "👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it\n", "⛓️ Over-clustering input data with resolution set to 20\n", "🗳️ Majority voting the predictions\n", "✅ Majority voting done!\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time elapsed: 22.271843671798706 seconds\n" ] } ], "source": [ "# CellTypist prediction with over-clustering and majority-voting.\n", "t_start = time.time()\n", "predictions = celltypist.annotate(adata_James, model = 'celltypist_demo_folder/model_from_Elmentaite_2021.pkl', majority_voting = True)\n", "t_end = time.time()\n", "print(f\"Time elapsed: {t_end - t_start} seconds\")" ] }, { "cell_type": "markdown", "id": "second-worry", "metadata": {}, "source": [ "It takes 22 seconds to both predict and majority-vote a dataset of 42k cells and 19k genes." ] }, { "cell_type": "markdown", "id": "concerned-cache", "metadata": {}, "source": [ "The results include both predicted cell type labels (`predicted_labels`), over-clustering result (`over_clustering`), and predicted labels after majority voting in local subclusters (`majority_voting`). Note in the `predicted_labels`, each query cell gets its inferred label by choosing the most probable cell type among all possible cell types in the given model." ] }, { "cell_type": "code", "execution_count": 19, "id": "divided-hawaii", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | predicted_labels | \n", "over_clustering | \n", "majority_voting | \n", "
|---|---|---|---|
| index | \n", "\n", " | \n", " | \n", " |
| AAACGGGAGGTGCAAC-1-Pan_T7935487 | \n", "IgA plasma cell | \n", "154 | \n", "IgA plasma cell | \n", "
| AAAGATGTCCTCAACC-1-Pan_T7935487 | \n", "IgA plasma cell | \n", "23 | \n", "IgA plasma cell | \n", "
| AAAGTAGTCTTACCGC-1-Pan_T7935487 | \n", "IgA plasma cell | \n", "154 | \n", "IgA plasma cell | \n", "
| AACCATGCAGATTGCT-1-Pan_T7935487 | \n", "IgG plasma cell | \n", "154 | \n", "IgA plasma cell | \n", "
| AACCATGTCCTGCCAT-1-Pan_T7935487 | \n", "IgG plasma cell | \n", "79 | \n", "IgA plasma cell | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| TTTGTCACAAGTTAAG-1-Human_colon_16S8000484 | \n", "Memory B | \n", "2 | \n", "Memory B | \n", "
| TTTGTCAGTACCGAGA-1-Human_colon_16S8000484 | \n", "CD8 Tmem | \n", "12 | \n", "CD8 Tmem | \n", "
| TTTGTCATCAACACCA-1-Human_colon_16S8000484 | \n", "Activated CD4 T | \n", "205 | \n", "SELL+ CD4 T | \n", "
| TTTGTCATCCCAACGG-1-Human_colon_16S8000484 | \n", "ILC3 | \n", "196 | \n", "Activated CD8 T | \n", "
| TTTGTCATCGGTGTCG-1-Human_colon_16S8000484 | \n", "Treg | \n", "77 | \n", "Treg | \n", "
41650 rows × 3 columns
\n", "| \n", " | predicted_labels | \n", "over_clustering | \n", "majority_voting | \n", "conf_score | \n", "
|---|---|---|---|---|
| index | \n", "\n", " | \n", " | \n", " | \n", " |
| AAACGGGAGGTGCAAC-1-Pan_T7935487 | \n", "IgA plasma cell | \n", "154 | \n", "IgA plasma cell | \n", "0.879683 | \n", "
| AAAGATGTCCTCAACC-1-Pan_T7935487 | \n", "IgA plasma cell | \n", "23 | \n", "IgA plasma cell | \n", "0.995480 | \n", "
| AAAGTAGTCTTACCGC-1-Pan_T7935487 | \n", "IgA plasma cell | \n", "154 | \n", "IgA plasma cell | \n", "0.974575 | \n", "
| AACCATGCAGATTGCT-1-Pan_T7935487 | \n", "IgG plasma cell | \n", "154 | \n", "IgA plasma cell | \n", "0.835985 | \n", "
| AACCATGTCCTGCCAT-1-Pan_T7935487 | \n", "IgG plasma cell | \n", "79 | \n", "IgA plasma cell | \n", "0.856283 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| TTTGTCACAAGTTAAG-1-Human_colon_16S8000484 | \n", "Memory B | \n", "2 | \n", "Memory B | \n", "0.995788 | \n", "
| TTTGTCAGTACCGAGA-1-Human_colon_16S8000484 | \n", "CD8 Tmem | \n", "12 | \n", "CD8 Tmem | \n", "0.975453 | \n", "
| TTTGTCATCAACACCA-1-Human_colon_16S8000484 | \n", "Activated CD4 T | \n", "205 | \n", "SELL+ CD4 T | \n", "0.942779 | \n", "
| TTTGTCATCCCAACGG-1-Human_colon_16S8000484 | \n", "ILC3 | \n", "196 | \n", "Activated CD8 T | \n", "0.427970 | \n", "
| TTTGTCATCGGTGTCG-1-Human_colon_16S8000484 | \n", "Treg | \n", "77 | \n", "Treg | \n", "0.700854 | \n", "
41650 rows × 4 columns
\n", "