LIVE NEWS
  • Beijing’s new missile threat to Australia
  • Frailty sets in far earlier than you’d expect, but you can reverse it
  • Middle East crisis live: Iranian missiles intercepted over Turkey, Qatar and UAE as Israel resumes strikes across Tehran and Beirut | US-Israel war on Iran
  • Institutional Investors Pour $619,000,000 Into Bitcoin and Crypto Assets in One Week: CoinShares
  • How AI Assistants are Moving the Security Goalposts – Krebs on Security
  • Oil Price Spike: What Countries Are Telling People to Do
  • The D Brief: US toll rises to 8; Iran’s new leader; Building military-specific AI; Counting the rationales for war; And a bit more.
  • Sugar Prices Jump as Crude Oil Surges
Prime Reports
  • Home
  • Popular Now
  • Crypto
  • Cybersecurity
  • Economy
  • Geopolitics
  • Global Markets
  • Politics
  • See More
    • Artificial Intelligence
    • Climate Risks
    • Defense
    • Healthcare Innovation
    • Science
    • Technology
    • World
Prime Reports
  • Home
  • Popular Now
  • Crypto
  • Cybersecurity
  • Economy
  • Geopolitics
  • Global Markets
  • Politics
  • Artificial Intelligence
  • Climate Risks
  • Defense
  • Healthcare Innovation
  • Science
  • Technology
  • World
Home»Artificial Intelligence»A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation
Artificial Intelligence

A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation

primereportsBy primereportsMarch 9, 2026No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation
Share
Facebook Twitter LinkedIn Pinterest Email


In this tutorial, we build a complete pipeline for single-cell RNA sequencing analysis using Scanpy. We start by installing the required libraries and loading the PBMC 3k dataset, then perform quality control, filtering, and normalization to prepare the data for downstream analysis. We then identify highly variable genes, perform PCA for dimensionality reduction, and construct a neighborhood graph to generate UMAP embeddings and Leiden clusters. Through marker gene discovery and visualization, we explore how clusters correspond to biological cell populations and implement a simple rule-based annotation strategy to infer cell types.

Copy CodeCopiedUse a different Browser
import sys
import subprocess
import importlib


def pip_install(*packages):
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *packages])


required = [
   "scanpy",
   "anndata",
   "leidenalg",
   "igraph",
   "harmonypy",
   "seaborn"
]
pip_install(*required)


import os
import warnings
warnings.filterwarnings("ignore")


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import anndata as ad


sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=110, facecolor="white", frameon=False)
np.random.seed(42)


print("Scanpy version:", sc.__version__)


adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()


print("\nInitial AnnData:")
print(adata)


adata.layers["counts"] = adata.X.copy()


adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)


print("\nQC summary:")
display(
   adata.obs[["n_genes_by_counts", "total_counts", "pct_counts_mt"]].describe().T
)

We install all required dependencies and import the core scientific computing libraries needed for the analysis. We configure Scanpy settings, initialize the environment, and load the PBMC 3k single-cell RNA-seq dataset. We then compute quality-control metrics, including mitochondrial gene percentage, total counts, and the number of detected genes, for each cell.

Copy CodeCopiedUse a different Browser
fig, axs = plt.subplots(1, 3, figsize=(15, 4))
sc.pl.violin(adata, ["n_genes_by_counts"], jitter=0.4, ax=axs[0], show=False)
sc.pl.violin(adata, ["total_counts"], jitter=0.4, ax=axs[1], show=False)
sc.pl.violin(adata, ["pct_counts_mt"], jitter=0.4, ax=axs[2], show=False)
plt.tight_layout()
plt.show()


sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts", color="pct_counts_mt")


adata = adata[adata.obs["n_genes_by_counts"] >= 200].copy()
adata = adata[adata.obs["n_genes_by_counts"] <= 5000].copy()
adata = adata[adata.obs["pct_counts_mt"] < 10].copy()
sc.pp.filter_genes(adata, min_cells=3)


print("\nAfter filtering:")
print(adata)


sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()


sc.pp.highly_variable_genes(
   adata,
   flavor="seurat",
   min_mean=0.0125,
   max_mean=3,
   min_disp=0.5
)


print("\nHighly variable genes selected:", int(adata.var["highly_variable"].sum()))


sc.pl.highly_variable_genes(adata)


adata = adata[:, adata.var["highly_variable"]].copy()

We visualize quality control metrics using plots to check the distribution of gene counts and mitochondrial content. We apply filtering steps to remove low-quality cells and genes that do not meet basic expression thresholds. We then normalize the data, apply a log transformation, and identify highly variable genes for downstream analysis.

Copy CodeCopiedUse a different Browser
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)


sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True)
sc.pl.pca(adata, color=None)


sc.pp.neighbors(adata, n_neighbors=12, n_pcs=30, metric="euclidean")
sc.tl.umap(adata, min_dist=0.35, spread=1.0)
sc.tl.leiden(adata, resolution=0.6, key_added="leiden")


print("\nCluster counts:")
display(adata.obs["leiden"].value_counts().sort_index().rename("cells_per_cluster").to_frame())


sc.pl.umap(adata, color=["leiden"], legend_loc="on data", title="PBMC 3k - Leiden clusters")


sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)


marker_table = sc.get.rank_genes_groups_df(adata, group=None)
print("\nTop marker rows:")
display(marker_table.head(20))

We regress out technical confounders and scale the dataset to prepare it for dimensionality reduction. We perform principal component analysis to capture the dataset’s major variance structure. We then construct the neighborhood graph, compute UMAP embeddings, perform Leiden clustering, and identify marker genes for each cluster.

Copy CodeCopiedUse a different Browser
top_markers_per_cluster = (
   marker_table.groupby("group")
   .head(10)
   .loc[:, ["group", "names", "logfoldchanges", "pvals_adj"]]
   .reset_index(drop=True)
)
print("\nTop 10 markers per cluster:")
display(top_markers_per_cluster)


candidate_markers = [
   "IL7R", "LTB", "MALAT1", "CCR7",
   "NKG7", "GNLY", "PRF1",
   "MS4A1", "CD79A", "CD79B",
   "LYZ", "S100A8", "FCER1A", "CST3",
   "PPBP", "FCGR3A", "LGALS3", "CTSS",
   "CD3D", "TRBC1", "TRAC"
]
candidate_markers = [g for g in candidate_markers if g in adata.var_names]


if candidate_markers:
   sc.pl.dotplot(
       adata,
       var_names=candidate_markers,
       groupby="leiden",
       standard_scale="var",
       dendrogram=True
   )
   sc.pl.matrixplot(
       adata,
       var_names=candidate_markers,
       groupby="leiden",
       standard_scale="var",
       dendrogram=True
   )


cluster_marker_reference = {
   "T_cells": ["IL7R", "LTB", "CCR7", "CD3D", "TRBC1", "TRAC"],
   "NK_cells": ["NKG7", "GNLY", "PRF1"],
   "B_cells": ["MS4A1", "CD79A", "CD79B"],
   "Monocytes": ["LYZ", "FCGR3A", "LGALS3", "CTSS", "S100A8", "CST3"],
   "Dendritic_cells": ["FCER1A", "CST3"],
   "Platelets": ["PPBP"]
}

We examine the most significant marker genes detected for each cluster and summarize the top markers. We visualize gene expression patterns across clusters using dot plots and matrix plots for known immune cell markers. We also define a reference mapping of marker genes associated with major immune cell types for later annotation.

Copy CodeCopiedUse a different Browser
available_reference = {
   celltype: [g for g in genes if g in adata.var_names]
   for celltype, genes in cluster_marker_reference.items()
}
available_reference = {k: v for k, v in available_reference.items() if len(v) > 0}


for celltype, genes in available_reference.items():
   sc.tl.score_genes(adata, gene_list=genes, score_name=f"{celltype}_score", use_raw=False)


score_cols = [f"{ct}_score" for ct in available_reference.keys()]
cluster_scores = adata.obs.groupby("leiden")[score_cols].mean()
display(cluster_scores)


cluster_to_celltype = {}
for cluster in cluster_scores.index:
   best = cluster_scores.loc[cluster].idxmax().replace("_score", "")
   cluster_to_celltype[cluster] = best


adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype).astype("category")


print("\nCluster to cell-type mapping:")
display(pd.DataFrame.from_dict(cluster_to_celltype, orient="index", columns=["assigned_cell_type"]))


sc.pl.umap(
   adata,
   color=["leiden", "cell_type"],
   legend_loc="on data",
   wspace=0.45
)


sc.tl.rank_genes_groups(adata, groupby="cell_type", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=15, sharey=False)


celltype_markers = sc.get.rank_genes_groups_df(adata, group=None)
print("\nTop markers by annotated cell type:")
display(
   celltype_markers.groupby("group").head(8)[["group", "names", "logfoldchanges", "pvals_adj"]]
)


cluster_prop = (
   adata.obs["cell_type"]
   .value_counts(normalize=True)
   .mul(100)
   .round(2)
   .rename("percent")
   .to_frame()
)
print("\nCell-type proportions (%):")
display(cluster_prop)


plt.figure(figsize=(7, 4))
cluster_prop["percent"].sort_values().plot(kind="barh")
plt.xlabel("Percent of cells")
plt.ylabel("Cell type")
plt.title("Estimated cell-type composition")
plt.tight_layout()
plt.show()


output_dir = "scanpy_pbmc3k_outputs"
os.makedirs(output_dir, exist_ok=True)


adata.write(os.path.join(output_dir, "pbmc3k_scanpy_advanced.h5ad"))
marker_table.to_csv(os.path.join(output_dir, "cluster_markers.csv"), index=False)
celltype_markers.to_csv(os.path.join(output_dir, "celltype_markers.csv"), index=False)
cluster_scores.to_csv(os.path.join(output_dir, "cluster_score_matrix.csv"))


print(f"\nSaved outputs to: {output_dir}")
print("Files:")
for f in sorted(os.listdir(output_dir)):
   print(" -", f)


summary = {
   "n_cells_final": int(adata.n_obs),
   "n_genes_final": int(adata.n_vars),
   "n_clusters": int(adata.obs["leiden"].nunique()),
   "clusters": sorted(adata.obs["leiden"].unique().tolist()),
   "cell_types": sorted(adata.obs["cell_type"].unique().tolist()),
}
print("\nAnalysis summary:")
for k, v in summary.items():
   print(f"{k}: {v}")

We score each cell using known marker gene sets and assign probable cell types to clusters based on expression patterns. We visualize the annotated cell types on the UMAP embedding and perform differential gene expression analysis across the predicted cell populations. Also, we compute cell-type proportions, generate summary visualizations, and save the processed dataset and analysis outputs for further research.

In conclusion, we developed a full end-to-end workflow for analyzing single-cell transcriptomic data using Scanpy. We performed preprocessing, clustering, marker-gene analysis, and cell-type annotation, and visualized the data structure using UMAP and gene expression plots. By saving the processed AnnData object and analysis outputs, we created a reusable dataset for further biological interpretation and advanced modeling. This workflow demonstrates how Scanpy enables scalable, reproducible single-cell analysis through a structured, modular Python pipeline.


Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation appeared first on MarkTechPost.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleNegative emissions technologies and practices could challenge global resource supply and environmental limits
Next Article Iran war: Israel targets ‘regime infrastructure’
primereports
  • Website

Related Posts

Artificial Intelligence

City Union Bank launches AI centre to support banking operations

March 9, 2026
Artificial Intelligence

Oil Above $100 Raises Inflation Concerns as Gulf Supply Routes Tighten

March 9, 2026
Artificial Intelligence

An ode to craftsmanship in software development

March 4, 2026
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Global Resources Outlook 2024 | UNEP

December 6, 20255 Views

The D Brief: DHS shutdown likely; US troops leave al-Tanf; CNO’s plea to industry; Crowded robot-boat market; And a bit more.

February 14, 20264 Views

German Chancellor Merz faces difficult mission to Israel – DW – 12/06/2025

December 6, 20254 Views
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Latest Reviews

Subscribe to Updates

Get the latest tech news from FooBar about tech, design and biz.

PrimeReports.org
Independent global news, analysis & insights.

PrimeReports.org brings you in-depth coverage of geopolitics, markets, technology and risk – with context that helps you understand what really matters.

Editorially independent · Opinions are those of the authors and not investment advice.
Facebook X (Twitter) LinkedIn YouTube
Key Sections
  • World
  • Geopolitics
  • Popular Now
  • Artificial Intelligence
  • Cybersecurity
  • Crypto
All Categories
  • Artificial Intelligence
  • Climate Risks
  • Crypto
  • Cybersecurity
  • Defense
  • Economy
  • Geopolitics
  • Global Markets
  • Healthcare Innovation
  • Politics
  • Popular Now
  • Science
  • Technology
  • World
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms & Conditions
  • Disclaimer
  • Cookie Policy
  • DMCA / Copyright Notice
  • Editorial Policy

Sign up for Prime Reports Briefing – essential stories and analysis in your inbox.

By subscribing you agree to our Privacy Policy. You can opt out anytime.
Latest Stories
  • Beijing’s new missile threat to Australia
  • Frailty sets in far earlier than you’d expect, but you can reverse it
  • Middle East crisis live: Iranian missiles intercepted over Turkey, Qatar and UAE as Israel resumes strikes across Tehran and Beirut | US-Israel war on Iran
© 2026 PrimeReports.org. All rights reserved.
Privacy Terms Contact

Type above and press Enter to search. Press Esc to cancel.