scRNA-seq#
You’ll learn how to manage a growing number of scRNA-seq data shards as a single queryable dataset.
Along the way, you’ll see how to create reports, leverage data lineage, and query individual data shards stored as files.
If you’re only interested in using a large curated scRNA-seq dataset, see the CELLxGENE Census guide.
Here, you will:
create an
Artifact
from anAnnData
object and seed a growingDataset
with it (, current page)append a new data batch (a new
.h5ad
file) and create a new version of this dataset ()load the joint dataset into memory and save analytical results ()
iterate over the dataset, train a model, store a derived representation ()
discuss converting a number of artifacts to a single TileDB SOMA store of the same data ()
Setup#
!lamin init --storage ./test-scrna --schema bionty
Show code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-22 11:25:11 UTC)
✅ saved: Storage(uid='kFrUCXTg', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-12-22 11:25:11 UTC, created_by_id=1)
💡 loaded instance: testuser1/test-scrna
💡 did not register local instance on hub
import lamindb as ln
import lnschema_bionty as lb
ln.settings.verbosity = "hint"
lb.settings.organism = "human"
ln.track()
Show code cell output
💡 lamindb instance: testuser1/test-scrna
💡 notebook imports: lamindb==0.64.2 lnschema_bionty==0.36.1
💡 saved: Transform(uid='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-12-22 11:25:15 UTC, created_by_id=1)
💡 saved: Run(uid='78KMypviRCyumUuJ2p3M', run_at=2023-12-22 11:25:15 UTC, transform_id=1, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_78KMypviRCyumUuJ2p3M
Ingest a artifact#
Let us look at the standardized data of Conde et al., Science (2022), available from CZ CELLxGENE.
By calling anndata_human_immune_cells()
, we load a subsampled version of the dataset from CZ CELLxGENE and pre-populate the corresponding LaminDB registries: Feature
, ULabel
, Gene
, CellType
, CellLine
, ExperimentalFactor
.
adata = ln.dev.datasets.anndata_human_immune_cells(populate_registries=True)
adata
Show code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
obs: 'donor', 'tissue', 'cell_type', 'assay'
var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
uns: 'default_embedding'
obsm: 'X_umap'
This AnnData
object is standardized using the CZI single-cell-curation validator with the same public ontologies that underlie lnschema_bionty
. Because registries are pre-populated, validation passes.
Note
In the next guide, we’ll curate a non-standardized dataset.
The gene registry provides metadata for each of the 36k genes measured in the AnnData
:
lb.Gene.filter().df()
Show code cell output
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | description | synonyms | organism_id | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
1 | nG5QMZBh5VxD | MIR1302-2HG | None | ENSG00000243485 | lncRNA | MIR1302-2 host gene [Source:HGNC Symbol;Acc:HG... | 1 | 9 | 2023-12-22 11:25:22.135866+00:00 | 1 | ||
2 | A7BOVXA3f7IS | FAM138A | None | ENSG00000237613 | 645520|124906933 | lncRNA | family with sequence similarity 138 member A [... | F379 | 1 | 9 | 2023-12-22 11:25:22.135913+00:00 | 1 |
3 | fXkFfHvbJRDx | OR4F5 | None | ENSG00000186092 | 79501 | protein_coding | olfactory receptor family 4 subfamily F member... | 1 | 9 | 2023-12-22 11:25:22.135952+00:00 | 1 | |
4 | On07aEStsWXn | None | None | ENSG00000238009 | lncRNA | novel transcript | 1 | 9 | 2023-12-22 11:25:22.135988+00:00 | 1 | ||
5 | JNhwMkpEuypB | None | None | ENSG00000239945 | lncRNA | novel transcript | 1 | 9 | 2023-12-22 11:25:22.136024+00:00 | 1 | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
36386 | RSPZ2lN4Rg0e | None | None | ENSG00000277836 | protein_coding | None | 1 | 9 | 2023-12-22 11:25:24.953283+00:00 | 1 | ||
36387 | 2WtI0o0bTNpX | None | None | ENSG00000278633 | protein_coding | None | 1 | 9 | 2023-12-22 11:25:24.953316+00:00 | 1 | ||
36388 | 1s0o6Iw6On1x | None | None | ENSG00000276017 | protein_coding | None | 1 | 9 | 2023-12-22 11:25:24.953349+00:00 | 1 | ||
36389 | eKgp8a8BbdBt | None | None | ENSG00000278817 | protein_coding | None | 1 | 9 | 2023-12-22 11:25:24.953381+00:00 | 1 | ||
36390 | Q8faDAE0IkSq | None | None | ENSG00000277196 | protein_coding | proline dehydrogenase 1 | 1 | 9 | 2023-12-22 11:25:24.953414+00:00 | 1 |
36390 rows × 12 columns
When we create a Artifact
object from an AnnData
, we automatically link its features:
artifact = ln.Artifact.from_anndata(
adata,
field=lb.Gene.ensembl_gene_id, # field to validate and link features
key="scrna/conde22.h5ad", # optional: a relative path in your default storage
description="Human immune cells from Conde22", # optional: a description
)
artifact
Show code cell output
💡 path content will be copied to default storage upon `save()` with key 'scrna/conde22.h5ad'
💡 parsing feature names of X stored in slot 'var'
✅ 36390 terms (99.70%) are validated for ensembl_gene_id
❗ 113 terms (0.30%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
✅ linked: FeatureSet(uid='IJdCwYowHCLhFRDjL6Cm', n=36390, type='number', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', created_by_id=1)
💡 parsing feature names of slot 'obs'
✅ 4 terms (100.00%) are validated for name
✅ linked: FeatureSet(uid='c1iGKBV0EhgauNo9AXwU', n=4, registry='core.Feature', hash='g1m0hshChNIbAUcLGIWa', created_by_id=1)
Artifact(uid='kJLMc6NiYPhmG31eM77C', key='scrna/conde22.h5ad', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
artifact.save()
Show code cell output
✅ saved 2 feature sets for slots: 'var','obs'
✅ storing artifact 'kJLMc6NiYPhmG31eM77C' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/kJLMc6NiYPhmG31eM77C.h5ad'
The artifact has 2 linked feature sets, one for measured genes and one for measured metadata:
artifact.features
Show code cell output
Features:
var: FeatureSet(uid='IJdCwYowHCLhFRDjL6Cm', n=36390, type='number', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-12-22 11:25:30 UTC, created_by_id=1)
'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
obs: FeatureSet(uid='c1iGKBV0EhgauNo9AXwU', n=4, registry='core.Feature', hash='g1m0hshChNIbAUcLGIWa', updated_at=2023-12-22 11:25:31 UTC, created_by_id=1)
🔗 cell_type (0, bionty.CellType):
🔗 assay (0, bionty.ExperimentalFactor):
🔗 tissue (0, bionty.Tissue):
🔗 donor (0, core.ULabel):
Let’s now annotate the artifact with labels:
experimental_factors = lb.ExperimentalFactor.lookup()
organism = lb.Organism.lookup()
features = ln.Feature.lookup()
artifact.labels.add(organism.human, feature=features.organism)
artifact.labels.add(
experimental_factors.single_cell_rna_sequencing, feature=features.assay
)
artifact.labels.add(adata.obs.cell_type, feature=features.cell_type)
artifact.labels.add(adata.obs.assay, feature=features.assay)
artifact.labels.add(adata.obs.tissue, feature=features.tissue)
artifact.labels.add(adata.obs.donor, feature=features.donor)
Show code cell output
✅ linked new feature 'organism' together with new feature set FeatureSet(uid='5wmaXji2xoIIjTW0c8Li', n=1, registry='core.Feature', hash='SMcPr9uZdZXIUbrmVZgh', updated_at=2023-12-22 11:25:32 UTC, created_by_id=1)
The artifact is now validated & queryable by everything we linked:
artifact.describe()
Show code cell output
Artifact(uid='kJLMc6NiYPhmG31eM77C', key='scrna/conde22.h5ad', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2023-12-22 11:25:31 UTC)
Provenance:
🗃️ storage: Storage(uid='kFrUCXTg', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-12-22 11:25:11 UTC, created_by_id=1)
💫 transform: Transform(uid='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-12-22 11:25:15 UTC, created_by_id=1)
👣 run: Run(uid='78KMypviRCyumUuJ2p3M', run_at=2023-12-22 11:25:15 UTC, transform_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-22 11:25:11 UTC)
Features:
var: FeatureSet(uid='IJdCwYowHCLhFRDjL6Cm', n=36390, type='number', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-12-22 11:25:30 UTC, created_by_id=1)
'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
obs: FeatureSet(uid='c1iGKBV0EhgauNo9AXwU', n=4, registry='core.Feature', hash='g1m0hshChNIbAUcLGIWa', updated_at=2023-12-22 11:25:31 UTC, created_by_id=1)
🔗 cell_type (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
external: FeatureSet(uid='5wmaXji2xoIIjTW0c8Li', n=1, registry='core.Feature', hash='SMcPr9uZdZXIUbrmVZgh', updated_at=2023-12-22 11:25:32 UTC, created_by_id=1)
🔗 organism (1, bionty.Organism): 'human'
Labels:
🏷️ organism (1, bionty.Organism): 'human'
🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
🏷️ cell_types (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
Seed a dataset#
Let’s create a first version of a dataset that will encompass many h5ad
files when more data is ingested.
Note
To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.
dataset = ln.Dataset(artifact, name="My versioned scRNA-seq dataset", version="1")
dataset.save()
dataset.labels.add_from(artifact) # seed the initial labels of the dataset
Show code cell output
💡 initializing versioning for this dataset! create future versions of it using ln.Dataset(..., is_new_version_of=old_dataset)
💡 transferring cell_type
💡 transferring assay
💡 transferring tissue
💡 transferring donor
💡 transferring organism
For this version 1 of the dataset, dataset and artifact match each other. But they’re independently tracked and queryable through their registries:
dataset.describe()
Dataset(uid='kJLMc6NiYPhmG31eM77C', name='My versioned scRNA-seq dataset', version='1', hash='9sXda5E7BYiVoDOQkTC0KB', visibility=1, updated_at=2023-12-22 11:25:33 UTC)
Provenance:
💫 transform: Transform(uid='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-12-22 11:25:15 UTC, created_by_id=1)
👣 run: Run(uid='78KMypviRCyumUuJ2p3M', run_at=2023-12-22 11:25:15 UTC, transform_id=1, created_by_id=1)
📎 artifact: Artifact(uid='kJLMc6NiYPhmG31eM77C', key='scrna/conde22.h5ad', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2023-12-22 11:25:33 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-22 11:25:11 UTC)
Features:
var: FeatureSet(uid='IJdCwYowHCLhFRDjL6Cm', n=36390, type='number', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-12-22 11:25:30 UTC, created_by_id=1)
'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
obs: FeatureSet(uid='c1iGKBV0EhgauNo9AXwU', n=4, registry='core.Feature', hash='g1m0hshChNIbAUcLGIWa', updated_at=2023-12-22 11:25:31 UTC, created_by_id=1)
🔗 cell_type (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
🔗 assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
external: FeatureSet(uid='5wmaXji2xoIIjTW0c8Li', n=1, registry='core.Feature', hash='SMcPr9uZdZXIUbrmVZgh', updated_at=2023-12-22 11:25:32 UTC, created_by_id=1)
🔗 organism (1, bionty.Organism): 'human'
Labels:
🏷️ organism (1, bionty.Organism): 'human'
🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
🏷️ cell_types (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
🏷️ experimental_factors (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
Access the underlying artifact like so:
dataset.artifact
Artifact(uid='kJLMc6NiYPhmG31eM77C', key='scrna/conde22.h5ad', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2023-12-22 11:25:33 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
See data flow:
dataset.view_flow()