scRNA-seq#

You’ll learn how to manage a growing number of scRNA-seq data shards as a single queryable dataset.

Along the way, you’ll see how to create reports, leverage data lineage, and query individual data shards stored as files.

If you’re only interested in using a large curated scRNA-seq dataset, see the CELLxGENE Census guide.

Here, you will:

create an Artifact from an AnnData object and seed a growing Dataset with it (, current page)
append a new data batch (a new .h5ad file) and create a new version of this dataset ()
query & inspect artifacts by metadata individually ()
load the joint dataset into memory and save analytical results ()
iterate over the dataset, train a model, store a derived representation ()
discuss converting a number of artifacts to a single TileDB SOMA store of the same data ()

Setup#

!lamin init --storage ./test-scrna --schema bionty

import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = "hint"
lb.settings.organism = "human"
ln.track()

Ingest a artifact#

Let us look at the standardized data of Conde et al., Science (2022), available from CZ CELLxGENE.

By calling anndata_human_immune_cells(), we load a subsampled version of the dataset from CZ CELLxGENE and pre-populate the corresponding LaminDB registries: Feature, ULabel, Gene, CellType, CellLine, ExperimentalFactor.

adata = ln.dev.datasets.anndata_human_immune_cells(populate_registries=True)
adata

This AnnData object is standardized using the CZI single-cell-curation validator with the same public ontologies that underlie lnschema_bionty. Because registries are pre-populated, validation passes.

Note

In the next guide, we’ll curate a non-standardized dataset.

The gene registry provides metadata for each of the 36k genes measured in the AnnData:

lb.Gene.filter().df()

Show code cell output Hide code cell output

	uid	symbol	stable_id	ensembl_gene_id	ncbi_gene_ids	biotype	description	synonyms	organism_id	bionty_source_id	updated_at	created_by_id
id
1	nG5QMZBh5VxD	MIR1302-2HG	None	ENSG00000243485		lncRNA	MIR1302-2 host gene [Source:HGNC Symbol;Acc:HG...		1	9	2023-12-22 11:25:22.135866+00:00	1
2	A7BOVXA3f7IS	FAM138A	None	ENSG00000237613	645520\|124906933	lncRNA	family with sequence similarity 138 member A [...	F379	1	9	2023-12-22 11:25:22.135913+00:00	1
3	fXkFfHvbJRDx	OR4F5	None	ENSG00000186092	79501	protein_coding	olfactory receptor family 4 subfamily F member...		1	9	2023-12-22 11:25:22.135952+00:00	1
4	On07aEStsWXn	None	None	ENSG00000238009		lncRNA	novel transcript		1	9	2023-12-22 11:25:22.135988+00:00	1
5	JNhwMkpEuypB	None	None	ENSG00000239945		lncRNA	novel transcript		1	9	2023-12-22 11:25:22.136024+00:00	1
...	...	...	...	...	...	...	...	...	...	...	...	...
36386	RSPZ2lN4Rg0e	None	None	ENSG00000277836		protein_coding	None		1	9	2023-12-22 11:25:24.953283+00:00	1
36387	2WtI0o0bTNpX	None	None	ENSG00000278633		protein_coding	None		1	9	2023-12-22 11:25:24.953316+00:00	1
36388	1s0o6Iw6On1x	None	None	ENSG00000276017		protein_coding	None		1	9	2023-12-22 11:25:24.953349+00:00	1
36389	eKgp8a8BbdBt	None	None	ENSG00000278817		protein_coding	None		1	9	2023-12-22 11:25:24.953381+00:00	1
36390	Q8faDAE0IkSq	None	None	ENSG00000277196		protein_coding	proline dehydrogenase 1		1	9	2023-12-22 11:25:24.953414+00:00	1

36390 rows × 12 columns

When we create a Artifact object from an AnnData, we automatically link its features:

artifact = ln.Artifact.from_anndata(
    adata,
    field=lb.Gene.ensembl_gene_id,  # field to validate and link features
    key="scrna/conde22.h5ad",  # optional: a relative path in your default storage
    description="Human immune cells from Conde22",  # optional: a description
)
artifact

artifact.save()

The artifact has 2 linked feature sets, one for measured genes and one for measured metadata:

artifact.features

Let’s now annotate the artifact with labels:

experimental_factors = lb.ExperimentalFactor.lookup()
organism = lb.Organism.lookup()
features = ln.Feature.lookup()

artifact.labels.add(organism.human, feature=features.organism)
artifact.labels.add(
    experimental_factors.single_cell_rna_sequencing, feature=features.assay
)
artifact.labels.add(adata.obs.cell_type, feature=features.cell_type)
artifact.labels.add(adata.obs.assay, feature=features.assay)
artifact.labels.add(adata.obs.tissue, feature=features.tissue)
artifact.labels.add(adata.obs.donor, feature=features.donor)

The artifact is now validated & queryable by everything we linked:

artifact.describe()

Show code cell output Hide code cell output

Artifact(uid='kJLMc6NiYPhmG31eM77C', key='scrna/conde22.h5ad', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2023-12-22 11:25:31 UTC)

Provenance:
  🗃️ storage: Storage(uid='kFrUCXTg', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-12-22 11:25:11 UTC, created_by_id=1)
  💫 transform: Transform(uid='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-12-22 11:25:15 UTC, created_by_id=1)
  👣 run: Run(uid='78KMypviRCyumUuJ2p3M', run_at=2023-12-22 11:25:15 UTC, transform_id=1, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-22 11:25:11 UTC)
Features:
  var: FeatureSet(uid='IJdCwYowHCLhFRDjL6Cm', n=36390, type='number', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-12-22 11:25:30 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='c1iGKBV0EhgauNo9AXwU', n=4, registry='core.Feature', hash='g1m0hshChNIbAUcLGIWa', updated_at=2023-12-22 11:25:31 UTC, created_by_id=1)
    🔗 cell_type (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
    🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  external: FeatureSet(uid='5wmaXji2xoIIjTW0c8Li', n=1, registry='core.Feature', hash='SMcPr9uZdZXIUbrmVZgh', updated_at=2023-12-22 11:25:32 UTC, created_by_id=1)
    🔗 organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...

Seed a dataset#

Let’s create a first version of a dataset that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

dataset = ln.Dataset(artifact, name="My versioned scRNA-seq dataset", version="1")
dataset.save()
dataset.labels.add_from(artifact)  # seed the initial labels of the dataset

For this version 1 of the dataset, dataset and artifact match each other. But they’re independently tracked and queryable through their registries:

dataset.describe()

Dataset(uid='kJLMc6NiYPhmG31eM77C', name='My versioned scRNA-seq dataset', version='1', hash='9sXda5E7BYiVoDOQkTC0KB', visibility=1, updated_at=2023-12-22 11:25:33 UTC)

Provenance:
  💫 transform: Transform(uid='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type=notebook, updated_at=2023-12-22 11:25:15 UTC, created_by_id=1)
  👣 run: Run(uid='78KMypviRCyumUuJ2p3M', run_at=2023-12-22 11:25:15 UTC, transform_id=1, created_by_id=1)
  📎 artifact: Artifact(uid='kJLMc6NiYPhmG31eM77C', key='scrna/conde22.h5ad', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2023-12-22 11:25:33 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-22 11:25:11 UTC)
Features:
  var: FeatureSet(uid='IJdCwYowHCLhFRDjL6Cm', n=36390, type='number', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-12-22 11:25:30 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='c1iGKBV0EhgauNo9AXwU', n=4, registry='core.Feature', hash='g1m0hshChNIbAUcLGIWa', updated_at=2023-12-22 11:25:31 UTC, created_by_id=1)
    🔗 cell_type (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
    🔗 assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
    🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  external: FeatureSet(uid='5wmaXji2xoIIjTW0c8Li', n=1, registry='core.Feature', hash='SMcPr9uZdZXIUbrmVZgh', updated_at=2023-12-22 11:25:32 UTC, created_by_id=1)
    🔗 organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...

Access the underlying artifact like so:

dataset.artifact

Artifact(uid='kJLMc6NiYPhmG31eM77C', key='scrna/conde22.h5ad', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2023-12-22 11:25:33 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)

See data flow:

dataset.view_flow()

_images/964b8a2bfcdf5677630b4a987e7f343f14a120f129d02df0e277599437bcd992.svg