Analyze a dataset in memory#

Here, we’ll analyze the growing dataset by loading it into memory.

This is only possible if it’s not too large.

If your data is large, you’ll likely want to iterate over the dataset to train a model, the topic of the next page ().

import lamindb as ln
import lnschema_bionty as lb
import anndata as ad

💡 lamindb instance: testuser1/test-scrna

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.64.2 lnschema_bionty==0.36.1 scanpy==1.9.6

💡 saved: Transform(uid='mfWKm8OtAzp8z8', name='Analyze a dataset in memory', short_name='scrna4', version='0', type=notebook, updated_at=2023-12-22 11:26:14 UTC, created_by_id=1)

💡 saved: Run(uid='jPRhhOjNmvRGdvrR5UD0', run_at=2023-12-22 11:26:14 UTC, transform_id=4, created_by_id=1)

ln.Dataset.filter().df()

	uid	name	description	version	hash	reference	reference_type	transform_id	run_id	artifact_id	initial_version_id	visibility	updated_at	created_by_id
id
1	kJLMc6NiYPhmG31eM77C	My versioned scRNA-seq dataset	None	1	9sXda5E7BYiVoDOQkTC0KB	None	None	1	1	1.0	NaN	1	2023-12-22 11:25:33.226808+00:00	1
2	yIwUH2JUgdJ8sandeDLl	My versioned scRNA-seq dataset	None	2	BOAf0T5UbN_iOe3fQDyq	None	None	2	2	NaN	1.0	1	2023-12-22 11:26:01.138779+00:00	1

dataset = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()

dataset.artifacts.df()

	uid	storage_id	key	suffix	accessor	description	version	size	hash	hash_type	n_objects	n_observations	transform_id	run_id	initial_version_id	visibility	key_is_virtual	updated_at	created_by_id
id
1	kJLMc6NiYPhmG31eM77C	1	scrna/conde22.h5ad	.h5ad	AnnData	Human immune cells from Conde22	None	57612943	9sXda5E7BYiVoDOQkTC0KB	sha1-fl	None	None	1	1	None	1	True	2023-12-22 11:25:33.223153+00:00	1
2	VbND5kUT2jBNcK1iICSO	1	None	.h5ad	AnnData	10x reference adata	None	853388	eKH1ljAEh7Kd81-o2H4A7w	md5	None	None	2	2	None	1	True	2023-12-22 11:26:00.008539+00:00	1

If the dataset isn’t too large, we can now load it into memory.

Under-the-hood, the AnnData objects are concatenated during loading.

The amount of time this takes depends on a variety of factors.

If it occurs often, one might consider storing a concatenated version of the dataset, rather than the individual pieces.

adata = dataset.load()

The default is an outer join during concatenation as in pandas:

adata

AnnData object with n_obs × n_vars = 1718 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay', 'n_genes', 'percent_mito', 'louvain', 'artifact_uid'
    obsm: 'X_umap', 'X_pca'

The AnnData has the reference to the individual artifacts in the .obs annotations:

adata.obs.artifact_uid.cat.categories

Index(['kJLMc6NiYPhmG31eM77C', 'VbND5kUT2jBNcK1iICSO'], dtype='object')

We can easily obtain ensemble IDs for gene symbols using the look up object:

genes = lb.Gene.lookup(field="symbol")

genes.itm2b.ensembl_gene_id

'ENSG00000136156'

Let us create a plot:

import scanpy as sc

sc.pp.pca(adata, n_comps=2)

2023-12-22 11:26:17,681:INFO - Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)

2023-12-22 11:26:17,734:INFO - generated new fontManager

sc.pl.pca(
    adata,
    color=genes.itm2b.ensembl_gene_id,
    title=(
        f"{genes.itm2b.symbol} / {genes.itm2b.ensembl_gene_id} /"
        f" {genes.itm2b.description}"
    ),
    save="_itm2b",
)

WARNING: saving figure to file figures/pca_itm2b.pdf

_images/a4961ade07b739d2faf6724e5938fd2a940bdaa1ea2cb478c79106b43a7feb05.png

We could save a plot as a pdf and then see it in the flow diagram:

artifact = ln.Artifact("./figures/pca_itm2b.pdf", description="My result on ITM2B")
artifact.save()
artifact.view_flow()

But given the image is part of the notebook, we can also rely on the report that we create when saving the notebook via the command line via:

lamin save <notebook_path>

To see the current notebook, visit: lamin.ai/laminlabs/lamindata/record/core/Transform?uid=mfWKm8OtAzp8z8