Concatenate datasets to a single array store¶
In the previous notebooks, we’ve seen how to incrementally create a collection of scRNA-seq datasets and train models on it.
Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata.
This is also what CELLxGENE does to create Census: a number of .h5ad
files are concatenated to give rise to a single tiledbsoma
array store (CELLxGENE: scRNA-seq).
import lamindb as ln
import pandas as pd
import scanpy as sc
import tiledbsoma.io
from functools import reduce
ln.context.uid = "oJN8WmVrxI8m0000"
ln.context.track()
Show code cell output
→ connected lamindb: testuser1/test-scrna
→ notebook imports: lamindb==0.76.3 pandas==2.2.2 scanpy==1.10.2 tiledbsoma==1.13.1
→ created Transform('oJN8WmVrxI8m0000') & created Run('2024-09-04 11:40:11.261519+00:00')
Query the collection of h5ad
files that we’d like to concatenate into a single array.
collection = ln.Collection.get(
name="My versioned scRNA-seq collection", version="2"
)
collection.describe()
Show code cell output
Collection(uid='cxgXRq8O2Zt1obHL0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='dBJLoG6NFZ8WwlWqnfyFdQ', visibility=1, updated_at='2024-09-04 11:39:36 UTC')
Provenance
.created_by = 'testuser1'
.transform = 'Standardize and append a batch of data'
.run = '2024-09-04 11:39:15 UTC'
.input_of_runs = ["'2024-09-04 11:39:46 UTC'", "'2024-09-04 11:40:04 UTC'"]
Feature sets
'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'
'obs' = 'donor', 'tissue', 'cell_type', 'assay'
Prepare the AnnData objects¶
To concatenate the AnnData
objects into a single tiledbsoma.Experiment
, they need to have the same .var
and .obs
columns.
# load a number of AnnData objects that's small enough to fit into memory
adatas = [artifact.load() for artifact in collection.ordered_artifacts]
# compute the intersection of columns for these objects
var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas]) # this only affects metadata columns of features (say, gene annotations)
var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])
obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas]) # this actually subsets features (dataset dimensions)
Prepare the AnnData
objects for concatenation. Prepare id fields, sanitize index
names, intersect columns, drop .obsp
, .uns
and columns that aren’t part of the intersection.
for i, adata in enumerate(adatas):
del adata.obsp # not supported by tiledbsoma
del adata.uns # not supported by tiledbsoma
adata.obs = adata.obs.filter(obs_columns) # filter columns to intersection
adata.obs["obs_id"] = adata.obs.index # prepare a column for tiledbsoma to use as an index
adata.obs["dataset"] = i
adata.obs.index.name = None
adata.var = adata.var.filter(var_columns) # filter columns to intersection
adata.var["var_id"] = adata.var.index
adata.var.index.name = None
drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
adata.raw.var["var_id"] = adata.raw.var.index
adata.raw.var.index.name = None
Create the array store¶
Save the AnnData
objects in one array store referenced by an Artifact
.
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
adatas,
description="tiledbsoma experiment",
measurement_name="RNA",
obs_id_name="obs_id",
var_id_name="var_id",
append_obsm_varm=True
)
Show code cell output
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
return _abc_instancecheck(cls, instance)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
return _abc_instancecheck(cls, instance)
Note
Provenance is tracked by writing the current run.uid
to tiledbsoma.Experiment.obs
as lamin_run_uid
.
If you know tiledbsoma
API, then note that save_tiledbsoma_experiment()
abstracts over both tiledbsoma.io.register_anndatas
and tiledbsoma.io.from_anndata
.
Query the array store¶
Here we query the obs
from the array store.
with soma_artifact.open() as soma_store:
obs = soma_store["obs"]
var = soma_store["ms"]["RNA"]["var"]
obs_columns_store = obs.schema.names
var_columns_store = var.schema.names
obs_store_df = obs.read().concat().to_pandas()
display(obs_store_df)
Show code cell output
soma_joinid | cell_type | obs_id | dataset | lamin_run_uid | |
---|---|---|---|---|---|
0 | 0 | dendritic cell | GCAGGGCTGGATTC-1 | 0 | bSfdfLdLKM4mjtY4PES7 |
1 | 1 | B cell, CD19-positive | CTTTAGTGGTTACG-6 | 0 | bSfdfLdLKM4mjtY4PES7 |
2 | 2 | dendritic cell | TGACTGGAACCATG-7 | 0 | bSfdfLdLKM4mjtY4PES7 |
3 | 3 | B cell, CD19-positive | TCAATCACCCTTCG-8 | 0 | bSfdfLdLKM4mjtY4PES7 |
4 | 4 | effector memory CD4-positive, alpha-beta T cel... | CGTTATACAGTACC-8 | 0 | bSfdfLdLKM4mjtY4PES7 |
... | ... | ... | ... | ... | ... |
1713 | 1713 | naive thymus-derived CD4-positive, alpha-beta ... | Pan_T7991594_CTCACACTCCAGGGCT | 1 | bSfdfLdLKM4mjtY4PES7 |
1714 | 1714 | naive thymus-derived CD4-positive, alpha-beta ... | Pan_T7980358_CGAGCACAGAAGATTC | 1 | bSfdfLdLKM4mjtY4PES7 |
1715 | 1715 | naive thymus-derived CD4-positive, alpha-beta ... | CZINY-0064_AGACCATCACGCTGCA | 1 | bSfdfLdLKM4mjtY4PES7 |
1716 | 1716 | CD8-positive, alpha-beta memory T cell | CZINY-0050_TCGATTTAGATGTTGA | 1 | bSfdfLdLKM4mjtY4PES7 |
1717 | 1717 | naive thymus-derived CD4-positive, alpha-beta ... | CZINY-0064_AGTGTTGTCCGAGCTG | 1 | bSfdfLdLKM4mjtY4PES7 |
1718 rows × 5 columns
Append to the array store¶
Prepare a new AnnData
object to be appended to the store.
adata = ln.core.datasets.anndata_with_obs()
adata.obs_names_make_unique()
adata.var_names_make_unique()
adata.obs["obs_id"] = adata.obs.index
adata.var["var_id"] = adata.var.index
adata.obs["dataset"] = obs_store_df["dataset"].max()
obs_columns_same = [obs_col for obs_col in adata.obs.columns if obs_col in obs_columns_store]
adata.obs = adata.obs[obs_columns_same]
var_columns_same = [var_col for var_col in adata.var.columns if var_col in var_columns_store]
adata.var = adata.var[var_columns_same]
adata.write_h5ad("adata_to_append.h5ad")
Append the AnnData
object from disk by revising soma_artifact
.
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
["adata_to_append.h5ad"],
revises=soma_artifact,
measurement_name="RNA",
obs_id_name="obs_id",
var_id_name="var_id"
)
Show code cell output
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
return _abc_instancecheck(cls, instance)
Update the array store¶
Add a new embedding to the existing array store.
# read the data matrix
with soma_artifact.open() as soma_store:
ms_rna = soma_store["ms"]["RNA"]
n_obs = len(soma_store["obs"])
n_var = len(ms_rna["var"])
X = ms_rna["X"]["data"].read().coos((n_obs, n_var)).concat().to_scipy()
# calculate PCA embedding from the queried `X`
pca_array = sc.pp.pca(X, n_comps=2)
Open the array store in write mode and add PCA.
with soma_artifact.open(mode="w") as soma_store:
tiledbsoma.io.add_matrix_to_collection(
exp=soma_store,
measurement_name="RNA",
collection_name="obsm",
matrix_name="pca",
matrix_data=pca_array
)
Show code cell output
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
return _abc_instancecheck(cls, instance)
See array store mutations¶
During the append-to and update operations, the data in the array store was changed. LaminDB automatically tracks these revisions recording the number of objects, hashes, and provenance.
soma_artifact.versions.df()
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
4 | YG2GUpMfx8j10lFI0000 | None | False | tiledbsoma experiment | None | .tiledbsoma | None | 15056696 | jn2YJSyJ5__a6wK3FvPaYg | 149 | None | md5-d | None | 1 | True | 1 | 6 | 6 | 1 | 2024-09-04 11:40:19.030226+00:00 |
5 | YG2GUpMfx8j10lFI0001 | None | False | tiledbsoma experiment | None | .tiledbsoma | None | 15068506 | cKnPYcE89Fuz2A3UPcuMsA | 173 | None | md5-d | None | 1 | True | 1 | 6 | 6 | 1 | 2024-09-04 11:40:19.891351+00:00 |
6 | YG2GUpMfx8j10lFI0002 | None | True | tiledbsoma experiment | None | .tiledbsoma | None | 15089310 | 6PUc942DT5tPQdF4WNJKjw | 182 | None | md5-d | None | 1 | True | 1 | 6 | 6 | 1 | 2024-09-04 11:40:19.892336+00:00 |
Note
For the underlying API, see the tiledbsoma documentation.