scgpt.scbank package
Submodules
scgpt.scbank.data module
- class scgpt.scbank.data.DataTable(name: str, data: Optional[Dataset] = None)[source]
Bases:
objectThe data structure for a single-cell data table.
- data: Optional[Dataset] = None
- property is_loaded: bool
- name: str
- class scgpt.scbank.data.MetaInfo(on_disk_path: Optional[Union[Path, str]] = None, on_disk_format: typing_extensions.Literal[json, parquet] = 'json', main_table_key: Optional[str] = None, gene_vocab_md5: Optional[str] = None, study_ids: Optional[List[int]] = None, cell_ids: Optional[List[int]] = None)[source]
Bases:
objectThe data structure for meta info of a scBank data directory.
- cell_ids: Optional[List[int]] = None
- gene_vocab_md5: Optional[str] = None
- load(path: Optional[Union[Path, str]] = None) None[source]
Load meta info from path. If path is None, will load from the same path at
on_disk_path.
- main_table_key: Optional[str] = None
- on_disk_format: typing_extensions.Literal[json, parquet] = 'json'
- on_disk_path: Optional[Union[Path, str]] = None
- save(path: Optional[Union[Path, str]] = None) None[source]
Save meta info to path. If path is None, will save to the same path at
on_disk_path.
- study_ids: Optional[List[int]] = None
scgpt.scbank.databank module
- class scgpt.scbank.databank.DataBank(meta_info: ~typing.Optional[~scgpt.scbank.data.MetaInfo] = None, data_tables: ~typing.Dict[str, ~scgpt.scbank.data.DataTable] = <factory>, gene_vocab: ~dataclasses.InitVar = <property object>, settings: ~scgpt.scbank.setting.Setting = <factory>)[source]
Bases:
objectThe data structure for large-scale single cell data containing multiple studies. See https://github.com/subercui/scGPT-release#the-data-structure-for-large-scale-computing.
- append_study(study_id: int, study_data: Union[AnnData, DataBank]) None[source]
Append a study to the current DataBank.
- Parameters:
study_id (str) – Study ID.
study_data (AnnData or DataBank) – Study data.
- custom_filter(field: str, filter_func: callable, inplace: bool = True) Self[source]
Filter the current DataBank by applying a custom filter function to a field.
- Parameters:
field (str) – Field to filter.
filter_func (callable) – Filter function.
inplace (bool) – Whether to also filter inplace.
- Returns:
Filtered DataBank.
- Return type:
- filter(study_ids: Optional[List[int]] = None, cell_ids: Optional[List[int]] = None, inplace: bool = True) Self[source]
Filter the current DataBank by study ID and cell ID.
- Parameters:
study_ids (list) – Study IDs to filter.
cell_ids (list) – Cell IDs to filter.
inplace (bool) – Whether to also filter inplace.
- Returns:
Filtered DataBank.
- Return type:
- classmethod from_anndata(adata: Union[AnnData, Path, str], vocab: Union[GeneVocab, Mapping[str, int]], to: Union[Path, str], main_table_key: str = 'X', token_col: str = 'gene name', immediate_save: bool = True) Self[source]
Create a DataBank from an AnnData object.
- Parameters:
adata (AnnData) – Annotated data or path to anndata file.
vocab (GeneVocab or Mapping[str, int]) – Gene vocabulary maps gene token to index.
to (Path or str) – Data directory.
main_table_key (str) – This layer/obsm in anndata will be used as the main data table.
token_col (str) – Column name of the gene token.
immediate_save (bool) – Whether to save the data immediately after creation.
- Returns:
DataBank instance.
- Return type:
- classmethod from_path(path: Union[Path, str]) Self[source]
Create a DataBank from a directory containing scBank data. NOTE: this method will automatically check whether md5sum record in the
manifest.jsonmatches the md5sum of the loaded gene vocabulary.- Parameters:
path (Path or str) – Directory path.
- Returns:
DataBank instance.
- Return type:
- link(data_path: Union[Path, str]) None[source]
Link to a scBank data directory. This will only load the meta info and perform validation check, but not load the data tables. Usually, can use the .load_table method to load a data table later.
- load(path: Union[Path, str]) Dataset[source]
Load scBank data from a data directory. Since DataBank is designed to work with large-scale data, this only loads the main data table to memory by default. This does as well load the meta info and perform validation check.
- load_all(path: Union[Path, str]) Dict[str, Dataset][source]
Load scBank data from a data directory. This will load all the data tables to memory.
- load_anndata(adata: AnnData, data_keys: Optional[List[str]] = None, token_col: str = 'gene name') List[DataTable][source]
Load anndata into datatables.
- Parameters:
adata (
AnnData) – Annotated data object to load.data_keys (list of
str) – List of data keys to load. If None, all data keys inadata.X,adata.layersandadata.obsmwill be loaded.token_col (
str) – Column name of the gene token. Tokens will be converted to indices byself.gene_vocab.
- Returns:
List of data tables loaded.
- Return type:
list of
DataTable
- property main_table_key: Optional[str]
The main data table key.
- save(path: Optional[Union[Path, str]], replace: bool = False) None[source]
Save scBank data to a data directory.
- Parameters:
path (Path) – Path to save scBank data. If None, will save to the directory at
self.meta_info.on_disk_path.replace (bool) – Whether to replace existing data in the directory.
- sync(attr_keys: Optional[Union[List[str], str]] = None) None[source]
Sync the current DataBank to a data directory, including, save the updated data/vocab to files, update the meta info and save to files. NOTE: This will overwrite the existing data directory.
- Parameters:
attr_keys (list of
str) – List of attribute keys to sync. If None, will sync all the attributes with tracked changes.
- track(attr_keys: Optional[Union[List[str], str]] = None) List[source]
Track all the changes made to the current DataBank and that have not been synced to disk. This will return a list of changes.
- Parameters:
attr_keys (list of
str) – List of attribute keys to look for changes. If None, all attributes will be checked.
- update_datatables(new_tables: List[DataTable], use_names: Optional[List[str]] = None, overwrite: bool = False, immediate_save: Optional[bool] = None) None[source]
Update the data tables in the DataBank with new data tables.
- Parameters:
new_tables (list of
DataTable) – New data tables to update.use_names (list of
str) – Names of the new data tables to use. If not provided, will use the names of the new data tables.overwrite (
bool) – Whether to overwrite the existing data tables.immediate_save (
bool) – Whether to save the data immediately after updating. Will save toself.meta_info.on_disk_path. If not provided, will followself.settings.immediate_saveinstead. Default to None.
scgpt.scbank.monitor module
scgpt.scbank.setting module
- class scgpt.scbank.setting.Setting(remove_zero_rows: bool = True, max_tokenize_batch_size: int = 1000000.0, immediate_save: bool = False)[source]
Bases:
objectThe configuration for scBank
DataBank.- immediate_save: bool = False
- max_tokenize_batch_size: int = 1000000.0
- remove_zero_rows: bool = True