scgpt package
Modules
- scgpt.model package
- Submodules
- scgpt.model.dsbn module
- scgpt.model.generation_model module
- scgpt.model.grad_reverse module
- scgpt.model.model module
- Module contents
- scgpt.scbank package
- Submodules
- scgpt.scbank.data module
- scgpt.scbank.databank module
DataBank
DataBank.append_study()
DataBank.batch_from_anndata()
DataBank.custom_filter()
DataBank.data_tables
DataBank.delete_study()
DataBank.filter()
DataBank.from_anndata()
DataBank.from_path()
DataBank.gene_vocab
DataBank.link()
DataBank.load()
DataBank.load_all()
DataBank.load_anndata()
DataBank.load_table()
DataBank.main_data
DataBank.main_table_key
DataBank.meta_info
DataBank.save()
DataBank.settings
DataBank.sync()
DataBank.track()
DataBank.update_datatables()
- scgpt.scbank.monitor module
- scgpt.scbank.setting module
- Module contents
- scgpt.tasks package
- Submodules
- scgpt.tasks.cell_emb module
- scgpt.tasks.grn module
GeneEmbedding
GeneEmbedding.average_vector_results()
GeneEmbedding.cluster_definitions_as_df()
GeneEmbedding.compute_similarities()
GeneEmbedding.generate_network()
GeneEmbedding.generate_vector()
GeneEmbedding.generate_weighted_vector()
GeneEmbedding.get_adata()
GeneEmbedding.get_metagenes()
GeneEmbedding.get_similar_genes()
GeneEmbedding.plot_metagene()
GeneEmbedding.plot_metagenes_scores()
GeneEmbedding.plot_similarities()
GeneEmbedding.read_embedding()
GeneEmbedding.read_vector()
GeneEmbedding.score_metagenes()
- Module contents
- scgpt.tokenizer package
- scgpt.utils package
scgpt.data_collator
- class scgpt.data_collator.DataCollator(do_padding: bool = True, pad_token_id: int | None = None, pad_value: int = 0, do_mlm: bool = True, do_binning: bool = True, mlm_probability: float = 0.15, mask_value: int = -1, max_length: int | None = None, sampling: bool = True, keep_first_n_tokens: int = 1)[source]
Bases:
object
Data collator for the mask value learning task. It pads the sequences to the maximum length in the batch and masks the gene expression values.
- Parameters:
do_padding (
bool
) – whether to pad the sequences to the max length.pad_token_id (
int
, optional) – the token id to use for padding. This is required if do_padding is True.pad_value (
int
) – the value to use for padding the expression values to the max length.do_mlm (
bool
) – whether to do masking with MLM.do_binning (
bool
) – whether to bin the expression values.mlm_probability (
float
) – the probability of masking with MLM.mask_value (
int
) – the value to fill at the expression postions that are masked.max_length (
int
, optional) – the maximum length of the sequences. This is required if do_padding is True.sampling (
bool
) – whether to do sampling instead of truncation if length > max_length.keep_first_n_tokens (
int
) – the number of tokens in the beginning of the sequence to keep unchanged from sampling. This is useful when special tokens have been added to the beginning of the sequence. Default to 1.
- do_binning: bool = True
- do_mlm: bool = True
- do_padding: bool = True
- keep_first_n_tokens: int = 1
- mask_value: int = -1
- max_length: int | None = None
- mlm_probability: float = 0.15
- pad_token_id: int | None = None
- pad_value: int = 0
- sampling: bool = True
scgpt.data_sampler
- class scgpt.data_sampler.SubsetSequentialSampler(indices: Sequence[int])[source]
Bases:
Sampler
Samples elements sequentially from a given list of indices, without replacement.
- Parameters:
indices (sequence) – a sequence of indices
- class scgpt.data_sampler.SubsetsBatchSampler(subsets: List[Sequence[int]], batch_size: int, intra_subset_shuffle: bool = True, inter_subset_shuffle: bool = True, drop_last: bool = False)[source]
Bases:
Sampler
[List
[int
]]Samples batches of indices from a list of subsets of indices. Each subset of indices represents a data subset and is sampled without replacement randomly or sequentially. Specially, each batch only contains indices from a single subset. This sampler is for the scenario where samples need to be drawn from multiple subsets separately.
- Parameters:
subsets (List[Sequence[int]]) – A list of subsets of indices.
batch_size (int) – Size of mini-batch.
intra_subset_shuffle (bool) – If
True
, the sampler will shuffle the indices within each subset.inter_subset_shuffle (bool) – If
True
, the sampler will shuffle the order of subsets.drop_last (bool) – If
True
, the sampler will drop the last batch if its size would be less thanbatch_size
.
scgpt.loss
- scgpt.loss.criterion_neg_log_bernoulli(input: Tensor, target: Tensor, mask: Tensor) Tensor [source]
Compute the negative log-likelihood of Bernoulli distribution
scgpt.preprocess
- class scgpt.preprocess.Preprocessor(use_key: str | None = None, filter_gene_by_counts: int | bool = False, filter_cell_by_counts: int | bool = False, normalize_total: float | bool = 10000.0, result_normed_key: str | None = 'X_normed', log1p: bool = False, result_log1p_key: str = 'X_log1p', subset_hvg: int | bool = False, hvg_use_key: str | None = None, hvg_flavor: str = 'seurat_v3', binning: int | None = None, result_binned_key: str = 'X_binned')[source]
Bases:
object
Prepare data into training, valid and test split. Normalize raw expression values, binning or using other transform into the preset model input format.
- check_logged(adata: AnnData, obs_key: str | None = None) bool [source]
Check if the data is already log1p transformed.
Args:
- adata (
AnnData
): The
AnnData
object to preprocess.- obs_key (
str
, optional): The key of
AnnData.obs
to use for batch information. This arg is used in the highly variable gene selection step.
- adata (
scgpt.trainer
- scgpt.trainer.eval_testdata(model: Module, adata_t: AnnData, gene_ids, vocab, config, logger, include_types: List[str] = ['cls']) Dict | None [source]
evaluate the model on test dataset of adata_t
- scgpt.trainer.evaluate(model: Module, loader: DataLoader, vocab, criterion_gep_gepc, criterion_dab, criterion_cls, device, config, epoch) float [source]
Evaluate the model on the evaluation data.
- scgpt.trainer.predict(model: Module, loader: DataLoader, vocab, config, device) float [source]
Evaluate the model on the evaluation data.
- scgpt.trainer.prepare_data(tokenized_train, tokenized_valid, train_batch_labels, valid_batch_labels, config, epoch, train_celltype_labels=None, valid_celltype_labels=None, sort_seq_batch=False) Tuple[Dict[str, Tensor]] [source]
- scgpt.trainer.prepare_dataloader(data_pt: Dict[str, Tensor], batch_size: int, shuffle: bool = False, intra_domain_shuffle: bool = False, drop_last: bool = False, num_workers: int = 0, per_seq_batch_sample: bool = False) DataLoader [source]