scgpt package

Modules

scgpt.data_collator

class scgpt.data_collator.DataCollator(do_padding: bool = True, pad_token_id: int | None = None, pad_value: int = 0, do_mlm: bool = True, do_binning: bool = True, mlm_probability: float = 0.15, mask_value: int = -1, max_length: int | None = None, sampling: bool = True, keep_first_n_tokens: int = 1)[source]

Bases: object

Data collator for the mask value learning task. It pads the sequences to the maximum length in the batch and masks the gene expression values.

Parameters:
  • do_padding (bool) – whether to pad the sequences to the max length.

  • pad_token_id (int, optional) – the token id to use for padding. This is required if do_padding is True.

  • pad_value (int) – the value to use for padding the expression values to the max length.

  • do_mlm (bool) – whether to do masking with MLM.

  • do_binning (bool) – whether to bin the expression values.

  • mlm_probability (float) – the probability of masking with MLM.

  • mask_value (int) – the value to fill at the expression postions that are masked.

  • max_length (int, optional) – the maximum length of the sequences. This is required if do_padding is True.

  • sampling (bool) – whether to do sampling instead of truncation if length > max_length.

  • keep_first_n_tokens (int) – the number of tokens in the beginning of the sequence to keep unchanged from sampling. This is useful when special tokens have been added to the beginning of the sequence. Default to 1.

do_binning: bool = True
do_mlm: bool = True
do_padding: bool = True
keep_first_n_tokens: int = 1
mask_value: int = -1
max_length: int | None = None
mlm_probability: float = 0.15
pad_token_id: int | None = None
pad_value: int = 0
sampling: bool = True

scgpt.data_sampler

class scgpt.data_sampler.SubsetSequentialSampler(indices: Sequence[int])[source]

Bases: Sampler

Samples elements sequentially from a given list of indices, without replacement.

Parameters:

indices (sequence) – a sequence of indices

class scgpt.data_sampler.SubsetsBatchSampler(subsets: List[Sequence[int]], batch_size: int, intra_subset_shuffle: bool = True, inter_subset_shuffle: bool = True, drop_last: bool = False)[source]

Bases: Sampler[List[int]]

Samples batches of indices from a list of subsets of indices. Each subset of indices represents a data subset and is sampled without replacement randomly or sequentially. Specially, each batch only contains indices from a single subset. This sampler is for the scenario where samples need to be drawn from multiple subsets separately.

Parameters:
  • subsets (List[Sequence[int]]) – A list of subsets of indices.

  • batch_size (int) – Size of mini-batch.

  • intra_subset_shuffle (bool) – If True, the sampler will shuffle the indices within each subset.

  • inter_subset_shuffle (bool) – If True, the sampler will shuffle the order of subsets.

  • drop_last (bool) – If True, the sampler will drop the last batch if its size would be less than batch_size.

scgpt.loss

scgpt.loss.criterion_neg_log_bernoulli(input: Tensor, target: Tensor, mask: Tensor) Tensor[source]

Compute the negative log-likelihood of Bernoulli distribution

scgpt.loss.masked_mse_loss(input: Tensor, target: Tensor, mask: Tensor) Tensor[source]

Compute the masked MSE loss between input and target.

scgpt.loss.masked_relative_error(input: Tensor, target: Tensor, mask: LongTensor) Tensor[source]

Compute the masked relative error between input and target.

scgpt.preprocess

class scgpt.preprocess.Preprocessor(use_key: str | None = None, filter_gene_by_counts: int | bool = False, filter_cell_by_counts: int | bool = False, normalize_total: float | bool = 10000.0, result_normed_key: str | None = 'X_normed', log1p: bool = False, result_log1p_key: str = 'X_log1p', subset_hvg: int | bool = False, hvg_use_key: str | None = None, hvg_flavor: str = 'seurat_v3', binning: int | None = None, result_binned_key: str = 'X_binned')[source]

Bases: object

Prepare data into training, valid and test split. Normalize raw expression values, binning or using other transform into the preset model input format.

check_logged(adata: AnnData, obs_key: str | None = None) bool[source]

Check if the data is already log1p transformed.

Args:

adata (AnnData):

The AnnData object to preprocess.

obs_key (str, optional):

The key of AnnData.obs to use for batch information. This arg is used in the highly variable gene selection step.

scgpt.preprocess.binning(row: ndarray | Tensor, n_bins: int) ndarray | Tensor[source]

Binning the row into n_bins.

scgpt.trainer

class scgpt.trainer.SeqDataset(data: Dict[str, Tensor])[source]

Bases: Dataset

scgpt.trainer.define_wandb_metrcis()[source]
scgpt.trainer.eval_testdata(model: Module, adata_t: AnnData, gene_ids, vocab, config, logger, include_types: List[str] = ['cls']) Dict | None[source]

evaluate the model on test dataset of adata_t

scgpt.trainer.evaluate(model: Module, loader: DataLoader, vocab, criterion_gep_gepc, criterion_dab, criterion_cls, device, config, epoch) float[source]

Evaluate the model on the evaluation data.

scgpt.trainer.predict(model: Module, loader: DataLoader, vocab, config, device) float[source]

Evaluate the model on the evaluation data.

scgpt.trainer.prepare_data(tokenized_train, tokenized_valid, train_batch_labels, valid_batch_labels, config, epoch, train_celltype_labels=None, valid_celltype_labels=None, sort_seq_batch=False) Tuple[Dict[str, Tensor]][source]
scgpt.trainer.prepare_dataloader(data_pt: Dict[str, Tensor], batch_size: int, shuffle: bool = False, intra_domain_shuffle: bool = False, drop_last: bool = False, num_workers: int = 0, per_seq_batch_sample: bool = False) DataLoader[source]
scgpt.trainer.test(model: Module, adata: DataLoader, gene_ids, vocab, config, device, logger) float[source]
scgpt.trainer.train(model: Module, loader: DataLoader, vocab, criterion_gep_gepc, criterion_dab, criterion_cls, scaler, optimizer, scheduler, device, config, logger, epoch) None[source]

Train the model for one epoch.