scgpt.tokenizer package
Submodules
scgpt.tokenizer.gene_tokenizer module
- class scgpt.tokenizer.gene_tokenizer.GeneVocab(gene_list_or_vocab: Union[List[str], Vocab], specials: Optional[List[str]] = None, special_first: bool = True, default_token: Optional[str] = '<pad>')[source]
Bases:
VocabVocabulary for genes.
- classmethod from_dict(token2idx: Dict[str, int], default_token: Optional[str] = '<pad>') Self[source]
Load the vocabulary from a dictionary.
- Parameters:
token2idx (Dict[str, int]) – Dictionary mapping tokens to indices.
- classmethod from_file(file_path: Union[Path, str]) Self[source]
Load the vocabulary from a file. The file should be either a pickle or a json file of token to index mapping.
- property pad_token: Optional[str]
Get the pad token.
- set_default_token(default_token: str) None[source]
Set the default token.
- Parameters:
default_token (str) – Default token.
- training: bool
- scgpt.tokenizer.gene_tokenizer.get_default_gene_vocab() GeneVocab[source]
Get the default gene vocabulary, consisting of gene symbols and ids.
- scgpt.tokenizer.gene_tokenizer.pad_batch(batch: List[Tuple], max_len: int, vocab: Vocab, pad_token: str = '<pad>', pad_value: int = 0, cls_appended: bool = True) Dict[str, Tensor][source]
Pad a batch of data. Returns a list of Dict[gene_id, count].
- Parameters:
batch (list) – A list of tuple (gene_id, count).
max_len (int) – The maximum length of the batch.
vocab (Vocab) – The vocabulary containing the pad token.
pad_token (str) – The token to pad with.
- Returns:
A dictionary of gene_id and count.
- Return type:
Dict[str, torch.Tensor]
- scgpt.tokenizer.gene_tokenizer.random_mask_value(values: Union[Tensor, ndarray], mask_ratio: float = 0.15, mask_value: int = -1, pad_value: int = 0) Tensor[source]
Randomly mask a batch of data.
- Parameters:
values (array-like) – A batch of tokenized data, with shape (batch_size, n_features).
mask_ratio (float) – The ratio of genes to mask, default to 0.15.
mask_value (int) – The value to mask with, default to -1.
pad_value (int) – The value of padding in the values, will be kept unchanged.
- Returns:
A tensor of masked data.
- Return type:
torch.Tensor
- scgpt.tokenizer.gene_tokenizer.tokenize_and_pad_batch(data: ndarray, gene_ids: ndarray, max_len: int, vocab: Vocab, pad_token: str, pad_value: int, append_cls: bool = True, include_zero_gene: bool = False, cls_token: str = '<cls>', return_pt: bool = True) Dict[str, Tensor][source]
Tokenize and pad a batch of data. Returns a list of tuple (gene_id, count).
- scgpt.tokenizer.gene_tokenizer.tokenize_batch(data: ndarray, gene_ids: ndarray, return_pt: bool = True, append_cls: bool = True, include_zero_gene: bool = False, cls_id: int = '<cls>') List[Tuple[Union[Tensor, ndarray]]][source]
Tokenize a batch of data. Returns a list of tuple (gene_id, count).
- Parameters:
data (array-like) – A batch of data, with shape (batch_size, n_features). n_features equals the number of all genes.
gene_ids (array-like) – A batch of gene ids, with shape (n_features,).
return_pt (bool) – Whether to return torch tensors of gene_ids and counts, default to True.
- Returns:
A list of tuple (gene_id, count) of non zero gene expressions.
- Return type:
list