-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable extraction of gene embeddings from geneformer (averaging of gene embeddings across all cells) #452
Comments
More context, this is about getting 'gene embeddings' from geneformer. Right now we can pull the hiddens from the last layer, but will need to be able to pull them from an arbitrary embedding layer: Our inference code:Description of the problem:
(The second to last layer is handled by @jstjohn 's description above. Reference Geneformer huggingface codehttps://geneformer.readthedocs.io/en/latest/_modules/geneformer/emb_extractor.html#EmbExtractor |
Context from Birkan Gökbağ Geneformer’s embedding extractions rely on the input datasets and for every cell, we obtain the generated embeddings of each cell’s expressed genes. § i.e., Average gene embeddings across all cells o Cell Embedding Extraction: As input is a cell (i.e., sorted series of tokens), the output is already a representation of the cell. These embeddings are averaged to represent the cell embedding (not including CLS token embedding). § I.e., Average embeddings of the input cell directly o Optional aggregation by cell annotation: The previous analyses are applied per cell type annotation. Since the embedding process is limited to select annotation subsets, the embeddings will already be representative of the state only. Those embeddings are then aggregated using mean/median to represent the state. This is the scenario where you basically take the mean, or median, of the means. |
Ideally a test would be added as well |
I would suggest to add a "--include-geneembeddings" and change the current "--include-embeddings" into "--include-cellembeddings" to avoid confusion: https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py#L235 Here is an example of the original implementation by Geneformer for your reference. |
My preference for this would be as a post-processing step to avoid OOM issues. Basically my recommendation would be to dump the cell x gene embeddings, and any necessary metadata to disk, then do whatever averaging/grouping/etc you need downstream. I am guessing you would need:
Then from there you could load those entities and either place them in a cell x gene_token shaped tensor ordered by gene_token (this needs testing): import numpy as np
from scipy.sparse import coo_matrix, csr_matrix
def construct_sparse_matrices(token_matrix, embedding_matrix, num_tokens):
"""
Constructs sparse matrices for embeddings and observation tracking.
Args:
token_matrix (np.ndarray): A (samples, 2048) matrix of token indices (int).
embedding_matrix (np.ndarray): A (samples, 2048, emb_dim) matrix of embeddings.
num_tokens (int): The total number of unique tokens.
Returns:
tuple:
sparse_embeddings (scipy.sparse.coo_matrix): Sparse matrix of embeddings
with shape (samples, num_tokens, emb_dim).
sparse_boolean (scipy.sparse.csr_matrix): Sparse boolean matrix of
token observations with shape (samples, num_tokens).
"""
samples, seq_len, emb_dim = embedding_matrix.shape
# Flatten tokens and embeddings
flat_tokens = token_matrix.flatten()
flat_embeddings = embedding_matrix.reshape(-1, emb_dim)
# Row indices for each sample and token
row_indices = np.repeat(np.arange(samples), seq_len)
col_indices = flat_tokens
# Construct sparse boolean matrix
data_boolean = np.ones_like(flat_tokens, dtype=bool)
sparse_boolean = csr_matrix(
(data_boolean, (row_indices, col_indices)),
shape=(samples, num_tokens)
)
# Construct sparse embedding matrix (coo for multidimensional sparse representation)
emb_row_indices = np.repeat(row_indices, emb_dim)
emb_col_indices = np.tile(np.arange(emb_dim), row_indices.size)
emb_data = flat_embeddings.ravel()
sparse_embeddings = coo_matrix(
(emb_data, (emb_row_indices, emb_col_indices)),
shape=(samples * num_tokens, emb_dim)
)
return sparse_embeddings, sparse_boolean Where the def compute_grouped_means(sparse_embeddings, sparse_boolean, group_indices):
"""
Compute grouped means for selected samples.
Args:
sparse_embeddings (scipy.sparse.coo_matrix): Sparse embeddings matrix.
sparse_boolean (scipy.sparse.csr_matrix): Sparse boolean matrix.
group_indices (np.ndarray): Indices of samples to include in the group.
Returns:
np.ndarray: Mean embeddings per token for the group.
"""
# Subset the matrices
group_boolean = sparse_boolean[group_indices, :]
group_embeddings = sparse_embeddings[group_indices * sparse_boolean.shape[1]]
# Sum embeddings and counts
token_sums = group_embeddings.sum(axis=0)
token_counts = group_boolean.sum(axis=0).A1 # Convert to 1D array
# Compute means
token_counts = np.maximum(token_counts, 1) # Avoid division by zero
mean_embeddings = token_sums / token_counts[:, None]
return mean_embeddings The benefit of this approach as well is that you can use the |
Thanks @jstjohn I think it is getting close to the results. The workflow seems good to me, first by extracting (samples, genes, emb_dim) then we can average different samples into mean embeddings (genes, emb_dim). Could you please further explain what the inputs are for construct_sparse_matrices(token_matrix, embedding_matrix, num_tokens)? How can I apply this function to the infer_geneformer.py output? |
You/we would need to add options to both the `def forward` of the biobert
model (`bionemo/llm/biobert/model.py`) to support outputting token_ids from
the input into the output. This would be expensive generally so I would
want the default to be `False`. After that you would need to modify
`infer_geneformer.py` to expose this option through the argparse arguments.
Then you would call the `infer_geneformer` command with that option enabled
and you would have outputs that have the different things in the result
dictionary (if you look at the geneformer cell type classification tutorial
notebook you can see how to load the .pt file and look at the dictionary of
things that are saved into it). Let me know if this makes sense and is
something you're willing to do! Farhad recently did something similar I
think for esm2 so there may already be this option available in the `def
forward`. Is this something you would be interested in contributing?
…On Fri, Nov 22, 2024 at 1:57 PM jyin-bst ***@***.***> wrote:
Thanks @jstjohn <https://github.com/jstjohn> I think it is getting close
to the results. The workflow seems good to me, first by extracting
(samples, genes, emb_dim) then we can average different samples into mean
embeddings (genes, emb_dim).
Could you please further explain what the inputs are for
construct_sparse_matrices(token_matrix, embedding_matrix, num_tokens)? How
can I apply this function to the infer_geneformer.py output?
—
Reply to this email directly, view it on GitHub
<#452 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQCBRZVMMXHEE7B5AUXWL2B6SF7AVCNFSM6AAAAABSCW6V56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJUHEZTKMJRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes see the inference tutorial of esm2. Looks like the option is already in
the `bionemo/llm/biobert/model.py` and just needs to be added to argparse
for the `infer_geneformer.py` just like it is in `infer_esm2.py`. See
https://nvidia.github.io/bionemo-framework/user-guide/examples/bionemo-esm2/inference/
for how they use it. Again let me know if you have trouble seeing how that
option is added into infer_esm2.py and how you would do something similar
with infer_geneformer.py.
On Sat, Nov 23, 2024 at 9:52 AM John St. John ***@***.***>
wrote:
… You/we would need to add options to both the `def forward` of the biobert
model (`bionemo/llm/biobert/model.py`) to support outputting token_ids from
the input into the output. This would be expensive generally so I would
want the default to be `False`. After that you would need to modify
`infer_geneformer.py` to expose this option through the argparse arguments.
Then you would call the `infer_geneformer` command with that option enabled
and you would have outputs that have the different things in the result
dictionary (if you look at the geneformer cell type classification tutorial
notebook you can see how to load the .pt file and look at the dictionary of
things that are saved into it). Let me know if this makes sense and is
something you're willing to do! Farhad recently did something similar I
think for esm2 so there may already be this option available in the `def
forward`. Is this something you would be interested in contributing?
On Fri, Nov 22, 2024 at 1:57 PM jyin-bst ***@***.***> wrote:
> Thanks @jstjohn <https://github.com/jstjohn> I think it is getting close
> to the results. The workflow seems good to me, first by extracting
> (samples, genes, emb_dim) then we can average different samples into mean
> embeddings (genes, emb_dim).
>
> Could you please further explain what the inputs are for
> construct_sparse_matrices(token_matrix, embedding_matrix, num_tokens)? How
> can I apply this function to the infer_geneformer.py output?
>
> —
> Reply to this email directly, view it on GitHub
> <#452 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AADQCBRZVMMXHEE7B5AUXWL2B6SF7AVCNFSM6AAAAABSCW6V56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJUHEZTKMJRGU>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
After playing with BioNeMo Geneformer as suggested by John, I got the following results by infering using a 462 samples by 25429 genes single cell sequencing data set using the modified infer_geneformer.py. It has an impressive interface for controlling the parallelling computing processes, which the original hugging face implementation does not offer. In the saved model output, by using sequence length of 512 (--seq-len 512), I got: token_logits, torch.Size([512, 462, 25472]) "input_ids" e.g. by using --include-logits, doesn't contain any gene information. It outputs a data matrix of (num_cells, sequence_length). So are "hidden_states" and "embeddings". All of them are cell related information. "token logits" are the closest results to gene embeddings, e.g. by using --include-logits. It saves a "token logits" data matrix of (sequence_length, num_cells, num_genes). It contains predictions for genes from the last layer of the model, which are still different from the gene embeddings used by Geneformer. Gene embeddings should be extracted from the second-to-last layer. @jstjohn @skothenhill-nv @isabel-wilkinson Do you have any suggestions on how to move forward from this? |
A potential design:
--num-layers-override
in infer.py https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py#L235 with a default of None.override_parent_fields=['num_layers'] + OVERRIDE_BIOBERT_CONFIG_DEFAULTS
to the config_class (around here https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py#L116) but only if the user setnum_layers_override != None
. This communicates to the checkpoint loader to not pull this field out of the trained model config in the checkpoint, and instead use the user supplied option for this field.num_layers=num_layers_override
to the config around that point, but again only if the user set this to not None.What will happen then is the model will be initialized with the user requested num layers rather than the num_layers it was originally trained with. So if you want to remove the last layer and get the inference results from that second to last layer, and you know the model was trained with 6 layers, then you could set --num-layers-override 5 and you would get back a 5 layer model with that last layer left off.
Side note: These steps are generally how you would override any setting in the loaded model. This pattern can be used for fine-tuning as well as inference if you want to change things about the model when you load it. Note that in the fine-tuning case, not here, if you add a new layer you also need to communicate to the checkpoint loader to not look for that new layer in the checkpoint, otherwise you get a confusing looking error about that layer not being found at checkpoint load time.
The text was updated successfully, but these errors were encountered: