Skip to content
/ IgLM Public
forked from Graylab/IgLM

Generative Language Modeling for Antibody Design

License

Notifications You must be signed in to change notification settings

norsage/IgLM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IgLM

Official repository for IgLM: Generative Language Modeling for Antibody Design.

The code and pre-trained models from this work are made available for non-commercial use under the terms of the JHU Academic Software License Agreement.

Setup

To use IgLM, install via pip:

pip install iglm

Alternatively, you can clone this repository and install the package locally:

$ git clone [email protected]:Graylab/IgLM.git 
$ pip install IgLM

Command line usage

IgLM supports sequence infilling, sequence generation (with prompting), and sequence evaluation from the command line.

Re-design spans of an antibody sequence

To use IgLM to re-design spans of an antibody sequence, supply the fasta file, the fasta record ID corresponding to the sequence to design, the start index of the span (0-indexed), and the end index of the span (0-indexed, exclusive).

To generate 100 unique sequences of the anti-tissue factor antibody (1JPT) heavy chain with an IgLM-designed CDR3:

iglm_infill data/antibodies/1jpt/1jpt.fasta :H 98 106 --chain_token [HEAVY] --species_token [HUMAN] --num_seqs 100 

Full antibody sequence generation

IgLM can be used to generate full antibody sequences while conditioning on the chain type and species-of-origin.

To generate 100 unique human heavy chain sequences starting with EVQ:

iglm_generate --prompt_sequence EVQ --chain_token [HEAVY] --species_token [HUMAN] --num_seqs 100 

To generate 100 unique nanobody sequences starting with QVQ:

iglm_generate --prompt_sequence QVQ --chain_token [HEAVY] --species_token [CAMEL] --num_seqs 100 

Sequence evaluation

IgLM can be used to calculate the log likelihood of a sequence given a chain type and species-of-origin.

Full sequence log likelihood calculation:

iglm_evaluate data/antibodies/1jpt/1jpt.fasta :H --chain_token [HEAVY] --species_token [HUMAN]

Infilled sequence log likelihood calculation:

iglm_evaluate data/antibodies/1jpt/1jpt.fasta :H --start 98 --end 106 --chain_token [HEAVY] --species_token [HUMAN]

Package usage

IgLM may also be used as a Python package, enabling the above use cases and more flexible usage.

Re-design spans of an antibody sequence

To use IgLM to re-design spans of an antibody sequence, supply the sequence to design, the start index of the span (0-indexed), and the end index of the span (0-indexed, exclusive).

To generate 100 unique sequences of the anti-tissue factor antibody (1JPT) heavy chain with an IgLM-designed CDR3:

from iglm import IgLM

iglm = IgLM()

parent_sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFNIKEYYMHWVRQAPGKGLEWVGLIDPEQGNTIYDPKFQDRATISADNSKNTAYLQMNSLRAEDTAVYYCARDTAAYFDYWGQGTLVTVS"
chain_token = "[HEAVY]"
species_token = "[HUMAN]"
infill_range = (98, 106)
num_seqs = 100

generated_seqs = iglm.infill(
    parent_sequence,
    chain_token,
    species_token,
    infill_range=infill_range,
    num_to_generate=num_seqs,
)

Full antibody sequence generation

IgLM can be used to generate full antibody sequences while conditioning on the chain type and species-of-origin.

To generate 100 unique human heavy chain sequences starting with EVQ:

from iglm import IgLM

iglm = IgLM()

prompt_sequence = "EVQ"
chain_token = "[HEAVY]"
species_token = "[HUMAN]"
num_seqs = 100

generated_seqs = iglm.generate(
    chain_token,
    species_token,
    prompt_sequence=prompt_sequence,
    num_to_generate=num_seqs,
)

To generate 100 unique nanobody sequences starting with QVQ:

from iglm import IgLM

iglm = IgLM()

prompt_sequence = "QVQ"
chain_token = "[HEAVY]"
species_token = "[CAMEL]"
num_seqs = 100

generated_seqs = iglm.generate(
    chain_token,
    species_token,
    prompt_sequence=prompt_sequence,
    num_to_generate=num_seqs,
)

Sequence evaluation

IgLM can be used to calculate the log likelihood of a sequence given a chain type and species-of-origin.

Full sequence log likelihood calculation:

import math
from iglm import IgLM

iglm = IgLM()

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFNIKEYYMHWVRQAPGKGLEWVGLIDPEQGNTIYDPKFQDRATISADNSKNTAYLQMNSLRAEDTAVYYCARDTAAYFDYWGQGTLVTVS"
chain_token = "[HEAVY]"
species_token = "[HUMAN]"

log_likelihood = iglm.log_likelihood(
    sequence,
    chain_token,
    species_token,
    infill_range=infill_range,
)
perplexity = math.exp(-log_likelihood)

Infilled sequence log likelihood calculation:

import math
from iglm import IgLM

iglm = IgLM()

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFNIKEYYMHWVRQAPGKGLEWVGLIDPEQGNTIYDPKFQDRATISADNSKNTAYLQMNSLRAEDTAVYYCARDTAAYFDYWGQGTLVTVS"
chain_token = "[HEAVY]"
species_token = "[HUMAN]"
infill_range = (98, 106)

log_likelihood = iglm.log_likelihood(
    sequence,
    chain_token,
    species_token,
    infill_range=infill_range,
)
perplexity = math.exp(-log_likelihood)

About

Generative Language Modeling for Antibody Design

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%