Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for masked language modeling (bidirectional models) #211

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

shehadak
Copy link
Collaborator

@shehadak shehadak commented Nov 8, 2023

This PR builds on the Huggingface subject, which assumes that models are autoregressive (following the ModelForCausalLM). This PR adds support for bidirectional models with masked language modeling following the ModelForMaskedLM). Since bidirectional models rely on future context, I use a sliding window approach (see google-research/bert#66). In particular, for each text part, up to w/2 tokens are included for the current part + previous context, and the remaining w/2 tokens are masked.

The region_layer_mapping for the language system was determined by scoring every transformer layer in BERT's encoder against the Pereira2018.243sentences-linear, Pereira2018.384sentences-linear, and Blank2014-linear benchmarks, and choosing the layer with the highest average score.

This PR also provides unit tests for reading time estimation, next word prediction, and neural recording, using the bert-base-uncased model. Future models can use the same format, as long as they implement the ModelForMaskedLM interface. For example, to add the base DistilBERT model:

model_registry['distilbert-base-uncased'] = lambda: HuggingfaceSubject(model_id='distilbert-base-uncased', region_layer_mapping={
    ArtificialSubject.RecordingTarget.language_system: 'distilbert.transformer.layer.5'}, bidirectional=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants