Custom Entity extraction project, on a self-created IOB annotated datatset(using Doccano), by using Hugginface BERT transformer.
This project implements a custom Named Entity Recognition (NER) model using BERT (Bidirectional Encoder Representations from Transformers) for identifying and classifying named entities in scientific texts. The model is specifically designed to recognize affiliations and potentially other relevant entities in academic papers.
- Base Model: BERT (bert-base-uncased)
- Task-specific layer: Token Classification Head
- Number of labels: At least 2 (Including 'O' for non-entity and 'I-AFFILIATIONS' for affiliations)
- Custom NER model trained on scientific text data
- Identification of affiliations in academic papers
- Potential for expanding to other relevant entities
- Fine-tuned BERT model for improved performance on domain-specific text
To train the model on your own dataset:
- Prepare your data in the JSONL format (see
dataset.jsonl
for an example). - Run the training script: python train_ner_model.py
To use the trained model for inference:
- Load the model:
from transformers import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("path/to/saved/model")
tokenizer = AutoTokenizer.from_pretrained("path/to/saved/model")
- Perform inference on new text:
text = "Your scientific text here"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
Training time: 211.5674 seconds Training samples per second: 7.156 Training steps per second: 0.454 Final training loss: 0.12777168552088079 Validation Loss: Epoch 1 - 0.051182, Epoch 2 - 0.036526
Implement cross-validation for more robust model evaluation Experiment with domain-specific pre-trained models (e.g., DistilBERT) Increase dataset size and diversity Develop post-processing for entity span consolidation Address class imbalance and introduce more granular entity types
Hugging Face Transformers library BERT model developers Contributors to the scientific text dataset used for training