Skip to content

This project efficiently extracts and tokenizes clinical notes from XML files by utilizing Word2Vec for semantic word embeddings in clinical text. It implements a Logistic Regression model for precise classification based on specified terms. This project also includes detailed classification reports for performance assessment.

Notifications You must be signed in to change notification settings

anirudhvee/NotNimbleMiner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

NotNimbleMiner

NotNimbleMiner by Christy Mednick, Zachary Graeber, Malav Ramanan, Anirudh Venkatachalam and Manasvini Narayanan for ECS170: Artificial Intelligence with Gabriel Simmons.

NotNimbleMiner is a project aimed at extracting and classifying sensitive information from clinical notes using natural language processing techniques. This repository contains code for a comprehensive pipeline that tokenizes clinical notes from XML files, creates word embeddings, explores vocabulary, assigns labels, and trains a multi-label classification model.

Key Features:

  1. Tokenization of Clinical Notes: Efficiently processes XML files to extract and tokenize clinical notes.
  2. Word Embedding Model: Creates word embeddings using Word2Vec to capture semantic relationships within the clinical text.
  3. Vocabulary Exploration: Allows users to interactively explore and select relevant terms for classification.
  4. Multi-Label Classification: Trains a Logistic Regression model to automatically classify clinical notes based on selected terms.
  5. Evaluation Metrics: Provides classification reports for model performance assessment.

Usage:

  1. Tokenize Clinical Notes: Input your XML files containing clinical notes to start the tokenization process.
  2. Explore Vocabulary: Interactively select terms for classification based on similarity scores.
  3. Train Multi-Label Classifier: Train a Logistic Regression model on the selected terms and tokenized data.
  4. Evaluate Model: Assess model performance using classification reports.

Requirements:

  1. Python 3.x
  2. Libraries: spaCy, numpy, xml.etree.ElementTree, gensim, scikit-learn

About

This project efficiently extracts and tokenizes clinical notes from XML files by utilizing Word2Vec for semantic word embeddings in clinical text. It implements a Logistic Regression model for precise classification based on specified terms. This project also includes detailed classification reports for performance assessment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages