Skip to content

TalhaMemisoglu/sentiment_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Sentiment Analysis with NLP Models

This project applies Natural Language Processing (NLP) techniques to classify IMDB movie reviews as positive or negative.

It compares three different modeling approaches:

  • Naive Bayes with Bag-of-Words / TF-IDF
  • Deep Learning with LSTM + GloVe embeddings
  • Transformer-based models (ALBERT, DistilBERT)

Dataset

Source: IMDB 50K Movie Reviews Dataset

Size: 50,000 labeled reviews (balanced: 25k positive / 25k negative)

Target: Sentiment (binary classification)

Methodology

1. Exploratory Data Analysis (EDA)

  • Review length distribution (words & characters)
  • Stopword analysis
  • Word clouds for positive/negative reviews
  • Patterns: HTML tags, emojis, excessive punctuation, slang

2. Preprocessing

  • HTML/URL removal
  • Lowercasing & contraction expansion
  • Stopword removal & lemmatization
  • Tokenization (NLTK & HuggingFace)

3. Feature Engineering

  • CountVectorizer & TF-IDF for ML baseline
  • Word embeddings (GloVe 100D) for LSTM
  • Tokenizer + Padding for sequence models

4. Models Trained

  • Naive Bayes: Baseline with DTM & TF-IDF
  • LSTM (Bi-LSTM + GlobalMaxPool): trained with GloVe embeddings
  • Transformers: Fine-tuned ALBERT & DistilBERT (HuggingFace Trainer)

5. Evaluation

  • Metrics: Accuracy, Precision, Recall, F1
  • Confusion matrices plotted for each model

Results

  • Naive Bayes: Good baseline, limited accuracy with %86 macro average F1-score
  • LSTM + GloVe: Improved performance, better contextual capture with %87 macro average F1-score
  • DistilBERT / ALBERT: Best overall performance with %94 macro average F1-score

About

Sentiment analysis with NLP models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published