- Description
- Features
- Technologies
- Key Findings
- Technical Methodology
- Installation
- Usage
- Data
- Development
- Configuration
- License
This project aims to predict the sentiment of text and perform topic categorization. It analyzes customer reviews to extract valuable insights about sentiment patterns and key discussion topics. The analysis is based on Trustpilot reviews data with over 123,000 customer reviews across 22 business categories.
- Sentiment analysis using VADER
- Topic modeling with Latent Dirichlet Allocation (LDA)
- Text preprocessing including tokenization and stopword removal
- Visualization of topic-sentiment relationships
- Distributed text processing with PySpark
This project leverages PySpark for efficient data processing in a local environment:
- Local Spark Context: Runs on a single machine for development and analysis
- DataFrame API: Used for structured data manipulation of customer reviews
- ML Pipeline: Implements machine learning workflows with Spark ML
- Text Processing: Utilizes Spark's text processing capabilities for tokenization and feature extraction
- RDD Operations: For custom transformations and exploratory data analysis
- UDFs (User-Defined Functions): For custom sentiment analysis with VADER integration
The PySpark implementation enables efficient processing of the large Trustpilot dataset even on a local machine, with the flexibility to scale to a cluster if needed in the future.
The analysis identified seven distinct topics in customer reviews:
- Product Quality - Discussion about quality, durability, and condition of products
- Billing - Comments related to payments, charges, refunds, and pricing
- General Feedback - Overall satisfaction and recommendations
- User Experience - Comments about ease of use, interface, and customer journey
- Delivery - Feedback on shipping, packaging, and timing
- Customer Service - Mentions of support quality and issue resolution
- Price - Specific discussion of cost and value
Sentiment analysis revealed that "Billing" topics have the highest average sentiment score, while "Product Quality" has the lowest, suggesting customers are most satisfied with payment processes but most critical about product quality issues.
Figure 1: Boxplot showing sentiment distribution across different topics
Figure 2: Histogram of sentiment scores across all customer reviews
This project employs a combination of unsupervised and rule-based learning techniques to analyze customer reviews:
The Latent Dirichlet Allocation (LDA) algorithm was selected as the primary topic modeling approach for several reasons:
- Interpretable results: LDA produces probability distributions over words for each topic, making it easier to interpret what each topic represents
- Appropriate for text data: Specifically designed for document collections and works well with sparse term-document matrices
- Scalability: Implementation through PySpark allows the model to handle large document collections efficiently
- Soft clustering: Documents can belong to multiple topics with different probabilities, reflecting the reality that customer reviews often cover multiple aspects
The LDA implementation uses a TF-IDF weighted document-term matrix to emphasize distinctive terms and reduce the importance of common words across all reviews.
For sentiment analysis, we implemented the VADER (Valence Aware Dictionary and sEntiment Reasoner) approach because:
- Domain suitability: VADER is specifically tuned for social media and short-form text like reviews
- Rule-based efficiency: As a lexicon and rule-based model, it doesn't require training data, allowing immediate application
- Nuance capture: Beyond simple positive/negative classification, VADER captures sentiment intensity on a continuous scale (-1 to 1)
- Contextual understanding: Handles negations, intensifiers, and other linguistic constructs that affect sentiment
This approach produces a compound sentiment score that correlates well with the star ratings in the dataset, validating its effectiveness.
The analysis pipeline is implemented using PySpark's ML library with these key components:
- Text preprocessing: Tokenization, stopword removal, and normalization
- Feature extraction: TF-IDF vectorization to create numerical representations of text
- Dimensionality reduction: LDA for topic discovery (k=7 topics)
- Sentiment scoring: VADER for sentiment analysis with fallback to lexicon-based approach
- Topic assignment: Dominant topic extraction for each document
Several alternative approaches were evaluated before selecting the final methodology:
-
Supervised classification: Training classifiers (SVM, Random Forest) for sentiment prediction was considered, but would have required labeled training data and might introduce biases from the labeling process.
-
Word embeddings with clustering: Word2Vec or GloVe embeddings with K-means clustering could provide more semantically nuanced topic representations but might be less interpretable than LDA's explicit word distributions.
-
BERT-based approaches: While fine-tuning BERT for sentiment analysis would likely improve accuracy, the computational requirements would be significantly higher and potentially unnecessary given VADER's strong performance for this application.
-
Non-negative Matrix Factorization (NMF): This alternative topic modeling approach was considered, but LDA was preferred for its probabilistic foundation and natural handling of documents as mixtures of topics.
-
Dynamic Topic Models: For tracking topic evolution over time, but deemed unnecessary for this static analysis of reviews.
The chosen approach balances accuracy, interpretability, and computational efficiency, making it well-suited for extracting actionable insights from customer reviews at scale.
Clone the repository:
git clone https://github.com/aaronginder/text-analyser.git
Install dependencies using Poetry:
poetry install
The project includes a Jupyter notebook in the exploration directory that demonstrates the full analysis pipeline.
The project uses Trustpilot reviews data for training and evaluation, located in the data directory.
Install Poetry:
pip install poetry
Install dependencies:
poetry install
To build the project:
poetry build
This project uses semantic-release to automate the release process. Follow the conventional commits specification for commit messages.
Example:
feat: add new sentiment analysis model
fix: correct topic classification bug
docs: update installation instructions
Releases are automated using semantic-release based on commit messages. Merging to the main branch will trigger a release. The project supports different release branches:
main
: production releasesdevelop
: development releasesalpha
,beta
,rc
: prerelease versions
Configuration is managed via config.yml.
This project is licensed under the MIT License - see the LICENSE file for details.