A real-time American Sign Language (ASL) recognition system that uses computer vision and machine learning to identify hand gestures. The system employs MediaPipe for hand landmark detection and a Support Vector Machine (SVM) classifier for gesture classification.
This project recognizes ASL signs through a two-stage pipeline:
- Hand Landmark Detection: MediaPipe extracts 21 key hand landmarks from images
- Gesture Classification: SVM classifier identifies the sign based on landmark positions
✅ Real-time Recognition: Webcam-based live ASL sign detection
✅ MediaPipe Integration: Accurate 21-point hand landmark extraction
✅ SVM Classification: Robust polynomial kernel SVM for gesture recognition
✅ Prediction Smoothing: Majority voting system for stable predictions
✅ 3D Visualization: PCA-based 3D clustering visualization of hand gestures
✅ Multiple Kernels: Support for both linear and polynomial SVM kernels
✅ Pre-trained Models: Ready-to-use trained SVM models included
The system uses MediaPipe's Hand Landmarker to detect and track hand landmarks:
- Detects 21 key points on the hand (fingertips, knuckles, palm, wrist)
- Extracts normalized (x, y) coordinates for each landmark
- Flattens to a 42-dimensional feature vector (21 landmarks × 2 coordinates)
Process:
- Converts image to MediaPipe format
- Runs hand detection model
- Extracts landmark coordinates
- Returns flattened feature array
Trains an SVM classifier on hand gesture datasets:
Training Pipeline:
- Data Loading: Reads images organized by gesture label in
data/directory - Preprocessing:
- Converts BGR to RGB
- Resizes to 256×256 pixels
- Feature Extraction: Extracts 21 hand landmarks per image
- Model Training:
- Uses polynomial kernel SVM (degree=3)
- Enables probability estimates
- Configured with
gamma='scale'andcoef0=1
- Model Serialization: Saves trained model as
.pklfile - Visualization (Optional):
- Reduces features to 3D using PCA
- Plots gesture clusters in 3D space
- Shows decision boundaries
SVM Configuration:
svm.SVC(kernel="poly", degree=3, gamma="scale", coef0=1, probability=True)Performs live gesture recognition via webcam:
Prediction Pipeline:
- Webcam Capture: Captures frames from default camera
- Preprocessing: Converts to RGB and resizes to 256×256
- Landmark Extraction: Detects hand and extracts landmarks
- Classification: Predicts gesture using trained SVM
- Smoothing: Applies majority voting over last 10 predictions
- Display: Overlays prediction text on video feed
Prediction Smoothing:
- Maintains a buffer of the last 10 predictions
- Uses majority voting to reduce jitter
- Persists last valid prediction when no hand is detected
Feature Representation:
- Each hand gesture is represented by 21 landmarks
- Each landmark has (x, y) coordinates
- Total feature vector: 42 dimensions
SVM Classification:
- Kernel: Polynomial (degree 3)
- Decision Function: One-vs-One multi-class strategy
- Output: Gesture label + probability scores
Why Polynomial Kernel?
- Better captures non-linear relationships between landmarks
- More effective than linear kernel for complex hand shapes
- Provides better separation between similar gestures
- Python 3.7+
- Webcam (for real-time recognition)
- pip package manager
# Clone the repository
git clone https://github.com/winterwidow/ASL-Recognition.git
cd ASL-Recognition
# Install dependencies
pip install -r requirements.txtopencv-python: Image capture and processingmediapipe: Hand landmark detectionscikit-learn: SVM classifier and PCAnumpy: Numerical operationsjoblib: Model serialization
The system requires the MediaPipe hand landmarker model:
- File:
hand_landmarker.task(~7.8 MB) - This file should already be included in the repository
- If missing, download from MediaPipe Models
Organize your dataset with one folder per gesture:
data/
├── data_num/ # or your dataset directory
│ ├── A/
│ │ ├── img1.jpg
│ │ ├── img2.jpg
│ │ └── ...
│ ├── B/
│ │ ├── img1.jpg
│ │ └── ...
│ ├── 0/
│ ├── 1/
│ └── ...
Run the training script:
python train.pyConfiguration (in train.py):
DATA_DIR = "data/data_num" # Path to training dataset
MODEL_PATH = "svm_model2.pkl" # Output model fileTraining Options:
- Linear Kernel: Fast, works for simple gestures
clf = svm.SVC(kernel="linear", probability=True)
- Polynomial Kernel: Better accuracy for complex gestures (default)
clf = svm.SVC(kernel="poly", degree=3, gamma="scale", coef0=1, probability=True)
Output:
- Saves trained model as
.pklfile - Prints number of samples collected
- (Optional) Displays 3D PCA visualization
Start the webcam-based recognition:
python predict.pyConfiguration (in predict.py):
MODEL_PATH = "svm_model.pkl" # Path to trained modelControls:
- Position your hand in front of the webcam
- The predicted sign appears on screen
- Press
qto quit
Features:
- Real-time prediction with webcam feed
- Prediction smoothing for stability
- Persistent display of last valid prediction
- Green text overlay showing recognized sign
ASL-Recognition/
├── train.py # Model training script
├── predict.py # Real-time prediction script
├── utils.py # Hand landmark extraction utilities
├── requirements.txt # Python dependencies
├── hand_landmarker.task # MediaPipe hand detection model
├── svm_model.pkl # Trained SVM model (alphabet)
├── svm_model2.pkl # Trained SVM model (numbers)
├── data/ # Training datasets
│ ├── dataset/ # Alphabet gestures
│ └── data_num/ # Number gestures
└── __pycache__/ # Python cache files
| Aspect | Linear Kernel | Polynomial Kernel (Degree 3) |
|---|---|---|
| Training Speed | Fast | Moderate |
| Accuracy (Simple) | Good (85-90%) | Excellent (92-97%) |
| Accuracy (Complex) | Moderate (75-80%) | Excellent (88-95%) |
| Overfitting Risk | Low | Moderate (requires tuning) |
| Best For | Simple gestures, quick prototyping | Complex hand shapes, production |
Recommendation: Use polynomial kernel for better accuracy with ASL gestures.
The training script includes optional 3D visualization:
Features:
- Reduces 42D feature space to 3D using PCA
- Plots gesture clusters in 3D space
- Shows approximate SVM decision boundaries
- Helps visualize gesture separability
Enable/Disable:
- Comment/uncomment the plotting section in
train.py(lines 87-135)
Improves Accuracy:
- ✅ Good lighting conditions
- ✅ Plain backgrounds
- ✅ Consistent hand positioning
- ✅ Larger training datasets (50+ images per gesture)
- ✅ Diverse training data (angles, distances)
Reduces Accuracy:
- ❌ Low light or shadows
- ❌ Complex backgrounds
- ❌ Partial hand occlusions
- ❌ Small training datasets
- ❌ Similar-looking gestures
-
Training Data Quality
- Use high-resolution images (256×256 minimum)
- Ensure hands are clearly visible
- Include variations in lighting and background
-
Prediction Smoothing
- Adjust buffer size in
predict.py(default: 10 frames) - Larger buffer = more stable but slower response
- Smaller buffer = faster but potentially jittery
- Adjust buffer size in
-
Model Selection
- Start with polynomial kernel
- Tune
degreeandgammaparameters if needed - Consider RBF kernel for highly non-linear data
"No hand detected in the image"
- Ensure hand is clearly visible in frame
- Check lighting conditions
- Verify webcam is working
- Hand should be primary object in frame
Poor prediction accuracy
- Increase training dataset size
- Ensure diverse training data
- Check if hand landmarks are correctly extracted
- Try different SVM kernels or parameters
Model file not found
- Ensure
hand_landmarker.taskis in project root - Check that trained model
.pklfile exists - Re-run
train.pyif model is missing
Webcam not working
- Verify camera permissions
- Check camera index (change
VideoCapture(0)toVideoCapture(1)etc.) - Ensure no other application is using the webcam
- ASL Alphabet Dataset (Kaggle)
- ASL Numbers Dataset (Kaggle)
- Create your own custom dataset for specific gestures (Done here to test the model on unstructured datasets)
- Capture 50-100 images per gesture
- Use consistent lighting
- Vary hand position, angle, and distance
- Include different backgrounds
- Organize in labeled folders
- Support for dynamic gestures (motion-based signs)
- Integration with text-to-speech for output
- Mobile app version (Android/iOS)
- Sentence construction from multiple signs
- Deep learning models (CNN/LSTM) for improved accuracy
- Multi-hand support for two-handed signs
- Real-time performance metrics display
This project is open source and available under the MIT License.
- Built with MediaPipe by Google
- Uses scikit-learn for SVM implementation
- OpenCV for image processing and display