Machine Learning - Hand written numbers recognition

Hand written number recognition by Histogram, Bayesian approach and principal components analysis method with MNIST dataset.

Dataset

Dataset is obtained from Dr. Yan Lecun's website. It's ready to be use and not modified in any way.
Each data record is a 28x28 pixel grid (784 dimensions vector). This pixel grid captures a hand written pattern of numbers (from 0 to 9). In this work, we are only intested in number 1 and 9. More numbers can be considered by changing configurations if desired.
Selected data records of 1 and 9 in the dataset:

Data dimension reduction

Principle Component Analysis method is applied to reduce the dimension of data records.
Only two most significant principal components are selected to represent the data.
Dataset after dimension reduction:

Training and prediction

Histogram and Bayesian learning algorithms are built on dimension reduced dataset (2d)
A Random pair of 1 and 9 is chosen from dataset and applied to both algorithms for prediction. In this particular case, results are quite accrurate:

Training accruracy

Both algorithms are used on the whole dataset to calculate the training accruracy:

Observations

Histogram and Bayesian algorithms work mighty fine in this particular case!
Although with a significant amount of dimension reduction (784 -> 2), PCA still captures a whole lot of information of the dataset. Hence, producing an impressive accuracy when predicting randomly picked number from the set. Although these numbers are from the training set itself, the accuracy of 98.6% for Histogram and 98% for Bayesian is impressive (once again, dimension is reduced from 784 to 2).
When the prediction result is quite good, we should not be too excited and believe that dimension reduction is a magic tool. Looking back at scatter plot of reduced demension dataset, we can see by the nature of image capturing, number "1" and "9" are quite well seperated. Blue dots and red dots overlap with a minor amount. This is where Histogram data counting and Bayesian 2D works best. For more complicated problems, e.g. majority of red and blue dots overlap, we can't reduce the dataset to this low as 2.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
MNIST_data		MNIST_data
img		img
.gitignore		.gitignore
README.md		README.md
draft.py		draft.py
main.py		main.py
pca.py		pca.py
scatterPlot.py		scatterPlot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning - Hand written numbers recognition

Dataset

Data dimension reduction

Training and prediction

Training accruracy

Observations

About

Releases

Packages

Contributors 2

Languages

sonttran/machine-learning-number-recognition

Folders and files

Latest commit

History

Repository files navigation

Machine Learning - Hand written numbers recognition

Dataset

Data dimension reduction

Training and prediction

Training accruracy

Observations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages