How to run

This package extracts the name, number and email from the text of business cards.

The code is implemented for python 2.7 and requires no special dependencies.

parse.py: invokes the package to extract info from sample text filess
train_classifier.py: estimates the person name classifier from training data.
BusinessCardParser/: python package implements the solution.
data/: training data for name classifier and test data for business cards.

How to run

To parse txt files containing business card output,

python parse.py -m data/name.model data/tests/*

To train the name model on different data,

python train_classifier.py -f data/name.model data/train/names.txt

Details

Each line of the business card is scored against three extractors (name, number, email). The extractors are either regexes or simple classifiers.

Email Extraction

A permissive regex is used. It is not RFC-5322 compliant since that standard is extremely flexibile and this application is very simple.

I assume the @ sign is unlikely to often appear in business cards, so ensuring full compliance with the standard is unnecessary at this point.

Phone Number Extraction

A regex first identifies a phone number.

Only matches U.S. phone numbers (1-111-111-1111)
Tries to extract extensions if present

If multiple numbers are present, a second pass looks at the whole line for context of a primary phone number. If multiple primary numbers are found, the top-most one is selected.

Positive Indicators

tel
telephone
phone
office
work
p
number
direct

Negative Indicators

fax
cell
home
f
mobile

Further extensiosn to the code could add:

non-U.S. phone numbers
slot filling to populate work/cell/home

Name Extraction

A naive bayes classifier built on character 4-grams is used.

Common first and last names came from http://names.mongabay.com/

A training corpus was then randomly generated from these word lists. From here, 4-gram frequencies were estimated with a small smoothing penalty.

The most likely line unde this model is selected as the name.

Extensions could add:

a discriminative classifier based on name vs. company
character sequence models built on a language model or recurrent neural networks
better training data based on real first and last name pairs
move code to scipy/numpy. For dependency simplicity, I did not use these libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to run

Details

Email Extraction

Phone Number Extraction

Name Extraction

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
BusinessCardParser		BusinessCardParser
data		data
.gitignore		.gitignore
README.md		README.md
parse.py		parse.py
train_classifier.py		train_classifier.py

ZeeshanLinein/business_card

Folders and files

Latest commit

History

Repository files navigation

How to run

Details

Email Extraction

Phone Number Extraction

Name Extraction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages