Skip to content

ZeeshanLinein/business_card

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This package extracts the name, number and email from the text of business cards.

The code is implemented for python 2.7 and requires no special dependencies.

  • parse.py: invokes the package to extract info from sample text filess
  • train_classifier.py: estimates the person name classifier from training data.
  • BusinessCardParser/: python package implements the solution.
  • data/: training data for name classifier and test data for business cards.

How to run

To parse txt files containing business card output,

python parse.py -m data/name.model data/tests/*

To train the name model on different data,

python train_classifier.py -f data/name.model data/train/names.txt

Details

Each line of the business card is scored against three extractors (name, number, email). The extractors are either regexes or simple classifiers.

Email Extraction

A permissive regex is used. It is not RFC-5322 compliant since that standard is extremely flexibile and this application is very simple.

I assume the @ sign is unlikely to often appear in business cards, so ensuring full compliance with the standard is unnecessary at this point.

Phone Number Extraction

A regex first identifies a phone number.

  • Only matches U.S. phone numbers (1-111-111-1111)
  • Tries to extract extensions if present

If multiple numbers are present, a second pass looks at the whole line for context of a primary phone number. If multiple primary numbers are found, the top-most one is selected.

Positive Indicators

  • tel
  • telephone
  • phone
  • office
  • work
  • p
  • number
  • direct

Negative Indicators

  • fax
  • cell
  • home
  • f
  • mobile

Further extensiosn to the code could add:

  • non-U.S. phone numbers
  • slot filling to populate work/cell/home

Name Extraction

A naive bayes classifier built on character 4-grams is used.

Common first and last names came from http://names.mongabay.com/

A training corpus was then randomly generated from these word lists. From here, 4-gram frequencies were estimated with a small smoothing penalty.

The most likely line unde this model is selected as the name.

Extensions could add:

  • a discriminative classifier based on name vs. company
  • character sequence models built on a language model or recurrent neural networks
  • better training data based on real first and last name pairs
  • move code to scipy/numpy. For dependency simplicity, I did not use these libraries.

About

Business Card Parser coding interview

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.6%
  • Perl 1.6%
  • Shell 0.8%