php-text-analysis

PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language. There are tools in this library that can perform:

document classification
sentiment analysis
compare documents
frequency analysis
tokenization
stemming
collocations with Pointwise Mutual Information
lexical diversity
corpus analysis
text summarization

All the documentation for this project can be found in the book and wiki.

PHP Text Analysis Book & Wiki

A book is in the works and your contributions are needed. You can find the book at https://github.com/yooper/php-text-analysis-book

Also, documentation for the library resides in the wiki, too. https://github.com/yooper/php-text-analysis/wiki

Installation Instructions

Add PHP Text Analysis to your project

composer require yooper/php-text-analysis

Tokenization

$tokens = tokenize($text);

You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class

$tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);

The default tokenizer is \TextAnalysis\Tokenizers\GeneralTokenizer::class . Some tokenizers require parameters to be set upon instantiation.

Normalization

By default, normalize_tokens uses the function strtolower to lowercase all the tokens. To customize the normalize function, pass in either a function or a string to be used by array_map.

$normalizedTokens = normalize_tokens(array $tokens);

$normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');

$normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });

Frequency Distributions

The call to freq_dist returns a FreqDist instance.

$freqDist = freq_dist(tokenize($text));

Ngram Generation

By default bigrams are generated.

$bigrams = ngrams($tokens);

Customize the ngrams

// create trigrams with a pipe delimiter in between each word
$trigrams = ngrams($tokens,3, '|');

Stemming

By default stem method uses the Porter Stemmer.

$stemmedTokens = stem($tokens);

You can customize which type of stemmer to use by passing in the name of the stemmer class name

$stemmedTokens = stem($tokens, \TextAnalysis\Stemmers\MorphStemmer::class);

Keyword Extract with Rake

There is a short cut method for using the Rake algorithm. You will need to clean your data prior to using. Second parameter is the ngram size of your keywords to extract.

$rake = rake($tokens, 3);
$results = $rake->getKeywordScores();

Sentiment Analysis with Vader

Need Sentiment Analysis with PHP Use Vader, https://github.com/cjhutto/vaderSentiment . The PHP implementation can be invoked easily. Just normalize your data before hand.

$sentimentScores = vader($tokens);

Document Classification with Naive Bayes

Need to do some document classification with PHP, trying using the Naive Bayes implementation. An example of classifying movie reviews can be found in the unit tests

$nb = naive_bayes();
$nb->train('mexican', tokenize('taco nacho enchilada burrito'));        
$nb->train('american', tokenize('hamburger burger fries pop'));  
$nb->predict(tokenize('my favorite food is a burrito'));

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
src		src
storage		storage
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile74		Dockerfile74
Dockerfile80		Dockerfile80
Dockerfile81		Dockerfile81
Dockerfile82		Dockerfile82
Dockerfile83		Dockerfile83
Dockerfile84		Dockerfile84
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
interactive		interactive
phpunit.xml		phpunit.xml
textconsole		textconsole

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

php-text-analysis

PHP Text Analysis Book & Wiki

Installation Instructions

Tokenization

Normalization

Frequency Distributions

Ngram Generation

Stemming

Keyword Extract with Rake

Sentiment Analysis with Vader

Document Classification with Naive Bayes

About

Uh oh!

Releases 43

Packages

Uh oh!

Contributors 13

Languages

License

yooper/php-text-analysis

Folders and files

Latest commit

History

Repository files navigation

php-text-analysis

PHP Text Analysis Book & Wiki

Installation Instructions

Tokenization

Normalization

Frequency Distributions

Ngram Generation

Stemming

Keyword Extract with Rake

Sentiment Analysis with Vader

Document Classification with Naive Bayes

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 43

Packages 0

Uh oh!

Contributors 13

Languages

Packages