Simple-NLP

This is a very simple micromaterial created for the Oxford Summer of Hacks Language Hack Day.

The aim is to give learners practice in doing a very simple NLP task: turning text into its corresponding Part of Speech (POS) tags. These are things like NNS (singular noun) and JJS (comparative adjective).

learning objectives

what is a POS tag, and how to get them from text
count the total different types of POS tags in a text
count the frequency of different POS tags in a text
calculate the most common n-grams of POS tags in a text

The activity

One big skeleton function has already been written, along with the test for it. So to complete the activity, just fill in the functions and run the tests. If the test passes, you did it! If not, try to fix the function so the test passes.

to run the test: python -m unittest

Possible steps:

1) Find out about POS tags

Here is a list of POS tags used in both nltk and spacy, two popular NLP libraries in python. And this is a good explanation of POS tags and what they're used for (though sketch engine doesn't use the Penn Treebank set of POS tags).

The Wikipedia page has a much more in-depth discussion of these, if you're interested.

2) Find out about turning text into a list of POS tags

For our purposes, it doesn't matter whether you use the NLTK POS tagger (example) or the Spacy POS tagger (example)

Basically, read in your data source, load the models, and pass in the text into the appropriate function. It might be a good idea to write the output to a file for later analysis so you don't have to repeat this step multiple times.

3) Find out about counting n-grams

An n-gram means a set of things that show up together, with the "n" part meaning the number of things. For example, if we're looking at POS tags, a 2-gram that shows up a lot might be "IN, DET" ("in a"), and a 3-gram that shows up a lot might be "DT NNS IN" ("the NOUN of"). For a more detailed discussion, see the Wikipedia page.

Follow the steps from #2, and see which 2-grams and 3-grams are the most frequent. If you want to use your own data source (perhaps a web page or document from the file system), possibly even look for 4-grams or 5-grams.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
simple_nlp		simple_nlp
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple-NLP

learning objectives

The activity

Possible steps:

1) Find out about POS tags

2) Find out about turning text into a list of POS tags

3) Find out about counting n-grams

About

Releases

Packages

Languages

lpmi-13/simple-NLP-pos

Folders and files

Latest commit

History

Repository files navigation

Simple-NLP

learning objectives

The activity

Possible steps:

1) Find out about POS tags

2) Find out about turning text into a list of POS tags

3) Find out about counting n-grams

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages