Last Reviewed: | 2017-10-12 |
---|
Contents
- Brief introduction
- Download and installation procedure
- Running PyRATA
- Language expressivity
- Matching regular expression methods
- Search the first match of a given pattern
- Find all non-overlapping matches of a pattern
- Get an iterator yielding match objects over all non-overlapping matches of a pattern
- Match and MatchesList objects
- Matching mode: global, greedy, reluctant
- Debugging the pattern compilation or the pattern matching
- Compiled regular expression
- Data feature structure edit methods
- Extracting Deterministic Finite Automata
- Generating the PyRATA data structure
- Drawing NFA in pdf file
- Time performance
- License
- To go further
Regular expressions (RE) are [traditionally known as] a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "finding" or "finding and replacing" operations on strings.
But a character string is somehow a poor data structure.
PyRATA takes lists of dict tokens as data input. The dict python type consists in a set of name-value attributes (also named features).
Consequently PyRATA is not restricted to some domain knowledge and attached use cases. It is free from the encapsulated information present in the features. Indeed, the data structure can represent a sentence as sequence of words, each word token coming with a set of features. But it is not limited to the representation of sentences. It can also be used to represent a text, with the sentence as token unit. Each sentence with its own set of features. Etc.
This is the first PyRATA innovation.
Right now, PyRATA handles only primitive types as allowed values. .. The objective is to offer a language and an engine to define patterns aiming at matching (parts of) lists of features set.
The API is developed to be familiar for whom who develops with the python re module API.
The module defines several known functions such as search, findall, or finditer. The functions are also available for compiled regular expressions. The former take at least two arguments including the pattern to recognize and the data to explore (e.g. re.search(pattern, data)) while the latter take at least one, the data to explore (e.g. compiledPattern.search(data)). In addition to exploration methods, the module offers methods to edit the structure of the data either by substitution (sub), update (update) or extension (extend) of the data feature structures.
A pattern is made of one or several ordered elements. We also called them steps in reference to the XPath language. A pattern element is, in its simplest form, the specification of a single constraint (NAME OPERATOR"VALUE") that a data token should satisfy. For a given attribute name, you can specify its required exact value (with = operator), a regex definition of its value (~ operator), a list of possible values (@ operator) or if it is part of a IOB tag (- operator).
These constraint operators are probably the second major innovation offered by PyRATA in the regex world.
A more complex element can be a quantified element, an element class, a group, alternatives or a combination of these various types.
A quantified element allows to set optional element (?), element which should occurs at least one (+), or zero or more (*). An element class aims at specifying more than one constraints and conditions on them with parenthesis (()) and logical connectors such as and (&), or (|) and not (!). A group of elements, surrounded by parenthesis (()), is used to refer to and retrieve subparts of the pattern. An alternative defines a set of pattern subparts at a specific point of the pattern.
- python re module python 3, PSF (open source) License
- python nltk chunk module python 3, Apache v2
- clips pattern python 2.6, BSD-3
- spaCy python 3, MIT
- GATE JAPE Java 8, GNU
- Apache UIMA RUTA JAVA 8, Apache v2
- Nooj C++/Java 1.7, LGPL
- unitex GPL restricted license: Academic Only, Non Commercial Use Only.
- tokensregex GPL restricted license: Academic Only, Non Commercial Use Only.
- The value type is String. May be extended to other primitive types or object.
- Cannot handle overlapping annotations. Inherent to the approach.
Right now PyRATA is published on PyPI, so the simplest procedure to install is to type in a console:
sudo pip3 install pyrata
Download the latest PyRATA release
wget https://github.com/nicolashernandez/PyRATA/archive/master.zip unzip master.zip -d . cd PyRATA-master/
or clone it
git clone https://github.com/nicolashernandez/PyRATA.git cd pyrata/
Then install PyRATA
sudo pip3 install .
Of course, as any python module you can barely copy the PyRATA sub dir in your project to make it available. This solution can be an alternative if you do not have root privileges or do not want to use a virtualenv.
In addition to python3
, PyRATA uses
- the PLY implementation of lex and yacc parsing tools for Python (version 3.10).
- the graph_tool library for drawing out PDF (optional)
If you encounter the ImportError: No module named 'graph_tool'
issue, then check the fix for the graph_tool module import here
Since graph_tool is more a wrapper for C++ code than a python module, it requires a dedicated installation. Roughly speaking, under Ubuntu 16:04, you have to
echo deb http://downloads.skewed.de/apt/xenial xenial universe > /etc/apt/sources.list.d/my_xenial.list echo deb-src http://downloads.skewed.de/apt/xenial xenial universe >> /etc/apt/sources.list.d/my_xenial.list apt-get update \ && apt-get install -y --allow-unauthenticated python3-graph-tool
If you do not properly install PyRATA, you will have to manually install ply (or download it manually to copy it in your local working dir).
sudo pip3 install ply
as of v0.5.1 the sympy library was replaced and is no longer required
python3 do_tests.py
Uses the unittest
module. You may also edit the file to set logger.disabled
to False
. By default, the logging file is do_tests.py.log
.
First run python in console:
python3
Then import the main PyRATA regular expression module:
>>> import pyrata.re as pyrata_re
PyRATA comes with a script, pyrata_re.py
, which allow to test the API and plots pretty graphs of NFAs.
In v0.4 it is an alpha code. It is provided "as is"...
Set your PATH
environment variable consequently or run it from its install directory.
Takes at least two parameters: the pattern to search and the data to process.
By default, it performs English natural language processing (nlp) with NLTK on the input data and search the first occurrence of the specified pattern with a greedy pattern matching policy. No pdf draw. No log export.
More information on parameters, API usage and language syntax with:
python3 pyrata_re.py -h
Which briefly outputs:
usage: pyrata_re.py [-h] [--path] [--draw] [--pdf_file_name PDF_FILE_NAME] [--draw_steps] [--pyrata_data] [--method METHOD] [--annotation ANNOTATION] [--group GROUP] [--iob] [--mode MODE] [--pos POS] [--endpos ENDPOS] [--lexicons LEXICONS] [--verbose_output] [--log] pattern data positional arguments: pattern a pattern data data string or path to a data file. Use --path to mean a path. By default the data is assumed to be English text and so nlp processed with NLTK. Use --pyrata_data to consider it as a list of dicts. optional arguments: -h, --help show this help message and exit --path force the interpretation of the data argument as a file path --draw draw the internal NFA to a pdf file. Default is 'NFA.pdf'. Requires graph_tool. --pdf_file_name PDF_FILE_NAME output pdf filename for the draw (--draw must be set) --draw_steps draw draw the internal NFA at every steps to a pdf file. Default is 'NFA.pdf'. Requires graph_tool. It is best to run this option and observe the result with a PDF viewer that can detect file change and reload the changed file. --pyrata_data interpret the string data as a list of dict --method METHOD set the search/edit method to perform among 'search', 'findall', 'match', 'fullmatch', 'finditer', 'sub', 'extend' (default is 'search') --annotation ANNOTATION 'extend' method requires to specify the annotation extension --group GROUP 'extend' method allows to specify the group you want to extend --iob 'extend' method allows to specify if the annotation to extend will be iob --mode MODE define the pattern matching policy (greedy or reluctant). Default is greedy, --pos POS index in the data where the search is to start; it defaults to 0. --endpos ENDPOS endpos limits how far the data will be searched --lexicons LEXICONS lexicons expressed as a dict of list, each key being a lexicon name --verbose_output verbose output --log log and export into the pyrata_re_py.log file
For example to search the first match of given pattern by using some basic nlp processing (tokenization, pos tagging...):
python3 pyrata_re.py 'pos="JJ"' "It is fast easy and funny to write regular expressions with PyRATA"
To operate with the raw PyRATA data structure
python3 pyrata_re.py 'pos="JJ"' "[{'raw': 'It', 'pos': 'PRP'}, {'raw': 'is', 'pos': 'VBZ'}, {'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}, {'raw': 'to', 'pos': 'TO'}, {'raw': 'write', 'pos': 'VB'}, {'raw': 'regular', 'pos': 'JJ'}, {'raw': 'expressions', 'pos': 'NNS'}, {'raw': 'with', 'pos': 'IN'}, {'raw': 'PyRATA', 'pos': 'NNP'}]" --pyrata_data
To find all occurrences in reluctant mode
python3 pyrata_re.py 'pos="JJ"' "It is fast easy and funny to write regular expressions with PyRATA" --method findall --mode reluctant
To draw the corresponding NFA in a filename my_nfa.pdf. Trick: No need to specify some data to draw a NFA.
python3 pyrata_re.py 'pos="DT"? pos~"JJ|NN"* pos~"NN.?"+' "" --draw --pdf_file_name my_nfa.pdf && evince my_nfa.pdf
To log the process in a pyrata_re_py.log
file.
python3 pyrata_re.py 'pos="JJ"' "It is fast easy and funny to write regular expressions with PyRATA" --log
- PyRATA data structure
- PyRATA is intented to process data made of sequence of elements, each element being a features set i.e. a set of name-value attributes. In other words the PyRATA data structure is litteraly a
list
ofdict
. The expected type of values is the typeString
.
In python, list
are marked by squared brackets, dict
by curly brackets. Elements of list
or dict
are then separated by commas. Feature names are quoted. And so values when they are Strings. Names and values are separated by a colon.
>>> data = [{'pos': 'PRP', 'raw': 'It'}, {'pos': 'VBZ', 'raw': 'is'}, {'pos': 'JJ', 'raw': 'fast'}, {'pos': 'JJ', 'raw': 'easy'}, {'pos': 'CC', 'raw': 'and'}, {'pos': 'JJ', 'raw': 'funny'}, {'pos': 'TO', 'raw': 'to'}, {'pos': 'VB', 'raw': 'write'}, {'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}, {'pos': 'IN', 'raw': 'with'},{'pos': 'NNP', 'raw': 'PyRATA'}]
There is no requirement on the names of the features.
In the previous code, you see that the names raw
and pos
have been arbitrary chosen to respectively mean the surface form of a word and its part-of-speech.
- PyRATA pattern
- PyRATA allows to define regular expressions on the PyRATA data structure. It is made of an ordered list of pattern elements.
- PyRATA pattern element
- The elementary component of a PyRATA pattern defines the combination of constraints (at least one) a data token should match. A pattern element is also named a step in reference to the XPath Language.
Let's say you want to search all the adjectives in the sentence. By chance there is a property which specifies the part of speech of tokens, pos, the value of pos which stands for adjectives is JJ. Your pattern will be made of only one element which will define only one constraint:
>>> pattern = 'pos="JJ"'
Pattern elements are made of constraints. At the atomic level, a simple constraint is defined with one of the following operators.
Classically, the value of the referenced feature name should be equal to the specified value. The syntax is name="value"
where name should match [a-zA-Z_][a-zA-Z0-9_]*
and value \"([^\\\n]|(\\.))*?\"
.
The following operators use the same definition for the related name and value, only the operator changes.
In addition to the equal operator, you can set a regular expression as a value.
In that case, the operator will be ~
metacharacter
>>> pyrata_re.findall('pos~"NN."', data)
[[{'raw': 'expressions', 'pos': 'NNS'}], [{'raw': 'PyRATA', 'pos': 'NNP'}]]
You can also set a list of possible values (lexicon). In that case, the operator will be the @
metacharacter in your constraint definition and the value will be the name of the lexicon. The lexicon is specified as a parameter of the pyrata_re methods (lexicons
parameter). Indeed, multiple lexicons can be specified. The data structure for storing lexicons is a dict/map of lists. Each key of the dict is the name of a lexicon, and each corresponding value a list of elements making of the lexicon.
>>> pyrata_re.findall('raw@"positiveLexicon"', data, lexicons = {'positiveLexicon':['easy', 'funny']})
[[ {'pos': 'JJ', 'raw': 'easy'}], [{'pos': 'JJ', 'raw': 'funny'}]]
The most widespread representation of chunks uses IOB tags. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O.
An example of PyRATA data structure with chunks annotated in IOB tagged format is shown below. See the values of the chunk
feature.
>>> data = [{'pos': 'NNP', 'chunk': 'B-PERSON', 'raw': 'Mark'}, {'pos': 'NNP', 'chunk': 'I-PERSON', 'raw': 'Zuckerberg'}, {'pos': 'VBZ', 'chunk': 'O', 'raw': 'is'}, {'pos': 'VBG', 'chunk': 'O', 'raw': 'working'}, {'pos': 'IN', 'chunk': 'O', 'raw': 'at'}, {'pos': 'NNP', 'chunk': 'B-ORGANIZATION', 'raw': 'Facebook'}, {'pos': 'NNP', 'chunk': 'I-ORGANIZATION', 'raw': 'Corp'}, {'pos': '.', 'chunk': 'O', 'raw': '.'}]
>>> pattern = 'chunk-"PERSON"'
>>> pyrata_re.search(pattern, data)
<pyrata.re Match object; groups=[[[{'pos': 'NNP', 'raw': 'Mark', 'chunk': 'B-PERSON'}, {'pos': 'NNP', 'raw': 'Zuckerberg', 'chunk': 'I-PERSON'}], 0, 2], [[{'pos': 'NNP', 'raw': 'Mark', 'chunk': 'B-PERSON'}, {'pos': 'NNP', 'raw': 'Zuckerberg', 'chunk': 'I-PERSON'}], 0, 2]]>
The metacharacter which means a chunk is -
(dash).
chunk-"PERSON"
can be substitute literally with (chunk="B-PERSON" chunk="I-PERSON"*)
. That's why the Match object contains two groups.
The actual chunk implementation uses the chunk operator - as a rewriting rule to turn the constraint into two with equality operator (e.g. chunk-"PERSON"
would be rewritten in (chunk="B-PERSON" chunk="I-PERSON"*)
).
This is done before starting the syntax analysis (compilation stage) or when building the compilation representation.
This trick has some consequences:
- Implicit groups are introduced around each chunk which be considered when referencing the groups
- It prevents us from including chunk constraints in classes (e.g.
[chunk-"PERSON" & raw="Mark"]
).
An element class offers a way to combine several simple constraints in the definition of a pattern element. The definition is marked by squared brackets ([...]
). Logical operators (and &
, or |
and not !
) and parenthesis are available to combine the constraints.
>>> data = [{'pos': 'PRP', 'raw': 'It'}, {'pos': 'VBZ', 'raw': 'is'}, {'pos': 'JJ', 'raw': 'fast'}, {'pos': 'JJ', 'raw': 'easy'}, {'pos': 'CC', 'raw': 'and'}, {'pos': 'JJ', 'raw': 'funny'}, {'pos': 'TO', 'raw': 'to'}, {'pos': 'VB', 'raw': 'write'}, {'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}, {'pos': 'IN', 'raw': 'with'},{'pos': 'NNP', 'raw': 'PyRATA'}]
>>> pyrata_re.findall('[(pos="NNS" | pos="NNP") & !raw="expressions"]', data)
[[{'pos': 'NNP', 'raw': 'PyRATA'}]]
Consequently [pos="NNS" | pos="NNP"]
, pos~"NN[SP]"
and pos~"(NNS|NNP)"
are equivalent (give the same result). They may not have the same processing time.
__Warning__ Since version 0.3.3, the grammar has a bit changed. It does not accept any longer raw negative element. '!pos="NNS"+'
must be rewritten into '[!pos="NNS"]+'
.
The wildcard element can match any single data token. It is represented by the .
(dot) metacharacter.
>>> data = [{'pos': 'PRP', 'raw': 'It'}, {'pos': 'VBZ', 'raw': 'is'}, {'pos': 'JJ', 'raw': 'fast'}, {'pos': 'JJ', 'raw': 'easy'}, {'pos': 'CC', 'raw': 'and'}, {'pos': 'JJ', 'raw': 'funny'}, {'pos': 'TO', 'raw': 'to'}, {'pos': 'VB', 'raw': 'write'}, {'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}, {'pos': 'IN', 'raw': 'with'},{'pos': 'NNP', 'raw': 'PyRATA'}]
>>> pyrata_re.search('. raw="PyRATA"', data)
<pyrata.re Match object; groups=[[[{'raw': 'with', 'pos': 'IN'}, {'raw': 'PyRATA', 'pos': 'NNP'}], 10, 12]]>
It can be used with any quantifiers
>>> pyrata_re.search('.+ raw="PyRATA"', data)
<pyrata.re Match object; groups=[[[{'raw': 'It', 'pos': 'PRP'}, {'raw': 'is', 'pos': 'VBZ'}, {'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}, {'raw': 'to', 'pos': 'TO'}, {'raw': 'write', 'pos': 'VB'}, {'raw': 'regular', 'pos': 'JJ'}, {'raw': 'expressions', 'pos': 'NNS'}, {'raw': 'with', 'pos': 'IN'}, {'raw': 'PyRATA', 'pos': 'NNP'}], 0, 12]]>
but cannot be considered as a simple constraint.
It can also easily be simulated by using a not wanted value or not-existing attribute. Below [!raw="to"]
and [!foo="bar"]
correspond to a not wanted data token. All give the same results as the dot wildcard.
>>> pyrata_re.findall('pos~"VB." [!raw="to"]* raw="to"', data)
[[{'raw': 'is', 'pos': 'VBZ'}, {'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}, {'raw': 'to', 'pos': 'TO'}]]
>>> pyrata_re.findall('pos~"VB." [!foo="bar"]* raw="to"', data)
[[{'raw': 'is', 'pos': 'VBZ'}, {'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}, {'raw': 'to', 'pos': 'TO'}]]
>>> pyrata_re.findall('pos~"VB." .* raw="to"', data)
[[{'raw': 'is', 'pos': 'VBZ'}, {'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}, {'raw': 'to', 'pos': 'TO'}]]
You can search a sequence of elements, for example an adjective (tagged JJ) followed by a noun in plural form (tagged NNS). The natural separator between the ordered elements is the whitespace character.
>>> pattern = 'pos="JJ" pos="NNS"'
>>> pyrata_re.search(pattern, data).group()
[{'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}]
To specify that a pattern should match from the begining and/or to the end of a data structure, you can use the anchors ^
and $
metacharacters in the pattern, respectively to mean the start and the end of the data.
>>> pattern = '^raw="It" [!foo="bar"]+'
>>> pyrata_re.search(pattern, data)
<pyrata.re Match object; groups=[[[{'raw': 'It', 'pos': 'PRP'}, {'raw': 'is', 'pos': 'VBZ'}, {'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}, {'raw': 'to', 'pos': 'TO'}, {'raw': 'write', 'pos': 'VB'}, {'raw': 'regular', 'pos': 'JJ'}, {'raw': 'expressions', 'pos': 'NNS'}, {'raw': 'with', 'pos': 'IN'}, {'raw': 'PyRATA', 'pos': 'NNP'}], 0, 12]]>
You can quantify the repetition of a pattern element.
You can specify a quantifier to match one or more times consecutively the same form of an element. The element definition should be followed by the +
symbol:
>>> pyrata_re.findall('pos="JJ"+', data)
[[{'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}], [{'raw': 'funny', 'pos': 'JJ'}], [{'raw': 'regular', 'pos': 'JJ'}]
You can specify a quantifier to match zero or more times consecutively a certain form of an element. The element definition should be followed by the *
symbol:
>>> pyrata_re.findall('pos="JJ"* [(pos="NNS" | pos="NNP")]', data)
[[[{'raw': 'regular', 'pos': 'JJ'}, {'raw': 'expressions', 'pos': 'NNS'}], [{'raw': 'PyRATA', 'pos': 'NNP'}]]
You can specify a quantifier to match once or not at all the given form of an element. The element definition should be followed by the ?
symbol:
>>> pyrata_re.findall('pos="JJ"? [(pos="NNS" | pos="NNP")]', data)
[[{'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}], [{'pos': 'NNP', 'raw': 'PyRATA'}]]
In order to retrieve the contents a specific part of a match, groups can be defined with parenthesis which indicate the start and end of a group.
The search
method, like finditer
, returns match objects. Only one for the search method, the first one, if it exists at least one. A match object contains by default one group, the zero group, which can be referenced by .group(0)
. If groups are defined in the pattern by mean of parenthesis, then they are also indexed. A group is described is described by a value, the covered data, and a pair of offsets.
>>> import pyrata.re as pyrata_re
>>> pyrata_re.search('raw="is" ([!raw="to"]+) raw="to"', [{'pos': 'PRP', 'raw': 'It'}, {'pos': 'VBZ', 'raw': 'is'}, {'pos': 'JJ', 'raw': 'fast'}, {'pos': 'JJ', 'raw': 'easy'}, {'pos': 'CC', 'raw': 'and'}, {'pos': 'JJ', 'raw': 'funny'}, {'pos': 'TO', 'raw': 'to'}, {'pos': 'VB', 'raw': 'write'}, {'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}, {'pos': 'IN', 'raw': 'with'},{'pos': 'NNP', 'raw': 'PyRATA'}]).group(1)
[{'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}]
Or a more complex example with many more groups and embedded groups:
>>> pattern = 'raw="It" (raw="is") (( (pos="JJ"* pos="JJ") raw="and" (pos="JJ") )) (raw="to")'
>>> data = [{'pos': 'PRP', 'raw': 'It'}, {'pos': 'VBZ', 'raw': 'is'}, {'pos': 'JJ', 'raw': 'fast'}, {'pos': 'JJ', 'raw': 'easy'}, {'pos': 'CC', 'raw': 'and'}, {'pos': 'JJ', 'raw': 'funny'}, {'pos': 'TO', 'raw': 'to'}, {'pos': 'VB', 'raw': 'write'}, {'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}, {'pos': 'IN', 'raw': 'with'},{'pos': 'NNP', 'raw': 'PyRATA'}]
>>> pyrata_re.search(pattern, data)
<pyrata.re Match object; groups=[[[{'raw': 'It', 'pos': 'PRP'}, {'raw': 'is', 'pos': 'VBZ'}, {'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}, {'raw': 'to', 'pos': 'TO'}], 0, 7], [[{'raw': 'is', 'pos': 'VBZ'}], 1, 2], [[{'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}], 2, 6], [[{'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}], 2, 6], [[{'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}], 2, 4], [[{'raw': 'funny', 'pos': 'JJ'}], 5, 6], [[{'raw': 'to', 'pos': 'TO'}], 6, 7]]>
Groups can be quantified like in the following example:
>>> pattern = '(pos="VB" pos="DT"? pos="JJ"* pos="NN" pos=".")+'
>>> data = [ {'raw':'Choose', 'pos':'VB'},
{'raw':'Life', 'pos':'NN' },
{'raw':'.', 'pos':'.' },
{'raw':'Choose', 'pos':'VB'},
{'raw':'a', 'pos':'DT'},
{'raw':'job', 'pos':'NN'},
{'raw':'.', 'pos':'.'},
{'raw':'Choose', 'pos':'VB'},
{'raw':'a', 'pos':'DT'},
{'raw':'career', 'pos':'NN'},
{'raw':'.', 'pos':'.'},
{'raw':'Choose', 'pos':'VB'},
{'raw':'a', 'pos':'DT'},
{'raw':'family', 'pos':'NN'},
{'raw':'.', 'pos':'.'},
{'raw':'Choose', 'pos':'VB'},
{'raw':'a', 'pos':'DT'},
{'raw':'fucking', 'pos':'JJ'},
{'raw':'big', 'pos':'JJ'},
{'raw':'television', 'pos':'NN'},
{'raw':'.', 'pos':'.'}
]
>>> quantified_group = pyrata_re.search(pattern, data)
>>> quantified_group
>>> <pyrata.re Match object; groups=[[[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'NN', 'raw': 'Life'}, {'pos': '.', 'raw': '.'}, {'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'job'}, {'pos': '.', 'raw': '.'}, {'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'career'}, {'pos': '.', 'raw': '.'}, {'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'family'}, {'pos': '.', 'raw': '.'}, {'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'JJ', 'raw': 'fucking'}, {'pos': 'JJ', 'raw': 'big'}, {'pos': 'NN', 'raw': 'television'}, {'pos': '.', 'raw': '.'}], 0, 21], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'NN', 'raw': 'Life'}, {'pos': '.', 'raw': '.'}], 0, 3], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'job'}, {'pos': '.', 'raw': '.'}], 3, 7], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'career'}, {'pos': '.', 'raw': '.'}], 7, 11], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'family'}, {'pos': '.', 'raw': '.'}], 11, 15], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'JJ', 'raw': 'fucking'}, {'pos': 'JJ', 'raw': 'big'}, {'pos': 'NN', 'raw': 'television'}, {'pos': '.', 'raw': '.'}], 15, 21]]>
Choose Life. Choose a job. Choose a career. Choose a family. Choose a fucking big television.
Alternatives are a list of possible sub-patterns which can occur at a given position. As a group the list is delimited by parenthesis while the options are delimited by a pipe |
symbol. The options should not need to be ordered. The match is dependent of the matching mode greedy or reluctant.
>>> pattern = '(pos="IN") (raw="a" raw="tea" | raw="a" raw="cup" raw="of" raw="coffee" | raw="an" raw="orange" raw="juice" ) ([!pos=";"])'
>>> data = [ {'raw':'Over', 'pos':'IN'},
{'raw':'a', 'pos':'DT' },
{'raw':'cup', 'pos':'NN' },
{'raw':'of', 'pos':'IN'},
{'raw':'coffee', 'pos':'NN'},
{'raw':',', 'pos':','},
{'raw':'Mr.', 'pos':'NNP'},
{'raw':'Stone', 'pos':'NNP'},
{'raw':'told', 'pos':'VBD'},
{'raw':'his', 'pos':'PRP$'},
{'raw':'story', 'pos':'NN'} ]
>>>pyrata_re.search(pattern, data).group(2)
[{'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'cup'}, {'pos': 'IN', 'raw': 'of'}, {'pos': 'NN', 'raw': 'coffee'}]
Groups can be embedded in alternatives:
>>> pattern = '(pos="IN") (raw="a" (raw="tea") | raw="a" (raw="cup" raw="of" raw="coffee") | raw="an" (raw="orange" raw="juice") ) ([!pos=";"])'
>>> pyrata_re.search(pattern, data).group(3)
[{'pos': 'NN', 'raw': 'cup'}, {'pos': 'IN', 'raw': 'of'}, {'pos': 'NN', 'raw': 'coffee'}]
And alternatives can embed groups. In the example below, the matching mode plays its role on the matched data.
>>> data = [{'raw': 'It', 'pos': 'PRP'}, {'raw': 'is', 'pos': 'VBZ'}, {'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'and', 'pos': 'CC'}, {'raw': 'funny', 'pos': 'JJ'}, {'raw': 'to', 'pos': 'TO'}, {'raw': 'write', 'pos': 'VB'}, {'raw': 'regular', 'pos': 'JJ'}, {'raw': 'expressions', 'pos': 'NNS'}, {'raw': 'with', 'pos': 'IN'}, {'raw': 'PyRATA', 'pos': 'NNP'}]
>>> pyrata_re.findall('(pos="JJ" | (pos="JJ" pos="NNS") )', data)
[[{'raw': 'fast', 'pos': 'JJ'}], [{'raw': 'easy', 'pos': 'JJ'}], [{'raw': 'funny', 'pos': 'JJ'}], [{'raw': 'regular', 'pos': 'JJ'}, {'raw': 'expressions', 'pos': 'NNS'}]]
>>> pyrata_re.findall('(pos="JJ" | (pos="JJ" pos="NNS") )', data, mode='reluctant')
[[{'raw': 'fast', 'pos': 'JJ'}], [{'raw': 'easy', 'pos': 'JJ'}], [{'raw': 'funny', 'pos': 'JJ'}], [{'raw': 'regular', 'pos': 'JJ'}]]
Alternatives can be quantified.
>>> pattern = '(pos="VB" [!pos="NN"]* raw="Life" pos="."| pos="VB" [!pos="NN"]* raw="job" pos="."|pos="VB" [!pos="NN"]* raw="career" pos="."|pos="VB" [!pos="NN"]* raw="family" pos="."|pos="VB" [!pos="NN"]* raw="television" pos=".")+'
>>> data = [ {'raw':'Choose', 'pos':'VB'},
{'raw':'Life', 'pos':'NN' },
{'raw':'.', 'pos':'.' },
{'raw':'Choose', 'pos':'VB'},
{'raw':'a', 'pos':'DT'},
{'raw':'job', 'pos':'NN'},
{'raw':'.', 'pos':'.'},
{'raw':'Choose', 'pos':'VB'},
{'raw':'a', 'pos':'DT'},
{'raw':'career', 'pos':'NN'},
{'raw':'.', 'pos':'.'},
{'raw':'Choose', 'pos':'VB'},
{'raw':'a', 'pos':'DT'},
{'raw':'family', 'pos':'NN'},
{'raw':'.', 'pos':'.'},
{'raw':'Choose', 'pos':'VB'},
{'raw':'a', 'pos':'DT'},
{'raw':'fucking', 'pos':'JJ'},
{'raw':'big', 'pos':'JJ'},
{'raw':'television', 'pos':'NN'},
{'raw':'.', 'pos':'.'}
]
>>> quantified_alternatives = pyrata_re.search(pattern, data)
>>> quantified_alternatives
>>> <pyrata.re Match object; groups=[[[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'NN', 'raw': 'Life'}, {'pos': '.', 'raw': '.'}, {'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'job'}, {'pos': '.', 'raw': '.'}, {'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'career'}, {'pos': '.', 'raw': '.'}, {'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'family'}, {'pos': '.', 'raw': '.'}, {'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'JJ', 'raw': 'fucking'}, {'pos': 'JJ', 'raw': 'big'}, {'pos': 'NN', 'raw': 'television'}, {'pos': '.', 'raw': '.'}], 0, 21], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'NN', 'raw': 'Life'}, {'pos': '.', 'raw': '.'}], 0, 3], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'job'}, {'pos': '.', 'raw': '.'}], 3, 7], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'career'}, {'pos': '.', 'raw': '.'}], 7, 11], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'NN', 'raw': 'family'}, {'pos': '.', 'raw': '.'}], 11, 15], [[{'pos': 'VB', 'raw': 'Choose'}, {'pos': 'DT', 'raw': 'a'}, {'pos': 'JJ', 'raw': 'fucking'}, {'pos': 'JJ', 'raw': 'big'}, {'pos': 'NN', 'raw': 'television'}, {'pos': '.', 'raw': '.'}], 15, 21]]>
Again Choose Life. Choose a job. Choose a career. Choose a family. Choose a fucking big television.
The matching methods available offer multiple ways of exploring the data.
Assuming the following data:
>>> data = [{'pos': 'PRP', 'raw': 'It'},
{'pos': 'VBZ', 'raw': 'is'},
{'pos': 'JJ', 'raw': 'fast'},
{'pos': 'JJ', 'raw': 'easy'},
{'pos': 'CC', 'raw': 'and'},
{'pos': 'JJ', 'raw': 'funny'},
{'pos': 'TO', 'raw': 'to'},
{'pos': 'VB', 'raw': 'write'},
{'pos': 'JJ', 'raw': 'regular'},
{'pos': 'NNS', 'raw': 'expressions'},
{'pos': 'IN', 'raw': 'with'},
{'pos': 'NNP', 'raw': 'PyRATA'}]
Let's say you want to search the adjectives. By chance there is a property which specifies the part of speech of tokens, pos, the value of pos which stands for adjectives is JJ.
To search the first location where a given pattern (here pos="JJ"
) produces a match:
>>> pyrata_re.search('pos="JJ"', data)
>>> <pyrata_re Match object; span=(2, 3), match="[{'pos': 'JJ', 'raw': 'fast'}]">
To get the value of the match:
>>> pyrata_re.search('pos="JJ"', data).group()
>>> [{'raw': 'fast', 'pos': 'JJ'}]
This default match is known as the zero group:
>>> pyrata_re.search('pos="JJ"', data).group(0)
>>> [{'raw': 'fast', 'pos': 'JJ'}]
To get the value of the start and the end:
>>> pyrata_re.search('pos="JJ"', data).start()
>>> 2
>>> pyrata_re.search('pos="JJ"', data).end()
>>> 3
To find all non-overlapping matches of pattern in data, as a list of datas:
>>> pyrata_re.findall('pos="JJ"', data)
>>> [[{'pos': 'JJ', 'raw': 'fast'}], [{'pos': 'JJ', 'raw': 'easy'}], [{'pos': 'JJ', 'raw': 'funny'}], [{'pos': 'JJ', 'raw': 'regular'}]]]
To get an iterator yielding match objects over all non-overlapping matches for the RE pattern in data:
>>> for m in pyrata_re.finditer('pos="JJ"', data): print (m)
...
<pyrata_re Match object; span=(2, 3), match="[{'pos': 'JJ', 'raw': 'fast'}]">
<pyrata_re Match object; span=(3, 4), match="[{'pos': 'JJ', 'raw': 'easy'}]">
<pyrata_re Match object; span=(5, 6), match="[{'pos': 'JJ', 'raw': 'funny'}]">
<pyrata_re Match object; span=(8, 9), match="[{'pos': 'JJ', 'raw': 'regular'}]">
A Match is an object which is created when a pattern matching occurs. With the search
method, only the first one is considered. With the finditer
method, all the occurrences of the pattern will lead to the creation of a Match. For, finditer
the Matches are appended to an object which lists all the Matches, namely a MatchesList.
Comparison operators and the len
method on Match objects are available:
>>> m1 = pyrata_re.search('pos="JJ"', data)
<pyrata.re Match object; groups=[[[{'raw': 'fast', 'pos': 'JJ'}], 2, 3]]>
The Match object contains the value of instanciated pattern and its offsets in data.
>>> m2 = pyrata_re.search('pos="JJ"', data)
>>> m3 = pyrata_re.search('pos="NN"', data)
>>> if m1 == m2: print ('True')
...
True
If none group is specified then the result of the comparison between the zero groups is returned with eq
and ne
operators.
>>> if m1 != m3: print ('True')
...
True
>>> len(m1)
>>> 1
>>> m4 = pyrata_re.search('(pos="JJ")+', data)
>>> m4
<pyrata.re Match object; groups=[[[{'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}], 2, 4], [[{'raw': 'fast', 'pos': 'JJ'}], 2, 3], [[{'raw': 'easy', 'pos': 'JJ'}], 3, 4]]>
In addition to the default zero group, the pattern defined a group which has two instances because of the quantifier.
>>> len(m4)
>>> 3 #
Comparison operators and the len
method on MatchesList objects are available:
>>> ml1 = pyrata_re.finditer('pos="JJ"', data)
>>> ml2 = pyrata_re.finditer('pos="JJ"', data)
>>> ml3 = pyrata_re.finditer('pos="NN"', data)
>>> if ml1 == ml2: print ('True')
...
True
>>> if ml1 != ml3: print ('True')
...
True
>>> len(ml1)
>>> 4
The previous tests can be performed with the two Matches objects created above from the Trainspotting data i.e. quantified_group
and quantified_alternatives
.
The PyRATA matching engine operates with a global matching mode.
- If the match succeeds, the matching engine moves jumps just after the position of the last matched data token and starts a new search from this new position. Quantifiers in an expression benefit from this mode.
- If the match fails, the matching engine moves to the next position in the data (from the current to the current+1) and starts a new search from this new position.
In addition, it allows to perform greedy or reluctant matching. By default, a quantified subpattern is greedy, that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match.
Let's work with the following pattern and data:
>>> pattern = 'pos="JJ"* pos="JJ"'
>>> data = [ {'pos': 'VBZ', 'raw': 'is'}, {'pos': 'JJ', 'raw': 'fast'}, {'pos': 'JJ', 'raw': 'easy'}, {'pos': 'JJ', 'raw': 'funny'}, {'pos': 'JJ', 'raw': 'neat'}, {'pos': 'TO', 'raw': 'to'}]
In the example below greedy
is explicitely specified (actually there is no need since it is the default mode).
>>> pyrata_re.search(pattern, data, mode = 'greedy')
<pyrata.re Match object; groups=[[[{'raw': 'fast', 'pos': 'JJ'}, {'raw': 'easy', 'pos': 'JJ'}, {'raw': 'funny', 'pos': 'JJ'}, {'raw': 'neat', 'pos': 'JJ'}], 1, 5]]>
Reluctant matching process means to match the minimum number of times possible. In the example below, the engine stops at the first match.
>>> pyrata_re.search(pattern, data, mode = 'reluctant')
<pyrata.re Match object; groups=[[[{'raw': 'fast', 'pos': 'JJ'}], 1, 2]]>
Same data, same pattern, same search method but distinct matching mode. We get two distinct object. The former being longer than the latter.
The logging facility was partially interrupted in v0.4. The following may not work as expected.
PyRATA uses the python logging facility.
To understand the process of a pyrata_re method either at the compilation or matching stage, first import the logging module:
>>> import pyrata.re as pyrata_re
>>> import logging
Set the loggging filename, optionally the logging format of messages, and the logging level:
logging.DEBUG
For very detailed output for diagnostic purposes (10)logging.INFO
Report events that occur during normal operation of a program (e.g. for status monitoring or fault investigation) (20)logging.WARNING
Issue a warning regarding a particular runtime event (30)
DEBUG is more verbose than WARNING. WARNING will only report syntactic parsing problems.
>>> logging.basicConfig(format='%(levelname)s:\t%(message)s', filename='mypyrata.log', level=logging.INFO)
Now you can just run a compilation process
>>> pyrata_re.compile ('pos~"JJ"* pos~"NN."')
or any matching process (which encompasses a compilation process):
>>> data = [{'pos': 'PRP', 'raw': 'It'},
{'pos': 'VBZ', 'raw': 'is'},
{'pos': 'JJ', 'raw': 'fast'},
{'pos': 'JJ', 'raw': 'easy'},
{'pos': 'CC', 'raw': 'and'},
{'pos': 'JJ', 'raw': 'funny'},
{'pos': 'TO', 'raw': 'to'},
{'pos': 'VB', 'raw': 'write'},
{'pos': 'JJ', 'raw': 'regular'},
{'pos': 'NNS', 'raw': 'expressions'},
{'pos': 'IN', 'raw': 'with'},
{'pos': 'NNP', 'raw': 'PyRATA'}]
>>> pyrata_re.findall ('pos="JJ" [(pos="NNS" | pos="NNP")]', data)
And observe the logging file in the current directory.
To dynamically change the log level without restarting the application, just type:
>>> logging.getLogger().setLevel(logging.DEBUG)
Log messages are incrementally appended at the end of the previous ones.
Compiled regular expression objects support the following methods search
, findall
and finditer
. It follows the same API as Python re but uses a sequence of features set instead of a string.
Below an example of use with the findall
method
>>> data = [{'pos': 'PRP', 'raw': 'It'}, {'pos': 'VBZ', 'raw': 'is'}, {'pos': 'JJ', 'raw': 'fast'}, {'pos': 'JJ', 'raw': 'easy'}, {'pos': 'CC', 'raw': 'and'}, {'pos': 'JJ', 'raw': 'funny'}, {'pos': 'TO', 'raw': 'to'}, {'pos': 'VB', 'raw': 'write'}, {'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}, {'pos': 'IN', 'raw': 'with'},{'pos': 'NNP', 'raw': 'PyRATA'}]
>>> compiled_re = pyrata_re.compile('pos~"JJ"* pos~"NN."')
>>> compiled_re.findall(data)
[[{'raw': 'regular', 'pos': 'JJ'}, {'raw': 'expressions', 'pos': 'NNS'}], [{'raw': 'PyRATA', 'pos': 'NNP'}]]
A compiled regular expression object is made of a Non-deterministic Finite Automata (NFA), the specification of having to start/end with the data and the lexicons which are used in its pattern elements.
Warning v0.4 may have some display bugs and some states may not be present.
The following expression IN[pos~"JJ"]->CHAR(#S)->OUT[pos~"NN.",pos~"JJ"]
defines the character state #S
which can be get by the input state pos~"JJ"``and lead to two output states ``pos~"NN."
and pos~"JJ"
. Characters #S
, #S
and #S
mean respectively Start, Matching and Empty.
>>> compiled_re
<pyrata.syntactic_pattern_parser CompiledPattern object;
starts_wi_data="False"
ends_wi_data="False"
lexicon="dict_keys([])"
nfa="
<pyrata.nfa NFA object;
states="{'IN[pos~"JJ"]->CHAR(#S)->OUT[pos~"NN.",pos~"JJ"]', 'IN[#S]->CHAR(pos~"NN.")->OUT[#M]', 'IN[pos~"NN."]->CHAR(#M)->OUT[]', 'IN[pos~"JJ"]->CHAR(#S)->OUT[pos~"NN.",pos~"JJ"]'}">
">
Here the representation of a compiled pattern with chunks:
>>> pyrata_re.compile ('chunk-"NP"')
<pyrata.syntactic_pattern_parser CompiledPattern object;
starts_wi_data="False"
ends_wi_data="False"
lexicon="dict_keys([])"
nfa="
<pyrata.nfa NFA object;
states="{'IN[chunk="I-NP",chunk="B-NP"]->CHAR(#M)->OUT[chunk="I-NP"]', 'IN[]->CHAR(#S)->OUT[chunk="B-NP"]'}">
">
Here the representation of a compiled pattern with quantified groups and alternatives :
pyrata_re.compile('raw="a"? (pos~"JJ" pos~"JJ")* (pos="NNS"|pos="NNP")+')
<pyrata.syntactic_pattern_parser CompiledPattern object;
starts_wi_data="False"
ends_wi_data="False"
lexicon="dict_keys([])"
nfa="
<pyrata.nfa NFA object;
states="{'IN[pos="NNS",pos="NNP"]->CHAR(#M)->OUT[#E]', 'IN[#S,raw="a",pos~"JJ"]->CHAR(#E)->OUT[pos~"JJ",#E]', 'IN[]->CHAR(#S)->OUT[#E,raw="a"]'}">
">
..[['?', 'raw="a"'], ['+', [[[None, 'pos="NNS"']], [[None, 'pos="NNP"']]]]]">
By edit methods we mean substitution, updating, extension of the data feature structure. The process of updating or extending a feature structure is also called annotation.
The sub(pattern, annotation, replacement, group = [0])
method substitutes the leftmost non-overlapping occurrences of pattern matches or a given group of matches by a dict or a sequence of dicts. Returns a copy of the data obtained and by default the data unchanged.
>>> import pyrata.re as pyrata_re
>>> pattern = 'pos~"NN.?"'
>>> annotation = {'raw':'smurf', 'pos':'NN' }
>>> data = [ {'raw':'Over', 'pos':'IN'},
{'raw':'a', 'pos':'DT' }, {'raw':'cup', 'pos':'NN' },
{'raw':'of', 'pos':'IN'},
{'raw':'coffee', 'pos':'NN'},
{'raw':',', 'pos':','},
{'raw':'Mr.', 'pos':'NNP'}, {'raw':'Stone', 'pos':'NNP'},
{'raw':'told', 'pos':'VBD'},
{'raw':'his', 'pos':'PRP$'}, {'raw':'story', 'pos':'NN'} ]
>>> pyrata_re.sub(pattern, annotation, data)
[{'raw': 'Over', 'pos': 'IN'},
{'raw': 'a', 'pos': 'DT'}, {'raw': 'smurf', 'pos': 'NN'},
{'raw': 'of', 'pos': 'IN'},
{'raw': 'smurf', 'pos': 'NN'},
{'raw': ',', 'pos': ','},
{'raw': 'smurf', 'pos': 'NN'}, {'raw': 'smurf', 'pos': 'NN'},
{'raw': 'told', 'pos': 'VBD'},
{'raw': 'his', 'pos': 'PRP$'}, {'raw': 'smurf', 'pos': 'NN'}]
Here an example by modifying a group of a Match:
>>> pyrata_re.sub('pos~"(DT|PRP\$)" (pos~"NN.?")', {'raw':'smurf', 'pos':'NN' }, [{'raw':'Over', 'pos':'IN'}, {'raw':'a', 'pos':'DT' }, {'raw':'cup', 'pos':'NN' }, {'raw':'of', 'pos':'IN'}, {'raw':'coffee', 'pos':'NN'}, {'raw':',', 'pos':','}, {'raw':'Mr.', 'pos':'NNP'}, {'raw':'Stone', 'pos':'NNP'}, {'raw':'told', 'pos':'VBD'}, {'raw':'his', 'pos':'PRP$'}, {'raw':'story', 'pos':'NN'}], group = [1])
[{'raw': 'Over', 'pos': 'IN'}, {'raw': 'a', 'pos': 'DT'}, {'raw': 'smurf', 'pos': 'NN'}, {'raw': 'of', 'pos': 'IN'}, {'raw': 'coffee', 'pos': 'NN'}, {'raw': ',', 'pos': ','}, {'raw': 'Mr.', 'pos': 'NNP'}, {'raw': 'Stone', 'pos': 'NNP'}, {'raw': 'told', 'pos': 'VBD'}, {'raw': 'his', 'pos': 'PRP$'}, {'raw': 'smurf', 'pos': 'NN'}]
To completely remove some parts of the data, the anotation should be an empty list []
.
The update(pattern, annotation, replacement, group = [0], iob = False)
method updates (and extends) the features of a match or a group of a match with the features of a dict or a sequence of dicts (of the same size as the group/match).
>>> pyrata_re.update('(raw="Mr.")', {'raw':'Mr.', 'pos':'TITLE' }, [{'raw':'Over', 'pos':'IN'}, {'raw':'a', 'pos':'DT' }, {'raw':'cup', 'pos':'NN' }, {'raw':'of', 'pos':'IN'}, {'raw':'coffee', 'pos':'NN'}, {'raw':',', 'pos':','}, {'raw':'Mr.', 'pos':'NNP'}, {'raw':'Stone', 'pos':'NNP'}, {'raw':'told', 'pos':'VBD'}, {'raw':'his', 'pos':'PRP$'}, {'raw':'story', 'pos':'NN'}])
[{'raw': 'Over', 'pos': 'IN'}, {'raw': 'a', 'pos': 'DT'}, {'raw': 'cup', 'pos': 'NN'}, {'raw': 'of', 'pos': 'IN'}, {'raw': 'coffee', 'pos': 'NN'}, {'raw': ',', 'pos': ','}, {'raw': 'Mr.', 'pos': 'TITLE'}, {'raw': 'Stone', 'pos': 'NNP'}, {'raw': 'told', 'pos': 'VBD'}, {'raw': 'his', 'pos': 'PRP$'}, {'raw': 'story', 'pos': 'NN'}]
The extend(pattern, annotation, replacement, group = [0], iob = False)
method extends (i.e. if a feature exists then do not update) the features of a match or a group of a match with the features of a dict or a sequence of dicts (of the same size as the group/match:
>>> pattern = 'pos~"(DT|PRP\$|NNP)"? pos~"NN.?"'
>>> annotation = {'chunk':'NP'}
>>> data = [ {'raw':'Over', 'pos':'IN'},
{'raw':'a', 'pos':'DT' }, {'raw':'cup', 'pos':'NN' },
{'raw':'of', 'pos':'IN'},
{'raw':'coffee', 'pos':'NN'},
{'raw':',', 'pos':','},
{'raw':'Mr.', 'pos':'NNP'}, {'raw':'Stone', 'pos':'NNP'},
{'raw':'told', 'pos':'VBD'},
{'raw':'his', 'pos':'PRP$'}, {'raw':'story', 'pos':'NN'} ]
>>> pyrata_re.extend(pattern, annotation, data)
[{'pos': 'IN', 'raw': 'Over'},
{'pos': 'DT', 'raw': 'a', 'chunk': 'NP'}, {'pos': 'NN', 'raw': 'cup', 'chunk': 'NP'},
{'pos': 'IN', 'raw': 'of'},
{'pos': 'NN', 'raw': 'coffee', 'chunk': 'NP'},
{'pos': ',', 'raw': ','},
{'pos': 'NNP', 'raw': 'Mr.', 'chunk': 'NP'}, {'pos': 'NNP', 'raw': 'Stone', 'chunk': 'NP'},
{'pos': 'VBD', 'raw': 'told'},
{'pos': 'PRP$', 'raw': 'his', 'chunk': 'NP'}, {'pos': 'NN', 'raw': 'story', 'chunk': 'NP'}]
Both with update or extend, you can specify if the data obtained should be annotated with IOB tag prefix.
>>> pyrata_re.extend(pattern, annotation, data, iob = True)
[{'raw': 'Over', 'pos': 'IN'},
{'raw': 'a', 'chunk': 'B-NP', 'pos': 'DT'}, {'raw': 'cup', 'chunk': 'I-NP', 'pos': 'NN'},
{'raw': 'of', 'pos': 'IN'}, {'raw': 'coffee', 'chunk': 'B-NP', 'pos': 'NN'},
{'raw': ',', 'pos': ','},
{'raw': 'Mr.', 'chunk': 'B-NP', 'pos': 'NNP'}, {'raw': 'Stone', 'chunk': 'I-NP', 'pos': 'NNP'},
{'raw': 'told', 'pos': 'VBD'},
{'raw': 'his', 'chunk': 'B-NP', 'pos': 'PRP$'}, {'raw': 'story', 'chunk': 'I-NP', 'pos': 'NN'}]
Each regular expression is converted into a Non-deterministic Finite Automata (NFA) at the compilation stage. During the execution, a pattern can match several data patterns wrt the expression. Each match corresponds to a possible Deterministic Finite Automata (DFA).
PyRATA offers a way to extract the DFA as a list of actual encountered states. Successively the following example shows the internal representation of the NFA with all the present steps, then it shows the match obtained with a search method, and the corresponding ordered DFA states.
>>> data = [{'pos': 'PRP', 'raw': 'It'}, {'pos': 'VBZ', 'raw': 'is'}, {'pos': 'JJ', 'raw': 'fast'}, {'pos': 'JJ', 'raw': 'easy'}, {'pos': 'CC', 'raw': 'and'}, {'pos': 'JJ', 'raw': 'funny'}, {'pos': 'TO', 'raw': 'to'}, {'pos': 'VB', 'raw': 'write'}, {'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}, {'pos': 'IN', 'raw': 'with'}, {'pos': 'NNP', 'raw': 'Pyrata'}]
>>> pattern = '(pos="JJ"|pos="NN")* pos~"NN.*"+'
>>> compiled_re = pyrata_re.compile(pattern)
<pyrata.nfa CompiledPattern object;
starts_wi_data="False"
ends_wi_data="False"
lexicon="dict_keys([])"
nfa="
<pyrata.nfa NFA object;
states="{'IN[pos="JJ",pos="NN"]->CHAR(#S)->OUT[pos="JJ",pos="NN",pos~"NN.*"]', 'IN[pos="JJ",pos="NN"]->CHAR(#S)->OUT[pos="JJ",pos="NN",pos~"NN.*"]', 'IN[#S,pos~"NN.*"]->CHAR(pos~"NN.*")->OUT[pos~"NN.*",#M]', 'IN[pos~"NN.*"]->CHAR(#M)->OUT[]'}">
">
>>> compile_re.search(data)
<pyrata.re Match object; groups=[[[{'raw': 'regular', 'pos': 'JJ'}, {'raw': 'expressions', 'pos': 'NNS'}], 8, 10], [[{'raw': 'regular', 'pos': 'JJ'}], 8, 9]]>
>>> compiled_re.search(data).DFA()
['IN[#S]->CHAR(pos="JJ")->OUT[#S]', 'IN[pos~"NN.*",#S]->CHAR(pos~"NN.*")->OUT[pos~"NN.*",#M]']
In a near future they could also be searched in new data.
Have a look at the nltk.py
script (run it). It shows how to turn various nltk analysis results into the pyrata data structure.
In practice two approaches are available: either by building the dict list on fly or by using the dedicated PyRATA nltk methods: list2pyrata (**kwargs)
and listList2pyrata (**kwargs)
.
Thanks to python, you can also easily turn a sentence into the PyRATA data structure, for example by doing:
>>> import nltk
>>> sentence = "It is fast easy and funny to write regular expressions with PyRATA"
>>> pyrata_data = [{'raw':word, 'pos':pos} for (word, pos) in nltk.pos_tag(nltk.word_tokenize(sentence))]
pyrata_data = [{'pos': 'PRP', 'raw': 'It'}, {'pos': 'VBZ', 'raw': 'is'}, {'pos': 'JJ', 'raw': 'fast'}, {'pos': 'JJ', 'raw': 'easy'}, {'pos': 'CC', 'raw': 'and'}, {'pos': 'JJ', 'raw': 'funny'}, {'pos': 'TO', 'raw': 'to'}, {'pos': 'VB', 'raw': 'write'}, {'pos': 'JJ', 'raw': 'regular'}, {'pos': 'NNS', 'raw': 'expressions'}, {'pos': 'IN', 'raw': 'with'},{'pos': 'NNP', 'raw': 'PyRATA'}]
Generating a more complex data on fly is similarly easy:
>>> import nltk
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> from nltk.chunk import tree2conlltags
>>> sentence = "Mark is working at Facebook Corp."
>>> pyrata_data = [{'raw':word, 'pos':pos, 'stem':nltk.stem.SnowballStemmer('english').stem(word), 'lem':nltk.WordNetLemmatizer().lemmatize(word.lower()), 'sw':(word in nltk.corpus.stopwords.words('english')), 'chunk':chunk} for (word, pos, chunk) in tree2conlltags(ne_chunk(pos_tag(word_tokenize(sentence))))]
>>> pyrata_data
[{'lem': 'mark', 'raw': 'Mark', 'sw': False, 'stem': 'mark', 'pos': 'NNP', 'chunk': 'B-PERSON'}, {'lem': 'is', 'raw': 'is', 'sw': True, 'stem': 'is', 'pos': 'VBZ', 'chunk': 'O'}, {'lem': 'working', 'raw': 'working', 'sw': False, 'stem': 'work', 'pos': 'VBG', 'chunk': 'O'}, {'lem': 'at', 'raw': 'at', 'sw': True, 'stem': 'at', 'pos': 'IN', 'chunk': 'O'}, {'lem': 'facebook', 'raw': 'Facebook', 'sw': False, 'stem': 'facebook', 'pos': 'NNP', 'chunk': 'B-ORGANIZATION'}, {'lem': 'corp', 'raw': 'Corp', 'sw': False, 'stem': 'corp', 'pos': 'NNP', 'chunk': 'I-ORGANIZATION'}, {'lem': '.', 'raw': '.', 'sw': False, 'stem': '.', 'pos': '.', 'chunk': 'O'}]
The former method, list2pyrata
, turns a list into a list of dict (e.g. a list of words into a list of dict) with a feature to represent the surface form of the word (default is raw
). If parameter name
is given then the dict feature name will be the one set by the first value of the passed list as parameter value of name. If parameter dictList
is given then this list of dict will be extended with the values of the list (named or not).
The latter, listList2pyrata
, turns a list of list listList
into a list of dict with values being the elements of the second list; the value names are arbitrary chosen. If the parameter names
is given then the dict feature names will be the ones set (the order matters) in the list passed as names
parameter value. If parameter dictList
is given then the list of dict will be extented with the values of the list (named or not).
Example of uses of PyRATA dedicated conversion methods: See the nltk.py
scripts
So far (v0.4), the drawing option are only available in the pyrata_re.py
script. See the command line running section.
If you run the git version, you may make the code faster by removing the logging instructions.
Simply run:
bash more/code-optimize.sh
To restore the code (modification will be lost):
bash more/code-restore.sh
A benchmark script is currently in development to compare PyRATA with some python alternatives.
python3 do_benchmark.py
Up to v0.3.* the code was realised under the MIT license. Since the v0.4, PyRATA is released under the Apache License 2.0. Here a short summary of the license .
The documentation is distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
In addition to this current documentation, you may have look at do_tests.py
to see the implemented examples and more.
You can also read
[1] | Regular Expression Matching Can Be Simple And Fast |
[2] | An Efficient and Elegant Regular Expression Matcher in Python: http://morepypy.blogspot.com.au/2010/05/efficient-and-elegant-regular.html |