This is the documentation for a wrapper for FreeLing 4. This wrapper was developed to ease the processing of texts for my PhD. It is written in Python 3 and tested on macOS Sierra.
This project was started to meet my own needs. I don't plan to include extra features upon request. I don't provide any kind of support, specially regarding other platforms. But feel free to fork, clone and play with the code.
├── README.md
├── config
├── freeling.py
├── test
└── test_freeling.sh
README.md
: this documentationconfig/
: folder containing different analyzer's configuration filesfreeling.py
: general wrapper forVRT
,XML
or plain text input formats outputingVRT
orCONLL
formats.test/
: folder containing test data.test_freeling.sh
: shell script to test English and Spanish basic processing.
There are 4 types of dependencies:
- Homebrew, a package manager to install software in Mac
- FreeLing, an open source language analysis tool suite
- Python 3
- lxml, a Python library to work with XML
- libxml2 and libxslt C libraries which are dependencies of
lxml
Follow Neil Gee's guide to install and set up homebrew for Mac OS Sierra
Use Homebrew to install FreeLing by running this command:
brew install freeling
Homebrew will take care of any dependencies.
You can install Python 3 with Homebrew following the instructions from The Hitchhiker's Guide to Python or following the very complete Lisa Tagliaferri's guide.
Basically:
brew install python3
macOS Sierra already provides libxml2
and libxslt
. They can be installed through Homebrew though:
brew install libxml2
brew install libxslt
Now, you are ready to install lxml
:
pip3 install lxml
Once you have installed FreeLing and all the dependencies, you will always do two things:
- start a FreeLing analyzer in server mode
- run the wrapper script
As our wrapper is devised to process batches of files and each file can be split into smaller text units, we want to avoid the downtime of loading parameters for each (chunk of) text to be processed. If we start a server, parameters are loaded only once.
The options of the server can be declared in a configuration file or via command line options. Command line options override configuration file's directives.
For more details check the FreeLing documentation for the analyzer.
A server to analyze texts in English with default options can be invoked with the following command:
analyze -f en.cfg --server --port 50005 &
You will see after a few seconds in the terminal window some information (like how to stop the server). Keep that terminal window open to monitor the server.
option | meaning | default | values |
---|---|---|---|
--input |
Input format in which to expect text to analyze | text |
text , freeling , conll |
--output |
Output format to produce with analysis results | freeling |
freeling , conll , xml , json , naf , train |
--inplv |
Analysis level of input data (already tagged) | text |
text , token , splitted , morfo , tagged , shallow , dep , coref |
--outlv |
Analysis level of output data (to be tagged) | tagged |
token , splitted , morfo , tagged , shallow , parsed , dep , coref , semgraph |
--sense |
Kind of sense annotation to perform | no |
no , all , mfs , ukb |
These are a series of customized files to ease the invokation of servers from the command line:
en_nomwe.cfg
andes_nomwe.cfg
: no module carrying out multiword expression detection is activated.en_mwe.cfg
andes_mwe.cfg
: all modules carrying out multiword expression detection are activated.en_mwe_nec.cfg
andes_mwe_nec.cfg
: like*_mwe.cfg
but performing NEC.
Once a server is up and running we can use freeling.py
to process the files. Below is an explanation of the options that can be passed to the wrapper.
usage: freeling.py [-h] -s SOURCE -t TARGET -p PORT [-f FPATTERN] [--sentence] -e ELEMENT
[-o {flg,vrt}]
optional arguments:
-h, --help show this help message and exit
-s SOURCE, --source SOURCE
path to directory where the source files are located.
-t TARGET, --target TARGET
path to the directory where the translations are located.
-p PORT, --port PORT port number of the FreeLing server.
-f FPATTERN, --fpattern FPATTERN
pattern to find the relevant files.
--sentence if provided sentences are already tagged as XML.
-e ELEMENT, --element ELEMENT
element where text to be processed is contained
-o {flg,vrt}, --oformat {flg,vrt}
output format
And this would be an example to process a text in English with an analyzer running with the default configuration:
python freeling.py -s ./test/en/ -t ./test/en/output/ -p 50005 -f "*_w_sentences.xml" --sentence -e s -o vrt
List of my most frequent server configurations.
port | command | requires | yields |
---|---|---|---|
50101 | analyze -f ./config/en_nomwe.cfg --server --port 50101 & |
text | token, lemma, POS |
50102 | analyze -f ./config/en_mwe.cfg --server --port 50102 & |
text | token, lemma, POS |
50103 | analyze -f ./config/en_mwe_nec.cfg --server --port 50103 & |
text | token, lemma, POS |
50104 | analyze -f ./config/en_mwe_nec.cfg --sense ukb --input freeling --inplv tagged --server --port 50104 & |
token, lemma, POS | WSD |
50105 | analyze -f ./config/en_mwe_nec.cfg --input freeling --inplv tagged --outlv shallow --server --port 50105 & |
token, lemma, POS | constituency |
50106 | analyze -f ./config/en_mwe_nec.cfg --input freeling --inplv tagged --outlv dep --server --port 50106 & |
token, lemma, POS | dependency |
50111 | analyze -f ./config/en_nomwe.cfg --output conll --server --port 50101 & |
text | token, lemma, POS |
50112 | analyze -f ./config/en_mwe.cfg --output conll --server --port 50102 & |
text | token, lemma, POS |
50113 | analyze -f ./config/en_mwe_nec.cfg --output conll --server --port 50103 & |
text | token, lemma, POS |
50114 | analyze -f ./config/en_mwe_nec.cfg --sense ukb --input freeling --inplv tagged --output conll --server --port 50104 & |
token, lemma, POS | WSD |
50115 | analyze -f ./config/en_mwe_nec.cfg --input freeling --inplv tagged --outlv shallow --output conll --server --port 50105 & |
token, lemma, POS | constituency |
50116 | analyze -f ./config/en_mwe_nec.cfg --input freeling --inplv tagged --outlv dep --output conll --server --port 50106 & |
token, lemma, POS | dependency |
50121 | analyze -f ./config/en_nomwe.cfg --output xml --server --port 50101 & |
text | token, lemma, POS |
50122 | analyze -f ./config/en_mwe.cfg --output xml --server --port 50102 & |
text | token, lemma, POS |
50123 | analyze -f ./config/en_mwe_nec.cfg --output xml --server --port 50103 & |
text | token, lemma, POS |
50124 | analyze -f ./config/en_mwe_nec.cfg --sense ukb --input freeling --inplv tagged --output xml --server --port 50104 & |
token, lemma, POS | WSD |
50125 | analyze -f ./config/en_mwe_nec.cfg --input freeling --inplv tagged --outlv shallow --output xml --server --port 50125 & |
token, lemma, POS | constituency |
50126 | analyze -f ./config/en_mwe_nec.cfg --input freeling --inplv tagged --outlv dep --output xml --server --port 50106 & |
token, lemma, POS | dependency |
50201 | analyze -f ./config/es_nomwe.cfg --server --port 50201 & |
sentence | token, lemma, POS |
50202 | analyze -f ./config/es_mwe.cfg --server --port 50202 & |
sentence | token, lemma, POS |
50203 | analyze -f ./config/es_mwe_nec.cfg --server --port 50203 & |
sentence | token, lemma, POS |
# English
analyze -f ./config/en_nomwe.cfg --server --port 50101 &
# Spanish
analyze -f ./config/es_nomwe.cfg --server --port 50201 &
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/nomwe -p 50101 -f "*w_sentences.xml" --sentence -e s -o flg
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/nomwe -p 50201 -f "*w_sentences.xml" --sentence -e s -o flg
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/nomwe -p 50101 -f "*w_sentences.xml" --sentence -e s -o vrt
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/nomwe -p 50201 -f "*w_sentences.xml" --sentence -e s -o vrt
# English
analyze -f ./config/en_nomwe.cfg --output conll --server --port 50111 &
# Spanish
analyze -f ./config/es_nomwe.cfg --output conll --server --port 50211 &
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/nomwe -p 50111 -f "*w_sentences.xml" --sentence -e s -o conll
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/nomwe -p 50211 -f "*w_sentences.xml" --sentence -e s -o conll
# English
analyze -f ./config/en_mwe_nec.cfg --server --port 50103 &
# Spanish
analyze -f ./config/es_mwe_nec.cfg --server --port 50203 &
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/mwe -p 50103 -f "*w_sentences.xml" --sentence -e s -o flg
# Spanish
python freeling.py -s ./testes/ -t ./test/es/tmp_output/mwe -p 50203 -f "*w_sentences.xml" --sentence -e s -o flg
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/mwe -p 50103 -f "*w_sentences.xml" --sentence -e s -o vrt
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/mwe -p 50203 -f "*w_sentences.xml" --sentence -e s -o vrt
# English
analyze -f ./config/en_mwe_nec.cfg --output conll --server --port 50113 &
# Spanish
analyze -f ./config/es_mwe_nec.cfg --output conll --server --port 50213 &
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/mwe -p 50103 -f "*w_sentences.xml" --sentence -e s -o conll
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/mwe -p 50203 -f "*w_sentences.xml" --sentence -e s -o conll
# English
analyze -f ./config/en_nomwe.cfg --server --port 50101 &
# Spanish
analyze -f ./config/es_nomwe.cfg --server --port 50201 &
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/nomwe -p 50101 -f "*wo_sentences.xml" -e p -o flg
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/nomwe -p 50201 -f "*wo_sentences.xml" -e p -o flg
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/nomwe -p 50101 -f "*wo_sentences.xml" -e p -o vrt
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/nomwe -p 50201 -f "*wo_sentences.xml" -e p -o vrt
# English
analyze -f ./config/en_nomwe.cfg --output conll --server --port 50111 &
# Spanish
analyze -f ./config/es_nomwe.cfg --output conll --server --port 50211 &
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/nomwe -p 50101 -f "*wo_sentences.xml" -e p -o conll
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/nomwe -p 50201 -f "*wo_sentences.xml" -e p -o conll
# English
analyze -f ./config/en_nomwe.cfg --server --port 50103 &
# Spanish
analyze -f ./config/es_nomwe.cfg --server --port 50203 &
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/nomwe -p 50103 -f "*wo_sentences.xml" -e p -o flg
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/nomwe -p 50203 -f "*wo_sentences.xml" -e p -o flg
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/nomwe -p 50103 -f "*wo_sentences.xml" -e p -o vrt
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/nomwe -p 50203 -f "*wo_sentences.xml" -e p -o vrt
# English
analyze -f ./config/en_nomwe.cfg --output conll --server --port 50113 &
# Spanish
analyze -f ./config/es_nomwe.cfg --output conll --server --port 50213 &
# English
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/nomwe -p 50113 -f "*wo_sentences.xml" -e p -o conll
# Spanish
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/nomwe -p 50213 -f "*wo_sentences.xml" -e p -o conll
# English
analyze -f ./config/en_mwe_nec.cfg --sense ukb --input freeling --inplv tagged --server --port 50104 &
# Spanish
analyze -f ./config/es_mwe_nec.cfg --sense ukb --input freeling --inplv tagged --server --port 50204 &
# English
python freeling.py -s ./test/en/tmp_output/mwe -t ./test/en/tmp_output/wsd -p 50104 -f "*.flg" -e p -o flg
# Spanish
python freeling.py -s ./test/es/tmp_output/mwe -t ./test/es/tmp_output/wsd -p 50204 -f "*.flg" -e p -o flg
# English
python freeling.py -s ./test/en/tmp_output/mwe -t ./test/en/tmp_output/wsd -p 50104 -f "*.flg" -e p -o vrt
# Spanish
python freeling.py -s ./test/es/tmp_output/mwe -t ./test/es/tmp_output/wsd -p 50204 -f "*.flg" -e p -o vrt
# English
analyze -f ./config/en_mwe_nec.cfg --sense ukb --input freeling --inplv tagged --output conll --server --port 50114 &
# Spanish
analyze -f ./config/es_mwe_nec.cfg --sense ukb --input freeling --inplv tagged --output conll --server --port 50214 &
# English
python freeling.py -s ./test/en/tmp_output/mwe -t ./test/en/tmp_output/wsd -p 50114 -f "*.flg" -e p -o vrt
# Spanish
python freeling.py -s ./test/es/tmp_output/mwe -t ./test/es/tmp_output/wsd -p 50214 -f "*.flg" -e p -o vrt
# English
analyze -f ./config/en_mwe_nec.cfg --outlv shallow --output xml --server --port 50125 &
# Spanish
analyze -f ./config/es_mwe_nec.cfg --outlv shallow --output xml --server --port 50225 &
# English
# with sentences preannoted
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/shallow -p 50125 -f "*w_sentences.xml" --sentence -e s -o xml --constituency
# without sentences preannotated
python freeling.py -s ./test/en/ -t ./test/en/tmp_output/shallow -p 50125 -f "*wo_sentences.xml" -e p -o xml --constituency
# Spanish
# with sentences preannotated
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/shallow -p 50225 -f "*w_sentences.xml" --sentence -e s -o xml --constituency
# without sentences preannotated
python freeling.py -s ./test/es/ -t ./test/es/tmp_output/shallow -p 50125 -f "*wo_sentences.xml" -e p -o xml --constituency
Run:
sh test/test_freeling.sh
- readers/writers for different formats VRT, XML, TCF
- implement output formats for processed text:
multilayer vrt
(output each layer of information in a separate VRT file)