PyDKPro provides a Python wrapper for the DKPro Core NLP framework. DKPro Core itself is based on the UIMA framework and programmed in Java. Interoperability is achieved via web services deployed as docker container. After containerization, services remain active unless manually disabled or idle for limited period of interval.
Analysis results in DKPro Core are represented as CAS objects. Conversion between Java and Python data structures is based on dkpro-cassis We also provide built-in support for spacy format and NLTK) format allowing for seamless integration.
PyDKPro is still under heavy development. Feedback is highly appreciated.
For demo purpose, different use cases are provided with working example (mocked) in Examples/UseCases.ipynb
- Python >=3.6
- Git
- mvn
- Docker (please make sure it is running)
Creating virtual environment
Install virtualenv if not already installed.
$ python -m pip install virtualenv
Create virtual environment. Replace [env_name] with a name of your choice.
$ mkdir [env_name]
$ virtualenv -p python3 [env_name]
or
$ python3 -m venv [env_name]
Activate created virtual environment.
For Windows:
$ [env_name]\Scripts\activate.bat
For Linux and Mac OS:
$ source [env_name]/bin/activate
Creating virtual environment using conda
Create virtual environment with conda. Replace [env_name] with a name of your choice.
$ conda create --name [env_name] python=3.6
To activate the environment (on Windows, MacOS and Unix),
$ conda activate [env_name]
Clone this repository
$ git clone https://github.com/zesch/pydkpro.git
Install dependencies using pip
$ cd pydkpro
$ python -m pip install -r requirements.txt
$ python -m spacy download en_core_web_sm
- Using DKPro Core components directly in Python
- Conversion to spaCy
- Conversion to NLTK
How to open examples notebook
$ cd Examples
$ jupyter notebook
Defining an NLP pipeline
A pipeline is build by adding DKPro Core components.
from pydkpro import Pipeline, Component
p = Pipeline(version="2.0.0", language='en')
p.add(Component().clearNlpSegmenter())
p.add(Component().stanfordPosTagger(variant='fast-caseless', printTagSet='false'))
p.build() # fire up the container web service
Note: All parameters are optional and default to best performed model versions.
Run the pipeline
For the triggered pipeline above, a CAS object will be generated for the provided string. This CAS object can be used to retrieve annotations like tokens and POS tags, etc.
cas = p.process('Backgammon is one of the oldest known board games.', language='en')
Note: Language detector is used, if language parameter is not provided.
To return all the tokens:
from pydkpro import DKProCoreTypeSystem as dts
cas.select(dts().token).as_text()
Output:
['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']
To return all the pos tags:
cas.select(dts().token).get_pos()
Output:
['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']
Provide UIMA CAS functionality
DKProCoreTypeSystem
would allow integration of other type systems to nicely use DKPro Cassis with their types systems. Generated cas object provide UIMA CAS functionality. For example:
# add annotation
from pydkpro.cas import Cas
Token = dts().typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token') # define dkpro token
cas = Cas(dts().typesystem)()
cas.sofa_string = "I like cheese ."
tokens = [
Token(begin=0, end=1, id='0', pos='NNP'),
Token(begin=2, end=6, id='1', pos='VBD'),
Token(begin=7, end=13, id='2', pos='IN'),
Token(begin=14, end=15, id='3', pos='.')
]
for token in tokens:
cas.add_annotation(token)
Cas token attributes can printed as following:
print([x.get_covered_text() for x in cas.select_all()])
print([x.pos for x in cas.select_all()])
Output:
['I', 'like', 'cheese', '.']
['NNP', 'VBD', 'IN', '.']
Conversion from CAS to spaCy format and vice-versa
Generated CAS objects can also be typecast to the spaCy type system.
from pydkpro import To_spacy, From_spacy
cas = p.process('Backgammon is one of the oldest known board games.', language='en')
for token in To_spacy(cas)():
print(token.text, token.tag_)
Conversion from spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
cas = From_spacy(doc)()
print(cas.select(dts().token).get_pos())
Conversion from CAS to NLTK format
NLTK returns a specific format for each type of preprocessing. Here is an example for POS:
from pydkpro.external import To_nltk, From_nltk
print(To_nltk().tagger(cas))
Output:
[('Backgammon', 'NNP'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('oldest', 'JJS'), ('known', 'VBN'), ('board', 'NN'), ('games', 'NNS'), ('.', '.')]
This output can then be used for further integration with other NLTK components:
import nltk
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(To_nltk().tagger(cas))
print(chunked)
Output:
(S
(Chunk Backgammon/NNP)
is/VBZ
one/CD
of/IN
the/DT
oldest/JJS
known/VBN
board/NN
games/NNS
./.)
Conversion from NLTK
PyDKPro also provides reverse functionality where a CAS object can be created from spaCy or NLTK output. In the following example, tokenization is performed using NLTK tweet tokenizer, but POS tagging is done with the DKPro wrapper of Stanford CoreNLP POS tagger using their fast.41 model:
from nltk.tokenize import TweetTokenizer
cas = From_nltk().tokenizer(TweetTokenizer().tokenize('Backgammon is one of the oldest known board games.'))
Cas processing
PyDKPro pipeline also provide direct cas object processing as demonstrated in below example:
Shortcut for running single components
A single component can also be run without the need to build a pipeline first:
tokenizer = Component().clearNlpSegmenter()
cas = tokenizer.process('I like playing cricket.')
print(cas.select(dts().token).as_text())
Output:
['I', 'like', 'playing', 'cricket', '.']
Working with list of strings
Multiple strings in the form of list can also be processed, where each element of list will be considered as document.
str_list = ['Backgammon is one of the oldest known board games.', 'I like playing cricket.']
for str in str_list:
cas = p.process(str)
print(cas.select(dts().token).as_text())
Working with text documents
Pipelines can also be directly run on text documents:
from pydkpro.external import File2str
cas = p.process(File2str('test_data/input/test2.txt')())
print(cas.select(dts().token).as_text())
Working with multiple text documents
Multiple documents can also be processed by providing documents path and document name matching patterns
# documents available at different path can be provided in list
docs = ['test_data/input/1.txt', 'test_data/input/2.txt']
for doc in docs:
p.process(File2str(doc)())
End collection process
With following command pipeline's collection process will be completed (Alternatively, scope operator with
can be used)
p.finish()