If you have a directory containing gzipped ukWaC xml files with extracted NP heads, you can use this API to read them. There are basically two ways you can do this: by reading corpus files directly and reading .h5 file. In order to read from .h5 file, you'll need to generate it first and this could require ~11-15h depending on the selected options. Direct reading gives you more output flexibility, .h5 file reading is roughly x3 faster.
You want to parse ukWaC corpus semantic roles and extract as much as possible from the data, you don't care about speed.
You need the output to contain a governor and its corresponding set of arguments, you also need source
and text
data.
The heads should all be lemmatized and delexicalized.
In addition, you want to apply a word frequency list of top 50k most frequent tokens.
from ukwac_api import Corpus, CorpusReader
reader = CorpusReader('march_2017_heads_low')
reader.create(None, 0, True) # h5name, compression, lemmatize
reader.generate_freqlist('wordfreq.pickle', 50000) # wordfreq, size
reader.set_wordfilter('wordfreq.txt')
it = reader.read('set', True, ['source', 'text'], True) # mode, delexicalize, extract, lemmatize
for govset in it:
print govset[:-1], govset[-1]['source'], govset[-1]['text']
Corpus:
.create(h5name='ukwac.h5') # parse and serialize the corpus into .h5 format
.clean_up() # remove all .h5 and .pickle files generated by create() method
.generate_freqlist('wordfreq.pickle', 50000) # use wordfreq.pickle
.set_wordfilter('wordfreq.txt') # set word filter
CorpusReader(Corpus):
.connect(h5file) # set the path to .h5 corpus file and use it
.disconnect() # close .h5 corpus file and switch to direct corpus parsing
.read(mode, delexicalize, extract, lemmatize) # return corpus iterator
<sents>
<text/>
<s>
<predicate>
<governor>take/vbd/2</governor>
<dependencies>
<dep algorithm="malt" source="i/prp/1" text="i" type="a0">i/prp/1</dep>
<dep source="take/vbd/2" text="take" type="v">take/vbd/2</dep>
<dep algorithm="failed" source="very/rb/3 little/jj/4" text="very little" type="a2">very/rb/3 little/jj/4</dep>
<dep algorithm="malt" source="part/nn/5" text="part" type="a1">part/nn/5</dep>
<dep algorithm="malt" source="in/in/6 the/dt/7 conversation/nn/8" text="in the conversation" type="am-loc">conversation/nn/8</dep>
</dependencies>
</predicate>
</s>
...
</sents>
<sents>
– a root element of the xml file, contains <text>
and <s>
.
<text>
– contains “id” attribute that is a url address of the sentences below until next <text>
element.
<s>
– an element that contains a sentence processed by head searching algorithms.
<predicate>
– sentence predicate block.
<governor>
– governor of the respective predicate block, e.g. examine/vbz/12
, where examine
is a word, vbz
is a POS-tag and 12
is word index in a sentence <s>
element.
<dependencies>
– a block of dependencies for the current sentence governor, contains <dep>
elements.
<dep>
– a single dependency of a current governor, contains a head word extracted by one of the head searching algorithms, e.g. state/nn/16
where state is a head word, nn is a POS-tag, 16 is a word index.
The element also contains the following attributes:
-
algorithm
– contains a name of a head searching algorithm used for head finding, e.g. malt, malt_span, linear or failed if no algorithm was able to find a head. -
source
– contains tokenized, lemmatized and POS-tagged clause governed by the<governor>
. This clause is used by head searching algorithms. -
text
– contains tokenized, lemmatized clause governed by the<governor>
. Similar to “source” but without POS-tags. -
type
– dependency type that follows PropBank annotation.
Iteratevily read through all .xml/.gz files found in the specified directory.
You can set the reading mode to be either "set", "random_set" or "single". In "set" mode the output will contain the governor and all its respective dependants within one sentence.
('sent id','governor', 'gov postag',
(('arg', 'arg postag', 'role', 'alg'), (...), (...))
)
"random_set" mode (.h5 only) returns randomly (without replacement) chosen govset from .h5 file. The search algorithm looks for a valid chunk of gov and its args and when found returns a tuple.
('sent id','governor', 'gov postag',
(('arg', 'arg postag', 'role', 'alg'), (...), (...))
)
In "single" mode the output will contain governor -> dependant pair and their semantic attributes.
('sent id','governor', 'gov postag', 'arg', 'arg postag', 'role', 'algorithm')
In other words, these are two different representations of the same data.
USAGE:
reader = CorpusReader('ukwac_heads')
reader.read('set')
During direct parsing you can also choose whether to delexicalize the input or not. Delixalization is the process of replacing string data with integers.
For example:
('establish', 'vbg', 'constitution', 'nnp', 1, 3) --> (234, 21, 120, 10, 1, 3)
However, for delexicalization to occur you need .pickle
files that are generated by .create()
method together with .h5 files.
If no required .pickle is found, the parameter will be ignored.
Delexicalization does not apply to extracted parameters source
and text
.
USAGE:
reader = CorpusReader('ukwac_heads')
reader.read('set', True)
As you can see above, xml head files also contain source
and text
attributes.
You can extract their values by supplying a list of the attributes to extract as extracted
parameter to read()
method.
These values are added to the dict and appended to the output.
You can then easily access them as output[-1]['source']
for example.
USAGE:
reader = CorpusReader('ukwac_heads')
reader.read('set', True, ['source','text'])
You can add any attribute, e.g. type
, its value also will be added to the dictionary even though it is already getting extracted as the main component of the output.
Applies nltk lemmatization to governors and dependants.
source
and text
attribute values are lemmatized since they come from ukWaC corpus (actually not 100%, due to ukWac preprocessing inconsistencies some words may retain their original form).
USAGE:
reader = CorpusReader('ukwac_heads')
reader.read('set', True, ['source','text'], True)
You can create or generate a list of words that must be included into the output.
A typical use case is when you need only top 50k most frequent words in you output.
You can use .generate_freqlist()
method that will pick up previously saved wordfreq.pickle
and generate n-most frequent words list for you.
USAGE:
corpus = Corpus('ukwac_heads')
corpus.generate_freqlist('wordfreq.pickle', 50000)
reader = CorpusReader('ukwac_heads', 'wordfreq.txt')
reader.read()
If you are aware of the ukWaC head searching algorithms (malt, malt_span, linear, failed), you can choose the results of which algorithm to include. A typical use case is when you don't want to include failed NP heads.
USAGE:
reader = CorpusReader('ukwac_heads', 'top50k.txt', ['malt','malt_span', 'linear'])
reader.read()
Parse ukWac corpus and store the output in three .h5 files corresponding to train/valid/test sets with the following ratio 70/20/10. You can apply lemmatization, which will obviously reduce your vocabulary size. All parsed data is stored as integer numbers, e.g. governor "go" is encoded as 3, role type "A0" as 1, "NN" part-of-speech as 5 etc.
The data in .h5 is stored in the following format:
-1, 0, 0, 1, 1, 5, 2, 2, 1, 1, 1, 3, 2, 3, 3, -1, 1, 2, 1, 5, 2, 2, -1, ...
Where -1 is a delimiter for governor set.
USAGE:
reader = CorpusReader('ukwac_heads')
reader.create('ukwac.h5') # requires ~10-12h
After the operation is finished you can read the parsed corpus data from saved .h5 file.
When reading from .h5 file, mode, delex params will be ignored.
The data stored in .h5 is in 'single' mode format and is already delexicalized.
If you do not supply any .h5 name, create()
method will generate word and postag mappings (.pickles) only.
Since dumping data into .h5 requires more processing time, generating mappings is faster.
If specified, create
method will dump the extracted data into .h5 file.
If not specified, only .pickle dicts for delexicalization and frequency lists will be generated.
If h5name
is specified, applies compression ratio [0-9] to .h5 file.
Applies nltk lemmatization to governors and dependants.
source
and text
attribute values are lemmatized since they come from ukWaC corpus (actually not 100%, due to ukWac preprocessing inconsistencies some words may retain their original form).
USAGE:
corpus = Corpus('ukwac_heads')
corpus.create(None, 0, True)
reader.connect('ukwac.h5')
reader.read() # mode and delexicalize params will be ignored
In order to read "train", "valid" and "test" .h5 files you'll need to instantiate new Corpus
or CorpusReader
class, connect the required file and call read()
method.
You can set the reading mode to be either "set" or "single". In "set" mode the output will contain the governor and all its respective dependants within one sentence.
('sent id','governor', 'gov postag',
'arg0', 'arg0 postag', 'role0', 'alg0',
'arg1', 'arg1 postag', 'role1', 'alg1',
...
)
In "single" mode the output will contain governor -> dependant pair and their semantic attributes.
('sent id','governor', 'gov postag', 'arg', 'arg postag', 'role', 'algorithm')
In other words, these are two different representations of the same data.
.h5 file is created using "single" mode by default.
If you attempt to use "set" read mode on such .h5 file, it will fail.
In other words, if you plan to read .h5 file using "single" mode, do create('ukwac_single.h5')
.
If you plan to read .h5 using "set" or "random_set" mode, do create('ukwac_set.h5', mode='set')
.
USAGE:
reader = Corpus('ukwac_heads')
reader.create('ukwac.h5', compress=0, lemmatize=False, mode='set')
You can control the number of included governors and their dependants by applying word filter before .h5 file creation. For example, if you only want to include top 10k or 50k most frequent words in .h5.
USAGE:
reader = CorpusReader('ukwac_heads', 'wordfilter.txt')
reader.create('ukwac.h5') # only the words from wordfilter.txt are included
Similarly to direct parsing you can control the results of which NP head algorithm to include
USAGE:
reader = CorpusReader('ukwac_heads', 'wordfilter.txt', ['malt','malt_span'])
reader.create('ukwac.h5') # only the words from wordfilter.txt and malt, malt_span algs are included
Direct reading allows you choose delexicalization options while having slower reading times. .h5 file reading achieves fastest reading times but no delexicalization option is available.
Keep in mind that converting the corpus into .h5 format reduces its size from 18GB to 8-12GB. Above that, you can use compression parameter and reduce the corpus size even more up to 4.3GB.
Direct corpus parsing is 50% slower per file, on the other hand, conversion to .h5 requires ~10-12h, ~2.5GB RAM and ~12GB of free space (~4.3GB compression=5). The fastest option is to parse the corpus and create .h5 file which contains delexicalized elements (int32). Reading from this file is very fast since the operation does not involve any intermediate preprocessing. You can't apply word filter during reading only before .h5 creation.
type | time | cpu |
---|---|---|
direct reading | 4.5s | Intel Core i5-4210U 2700 MHz |
direct reading (with lemmatization) | 10.5s | Intel Core i5-4210U 2700 MHz |
creating .pickles | 3.8s | Intel Core i5-4210U 2700 MHz |
creating .pickles (with lemmatization) | 7.3s | Intel Core i5-4210U 2700 MHz |
conversion to .h5 | 8.3s | Intel Core i5-4210U 2700 MHz |
conversion to .h5 (with lemmatization) | 12.5s | Intel Core i5-4210U 2700 MHz |
reading .h5 (~4MB) | 2s | Intel Core i5-4210U 2700 MHz |
reading .h5 (~1.4M with 9 compression) | 2s | Intel Core i5-4210U 2700 MHz |
type | time | cpu |
---|---|---|
direct reading | 6-7h | AMD Opteron(tm) Processor 6380, 2500 Mhz |
direct reading | 3h | Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz |
direct reading (with lemmatization) | 3.4h | Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz |
direct reading (with lemmatization, extracting) | 4.45h | Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz |
conversion to .h5 (with lemmatization) | 14-15h | AMD Opteron(tm) Processor 6380, 2500 Mhz |
generating .pickles | 6.4h | AMD Opteron(tm) Processor 6380, 2500 Mhz |
generating .pickles (with lemmatization) | 9.2h | AMD Opteron(tm) Processor 6380, 2500 Mhz |
conversion to .h5, generating .pickles | 11-12h | AMD Opteron(tm) Processor 6380, 2500 Mhz |
reading .h5 | 2h | AMD Opteron(tm) Processor 6380, 2500 Mhz |
Relevant resources:
"A large corpus automatically annotated with semantic role information"
"An exploration of semantic features in an unsupervised thematic fit evaluation framework"