Skip to content

cristinae/BabelWE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BabelWE

Collection of utilities to work with BabelNet synsets

Current implemented languages:

Complete pipeline available for Arabic, German, English, French, Spanish, Turkish (no lemmatisation done for tr). A previous lemmatisation with TreeTagger is expected for Italian, Romanian and Dutch.


Set-up and installation

  1. Download and install the BabelNet API and its dependencies
    [API download] (http://babelnet.org/data/4.0/BabelNet-API-4.0.1.zip)
    unzip BabelNet-API-4.0.1.zip
    mvn install:install-file -Dfile=lib/lcl-jlt-2.4.jar -DgroupId=it.uniroma1.lcl.jlt -DartifactId=lcl-jlt -Dversion=2.4 -Dpackaging=jar
    mvn install:install-file -Dfile=lib/babelscape-data-commons-1.0.jar -DgroupId=com.babelscape -DartifactId=babelscape-data-commons -Dversion=1.0 -Dpackaging=jar
    unzip -p babelnet-api-4.0.1.jar META-INF/maven/it.uniroma1.lcl.babelnet/babelnet-api/pom.xml | grep -vP '<(scope|systemPath)>' >babelnet-api-4.0.1.pom
    (consider using homebrew's ggrep if on OsX)
    mvn install:install-file -Dfile=babelnet-api-4.0.1.jar -DpomFile=babelnet-api-4.0.1.pom

  2. Download BabelNet indices and make the API aware of them
    [Indices download] (http://babelnet.org/login)
    tar xjvf babelnet-4.0.1-index.tar.bz2

  • In ./BabelNet-API-4.0.1/config/babelnet.var.properties include the path to the index:
    babelnet.dir=/home/usr/BabelNet-4.0.1
  • In ./BabelNet-API-4.0.1/config/jlt.var.properties include the path to WordNet:
    jlt.wordnetPrefix=/usr/local/share/wordnet
  • Move the ./BabelNet-API-4.0.1/config folder to your ${basedir}

If you need to annotate Arabic corpora:

  1. Download and install MADAMIRA jar
    License for downloading
    mvn install:install-file -Dfile={$PATH}/MADAMIRA-release-20160516-2.1/MADAMIRA-release-20160516-2.1.jar -DgroupId=edu.columbia.ccls.madamira -DartifactId=MADAMIRA-release -Dversion=20160516-2.1 -Dpackaging=jar

Finally:

  1. Download and install this repository
    git clone https://github.com/cristinae/BabelWE.git
    mvn clean dependency:copy-dependencies package

External resources

  1. Download the IXA pipes for tokenisation and lemmatisation of English, Spanish, French and German. They are used as an external executable, no need for installation.
    Download page
    Include their path in the configuration file babelWE.ini

  2. Use the Moses tokeniser included in the ./scripts folder
    Include its path in the configuration file babelWE.ini

### External software and models
# IXA pipe
ixaTok=/fullPath/ixa/ixa-pipe-tok-1.8.5-exec.jar
ixaLem=/fullPath/ixa/ixa-pipe-pos-1.5.1-exec.jar
posEs=/fullPath/ixa/morph-models-1.5.0/es/es-pos-perceptron-autodict01-ancora-2.0.bin
lemEs=/fullPath/ixa/morph-models-1.5.0/es/es-lemma-perceptron-ancora-2.0.bin
posEn=/fullPath/ixa/morph-models-1.5.0/en/en-pos-perceptron-autodict01-conll09.bin
lemEn=/fullPaths/ixa/morph-models-1.5.0/en/en-lemma-perceptron-conll09.bin

# Moses
mosesTok=/fullPath/moses/tokenizerNO2html.perl
  1. For Italian, Romanian and Dutch we expect input to be already lemmatised using TreeTagger, but the lemmatisation pipeline with TreeTagger is not included yet.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published