Skip to content

DBpedia Abstract Extraction step by step guide

Markus Freudenberg edited this page Jul 12, 2017 · 2 revisions

DEPRECATED!

TODO: REPLACE

Software Requirements

  • MySQL
  • PHP with xml and apc,
  • Scala
  • Maven
  • MediaWiki
  • Web Server (nginx seems to perform way better than Apache))

Steps

Download the DBpedia Extraction Framework

Please download or pull the extraction framework using git:

git clone git://github.com/dbpedia/extraction-framework.git

Download DBpedia dumps (if needed)

If you want to download the DBpedia dump files then please do as follows:

cd dump
../clean-install-run download config=download.minimal.properties

There are already some configuration files in the extraction framework (e./g. download.minimal.properties). Customize file according to your need and fire the above command to download the dumps you need. In download configuration file (i.e. download.minimal.properties) there is a property named base-dir which specify the directory where the dump files will be stored. The DBpedia extraction framework uses the following structure when storing dump files:

/path_to_download_folder/yyyymmdd/[language_code]wiki-yyyymmdd.-pages-articles.xml.bz2

NOTE: If you have already downloaded the above pages-articles dump manually (without using the DBpedia extraction framework), then please skip this step. Anyway, please make sure that the above naming convention for directory structure have been followed. If not, then create this directory structure manually.

Install required software

You need to install MySQL, PHP, Apache and other software.

MySQL configuration

  • step 1 : install mysql
  • step 2 : open my.cnf file (in mysql root directory if installed by hand or in /etc/mysql/ if installed with ubuntu packages)
  • step 3 : add these parameters in the [mysqld] section to have the utf8 encoding by default :

character-set-server=utf8 and skip-character-set-client-handshake

  • step 4 : change max_allowed_packet=16M to max_allowed_packet=1G
  • step 5 : change key_buffer=16M to key_buffer=1G
  • step 6 : change query_cache_size=16M to query_cache_size=1G

These next step are made for those who installed mysql by hand. Otherwise if you installed MySQL with the repositories of your Linux distribution you can pass them.

  • step 7 : set socket parameter to $MYDIR/mysqld.sock
  • step 8 : set datadir parameter to $MYDIR/data
  • step 9 : open your ~/.bashrc file to add : export MYDIR=/path/where/you/installed/mysql

Now you need to install PHP and Apache. The installation of these tools is out of scope for this guide. Please refer to the tools documentation. (note! php5-mysql is missing from metapackage lamp-server^ in ubuntu )

You are also requested to install php-xml and php-apc to avoid some error and performance issues which will be described later in this document.

NOTE: For some Linux/Unix distros php-apc might be named php-pecl-apc.

There exists also a script which may be used for this setup. It has not been tested though, but should be working fine:

https://github.com/saxenap/install-php-apc-mysql-amazon-linux-centos/blob/master/php-apc-mysql-script.sh

Finally download MediaWiki from http://www.mediawiki.org/wiki/Download . It is recommended to use the latest stable release. (Note! I have tried most of the 2.x versions, I would recommend using the MediaWiki 1.19.11 legacy lts release since it seems to work best, early 2.x releases seem to work too, but they might require some changes)

You can also use download latest release from git: (Note! the current Git version does not work, do not use it)

git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git

Trigger import to MySQL

In order to generate clean abstracts from Wikipedia articles one needs to render wiki templates as they would be rendered in the original Wikipedia instance. So in order for the DBpedia Abstract Extractor to work, a running MediaWiki instance with Wikipedia data in a MySQL database is necessary.

To import the data, you need to run the Scala import launcher.

Before importing, you have to adapt the settings for the import launcher in dump/pom.xml as below:

(Note: dump/pom.xml may be found in extraction-framework/pom.xml)

<launcher>
    <id>import</id>
    <mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
    <jvmArgs>
        <jvmArg>-server</jvmArg>
    </jvmArgs>
    <args>
        <arg>path_to_download_folder</arg>
        <arg>/path_to_wikimedia_parent_dir/mediawiki/maintenance/tables.sql</arg>
        <arg>jdbc:mysql://machine_name:mysql_port/?characterEncoding=UTF-8&amp;user=myuser&amp;password=mypass</arg>
        <arg>false</arg><!-- require-download-complete -->
        <arg>language-code</arg><!-- languages and article count ranges, comma-separated -->
    </args>
</launcher>

If you have downloaded the DBpedia dump file manually then set require-download-complete to false as no file with the name exists to indicate successful download.

Now to import data into MySQL fire:

../clean-install-run import

NOTE:

If while importing you get error ERROR 1283: Column 'si_title' cannot be part of FULLTEXT index than collate should be specified for table 'searchindex' then please change line for table searchindex from ENGINE=MyISAM to ENGINE=MyISAM COLLATE='utf8_general_ci';. This change should be done on the file : /path_to_wikimedia_parent_dir/mediawiki/maintenance/tables.sql

Prepare MediaWiki - Configuration and Settings

Download Mediawiki & extensions

git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git mediawiki
cd mediawiki/extensions

git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/timeline.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CharInsert.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/MobileFrontend.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CategoryTree.git
git clone  https://gerrit.wikimedia.org/r/mediawiki/extensions/Cite.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Interwiki.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/SyntaxHighlight_GeSHi.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/php/luasandbox.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/InputBox.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/GeoData.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ExpandTemplates.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Babel.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Scribunto.git
git clone  https://gerrit.wikimedia.org/r/mediawiki/extensions/ParserFunctions.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Poem.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/TextExtracts.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ImageMap.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Math.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/wikihiero.git
git clone  https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Mantle.git

Set up MediaWiki

You need to adjust your LocalSettings.php according to https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/mediawiki/LocalSettings.php.

To make Lua faster, please read the Scribunto instructions: http://www.mediawiki.org/wiki/Extension:Scribunto

Configure your MediaWiki directory as web-directory by adding configuration information into Apache httpd.conf as below:

Alias /mediawiki /path_to_mediawiki_parent_dir/mediawiki
<Directory /mediawiki>
   Allow from all
</Directory>

Verify MediaWiki and PHP configurations

Now visit the following URL with your browser

http://localhost/mediawiki/api.php?uselang=en

If you get some usage instructions in you browser then the MediaWiki configuration is correct and you can move to the next step.

If you are not getting usage information, then it is necessary to resolve each error and to verify with the aforementioned URL, until you get a valid web page.

You are also invited to check Apache error log to get further details on how to troubleshoot the errors which might appear.

Below a list of possible errors together with some solutions:

  • Class 'DOMDocument' not found in LocalisationCache.php

To solve this you need to install the php-xml module as specified in Install required software.

  • Set $wgShowExceptionDetails = true; in LocalSetting.php

You need to change the LocalSetting.php as suggested. It is used to throw full debugging information.

  • CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC.

Example backtrace:

Backtrace:
#0 [internal function]: ObjectCache::newAccelerator(Array)
#1 /mnt/ebs/framework/media_wiki/wikimedia/includes/objectcache/ObjectCache.php(85): call_user_func('ObjectCache::ne...', Array)
#2 /mnt/ebs/framework/media_wiki/wikimedia/includes/objectcache/ObjectCache.php(72): ObjectCache::newFromParams(Array)
#3 /mnt/ebs/framework/media_wiki/wikimedia/includes/objectcache/ObjectCache.php(44): ObjectCache::newFromId(3)
#4 /mnt/ebs/framework/media_wiki/wikimedia/includes/GlobalFunctions.php(3780): ObjectCache::getInstance(3)
#5 /mnt/ebs/framework/media_wiki/wikimedia/includes/Setup.php(464): wfGetMainCache()
#6 /mnt/ebs/framework/media_wiki/wikimedia/includes/WebStart.php(157): require_once('/mnt/ebs/framew...')
#7 /mnt/ebs/framework/media_wiki/wikimedia/api.php(47): require('/mnt/ebs/framew...')
#8 {main}

This means you have not installed php-apc. This is an e-accelerator used to speed-up the process around 4-5 times.

If you really do not want to usephp-apc then please set $wgMainCacheType=CACHE_ANYTHING (not recommended).

Trigger Abstract export

Execute the following command after making the appropriate changes to the extraction.abstracts.properties configuration file:

../clean-install-run extraction extraction.abstracts.properties

NGINX Configuration

Install nginx server

# For debian based systems run
sudo apt-get install nginx nginx-extras lua-nginx-memcached php5-fpm

If you use the LuaSandbox option for the Scrbibunto mw extension (recommended) keep in mind that the nginx/php-fpm php.ini file is located in /etc/php5/fpm/php.ini

Add the following configuration in `/etc/nginx/sites-enabled'. Change the port to e.g. 81 if you have another server running on 80 and place the mediawiki installation in a subfolder.

server {
        listen 81 default_server;
        listen [::]:81 default_server ipv6only=on;

        root /var/www/abstracts;

        index index.html index.htm index.php;

        # Make site accessible from http://localhost/
        server_name localhost;
        client_max_body_size 5m;
        client_body_timeout 60;
 
        location / {
                try_files $uri $uri/ @rewrite;
        }
 
        location @rewrite {
                rewrite ^/(.*)$ /index.php?title=$1&$args;
        }
  
        location ~ \.php$ {
                include fastcgi_params;
                fastcgi_index index.php;
                try_files $uri =404;
                fastcgi_split_path_info ^(.+\.php)(/.+)$;
                fastcgi_pass unix:/var/run/php5-fpm.sock;
                fastcgi_buffers 32 16k;
        }
}

Notes by Christopher

Here are a few more notes I took when I ran the abstract extraction in summer 2012.

If possible, use the MySQL, PHP and MediaWiki versions shown at Special:Version. This is probably most important for MediaWiki, not so much for MySQL and PHP.

The default MySQL installation of Ubuntu didn't work for me. MySQL bug 34981 caused problems. I don't remember exactly what else went wrong. In the end, I just downloaded and unzipped the appropriate MySQL version and removed the Ubuntu version because it somehow interfered with my installation.

I also wrote a little script that uses this MySQL installation to create the necessary data directories and start/stop the server. This script is not well documented and not really finished. :-(

Here's the rest of my notes from summer 2012. Version numbers and a few other things may have changed by now.

Clone production version of MediaWiki in your projects folder:

  • mkdir mediawiki
  • cd mediawiki/
  • git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git
  • cd core/
  • git branch -r # list branches / tags
  • git checkout origin/wmf/1.20wmf4 # current tag as of 2012-06-13
  • git submodule update --init # gets all the extensions

Install MySQL, create tables:

  • DO NOT install the Ubuntu MySQL package. If it is installed, remove it. (Note: The bug has been fixed in the newer versions of Ubuntu and Debian, you don't need to follow this steps anymore and can just use the version from the packages )
  • Install MySQL 5.1.63 or a similar version: Download and unzip the tarball.
  • ./mysql.sh install
  • ./mysql.sh run /home/release/mysql test <../mediawiki/core/maintenance/tables.sql

Install Ubuntu packages php, php-cli, php-apc, php-intl Add symlink in /var/www: ln -s …./mediawiki/core mediawiki Install Ubuntu package php-mysql. This also installs mysql-common, but that doesn’t seem to interfere with our local MySQL (see above).

Install Ubuntu package ploticus (for Timeline extension)

Add APC setting to php.ini: apc.stat = 0

To make extraction faster, we should try HipHop, the PHP compiler / virtual machine created by Facebook. I tried to follow the instructions at https://github.com/facebook/hiphop-php/wiki/Building-and-installing-on-ubuntu-10.10 but didn't succeed.

Clone this wiki locally