Skip to content

Commit

Permalink
Documentation for v0.4.0
Browse files Browse the repository at this point in the history
  • Loading branch information
nreimers committed Oct 30, 2015
1 parent 17aaa44 commit c6ec4fe
Show file tree
Hide file tree
Showing 3 changed files with 101 additions and 26 deletions.
2 changes: 1 addition & 1 deletion code/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<!-- Prevent huge shaded artifacts from being deployed to Artifactory -->
<outputFile>${project.build.directory}/${artifactId}-${version}-standalone.jar</outputFile>
<outputFile>${project.build.directory}/wrapper-${version}.jar</outputFile>
</configuration>
</plugin>
</plugins>
Expand Down
55 changes: 47 additions & 8 deletions doc/user-guide.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.
:version: 0.3.6
:version: 0.4.0

= DARIAH-DKPro-Wrapper v{version}
:Author: DARIAH2 - Cluster 5, Use Case 1 Team
Expand All @@ -37,13 +37,13 @@ The pipeline requires required *Java 1.8* or higher. You can download Java from

After downloading and unzipping the files, execute in your command line the following code:
****
+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -input file.txt -output folder+
+java -Xmx4g -jar wrapper-{version}.jar -input file.txt -output folder+
****

You can change the language by specifying the language parameter for the pipeline. Support for the following languages are include in the current version of the DARIAH-DKPro-Wrapper: German (de), English (en), Spanish (es), and French (fr). To run the pipeline for English, execute the following command:

****
+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -language en -input file.txt -output folder+
+java -Xmx4g -jar wrapper-{version}.jar -language en -input file.txt -output folder+
****

== Run the full pipeline
Expand All @@ -52,32 +52,71 @@ By default, the pipeline runs in a light mode, the memory and time intensive com
If you like to use them, feel free to enable them in the `default.properties` or create a new `.properties`-File and pass the path to this file via the `config`-parameter.


== File Reader

You can process either single files or also all files inside a directory. Patterns can be used to select specific files that should be processed.

=== XML Reader

The DARIAH-DKPro-Wrapper implements two base readers, one text reader and one XML-file reader. You can specify the reader that should be used with the `-reader` parameter. By default, the text reader is used. To use the XML reader, run the pipeline in the following way:

****
+java -Xmx4g -jar wrapper-{version}.jar -language en -reader xml -input file.xml -output folder+
****

The XML reader skips XML tags and processes only text which is inside the XML tags. The xpath to each tag is conserved and stored in the column *SectionId* in the ouput format.

=== Reading Directories

You can also specify for the *-input* argument a directory instead of a file. If you run the pipeline in the following way:
****
+java -Xmx4g -jar wrapper-{version}.jar -language en -input folder/With/Files/ -output folder+
****

the pipeline will process all files with a _.txt_ extension for the Text-reader. For the XML-reader, it will process all files with a _.xml_ extension.

You can speficy also patterns to read in only certain files or files with certain extension. For example to read in only _.xmi_ with the XML reader, you must start the pipeline in the following way:
****
+java -Xmx4g -jar wrapper-{version}.jar -language en -reader xml -input "folder/With/Files/*.xmi" -output folder+
****

*Note:* If you use patterns (i.e. paths containing an *), you must set it into quotes to prevent shell globbing.

To read all files in all subfolders, you can use a pattern like this:
****
+java -Xmx4g -jar wrapper-{version}.jar -language en -input "folder/With/Subfolders/**/*.txt" -output folder+
****

This will read in all _.txt_ files in all subfolders. Note that the subfolder path will not be maintained in the output folder.



== Write your own config files

The pipeline can be configurated via properties-files that are stored in the `configs` folder. In this folder you find a `default.properties`, the most basic configuration file. For the different supported languages, you can find further properties-files, for example `default_de.properties` for German, `default_es.properties` for English and so on.


If you like to write your own config file, just create your own `.properties` file. You can run the pipeline with your `.properties`-file by setting the command argument.
****
+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config /path/to/my/config/myconfigfile.properties -language en -input file.txt -output folder+
+java -Xmx4g -jar wrapper-{version}.jar -config /path/to/my/config/myconfigfile.properties -language en -input file.txt -output folder+
****

In case you store your `myconfigfile.properties` in the `configs` folder, you can run the pipeline via:
****
+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config myconfigfile.properties -language en -input file.txt -output folder+
+java -Xmx4g -jar wrapper-{version}.jar -config myconfigfile.properties -language en -input file.txt -output folder+
****

You can split your config file into different parts and pass them all to the pipeline by seperating the paths using comma or semicolons. The pipeline examines all passed config files and derives the final configuration from all files. The config-file passed as last arguments has the highest priority, i.e. it can overwrite the values for all previous config files:
****
+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config myfile1.properties,myconfig2.properties,myfile3.properties -language en -input file.txt -output folder+
+java -Xmx4g -jar wrapper-{version}.jar -config myfile1.properties,myconfig2.properties,myfile3.properties -language en -input file.txt -output folder+
****

*Note:* The system always uses the default.properties and default_[langcode].properties as basic configuration files. All further config files are added on top of these files.


In case you like to use the _full_-version and also want to change the POS-tagger, you can run the pipeline in the following way:
****
+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config myFullVersion.properties,myPOSTagger.properties -language en -input file.txt -output folder+
+java -Xmx4g -jar wrapper-{version}.jar -config myFullVersion.properties,myPOSTagger.properties -language en -input file.txt -output folder+
****

In `myPOSTagger.properties` you just add the configuration for the different POS-tagger.
Expand Down Expand Up @@ -135,7 +174,7 @@ useLemmatizer = false

Change the paths for the parameter _executablePath_ and _modelLocation_ to the correct paths on your machine. You can then use Treetagger in your pipeline using the `-config` argument:
****
+java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-{version}-standalone.jar -config treetagger-example.properties -language de -input file.txt -output folder+
+java -Xmx4g -jar wrapper-{version}.jar -config treetagger-example.properties -language de -input file.txt -output folder+
****

Check the output of the pipeline that Treetagger is used. The output of your pipeline should look something like this:
Expand Down
70 changes: 53 additions & 17 deletions doc/user-guide.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.6" />
<title>DARIAH-DKPro-Wrapper v0.3.6</title>
<title>DARIAH-DKPro-Wrapper v0.4.0</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */

Expand Down Expand Up @@ -735,7 +735,7 @@
</head>
<body class="article">
<div id="header">
<h1>DARIAH-DKPro-Wrapper v0.3.6</h1>
<h1>DARIAH-DKPro-Wrapper v0.4.0</h1>
<span id="author">DARIAH2 - Cluster 5, Use Case 1 Team</span><br />
<div id="toc">
<div id="toctitle">User Guide</div>
Expand All @@ -745,7 +745,7 @@ <h1>DARIAH-DKPro-Wrapper v0.3.6</h1>
<div id="content">
<div id="preamble">
<div class="sectionbody">
<div class="paragraph"><p>This is a short user guide for the current version v0.3.6 of the DARIAH-DKPro-Wrapper.</p></div>
<div class="paragraph"><p>This is a short user guide for the current version v0.4.0 of the DARIAH-DKPro-Wrapper.</p></div>
</div>
</div>
<div class="sect1">
Expand Down Expand Up @@ -779,12 +779,12 @@ <h2 id="_running_the_pipeline">2. Running the pipeline</h2>
<div class="paragraph"><p>After downloading and unzipping the files, execute in your command line the following code:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-0.3.6-standalone.jar -input file.txt -output folder</tt></p></div>
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -input file.txt -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p>You can change the language by specifying the language parameter for the pipeline. Support for the following languages are include in the current version of the DARIAH-DKPro-Wrapper: German (de), English (en), Spanish (es), and French (fr). To run the pipeline for English, execute the following command:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-0.3.6-standalone.jar -language en -input file.txt -output folder</tt></p></div>
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -language en -input file.txt -output folder</tt></p></div>
</div></div>
</div>
</div>
Expand All @@ -796,34 +796,70 @@ <h2 id="_run_the_full_pipeline">3. Run the full pipeline</h2>
</div>
</div>
<div class="sect1">
<h2 id="_write_your_own_config_files">4. Write your own config files</h2>
<h2 id="_file_reader">4. File Reader</h2>
<div class="sectionbody">
<div class="paragraph"><p>You can process either single files or also all files inside a directory. Patterns can be used to select specific files that should be processed.</p></div>
<div class="sect2">
<h3 id="_xml_reader">4.1. XML Reader</h3>
<div class="paragraph"><p>The DARIAH-DKPro-Wrapper implements two base readers, one text reader and one XML-file reader. You can specify the reader that should be used with the <tt>-reader</tt> parameter. By default, the text reader is used. To use the XML reader, run the pipeline in the following way:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -language en -reader xml -input file.xml -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p>The XML reader skips XML tags and processes only text which is inside the XML tags. The xpath to each tag is conserved and stored in the column <strong>SectionId</strong> in the ouput format.</p></div>
</div>
<div class="sect2">
<h3 id="_reading_directories">4.2. Reading Directories</h3>
<div class="paragraph"><p>You can also specify for the <strong>-input</strong> argument a directory instead of a file. If you run the pipeline in the following way:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -language en -input folder/With/Files/ -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p>the pipeline will process all files with a <em>.txt</em> extension for the Text-reader. For the XML-reader, it will process all files with a <em>.xml</em> extension.</p></div>
<div class="paragraph"><p>You can speficy also patterns to read in only certain files or files with certain extension. For example to read in only <em>.xmi</em> with the XML reader, you must start the pipeline in the following way:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -language en -reader xml -input "folder/With/Files/*.xmi" -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p><strong>Note:</strong> If you use patterns (i.e. paths containing an *), you must set it into quotes to prevent shell globbing.</p></div>
<div class="paragraph"><p>To read all files in all subfolders, you can use a pattern like this:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -language en -input "folder/With/Subfolders/<strong>*/</strong>.txt" -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p>This will read in all <em>.txt</em> files in all subfolders. Note that the subfolder path will not be maintained in the output folder.</p></div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_write_your_own_config_files">5. Write your own config files</h2>
<div class="sectionbody">
<div class="paragraph"><p>The pipeline can be configurated via properties-files that are stored in the <tt>configs</tt> folder. In this folder you find a <tt>default.properties</tt>, the most basic configuration file. For the different supported languages, you can find further properties-files, for example <tt>default_de.properties</tt> for German, <tt>default_es.properties</tt> for English and so on.</p></div>
<div class="paragraph"><p>If you like to write your own config file, just create your own <tt>.properties</tt> file. You can run the pipeline with your <tt>.properties</tt>-file by setting the command argument.</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-0.3.6-standalone.jar -config /path/to/my/config/myconfigfile.properties -language en -input file.txt -output folder</tt></p></div>
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -config /path/to/my/config/myconfigfile.properties -language en -input file.txt -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p>In case you store your <tt>myconfigfile.properties</tt> in the <tt>configs</tt> folder, you can run the pipeline via:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-0.3.6-standalone.jar -config myconfigfile.properties -language en -input file.txt -output folder</tt></p></div>
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -config myconfigfile.properties -language en -input file.txt -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p>You can split your config file into different parts and pass them all to the pipeline by seperating the paths using comma or semicolons. The pipeline examines all passed config files and derives the final configuration from all files. The config-file passed as last arguments has the highest priority, i.e. it can overwrite the values for all previous config files:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-0.3.6-standalone.jar -config myfile1.properties,myconfig2.properties,myfile3.properties -language en -input file.txt -output folder</tt></p></div>
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -config myfile1.properties,myconfig2.properties,myfile3.properties -language en -input file.txt -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p><strong>Note:</strong> The system always uses the default.properties and default_[langcode].properties as basic configuration files. All further config files are added on top of these files.</p></div>
<div class="paragraph"><p>In case you like to use the <em>full</em>-version and also want to change the POS-tagger, you can run the pipeline in the following way:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-0.3.6-standalone.jar -config myFullVersion.properties,myPOSTagger.properties -language en -input file.txt -output folder</tt></p></div>
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -config myFullVersion.properties,myPOSTagger.properties -language en -input file.txt -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p>In <tt>myPOSTagger.properties</tt> you just add the configuration for the different POS-tagger.</p></div>
<div class="paragraph"><p><strong>Note:</strong> The properties-files must use the ISO-8859-1 encoding. If you like to include UTF-8 characters, you must encode them using \u[HEXCode].</p></div>
<div class="sect2">
<h3 id="_understanding_the_argument_parameter">4.1. Understanding the Argument Parameter</h3>
<h3 id="_understanding_the_argument_parameter">5.1. Understanding the Argument Parameter</h3>
<div class="paragraph"><p>Most components can be equipped with arguments so specifcy for example the model that should be used. Arguments are passed to the pipeline in a 3 tuple format. In the <tt>default.properties</tt> you can find the following line:</p></div>
<div class="listingblock">
<div class="content">
Expand All @@ -834,11 +870,11 @@ <h3 id="_understanding_the_argument_parameter">4.1. Understanding the Argument P
</div>
</div>
<div class="sect1">
<h2 id="_using_treetagger">5. Using Treetagger</h2>
<h2 id="_using_treetagger">6. Using Treetagger</h2>
<div class="sectionbody">
<div class="paragraph"><p>Due to copyright issues, TreeTagger cannot directly be accessed from the DKPro repository. Instead, you have first to download and to install TreeTagger to able to use it with DKPro.</p></div>
<div class="sect2">
<h3 id="_treetagger_installation_for_linux">5.1. TreeTagger Installation for Linux</h3>
<h3 id="_treetagger_installation_for_linux">6.1. TreeTagger Installation for Linux</h3>
<div class="ulist"><ul>
<li>
<p>
Expand Down Expand Up @@ -882,7 +918,7 @@ <h3 id="_treetagger_installation_for_linux">5.1. TreeTagger Installation for Lin
</ul></div>
</div>
<div class="sect2">
<h3 id="_treetagger_installation_for_windows_7">5.2. TreeTagger Installation for Windows 7</h3>
<h3 id="_treetagger_installation_for_windows_7">6.2. TreeTagger Installation for Windows 7</h3>
<div class="ulist"><ul>
<li>
<p>
Expand Down Expand Up @@ -936,7 +972,7 @@ <h3 id="_treetagger_installation_for_windows_7">5.2. TreeTagger Installation for
</ul></div>
</div>
<div class="sect2">
<h3 id="_configuration_of_the_pipeline">5.3. Configuration of the pipeline</h3>
<h3 id="_configuration_of_the_pipeline">6.3. Configuration of the pipeline</h3>
<div class="paragraph"><p>After downloading the correct executable and correct model, we must configure our pipeline in order to be able to use Treetagger. You can find an example configuration in the <em>configs</em> folder <em>treetagger-example.properties</em>:</p></div>
<div class="listingblock">
<div class="content">
Expand All @@ -951,7 +987,7 @@ <h3 id="_configuration_of_the_pipeline">5.3. Configuration of the pipeline</h3>
<div class="paragraph"><p>Change the paths for the parameter <em>executablePath</em> and <em>modelLocation</em> to the correct paths on your machine. You can then use Treetagger in your pipeline using the <tt>-config</tt> argument:</p></div>
<div class="sidebarblock">
<div class="content">
<div class="paragraph"><p><tt>java -Xmx4g -jar de.tudarmstadt.ukp.dariah.pipeline-0.3.6-standalone.jar -config treetagger-example.properties -language de -input file.txt -output folder</tt></p></div>
<div class="paragraph"><p><tt>java -Xmx4g -jar wrapper-0.4.0.jar -config treetagger-example.properties -language de -input file.txt -output folder</tt></p></div>
</div></div>
<div class="paragraph"><p>Check the output of the pipeline that Treetagger is used. The output of your pipeline should look something like this:</p></div>
<div class="listingblock">
Expand All @@ -967,7 +1003,7 @@ <h3 id="_configuration_of_the_pipeline">5.3. Configuration of the pipeline</h3>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-10-15 11:49:13 CEST
Last updated 2015-10-30 12:38:15 CET
</div>
</div>
</body>
Expand Down

0 comments on commit c6ec4fe

Please sign in to comment.