index.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head><title>Sieve</title>
  <link rel="StyleSheet" href="stylesheets/style.css" type="text/css" media="screen"/>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>

<body>

  <div id="wbsg_navbar">
  	<div id="wbsg_navbar_projects">
  	<a href="http://dbpedia.org" title="DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web" >DBpedia</a>
  	<a href="http://spotlight.dbpedia.org/" title="DBpedia Spotlight is a tool for annotating DBpedia entities in text.">DBpedia Spotlight</a>
	<a href="http://d2rq.org/d2r-server" title="D2R Server is a tool for publishing the content of relational databases on the Semantic Web" >D2R Server</a>
	<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/r2r/" title="R2R Framework &ndash; Translating RDF data from the Web to a target vocabulary" >R2R</a>
	<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/" title="The Silk framework is a tool for discovering relationships between data items within different Linked Data sources" >Silk</a>
	<a href="http://sieve.wbsg.de/" title="Sieve is a tool for assessing data quality and performing data fusion." class="wbsg_navbar_active_project">Sieve</a>
	<a href="http://ldif.wbsg.de/" title="LDIF &ndash; Linked Data Integration Framework translates heterogeneous Linked Data from the Web into a clean, local target representation while keeping track of data provenance" >LDIF</a>
	<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/ng4j/" title="The Named Graphs API for Jena (NG4J) is an extension to the Jena Semantic Web framework for parsing, manipulating and serializing sets of Named Graphs" >NG4J</a>
	<a href="http://mes.github.com/marbles/" title="Marbles is a server-side application that formats Semantic Web content for XHTML clients using Fresnel lenses and formats" >Marbles</a>
  	<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/wiqa/" title="The WIQA - Information Quality Assessment Framework is a set of software components that empowers information consumers to employ a wide range of different information quality assessment policies to filter information from the Web" >WIQA</a>
  	<a href="http://wifo5-03.informatik.uni-mannheim.de/pubby/" title="Pubby &ndash; A Linked Data Frontend for SPARQL Endpoints can be used to add Linked Data interfaces to SPARQL endpoints" >Pubby</a>
  	<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/rdfapi/" title="RAP &ndash; RDF API for PHP is a software package for parsing, querying, manipulating, serializing and serving RDF models" >RAP</a>
  	</div>
  	<!--div id="wbsg_navbar_intro">Open Source projects by the <a href="http://wbsg.de">Web-based Systems Group</a>:&nbsp;&nbsp;</div-->
	<div id="wbsg_navbar_intro">Open Source projects by the <a href="http://dws.informatik.uni-mannheim.de/">Data and Web Science Group</a>:&nbsp;&nbsp;</div>
  </div>
<!-- End WBSG navbar -->

<!--div id="logo" align="right">
	<a href="http://wbsg.de"><img src="http://ldif.wbsg.de/images/fu-logo.gif" alt="Freie Universität Berlin Logo" border="0"></a>
</div-->
<DIV id=logo><A href="http://dws.informatik.uni-mannheim.de/"><IMG src="images/logo_uni_en.gif" alt="Universit&auml;t Mannheim Logo"></A> </DIV>

<div id="header">
<h1 style="font-size: 200%;">Sieve - Linked Data Quality Assessment and Fusion</h1>
</div>

<div id="tagline">A quality evaluation and conflict resolution module for LDIF</div>

<div id="authors">
<a href="http://pablomendes.com/">Pablo N. Mendes</a><br>
<a href="http://hannes.muehleisen.org/">Hannes M&uuml;leisen</a><br>
<a href="http://dws.informatik.uni-mannheim.de/en/people/researchers/dr-volha-bryl/">Volha Bryl</a><br>
<a href="http://dws.informatik.uni-mannheim.de/en/people/professors/prof-dr-christian-bizer/">Christian Bizer</a><br>
</div>

<div id="content">
  <div id="top-container">
    <div id="purpose">
    <p><i>"A sieve, or sifter, separates wanted elements from unwanted material using a woven screen such as a mesh or net." <br/>Source: <a href="http://en.wikipedia.org/wiki/Sieve">Wikipedia</a></i></p>
    <p>
    Sieve allows Web data to be filtered according to different data quality assessment policies and provides for fusing Web data according to different conflict resolution methods.</p>
    </div>
  </div>

<div id="body-container">

<h2 id="news">News</h2>
<div>
 <ul>
  <li><b>13/02/2014</b>: Sieve <b>new version</b> released with <a href="http://ldif.wbsg.de/">LDIF 0.5.2</a>, including <a href="FPL.html">Fusion Policy Learner</a>, new quality assessment scores and fusion functions, various bugfixes. </li-->
  <li><b>19/12/2013</b>: <a href="FPL.html">Fusion Policy Learner</a> module has been added to Sieve. </li>
  <li><b>14/11/2012</b>: Sieve <b>new version</b> released with  <a href="http://ldif.wbsg.de/">LDIF 0.5.1</a>, including various bugfixes.</li>
  <li><b>12/08/2012</b>: Bugfixes; Additional Functions; Quality Assessment and Data Fusion modules integrated with the LOD2 Stack.</li>
  <li><b>03/04/2012</b>: <b>First implementation</b> included in <a href="http://ldif.wbsg.de/">LDIF 0.5</a>.</li> <!-- http://lists.w3.org/Archives/Public/public-lod/2012Apr/0041.html -->
  <li><b>30/03/2012</b>: Paper about Sieve presented at the LWDM workshop at EDBT'12. [ <a href="http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/Mendes-Muehleisen-Bizer-Sieve-LWDM2012.pdf">pdf</a> ] [ <a href="http://www.slideshare.net/pablomendes/lwdm2012-sieveslideshare">slides</a> ]</li>
  <li><b>29/02/2012</b>: Conceptual design and implementation of metrics released.</li>
 </ul>
</div>


<h2 id="contents">Contents</h2>
<div>
 <ol class="toc">
  <li><a href="#about">About Sieve</a></li>
  <li><a href="#components">Sieve within LDIF Architecture</a></li>
  <li><a href="#qualityassessment">Quality Assessment</a></li>
  <li><a href="#datafusion">Data Fusion</a></li>
  <li><a href="#examples">Examples</a><br></li>
  <!--ul>
    <li><a href="#example1">Using LDIF to integrate Data about Cities from Multiple DBpedias</a></li>
  </ul-->
  <li><a href="#development">Source code and development</a></li>
  <li><a href="#feedback">Support and Feedback</a></li>
  <li><a href="#references">References</a></li>
  <li><a href="#acknowledgments">Acknowledgments</a></li>
 </ol>
</div>


<h2 id="about">1. About Sieve</h2>
<p>
  The <a href="http://linkeddatabook.com/editions/1.0/index.html#htoc23">Web of Linked Data</a> grows rapidly and contains data originating from <a href="http://www4.wiwiss.fu-berlin.de/lodcloud/state/">hundreds of data sources</a>.
  The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect.
  Moreover, data sources may provide conflicting values for the same properties.
</p>
<p>
  In order for <a href="http://linkeddatabook.com/editions/1.0/index.html#htoc75">Linked Data applications</a> to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome.
  The <a href="http://ldif.wbsg.de/">Linked Data Integration Framework (LDIF)</a> provides a modular architecture that homogenizes Web data into a target representation as configured by the user. LDIF includes Data Access, <a href="http://www4.wiwiss.fu-berlin.de/bizer/r2r/">Schema Mapping</a>, <a href="http://www4.wiwiss.fu-berlin.de/bizer/silk/">Identity Resolution</a> and Data Output modules. This document describes Sieve, which adds quality assessment and data fusion capabilities to the LDIF architecture.
</p>
<p>
  Sieve uses metadata about named graphs (e.g. provenance) in order to assess data quality as defined by users.
  Sieve is agnostic to provenance vocabulary and quality models.
  Through its configuration files, Sieve is able to access provenance expressed in different ways, and uses customizable scoring functions
  to output data quality descriptors according to a user-specified data-quality vocabulary.
</p>
<p>
  Based on these quality descriptors (and/or optionally other descriptors computed elsewhere), Sieve can use configurable FusionFunctions to clean the data according to task-specific requirements.
</p>
<!--p>The LDIF integration pipeline consists of the following steps:
</p><ol>
	<li><b>Collect Data</b>: Access modules locally replicate data sets via file download, crawling or SPARQL.</li>
	<li><b>Map to Schema</b>: An expressive mapping language allows for translating data from
the various vocabularies that are used on the Web into a consistent,
local target vocabulary.</li>
	<li><b>Resolve Identities</b>: An identity resolution component discovers URI aliases in the input data and replaces them with a
single target URI based on user-provided matching heuristics.</li>
	<li><b>Output</b>: LDIF outputs the integrated data in a single file. For provenance tracking, LDIF employs the Named Graphs data model.</li>
</ol-->


<h2 id="components">2. Sieve within LDIF Architecture</h2>
<p>The figure below shows the schematic <a href="http://linkeddatabook.com/editions/1.0/index.html#htoc84">architecture of Linked Data applications</a>
that implement the crawling/data warehousing pattern. The figure highlights the steps of the data integration process that are currently supported by LDIF.</p>

<img alt="Example-architecture of an integration aware Linked Data application" src="http://ldif.wbsg.de/images/linkeddataapp-sieve.png">

<!--p>The LDIF Framework consists of the Runtime Environment and a set of pluggable modules.
The pluggable modules include Data Access (dumps, crawler and SPARQL), Transformation (R2R data translation, Silk identity resolution, Sieve data quality and fusion) and Data Output (n-quads or n-triples).
</p-->
<p>
Sieve is employed as the quality evaluation module in LDIF (see figure 1). The data cleaning procedure in Sieve works in two steps. The first step, Quality Assessment, associates quality-scores to the named
graphs that are used in LDIF for provenance tracking and that group subsets of triples together. The second step uses this metadata to decide on how to fuse conflicting property values according to user configuration.
</p>

<h2 id="qualityassessment">3. Quality Assessment</h2>

<p>Data is considered to be high quality "if they are fit for their intended uses in operations, decision making and planning." (J. M. Juran) According to this task-dependent view on data quality, we realize the quality assessment task through a configurable module relying on multifaceted quality descriptions based on quality indicators.</p>
<dl>

<dt>Data Quality Indicator</dt><dd>An aspect of a data item or data set that may give an indication to the user of the suitability of the data for some intended use. The types of information which may be used as quality indicators are very diverse. Besides the information to be assessed, scoring functions may rely on meta-information about the circumstances in which information was created or will be used, on background information about the information provider, or on ratings provided by the information consumers themselves, other information consumers, or domain experts.</dd>

<dt>Scoring Function</dt><dd>An implementation that, based on indicators, generates a score to be evaluated by the user in the process of deciding on the suitability of the data for some intended use. There may be a choice of several alternative or combined scoring functions for producing a score for a given indicator. Depending on the quality dimension to be assessed and the chosen quality indicators, scoring functions range from simple comparisons, like "assign true if the quality indicator has a value greater than X", over set functions, like "assign true if the indicator is in the set Y", aggregation functions, like "count or sum up all indicator values", to more complex statistical functions, text-analysis, or network-analysis methods.</dd>

<dt>Assessment Metric</dt><dd>Is a procedure for measuring an information quality dimension. Assessment metrics rely on a set of quality indicators and calculate an assessment score from these indicators using a scoring function. Information quality assessment metrics can be classified into three categories according to the type of information that is used as quality indicator: content-based metrics, context-based metrics and rating-based metrics.</dd>

<dt>Aggregate Metric</dt><dd>User can specify aggregate assessment metrics built out of individual assessment metrics. These aggregations produce new assessment values through the average, sum, max, min or threshold functions applied to a set of assessment metrics. Aggregate assessment metrics are better visualized as trees, where an aggregation function is applied to the leaves and combined up the tree until a single value is obtained. The functions to be applied at each branch are specified by the users.</dd>
</dl>

<h3 id="qualityassessmentconfig">Quality Assessment Configuration</h3>
<div>
<p>A Quality Assessment Job updates the metadata about external sources in the local cache with by computing quality metrics and it is configured with an XML document, whose structure
is described by this <a href="https://github.com/wbsg/ldif/blob/master/ldif/ldif-modules/ldif-sieve/ldif-sieve-common/src/main/resources/Sieve.xsd">XML Schema</a>.</p>

<p>A typical configuration document looks like this:</p>

<pre>
&lt;QualityAssessment name="Recent and Reputable is Best"
                   description="The idea that more recent articles from Wikipedia
                        could capture better values that change over time (recency),
                        while if there is a conflict between two Wikipedias, trust the one
                        which is more likely to have the right answer (reputation)."&gt;

    &lt;AssessmentMetric id="sieve:reputation"&gt;
        &lt;ScoringFunction class="ScoredList"&gt;
           &lt;Param name="list"
                  value="http://pt.wikipedia.org http://en.wikipedia.org
                         http://es.wikipedia.org http://fr.wikipedia.org http://de.wikipedia.org"/&gt;
        &lt;/ScoringFunction&gt;
    &lt;/AssessmentMetric&gt;

    &lt;AssessmentMetric id="sieve:recency"&gt;
        &lt;ScoringFunction class="TimeCloseness"&gt;
            &lt;Param name="timeSpan" value="50000"/&gt;
            &lt;Input path="?GRAPH/ldif:lastUpdate"/&gt;
        &lt;/ScoringFunction&gt;
    &lt;/AssessmentMetric&gt;

&lt;/QualityAssessment&gt;
</pre>


<p>It has the following elements:</p>
<ul>
    <!--li><tt>Prefixes</tt> - specifies nicknames for namespaces in order to allow the use of <a href="http://en.wikipedia.org/wiki/QName">shorter references to URIs</a>. </li-->
    <li><tt>QualityAssessment</tt> -
        groups a number of assessment metrics into a quality assessment "policy". Users can give it a <tt>name</tt> which should be unique, and a <tt>description</tt> to help to explain the intentions of this policy.</li>
    <li><tt>AssessmentMetric</tt> -
        describes a simple assessment metric. Each simple <tt>AssessmentMetric</tt> element specifies a unique identifier for the aspect of quality that is being assessed (through the attribute <tt>indicator</tt>) and a <tt>ScoringFunction</tt> element that will provide a real valued assessment (between 0 and 1) of this indicator.</li>
    <li><tt>AggregateMetric </tt> -
        describes an aggregation function over a number of simple assessment metrics.</li>
    <li><tt>ScoringFunction</tt> -
        configures the input of a java/scala class implementing a scoring function. Each scoring function takes in a number of input parameters (<tt>Param</tt>, <tt>Input</tt> or <tt>EnvironmentVariable</tt>) and generates a real value between 0 and 1 based on this input.
        The <tt>Param</tt> element is used for static values, <tt>Input</tt> is used for accessing values in the data being consumed by LDIF, and <tt>EnvironmentVariable</tt> accesses values stored in the system (e.g. username, date, etc.).</li>
    <li><tt>Input</tt> -
        specifies either a path or a SPARQL query to access metadata (e.g. provenance) about named graphs. This metadata will be provided as input to a scoring function, which will then produce additional metadata in the form of indicators.
        A <tt>path</tt> attribute contains a path expression (series of variables and RDF properties separated by "/") that indicate how to select applicable metadata from the input.
        Conventionally, for <tt>path</tt> attributes within ScoringFunction elements, the paths start with a variable named <tt>?GRAPH</tt>.</li>
</ul>
</div>

<h3 id=qualityinput>Input</h3>
<div>
  <p>
  In its most basic form, the input for the quality assessment task is the <a href="http://ldif.wbsg.de/#provenance">LDIF provenance metadata</a>.
  When LDIF imports a new data source, it generates <code>ldif:ImportedGraph</code> descriptions for all graphs in the input (see <a href="http://ldif.wbsg.de/#importjob">Data Import</a> for module-specific details).
  Sieve will (by default) mash that with any other graph metadata provided as input, and feed that through the scoring functions defined in the configuration.
  This allows the configuration to use automatically computed metadata by LDIF, as well as metadata from the original source, or from other third-parties.
  Examples of such third-party metadata include ratings, prominence statistics, etc.
  Optionally, if all the metadata you need is contained within the LDIF provenance graph, you can set the configuration property <code>qualityFromProvenanceOnly=true</code> in <a href="http://ldif.wbsg.de/#configuration" target="ldif">integration.properties</a>.
  This will instruct Sieve to limit the input data to the quality module to only the provenance graph, reducing computation effort, time and output size.
  </p>
  <!--p>Example metadata:</p>
  <pre>

  </pre-->
</div>

<h3 id="scoringfunctions">Available Scoring Functions</h3>
<div>
<p>There is a vast number of possible scoring functions for quality assessment. We do not attempt to implement all of them. The currently implemented scoring functions are:</p>
<ul>
  <li><i>TimeCloseness</i>: measures the distance from the input date (obtained from the input metadata through a path expression) to the current (system) date. Dates outside the range (expressed in number of days) receive value 0, and dates that are more recent receive values closer to 1. Example input:
<pre>
&lt;ScoringFunction class="TimeCloseness"&gt;
  &lt;Param name="range" value="7"/&gt;
  &lt;Input path="?GRAPH/ldif:lastUpdate"/&gt;
&lt;/ScoringFunction&gt;
</pre> 
      Example output: 0.85 (assuming the last update was yesterday).
      <br/>TimeCloseness can be used for indicators such as freshness (last updated) and recency (creation date).
  </li>
  <li><i>ScoredList</i>: assigns decreasing, uniformly distributed, real values to each graph URI provided as space-separated list. 
      Example input: 
<pre>
&lt;ScoringFunction class="ScoredList"&gt;
  &lt;Param name="list"
         value="http://en.wikipedia.org http://pt.wikipedia.org http://de.wikipedia.org http://es.wikipedia.org"/&gt;
&lt;/ScoringFunction&gt;
</pre>
      Example output: {http://en.wikipedia.org=1, http://pt.wikipedia.org=0.75, http://de.wikipedia.org=0.5, http://es.wikipedia.org=0.25, http://fr.wikipedia.org=0}</li>
  </li>
  <li><i>ScoredPrefixList</i>: assigns decreasing, uniformly distributed, real values to each graph URI that matches one of the provided (as space-separated list) prefixes. 
      Example input: 
<pre>
&lt;ScoringFunction class="ScoredPrefixList"&gt;
  &lt;Param name="list"
         value="http://en.wikipedia.org http://pt.wikipedia.org http://de.wikipedia.org  http://es.wikipedia.org"/&gt;
&lt;/ScoringFunction&gt;
</pre>
      Example output: {http://en.wikipedia.org/wiki/Berlin=1, http://pt.wikipedia.org/wiki/Berlin=0.75, http://de.wikipedia.org/wiki/Berlin=0.5, http://es.wikipedia.org/wiki/Berlin=0.25, http://fr.wikipedia.org/wiki/Berlin=0}</li>
  </li>
  <!--li><i>SetMembership</i>: assigns 1 if the value of the indicator provided as input belongs to the set specified as parameter, 0 otherwise.</li-->
  <li><i>Threshold</i>: assigns 1 if the value of the indicator provided as input is higher than a threshold specified as parameter, 0 otherwise.
<pre>
&lt;ScoringFunction class="Threshold"&gt;
  &lt;Param name="threshold" value="5"/&gt;
  &lt;Input path="?GRAPH/example:numberOfEdits"/&gt;
&lt;/ScoringFunction&gt;
</pre>
      Example output: {http://en.wikipedia.org/wiki/DBpedia=1, http://pt.wikipedia.org/wiki/DBpedia=0}
  </li>
  <li><i>Interval</i>: assigns 1 if the value of the indicator provided as input is within the interval specified as parameter, 0 otherwise.
<pre>
&lt;ScoringFunction class="Interval"&gt;
  &lt;Param name="from" value="6"/&gt;
  &lt;Param name="to" value="42"/&gt;
  &lt;Input path="?GRAPH/provenance:whatever"/&gt;
&lt;/ScoringFunction&gt;
</pre>
  </li>
  <li><i>NormalizedCount</i>: normalizes indicator values by the threshold provided as a parameter, if the value of an indicator is greater than the threshold outputs  1.0.
<pre>
&lt;ScoringFunction class="NormalizedCount"&gt;
  &lt;Param name="maxCount" value="100"/&gt;
  &lt;Input path="?GRAPH/example:numberOfEdits"/&gt;
&lt;/ScoringFunction&gt;
</pre>
  </li>    
  
</ul>

  <!--aggregation functions--
  <!--ul>
      <li><i>Max,Min,Count,Sum,Average</i>: computes mathematical functions over the value of the indicator informed as parameter. These functions can be applied, for example, over ratings, number of edits, number of views, etc. according to the application needs.</li>
      <li><i>TieBreaker</i>: takes as input a set of real valued indicators. If there is a tie for the first indicator, it uses the second, and if there is still a tie, it keeps going through the list.</li>
      <li><i>Mixture</i>: performs a weighted sum of scores like: 0.3 * sieve:recency + 0.7 * sieve:reputation</li>
  </ul-->

<!--table>
<tr>
 <th>Name</th>
</tr>
<tr><td>
</tr>
</table-->

The framework is extensible, and users can define their own set of <code>ScoringFunction</code> classes. The class should implement <code>ldif.modules.sieve.quality.ScoringFunction</code>, which defines the method <code>fromXML</code> (a method to create a new object of the class given the configuration parameters) and the method <code>score</code> (which will be called at runtime to perform quality assessment based on a number of metadata values). Method signatures (in Scala):  </li>
<pre>
trait ScoringFunction {
  def score(graphId: NodeTrait, metadataValues: Traversable[IndexedSeq[NodeTrait]]): Double
}
object ScoringFunction {
  def fromXML(node: Node) : ScoringFunction
}
</pre>
</div>


<h3 id=qualityoutput>Output</h3>
<div>
<p>The quality-describing metadata computed by this step can be output and stored as an extension to the <a href="http://ldif.wbsg.de/#provenance">LDIF Provenance Graph</a>.
   In this case, all non-zero scores are written out as quads, for each graph evaluated in LDIF. Metrics with score=0.0 are omitted, as in Sieve a missing quality assessment is equivalent to a zero score.
   If you wish to omit all quality scores from the final output (e.g. to reduce output size), you can set the configuration property <code>outputQualityScores=false</code> in <a href="http://ldif.wbsg.de/#configuration" target="ldif">integration.properties</a>.
   Regardless, the quality scores will always be passed on to the fusion module, if a Fusion configuration exists.

<pre>
enwiki:Juiz_de_Fora sieve:recency "0.4" ldif:provenance .
ptwiki:Juiz_de_Fora sieve:recency "0.8" ldif:provenance .
enwiki:Juiz_de_Fora sieve:reputation "0.75" ldif:provenance .
ptwiki:Juiz_de_Fora sieve:reputation "0.25" ldif:provenance .
</pre>
</div>
</div>


<h2 id="datafusion">4. Data Fusion</h2>
<div>
<p>Data Fusion combines data from multiple sources and aims to remove or transform conflicting values towards a clean representation.</p>
<p>
In Sieve, the data fusion step relies on quality metadata generated by the Quality Assessment Module, and a configuration file that instructs which fusion functions to apply for each property.
A data fusion function takes in all values for each property from all data sources together with the quality scores that have been previously calculated by the Data Quality Assessment Module. 
In order to output a clean value for each property, it applies fusion fuctions implementing decisions such as: take the highest scored value for a given quality assessment metric, take the average of values, etc.
</p>

<!--h3 id="nextSteps">Next steps for Sieve</h3>
<div>
<p>Over the next months, we plan to extend Sieve along the following lines:<br>
</p>
<ol>
  <li>Implement more quality assessment metrics and data fusion strategies.</li>
  <li>Hadoop version</li>
  <li>Wildcards</li>
</li>
</ol>
</div-->

<h3 id="fusionconfig">Fusion Configuration</h3>
<div>

<p>A Data Fusion Job is configured with an XML document, whose structure is described by this <a href="https://github.com/wbsg/ldif/blob/master/ldif/ldif-modules/ldif-sieve/ldif-sieve-common/src/main/resources/Sieve.xsd">XML Schema</a>.</p>

<p>A typical configuration document looks like this:</p>

<pre>
&lt;Fusion name="Fusion strategy for DBpedia City Entities"
        description="The idea is to use values from multiple DBpedia languages to improve the quality of data about cities."&gt;

      &lt;Class name="dbpedia:City"&gt;
         &lt;Property name="dbpedia:areaTotal"&gt;
               &lt;FusionFunction class="KeepValueWithHighestScore" metric="sieve:lastUpdated" /&gt;
         &lt;/Property&gt;
         &lt;Property name="dbpedia:population"&gt;
               &lt;FusionFunction class="Average" /&gt;
         &lt;/Property&gt;
         &lt;Property name="dbpedia:name"&gt;
               &lt;FusionFunction class="KeepValueWithHighestScore" metric="sieve:reputation" /&gt;
         &lt;/Property&gt;
     &lt;/Class&gt;
&lt;/Fusion&gt;
</pre>

<p>It has the following elements:</p>

<ul>
<li><tt>Fusion</tt> - describes a Data Fusion policy. Users can give it a <tt>name</tt> which should be unique, and a <tt>description</tt> to help to explain the intentions of this policy. Each <tt>Fusion</tt> defines which data to fuse through <tt>Class</tt> and <tt>Property</tt> sub-elements, and which fusion functions to apply through <tt>FusionFunction</tt> elements.</li>
<li><tt>Class</tt> - defines a subset of the input by selecting all instances of the class provided in the attribute <tt>name</tt>.</li>
<li><tt>Property</tt> - defines which <tt>FusionFunction</tt> should be applied to the values of a given RDF property. The property qName should be given in the attribute <tt>name</tt>, and the <tt>FusionFunction</tt> element must be specified as a child of <tt>Property</tt>.</li>
<li><tt>FusionFunction</tt> - specifies in the <tt>name</tt> attribute which java/scala class should be used to fuse values for a given property. It can take in a number of <tt>Parameter</tt> elements, which specify <tt>name</tt> of the parameter and <tt>value</tt> for that parameter. The framework is extensible, and users can define their own set of <tt>FusionFunction</tt> classes. The class should implement <tt>ldif.modules.sieve.fusion.FusionFunction</tt>, which defines the method <tt>fromXML</tt> (a method to create a new object of the class given the configuration parameters) and the method <tt>fuse</tt> (which will be called at runtime to perform fusion based on a number of values and quality assessment metadata). Method signatures (in Scala): 
<pre>
class FusionFunction(val metricId: String="") {
  def fuse(values: Traversable[IndexedSeq[NodeTrait]], quality: QualityAssessment) : Traversable[IndexedSeq[NodeTrait]]
}

object FusionFunction {
  def fromXML(node: Node) : FusionFunction
}
</pre>

</li>
</ul>
</div>

<h3 id="fusionfunctions">Available Fusion Functions</h3>
<div>
<p>The currently implemented fusion functions are:</p>
<ul>
  <!--li><i>Filter</i>: removes all values for which the input quality assessment metric is below a given threshold.
<pre>
 &lt;FusionFunction class="Filter" metric="sieve:reputation"&gt;
    &lt;Param name="threshold" value="0.85"/&gt;
 &lt;/FusionFunction&gt;
</pre> 
  </li-->
  <li><i>PassItOn</i>: does nothing, passes values on to the next component in the pipeline.</li>
  <li><i>KeepFirst</i>: keeps the value with the highest score for a given quality assessment metric. In case of ties, the function keeps the first value in the order of input.
<pre>
 &ltFusionFunction class="KeepFirst" metric="sieve:reputation"/&gt;
</pre> 
 </li>
 <li><i>KeepAllValuesByQualityScore</i>: similar to <i>KeepFirst</i>, but in case of ties, it keeps all values with the highest score.</li>
 <li><i>KeepLast</i>: keeps the value with the lowest value for a given quality assessment metric. In case of ties, the function keeps the first in order of input. Used to invert scores, e.g. to invert  TimeCloseness  to  consider the  most 
experienced author (active since the earliest date).</li>
 <li><i>Average</i>: takes the average of all input data for a given property.
 <pre>
 &ltFusionFunction class="Average"/&gt;
</pre> 
 </li>
 <li><i>Maximum</i>: takes the maximum of all input data for a given property.</li>
 <li><i>Voting</i>: picks the value that appeared most frequently across sources. Each named graph has one vote, the most voted value is chosen.</li>
 <li><i>WeightedVoting</i>: picks the value that appeared most frequently across highly rated sources. Each named graph has one vote proportional to its score for a given quality metric, the value with the highest aggregated scores is chosen.</li>
</ul>
</div>
</div>


<h2 id="examples">5. Examples</h2>
<div>
<p>This section provides a Sieve usage example. We will explore the use case of integrating information about cities coming from different DBpedia editions.
   We have included the configuration files for this example in our distribution under <a href="https://github.com/wbsg/ldif/blob/master/ldif/examples/lwdm2012">ldif/examples/lwdm2012</a>.</p>

<p>
Here is how you can run this example:
  <ol>
    <li><a href="http://dl.mes-semantics.com/ldif/ldif-0.5.1.zip">download</a> the latest release</li>
    <li>unpack the archive and change into the extracted directory <code>ldif-0.5.1</code></li>
    <li>to run this example:
        <ul>
            <li>under Linux / Mac OS type:<pre>bin/ldif examples/lwdm2012/schedulerConfig.xml</pre></li>
            <li>under Windows type:<pre>bin\ldif.bat examples\lwdm2012\schedulerConfig.xml</pre></li>
        </ul>
    </li>
  </ol>
</p>

<p>
  The directory <code>ImportJobs</code> contains a list of configuration files that will download the data into your server. One example configuration is shown below:

<pre>&lt;importJob &gt;
    &lt;internalId&gt;pt.dbpedia.org&lt;/internalId&gt;
    &lt;dataSource&gt;DBpedia&lt;/dataSource&gt;
    &lt;refreshSchedule&gt;onStartup&lt;/refreshSchedule&gt;
    &lt;tripleImportJob&gt;
        &lt;dumpLocation&gt;http://sieve.wbsg.de/download/lwdm2012/pt.nq.bz2&lt;/dumpLocation&gt;
    &lt;/tripleImportJob&gt;
&lt;/importJob&gt;</pre>
</p>

<p>
The directory <code>sieve</code> contains configuration for quality assessment and fusion of these data. For this use case, we consider reputation and recency to be important quality aspects.
We will assign higher reputation to data coming from the English Wikipedia (because it is the largest and most viewed), followed by the Portuguese (because it's the most relevant for the use case), Spanish, French and German (in subjective order of language similarity to Portuguese).
Recency will be measured by looking at the <code>lastUpdated</code> property, which was extracted from Wikipedia Dumps.
<pre>
&lt;Sieve xmlns="http://www4.wiwiss.fu-berlin.de/ldif/"&gt;
   &lt;Prefixes&gt;
       &lt;Prefix id="dbpedia-owl" namespace="http://dbpedia.org/ontology/"/&gt;
       &lt;Prefix id="dbpedia" namespace="http://dbpedia.org/resource/"/&gt;
       &lt;Prefix id="sieve" namespace="http://sieve.wbsg.de/vocab/"/&gt;
   &lt;/Prefixes&gt;


   &lt;QualityAssessment name="Recent and Reputable is Best"
                      description="The idea that more recent articles from Wikipedia
                                   could capture better values that change over time (recency),
                                   while if there is a conflict between two Wikipedias, trust the one
                                   which is more likely to have the right answer (reputation)."&gt;

       &lt;AssessmentMetric id="sieve:recency"&gt;
           &lt;ScoringFunction class="TimeCloseness"&gt;
               &lt;Param name="timeSpan" value="500"/&gt;
               &lt;Input path="?GRAPH/ldif:lastUpdate"/&gt;
           &lt;/ScoringFunction&gt;
       &lt;/AssessmentMetric&gt;

       &lt;AssessmentMetric id="sieve:reputation"&gt;
           &lt;ScoringFunction class="ScoredList"&gt;
               &lt;Param name="list"
                      value="http://en.wikipedia.org http://pt.wikipedia.org http://es.wikipedia.org http://fr.wikipedia.org"/&gt;
           &lt;/ScoringFunction&gt;
       &lt;/AssessmentMetric&gt;
   &lt;/QualityAssessment&gt;


   &lt;Fusion name="Fusion strategy for DBpedia City Entities"
           description="The idea is to use values from multiple DBpedia languages to improve the quality of data about cities."&gt;

       &lt;Class name="dbpedia-owl:Settlement"&gt;
           &lt;Property name="dbpedia-owl:areaTotal"&gt;
               &lt;FusionFunction class="KeepFirst" metric="sieve:recency"/&gt;
           &lt;/Property&gt;
           &lt;Property name="dbpedia-owl:populationTotal"&gt;
               &lt;FusionFunction class="KeepFirst" metric="sieve:recency"/&gt;
           &lt;/Property&gt;
           &lt;Property name="dbpedia-owl:foundingDate"&gt;
               &lt;FusionFunction class="KeepFirst" metric="sieve:reputation"/&gt;
           &lt;/Property&gt;
       &lt;/Class&gt;
   &lt;/Fusion&gt;

&lt;/Sieve&gt;
</pre>
</p>

<p>
After running  you should obtain results saved in a file called "integrated_cities.nq". This file name is configurable in <code>integrationJob.xml</code>.
The output contains all the fused properties and every other property that came in the input for which no fusion was specified.
<br/>
Some example data for fused properties:
<pre>
&lt;http://dbpedia.org/resource/Cachoeiras_de_Macacu&gt; &lt;http://dbpedia.org/ontology/areaTotal&gt; "9.55806E11"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://pt.wikipedia.org/wiki/Cachoeiras_de_Macacu&gt; .
&lt;http://dbpedia.org/resource/Cachoeiras_de_Macacu&gt; &lt;http://dbpedia.org/ontology/populationTotal&gt; "54370"^^&lt;http://www.w3.org/2001/XMLSchema#nonNegativeInteger&gt; &lt;http://pt.wikipedia.org/wiki/Cachoeiras_de_Macacu&gt; .
&lt;http://dbpedia.org/resource/%C3%81lvares_Florence&gt; &lt;http://dbpedia.org/ontology/areaTotal&gt; "3.62E8"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://pt.wikipedia.org/wiki/%C3%81lvares_Florence&gt; .
&lt;http://dbpedia.org/resource/%C3%81lvares_Florence&gt; &lt;http://dbpedia.org/ontology/populationTotal&gt; "3897"^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt; &lt;http://en.wikipedia.org/wiki/%C3%81lvares_Florence&gt; .
&lt;http://dbpedia.org/resource/Ant%C3%B4nio_Prado&gt; &lt;http://dbpedia.org/ontology/areaTotal&gt; "3.47616E8"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://en.wikipedia.org/wiki/Ant%C3%B4nio_Prado&gt; .
&lt;http://dbpedia.org/resource/Ant%C3%B4nio_Prado&gt; &lt;http://dbpedia.org/ontology/populationTotal&gt; "14159"^^&lt;http://www.w3.org/2001/XMLSchema#integer&gt; &lt;http://en.wikipedia.org/wiki/Ant%C3%B4nio_Prado&gt; .
&lt;http://dbpedia.org/resource/Ant%C3%B4nio_Prado&gt; &lt;http://dbpedia.org/ontology/foundingDate&gt; "1899-02-11"^^&lt;http://www.w3.org/2001/XMLSchema#date&gt; &lt;http://en.wikipedia.org/wiki/Ant%C3%B4nio_Prado&gt; .
</pre>
</p>

<p>
If you choose to output quality scores, then you should also notice many triples assigned to your graph URIs, like this:
<pre>
&lt;http://es.wikipedia.org/wiki/Rio_de_Janeiro&gt; &lt;http://sieve.wbsg.de/vocab/reputation&gt; "0.5"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://www4.wiwiss.fu-berlin.de/ldif/provenance&gt; .
&lt;http://pt.wikipedia.org/wiki/Rio_de_Janeiro&gt; &lt;http://sieve.wbsg.de/vocab/reputation&gt; "0.75"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://www4.wiwiss.fu-berlin.de/ldif/provenance&gt; .
&lt;http://fr.wikipedia.org/wiki/Rio_de_Janeiro&gt; &lt;http://sieve.wbsg.de/vocab/reputation&gt; "0.25"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://www4.wiwiss.fu-berlin.de/ldif/provenance&gt; .
&lt;http://en.wikipedia.org/wiki/Rio_de_Janeiro&gt; &lt;http://sieve.wbsg.de/vocab/reputation&gt; "1.0"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://www4.wiwiss.fu-berlin.de/ldif/provenance&gt; .
&lt;http://de.wikipedia.org/wiki/Rio_de_Janeiro&gt; &lt;http://sieve.wbsg.de/vocab/recency&gt; "0.9956"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://www4.wiwiss.fu-berlin.de/ldif/provenance&gt; .
&lt;http://en.wikipedia.org/wiki/Rio_de_Janeiro&gt; &lt;http://sieve.wbsg.de/vocab/recency&gt; "0.9955"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://www4.wiwiss.fu-berlin.de/ldif/provenance&gt; .
&lt;http://es.wikipedia.org/wiki/Rio_de_Janeiro&gt; &lt;http://sieve.wbsg.de/vocab/recency&gt; "0.9648"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://www4.wiwiss.fu-berlin.de/ldif/provenance&gt; .
&lt;http://fr.wikipedia.org/wiki/Rio_de_Janeiro&gt; &lt;http://sieve.wbsg.de/vocab/recency&gt; "0.9956"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://www4.wiwiss.fu-berlin.de/ldif/provenance&gt; .
&lt;http://pt.wikipedia.org/wiki/Rio_de_Janeiro&gt; &lt;http://sieve.wbsg.de/vocab/recency&gt; "0.9953"^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; &lt;http://www4.wiwiss.fu-berlin.de/ldif/provenance&gt; .
</pre>
</p>

  <p>
      Now you can play with the assessment metrics and the fusion functions and check the differences in the output. You may also add your own data via an ImportJob, and see how that changes the results.
      Have fun sifting data for a better Web!
  </p>
</div>


<h2 id="development">6. Source Code and Development</h2>
<div>
<p>The latest source code is available from the <a href="http://github.com/wbsg/ldif/">LDIF development page</a> on GitHub.</p>
<p>The framework can be used under the terms of the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache Software License</a>.
</p>
</div>

<h2 id="feedback">7. Support and Feedback </h2>
<div>
<p>For questions and feedback please use the <a href="http://groups.google.com/group/ldif?hl=en">LDIF Google Group</a>.</p>
</div>


<h2 id="references">8. References</h2>
<div>
<ul>
  <li>Pablo N. Mendes, Hannes Mühleisen, Christian Bizer. <b>Sieve: Linked Data Quality Assessment and Fusion</b>. 2nd International Workshop on Linked Web Data Management (LWDM 2012) at the 15th International Conference on Extending Database Technology, EDBT 2012. [<a href="http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/Mendes-Muehleisen-Bizer-Sieve-LWDM2012.pdf">pdf</a>]
  <pre>
@inproceedings{lwdm12mendes,
    booktitle = {2nd International Workshop on Linked Web Data Management (LWDM 2012)
                 at the 15th International Conference on Extending Database Technology, EDBT 2012},
        month = {March},
        title = {{Sieve: Linked Data Quality Assessment and Fusion}},
       author = {Mendes, Pablo N. and M\"{u}hleisen, Hannes and Bizer, Christian}
         year = {2012},
        pages = {to appear},
          url = {http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/Mendes-Muehleisen-Bizer-Sieve-LWDM2012.pdf},
 howpublished = {invited paper}
}</pre>
  <iframe src="http://www.slideshare.net/slideshow/embed_code/12271401" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" allowfullscreen webkitallowfullscreen mozallowfullscreen> </iframe>
 </li>
</ul>
</div>


<h2 id="acknowledgments">9. Acknowledgments </h2>
<div>
<p>This work was supported in part by the EU FP7
grants <a href="http://lod2.eu/">LOD2 - Creating Knowledge out of Interlinked Data</a> (Grant No. 257943) and <a href="http://www.planet-data.eu">PlanetData -  A European Network of Excellence on Large-Scale Data Management</a> (Grant No. 257641) as well as by Vulcan Inc. as part of its <a href="http://www.projecthalo.com">Project Halo</a>.</p>
<p>WooFunction icon set licensed under <a href="http://www.gnu.org/licenses/gpl.html">GNU General Public License</a>.</p>
</div>

</div>
</div>

</body>
</html>