-
Notifications
You must be signed in to change notification settings - Fork 270
Improving CSS selectors for NIF extraction
The newest addition to the dataset family of DBpedia records the whole wiki text, its basic structure (sections, titles, paragraphs, etc.) and the included text links. As a by-product the abstracts are extracted as well, along with other useful date (i.e. all included tables in its raw HTML format and equations in MathML).
For the content of the different NIF datasets please refer to this page: http://wiki.dbpedia.org/downloads-2016-10#p10608-2.
While we are happy with the results of this extraction, it could always be improved. Since I tested mainly on the English and German Wikipedia, other languages might include HTML artifacts or do not capture the correct CSS paths to find a certain element. I added the necessary paths to find the end of the page for most of the mapping languages, but these need to be reviewed by the language chapters or individuals in command of this language.
For those interested in improving this extraction for a given language, you can contribute by specifying and updating CSS selectors to pinpoint certain areas in a wiki page in this json file.
This CSS mapping file specifies certain categories of CSS selectors for each language. The 'default' entry is equivalent to the CSS selectors needed for the extraction of the English wikipedia. Further entries for other languages are added to the entries in the default mappings before applying them to the NIF extraction (where the language specific entries are tested first). All entries in one of those categories are treated as alternatives. Thus, if one of the listed CSS selectors (those of the language in question and the default entries) do result in an element of the given wiki page, it is assumed to be the target element and the extraction moves on.
Note: we use CSS 3 selectors to query for target elements.
The different categories are defined as follows:
- "nif-find-pageend": CSS selectors to detect the end of a wikipage, which is usually right before the References section (references are recorded in the citations datasets). Example: "span[id*='eference']" - Any element with an id attribute containing the 'eference' string.
- "nif-find-next-title": Detect a title (and thereby the next section). Example: "["h1", "h2", "h3", "h4"]" - any title tag (this probably does not need improvement).
- "nif-find-toc": Detect the start of the table of content (and thereby the end of the abstract). Example:[".toc", "span[id='toc']"] - find an element with the class 'toc' or any span tag with the id 'toc'.
- "nif-remove-elements": Specify elements which should be removed from the extraction. Use this list to point out any element which produces unwanted artifacts in the extracted text. Example: ".noprint" - removes any occurrence of elements with the 'noprint' class or "div[role='img']" to remove any div with the attribute 'role=img'.
- "nif-replace-elements": Similar to remove-elements, this map replaces a given target with something else (separated by '->'). The generic constant $c is used to denote the (textual) content of the original target element. Example: "ul > li:not(:last-child) -> \n* $c" - Capture the content of any li element (if its not the last in a given list) and insert a new line character with an asterisk before (markdown style).
- "nif-note-elements": Similar to replace-elements, this map replaces any element which is tagged as a note (often indented with as smaller font) and replaces it with something else. Example: "div.hatnote -> ($c)" - places the note inside brackets.
Please update or add languages (using the same object structure) in this document. Push directly to master or provide a pull request, or point out possible changes on the issue track. We will review all updates before starting the extraction.
Thank you for contributing to this effort.