-
Notifications
You must be signed in to change notification settings - Fork 270
The DBpedia Data Set
The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia. The English version of the DBpedia 2014 data set currently describes 4.58 million “things” with 583 million “facts”.
In addition, we provide localized versions of DBpedia in 125 languages. All these versions together describe 38.3 million things, out of which 23.8 million overlap (are interlinked) with concepts from the English DBpedia. The full DBpedia data set features 38 mln abstracts in 125 different languages, 25.2 million links to images and 29.8 million links to external web pages; 80.9 million links to Wikipedia categories, and 41.2 million YAGO categories. Altogether the DBpedia 2014 release consists of 3 billion pieces of information (RDF triples) out of which 583 million were extracted from the English edition of Wikipedia, 2.46 billion were extracted from other language editions, and about 50 million are links to external data sets.
- 1. Background
- 2. Content of the DBpedia Data Set
- 3. Denoting or Naming “things”
- 4. Describing “things”
- 4.1. Basic Information
- 4.2. Classifications
- 4.3. Infobox Data
- 4.4. External Links
- 4.5. Geo-Coordinates
- 5. Provenance Meta-Data
- 6. Localized Datasets
- 6.1. Directory structure and file names
- 6.2. Data Set Statistics
- 7. iPopulator
- 8. Datasets for Natural Language Processing (NLP)
- 9. DBpedia as Tables
- 10. License
Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. Wikipedia articles consist mostly of free text, but also contain different types of structured information, such as infobox templates, categorisation information, images, geo-coordinates, and links to external Web pages. For instance, the figure below shows the source code and the visualisation of an infobox template containing structured information about the town of Innsbruck.
The DBpedia project extracts various kinds of structured information from Wikipedia editions in 125 languages and combines this information into a huge, cross-domain knowledge base.
DBpedia uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. We use the SPARQL query language to query this data. Please refer to the Developers Guide to Semantic Web Toolkits to find a development toolkit in your preferred programming language to process DBpedia data.
The English version of the DBpedia knowledge base describes 4.58 million things, out of which 4.22 million are classified in a consistent ontology, including 1,445,000 persons, 735,000 places (including 478,000 populated places), 411,000 creative works (including 123,000 music albums, 87,000 films and 19,000 video games), 241,000 organizations (including 58,000 companies and 49,000 educational institutions), 251,000 species and 6,000 diseases.
In addition, we provide localized versions of DBpedia in 125 languages. All these versions together describe 38.3 million things, out of which 23.8 million are localized descriptions of things that also exist in the English version of DBpedia. The full DBpedia data set features 38 million labels and abstracts in 125 different languages, 25.2 million links to images and 29.8 million links to external web pages; 80.9 million links to Wikipedia categories, and 41.2 million links to YAGO categories. DBpedia is connected with other Linked Datasets by around 50 million RDF links. Altogether the DBpedia 2014 release consists of 3 billion pieces of information (RDF triples) out of which 580 million were extracted from the English edition of Wikipedia, 2.46 billion were extracted from other language editions. Detailed statistics about the DBpedia datasets in 28 popular languages are provided at Dataset Statistics.
The table below contains links to some example “things” from the data set:
Class | Examples |
---|---|
City | Cambridge, Berlin, Manchester |
Country | Spain, Iceland, South Korea |
Politician | George W. Bush, Nicolas Sarkozy, Angela Merkel |
Musician | AC/DC, Diana Ross, Röyksopp |
Music album | Led Zeppelin III, Like a Virgin, Thriller |
Director | Woody Allen, Oliver Stone, Takashi Miike |
Film | The Great Beauty, Hysterical Blindness, Breakfast at Tiffany's |
Book | The Lord of the Rings, The Adventures of Tom Sawyer, the Bible |
Computer Game | Tetris, World of Warcraft, Sam & Max hit the Road |
Technical Standard | HTML, RDF, URI |
You can also use Richard Cyganiak's PHP script to view random things from the DBpedia data set.
Find the properties used in the different DBpedia data sets here.
Each thing in the DBpedia data set is denoted by a de-referenceable IRI- or URI-based reference of the form http://dbpedia.org/resource/Name
, where Name is derived from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name
. Thus, each DBpedia entity is tied directly to a Wikipedia article. Every DBpedia entity name resolves to a description-oriented Web document (or Web resource).
Until DBpedia release 3.6, we only used article names from the English Wikipedia, but since DBpedia release 3.7, we also provide localized datasets that contain IRIs like http://xx.dbpedia.org/resource/Name
, where xx is a Wikipedia language code and Name is taken from the source URL, http://xx.wikipedia.org/wiki/Name
.
Starting with DBpedia release 3.8, we use IRIs for most DBpedia entity names. IRIs are more readable and generally preferable to URIs, but for backwards compatibility, we still use URIs for DBpedia resources extracted from the English Wikipedia and IRIs for all other languages. Triples in Turtle files use IRIs for all languages, even for English.
There are several details on the encoding of URIs that should always be taken into account.
Each DBpedia entity is described by various properties. Below, we give an overview about the most important types of properties.
Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).
If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description. The DBpedia data set contains the following numbers of abstracts for the top 12 languages (September 2014):
Language | Number of Abstracts |
---|---|
English | 4,636,000 |
Dutch | 1,809,000 |
French | 1,799,000 |
German | 1,781,000 |
Swedish | 1,779,000 |
Russian | 1,737,000 |
Japanese | 1,661,000 |
Chinese | 1,402,000 |
Polish | 1,307,000 |
Spanish | 1,254,000 |
Italian | 1,156,000 |
Vietnamese | 1,061,000 |
DBpedia provides three different classification schemata for things.
- Wikipedia Categories are represented using the SKOS vocabulary and DCMI terms.
- The YAGO Classification is derived from the Wikipedia category system using Word Net. Please refer to Yago: A Core of Semantic Knowledge – Unifying WordNet and Wikipedia (PDF) for more details.
- Word Net Synset Links were generated by manually relating Wikipedia infobox templates and Word Net synsets, and adding a corresponding link to each thing that uses a specific template. In theory, this classification should be more precise then the Wikipedia category system.
Using these classifications within SPARQL queries allows you to select things of a certain type.
- NBA Teams (Does not work with Internet Explorer)
- Car manufacturers
Wikipedia infoboxes contain very specific information about things and are thus a very valuable source of structured information that can be used to ask expressive queries against Wikipedia. The DBpedia project currently extracts three different datasets from the Wikipedia infoboxes.
- The Infobox Dataset is created using our initial, now three year old infobox parsing approach. This extractor extracts all properties from all infoboxes and templates within all Wikipedia articles. Extracted information is represented using properties in the
http://dbpedia.org/property/
namespace. The names of the these properties directly reflect the name of the Wikipedia infobox property. Property names are not cleaned or merged. Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset. Currently, there are approximately 8000 different property types. The infobox extractor performs only a minimal amount of property value clean-up, e.g., by converting a value like “June 2009” to the XML Schema format “2009–06”. You should therefore use the infobox dataset only if your application requires complete coverage of all Wikipeda properties and you are prepared to accept relatively noisy data. - The Mapping-based Datasets. With the DBpedia 3.2 release, we introduced a new infobox extraction method which is based on hand-generated mappings of Wikipedia infoboxes/templates to a newly created DBpedia ontology. The mappings adjust weaknesses in the Wikipedia infobox system, like using different infoboxes for the same type of thing (class) or using different property names for the same property. Therefore, the instance data within the Mapping-based Dataset is much cleaner and better structured than the Raw Infobox Dataset, but currently doesn't cover all infobox types and infobox properties within Wikipedia. Starting with DBpedia release 3.5, we provide three different Mapping-based Datasets:
- The Mapping-based Types dataset contains the rdf:types of the instances which have been extracted from the infoboxes.
- The Mapping-based Properties dataset contains the actual data values that have been extracted from infoboxes. The data values are represented using ontology properties (e.g., 'volume') that may be applied to different things (e.g., the volume of a lake and the volume of a planet). This restricts the number of different properties to a minimum, but has the drawback that it is not possible to automatically infer the class of an entity based on a property. For instance, an application that discovers an entity described using the volume property cannot infer that the entity is a lake and then, for example, use a map to visualize the entity. Properties are represented using properties following the
http://dbpedia.org/ontology/{propertyname}
naming schema. All values are normalized to their respective SI unit. - The Mapping-based Properties (Specific) dataset contains properties which have been specialized for a specific class using a specific unit. e.g. the property height is specialized on the class Person using the unit centimetres instead of metres. Specialized properties follow the
http://dbpedia.org/ontology/{Class}/{property}
naming schema (e.g.http://dbpedia.org/ontology/Person/height
). The properties have a single class as rdfs:domain and rdfs:range and can therefore be used for classification reasoning. This makes it easier to express queries against the data, e.g., finding all lakes whose volume is in a certain range. Typically, the range of the properties are not using SI units, but a unit which is more appropriate in the specific domain.
All three data sets are available for download as well as being available for queries via the DBpedia SPARQL endpoint.
The mapping-based data enables sophisticated, fine-grained queries over the data set. Some example queries are shown below:
- Abstracts of movies starring Tom Cruise, released before 1999
- The official websites of companies with more than 50000 employees
- Cities with more than 2 million habitants
List all episodes of the HBO television series The Sopranos ordered by their air-date:
SELECT *
WHERE
{
?e <http://dbpedia.org/ontology/series> <http://dbpedia.org/resource/The_Sopranos> .
?e <http://dbpedia.org/ontology/releaseDate> ?date .
?e <http://dbpedia.org/ontology/episodeNumber> ?number .
?e <http://dbpedia.org/ontology/seasonNumber> ?season
}
ORDER BY DESC(?date)
Software developed by an organisation founded in California:
SELECT *
WHERE
{
?company a <http://dbpedia.org/ontology/Organisation> .
?company <http://dbpedia.org/ontology/foundationPlace> <http://dbpedia.org/resource/California> .
?product <http://dbpedia.org/ontology/developer> ?company .
?product a <http://dbpedia.org/ontology/Software>
}
The DBpedia data set contains HTML links to external web pages as well as RDF links into external data sources.
There are two types of links to HTML pages: dbpedia:reference
links point to several web pages about a thing. In addition, some things also have foaf:homepage
links that point to web pages that can be considered the “official homepage” of a thing.
RDF links are represented using the owl:sameAs
property. Please refer to Interlinking for more information about RDF links and the interlinked data sets.
- Geographical (to GeoNames.org, Eurostat data, and the RDF version of the CIA Factbook):
- Authors / Books (to quotationsbook.com and Project Gutenberg RDF. Links to the RDF Book Mashup will follow soon ):
- Computer Scientist publications DBLP:
- U.S. Census Statistical Data rdfabout.com, RDF version by Joshua Tauber:
Besides coordinates extracted from infoboxes, the DBpedia data set contains additional geo-coordinates for 1,094,000 geographic locations. Geo-coordinates are expressed using the W3C Basic Geo Vocabulary.
Besides simple listings of geo-coordinates (e.g., German soccer stadiums), the new geo-coordinates allow sophisticated queries, like “show me all things next to the”:
In addition to the triples provided by the N-Triples datasets, the N-Quads datasets include a provenance URI to each statement. The provenance URI denotes the origin of the extracted triple in Wikipedia.
The provenance URI is composed of the URI of the article from Wikipedia where the statement has been extracted and a number of parameters denoting the exact source line. The following parameters are set:
-
absolute-line
: The (absolute) line in the Wikipedia article source. The first line of a source has the line number 1. -
relative-line
: The line in the Wikipedia article source in respect of the current section. -
section
: The section inside the article
Example:
http://en.wikipedia.org/wiki/BMW_7_Series#section=E23&relative-line=1&absolute-line=23
The source of the given statement can be found in the 23th line. It is located in the first line of the section “E23”.
The localized datasets contain the complete DBpedia data from non-English Wikipedias. Until DBpedia release 3.6, we extracted data from non-English Wikipedia pages only if there exists an equivalent English page, as we wanted to have a single URI to identify a resource across all 97 languages. However, since there are many pages in the non-English Wikipedia editions that do not have an equivalent English page (especially small towns in different countries, e.g. the Austrian village Endach, or legal and administrative terms that are just relevant for a single country) relying on English URIs only had the negative effect that DBpedia did not contain data for these entities and many DBpedia users have complained about this shortcoming.
Since the DBpedia 3.7 release, we provide localized DBpedia editions for download that contain data from all Wikipedia pages in a specific language. In DBpedia 3.7, these localized editions covered the following 15 languages: ca, de, el, es, fr, ga, hr, hu, it, nl, pl, pt, ru, sl, tr. Starting with DBpedia 3.8, we provide localized DBpedia editions for all languages.
The IRIs identifying entities in these internationalized datasets are constructed directly from the non-English title and a language-specific URI namespaces (e.g. http://ja.dbpedia.org/resource/ベルリン), so there are now many different URIs in DBpedia that refer to Berlin.
We also extract the inter-language links from the different Wikipedia editions. Thus, whenever an inter-language links between a non-English Wikipedia page and its English equivalent exists, the resulting owl:sameAs link can be used to relate the localized DBpedia URI to the equivalent in the main (English) DBpedia edition. The localized DBpedia editions are provided for download on the DBpedia download page.
Note that not all localized editions provide public SPARQL endpoints, nor do all localized URIs dereference. This might change in the future, as more local DBpedia chapters are set up in different countries as part of the DBpedia internationalization effort.
All DBpedia IRIs/URIs in the canonicalized datasets use the generic namespace http://dbpedia.org/
. For backwards compatibility, the N-Triples files (.nt, .nq) use URIs, e.g. http://dbpedia.org/resource/Bo%C3%B6tes. The Turtle (.ttl) files use IRIs, e.g. http://dbpedia.org/resource/Boötes.
The localized datasets use DBpedia IRIs (not URIs) and language-specific namespaces, e.g. http://el.dbpedia.org/resource/Βερολίνο.
For the DBpedia 3.7 release, we created two separate folders on the download server: /3.7/ for the datasets using 'English' URIs, /3.7-i18n/ for the datasets using 'local' URIs. Since the 3.8 release, we abandoned this high-level distinction and instead offer different dataset files in the download folder for each language, where the files with 'local' IRIs are named after their dataset, while the files with 'English' URIs/IRIs append _en_uris to the dataset name. Examples:
- /3.7/de/labels_de.nt.bz2 -> /3.8/de/labels_en_uris_de.nt.bz2 (Canonicalized dataset, 'English' URIs)
- /3.7-i18n/de/labels_de.nt.bz2 -> /3.8/de/labels_de.nt.bz2 (Internationalized dataset, 'local' IRIs)
- Dataset Statistics provides detailed statistics about the DBpedia datasets in 28 languages.
- Cross-Language Statistics provides statistics about the cross-language overlap of instances and property values between these lanuages.
iPopulator is a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. The dataset contains new extracted information and complements DBpedia's attribute values.
Each and every dataset from DBpedia is potentially useful for several tasks related to Natural Language Processing (NLP) and Computational Linguistics. We have described in Datasets/NLP a few examples of how to use these datasets. Moreover, we describe a number of extended datasets that were generated during the creation of DBpedia Spotlight and other NLP-related projects.
As some of the potential users of DBpedia might not be familiar with the RDF data model and the SPARQL query language, we provide some of the core DBpedia data also in the form of Comma-Seperated-Values (CSV) files that can easily be processed using standard tools such as spreadsheet applications, relational databases or data mining tools. More infomation about the tabular version of DBpedia is found at DBpediaAsTables.
DBpedia is derived from Wikipedia and is distributed under the same licensing terms as Wikipedia itself. As Wikipedia has moved to dual-licensing, we also dual-license DBpedia starting with release 3.4.
DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License. All DBpedia releases up to and including release 3.3 are licensed under the terms of the GNU Free Documentation License only.
Attribution in this case means keep DBpedia URIs visible and active through at least one (preferably all) of @href
, <link />
, or “Link:"
. If live links are impossible (e.g., when printed on paper), a textual blurb-based attribution is acceptable.
This material is Open Knowledge.