GSoC_2016_Progress_Simone

The Table Extractor

This project has the aim to build a module of the extraction framework with the capability of extract useful rdf formatted data from tables of a wiki page. This kind of structures are really particular: they convey data in a semi-structured way. The first approach is to take a domain with interesting data and build a python script to retrieve them out. The script would get a json representation of the wiki page using JSONpedia.

Mentors:

Marco Fossati
Claudia Diamantini
Domenico Potena
Emanuele Storti

Changing point of view:

I encountered a lot of obstacles with the first approach I had to the problem. In GSoC proposal I stated that I would have used JSONpedia web service to retrieve a json representation of selected wiki pages. Problems started there. JSONpedia is a wonderful community project (and I want to thanks Michele Mostarda for his efforts trying to make the web service efficient and available.) but it has some difficulties dealing with tables of a wiki page. I have to say that most of the problems do not strictly depend on JSONpedia itself, but on the users who wrote the tables in question. I love the Wikipedia spirit about community efforts and freedom in writing articles, but this create problems when you have to extract data from complex structure like Wiki Tables. In fact users don’t care about wiki text validation, as long as they think the html representation of table is correct. So even if there are errors in wiki text, these are not a problem in wiki webpages, as MediaWiki Parser and browsers generally do a good job interpreting and managing little errors. The problem comes out when JSONpedia instead wraps wiki text so spreads the errors over. In the first part of my work (till midTerm evaluation) I tried to manage and resolve problems involve concerning users’ errors, but it cost me too much time, and it didn’t seem to come to an end. So, accordingly with my mentors I changed approach to the problem and I start to retrieve html representations (using lxml) of wiki pages instead of json ones. Actually I lost a lot of time dealing with this as I had to build up two parsers, Html and Json.

Problems encountered with JSONpedia:

Wiki text errors are spread over json resource’s representation.
No tag for rows delimitation so it is difficult to know what a cell is (header | data).
No tag for headers, so, as tables have multiple rows of headers, it is hard to know which row is header and which one is data.
If a cell is empty in wiki text, it is not written in the json’s representation, so we lose the ability to deal with table’s structure reconstruction as we loose positional informations.

Proposal goals:

Make a stable software capable of analyze a topic for one wiki language. +
Extend topics and languages involved. []
Try to translate Table Extractor in Scala language in order to make possible the integration with Extraction Framework. -

Results achieved:

USA Election related wiki pages for ‘it’ wiki chapter

Example of RDF data set:

Reification of row concept: <http://it.dbpedia.org/resource/Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1912> <http://dbpedia.org/ontology/Election> <http://it.dbpedia.org/resource/Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1912__1> .

Data regarding that row: <http://it.dbpedia.org/resource/Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1912__1> <http://dbpedia.org/ontology/PoliticalParty> <http://it.dbpedia.org/resource/Partito_Democratico_(Stati_Uniti_d'America)> ; <http://dbpedia.org/ontology/President> <http://it.dbpedia.org/resource/Thomas_Woodrow_Wilson> ; <http://dbpedia.org/ontology/VicePresident> "Thomas Riley Marshal"^^xsd:string ; <http://dbpedia.org/ontology/popularVote> "6296184"^^xsd:positiveInteger ; <http://dbpedia.org/property/electoralVote> "435"^^xsd:positiveInteger .

Statistics:

of wiki pages: 52
tables: 53
of rows extracted: 295
of data cells extracted : 1723
of exceptions extracting data : 1 /53
of 'header not resolved' errors : 0
of 'no headers' errors : 1 /53
of mapping errors : 0
of 'no mapping rule' errors : 33/1723 = 1,9 % (cells lost due to a lack of mapping rules)
of headers with no adequate mapping rule: 5
cells mapped : 1443/1723 = 83,7 % (cells correctly extracted and mapped)
of triples serialized : 1730

Election related wiki pages for ‘it’ wiki chapter

Example of RDF data set:

Reification for row concept same as USA elections: <http://it.dbpedia.org/resource/Elezioni_politiche_italiane_del_2006> <http://dbpedia.org/ontology/Election> <http://it.dbpedia.org/resource/Elezioni_politiche_italiane_del_2006__1>,

Data regarding that row: http://it.dbpedia.org/resource/Elezioni_politiche_italiane_del_2006__1> <http://dbpedia.org/ontology/PoliticalParty> <http://it.dbpedia.org/resource/L'Unione>; <http://dbpedia.org/ontology/President> <http://it.dbpedia.org/resource/Carlo_Perrin>; <http://dbpedia.org/ontology/popularVote> "198"^^xsd:positiveInteger; <http://dbpedia.org/property/pvPct "34.54"^^xsd:float

Statistics and metrics:

of wiki pages: 137
tables: 347
of rows extracted: 5572
of data cells extracted : 24017
of exceptions extracting data : 28/347
of 'header not resolved' errors : 65
of 'no headers' errors : 0
of mapping errors : 0
of 'no mapping rule' errors : 10236/24017 = 42,6 % (cells lost due to a lack of mapping rules)
of headers with no adequate mapping rule: 100
cells mapped : 10338/24017 = 43,0 % (cells correctly extracted and mapped)
of triples serialized : 10544

Note: General election's pages are subject to a condition widespread to a lot of topic pages. In fact in some elections pages, such as 2013_Administrative_Italian_Elections, there are a bunch of tables regarding different electoral result but with same headers. Other topics are affected by this condition, see 2015_Edition_of_TT, which cause a problem of 'same row' in the extracted dataset. So for every page, every row would be extracted. Data cells for each row would be associated to row number X, where X is the position of that cell's row in their respective tables. As far as extractor isn't able to analyze more than 1 table at a time, it doesn't know what happened with the last table, and some names used for row reifications are repeated. This means that more than one cell of the same type is associated with the same row. Eg. 2013_Italian_Presidential_Election which has a lot of different tables with same headers. <http://it.dbpedia.org/resource/Elezione_del_Presidente_della_Repubblica_Italiana_del_2013__1> <http://dbpedia.org/ontology/popularVote> "10"^^xsd:positiveInteger, "210"^^xsd:positiveInteger, "230"^^xsd:positiveInteger, "250"^^xsd:positiveInteger, "395"^^xsd:positiveInteger, "521"^^xsd:positiveInteger, "738"^^xsd:positiveInteger ; <http://dbpedia.org/property/pvPct> "1.0"^^xsd:float . I contacted my mentors about this but up to now none of them discussed this with me.

Table Extractor Main Features:

Possibility to customize topics and corresponding SPARQL where_clause stored in order to easy extends domains involved You can launch the extraction over a topic, a single resource, a set of resources determined by a SPARQL query.
The lists of resources involved in your extractions are stored, to help user understand if the scope targeted has been correctly hit.
Log file containing infos and complete report regarding the analysis resource by resource, table by table, and a final extraction report
WYSIWYG approach: As the mapping rules are based on the header text, when you have to create a new mapping rule simply add the text you see as header and you’re sure to select the right cells.
Useful report to increase effectiveness of software: as the mapping rules are based upon the text of header cells and users change them even for pages related to the same domain, it is good to know which headers of a selected topic haven’t been mapped because they don’t match any of the mapping rules already set. Eg. take a look to USA_presidential_election_2000 and to USA_presidential_election_2004. As you can see the tables under section ‘Risultati’ have same structure and data cells type but with little light differences in headers’ text. 2000_page:Presidente 2004_page:Presidente (Stato) .So a mapping based upon header’s text would differentiate these cases. If you don’t know all the possible cases, it is impossible to map efficiently a domain. Log file helps you resolving this problem as you can find, just before the final report, a section in which the headers ,the Extractor hasn’t be able to map due to a lack of adequate mapping rules, are reported along with values examples and the resource you can find the problem. Eg.
INFO 08/23 03:27:44 AM - Header: -Stato- , Value example ['Virginia', 'Virginia'], Resource: Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1789
INFO 08/23 03:27:44 AM - Header: -Stato di origine- , Value example ['Illinois', 'Illinois'], Resource: Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_2008
INFO 08/23 03:27:44 AM - Header: -Stati vinti- , Value example ['Washington_DC', '28 + Washington DC'], Resource: Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_2008
INFO 08/23 03:27:44 AM - Header: -Metodo di selezione dei grandi elettori- , Value example ["Gli elettori furono investiti dall'assemblea legislativa dello Stato."], Resource: Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1789

Tables or 'It takes all sorts, to make the World':

Main problem dealing with table is the users who write them. As everyone has the ability to add tables to a wiki page, tables in wiki pages are as heterogeneous as the imagination of those who created it. Even between same topic pages, tables are differently structured or with data of different meaning. This is not only due to the users: they are writing wiki text in order to add a table in a wiki page. So even if there are logical or structural errors (headers written as data cells, cells used as vertical divider line) as long as users see a fancy and graphically pleasing table in their browser, they will continue to be happy and not concerned to this little (fatal for us) 'errors'.

Future Developments:

As soon as the final evaluation of my project is completed, I would like to:

Improve mapping class and methods so that a user of Table Extractor can write his own mapping rules despite touching code. I guess a structure containing the actual mapping rules would be great to Table Extractor's expandability. You can find a possible solution's sketch here.
Expand scopes analyzed as being honest I had problems finding time to expand domains of interest
Test solution over a lot of different tables type as it has problems with very complex tables like this one (hard to notice,but there is an header cell with a 'rowspan' = 17 in order to have a fancy divider line which cause problems to data extraction)
Have the possibility to increase concepts in the dbpedia's ontology | properties set, as a lot of data inside tables refer to concepts not already described there.

Project Timeline:

08/23/2016 Last day of GSoC. Final little software and mapping fixes, added other log and data set examples. Improved DocStrings for every doc file. Readme updated. Added requirements file.

08/21/2016 New mapping feature: once data are extracted, before automatically map them as a dbpedia resources, the Table Extractor ask directly to dbpedia if it is or not a resource. Then only actually existing resources are added to the graph as "http://dbpedia.org/resource/ResourceName". The others are added as simple Literal string

08/19/2016 New feature: once the extraction is finished, headers with ’no mapping rule found’ condition (refer to software features) will be reported in the Log. This means that you now can find easily if some headers are not mapped due to the lack a corresponding mapping rule. So it is simpler to increase effectiveness on a domain.

08/17/2016 Log improvement, now after the last resource has been analyzed a little final REPORT is printed. This report has a lot of statistics regarding general extraction values. Here an example of final report for 'elections'-'it':

Total # of resources collected for this topic: 52
Total # of resources analyzed: 52
Total # tables found : 53
Total # tables analyzed : 53
Total # of rows extracted: 295
Total # of data cells extracted : 1723
Total # of exceptions extracting data : 1
Total # of 'header not resolved' errors : 0
Total # of 'no headers' errors : 1
Total # of mapping errors : 0
Total # of 'no mapping rule' errors : 33
Total # cells mapped : 1443
Total # of triples serialized : 1730

08/11/2016 - Now HtmlTableParser can find sections under which resides the table that is under analysis. Eg. In 2004_USA_presidential_election the table resides under section "Risultati". You can find some Extraction examples here.

08/09/2016 - Readme updated with script usage, options, folder structure and classes explanation.

08/07/2016 - Now table which headers has a 'rowspan' > 1 are correctly parsed.

08/05/2016 - Added support to 'Sortable' table. A kind of fancy wiki tables which user can reorder clicking over the headers. Now the algorithm find out the section under which stands the table analyzed. It can be very useful to distinguish one table to another, and to apply different mapping rules, as I found many tables with same headers but different data meaning, depending on the section. Eg. 2016 Tourist Trophy standings

08/03/2016 - I've reformat the project folder structure. Now it contains three folders, making it simple to understand where to find sources file, resources lists, datasets and log files.

08/02/2016 - First functioning version of html parser. It uses lxml to obtain an 'etree' html object. It finds tables and other html tag nodes using Xpath query language. I found that very interesting because it works directly with Html structure of a wiki page, which is the same users check out once they write down a new table, instead of a json representation of the wikitext they used, really different from the result those users controlled in their browsers.

07/20/2016 - As already request by mentors, I changed the module interpreting arguments which now use ArgParse, a std 2.7 Python library.

07/16/2016 - I realized that is too difficult to be implement correctly the 'json' approach, so I started working on a html parser that would be more usable a flexible than the json one. In fact if I want to work with other domains, I need a more sustainable solution. Please refer to tables at this page to have an idea of structure complexity I have to face.

07/15/2016 - Logging : I started used a log to store infos, problems and statistics related to data extraction and all the algorithm.

07/13/2016 - Mapper.py : new class used to map in rdf statements data extracted with tableParser. I used it with success, making possible to create the first project rdf dataset.

07/12/2016 - Meeting - As I arrived at the meeting without a rdf dataset, Diamantini urged me to go on with data extraction and to try the approach I used over other wiki pages topic. She helped me finding out express solutions capable of speed my project up. I don't agree with her as I found these solutions a bit useless to hit the target of the project. In the next few days I will try to complete data extraction and the mapping of this data, in order to have a rdf dataset complete.

07/08/2016 - Despite problems with json table representation, I made out to finally reconstruct headers. See this page to have an idea about which kind of structures I am analyzing. I thought it would be useful to associate different row of headers (if there is more than a row of them), as it could be used afterward to tag data cells and to make easy data mapping. I also started working on data extraction. The algorithm works on the assumption that every cell and his data are wrapped in dictionaries with anonymous tag. Eg. {u'content': {u'width': [u'"35%" colspan=3'], u'@an0': u'----'}, u'@type': u'head_cell'}, {u'content': {u'width': [u'"10%"'], u'@an0': u'Candidati'}, u'@type': u'body_cell'}, {u'content': {u'width': [u'"20%" colspan=2'], u'@an0': u'Grandi elettori'}, u'@type': u'body_cell'}, {u'content': {u'@an1': u'----', u'@an0': u'Voti'}, u'@type': u'body_cell'}, this part is only representing the first header row.

07/04/2016 - Meeting - I set out problems encountered with tables wrapped by JSONpedia. In particular it's clear that a lot of wiki tables are full of little errors (due to the user who wrote them) which don't affect directly tables in the wiki pages, but that are transferred with JSONpedia to the json object representing the page. More than that, JSONpedia doesn't support the tables structures, so that headers aren't tagged. Potena pointed out the possibility of changing approach using an html parser rather than a json one. He spurred me to accelerate my work as these errors could be a big problem for my project. Diamantini presented me some other possibilities. She said it would be better to go straight on with data extraction, avoiding the part of structure reconstruction, using an brute-force extractor and hard-coded mapping rules.

06/27/2016 - Midterm Evaluation - My work has been evaluated good and DBpedia let me continue GSoC but they pointed out a warning about my code style. I hope to upgrade my code skills during next weeks.

06/20/2016 - Working on a possible solution to keep trace of table structures based on the assumption that every table has to contain a number of data cells equal to the header cells. Moreover the first (the number depends) rows always represent headers, so I am trying to apply some rules to get over the problem.

06/17/2016 - tableParser.py : I started to write down this new class, which has to parse and reconstruct the structure of JSONpedia wrapped tables. Having some problems interpreting the structure of tables. They really depend on how users wrote them, so I am trying a general approach to find out which part of the table represents header cells and which one data. I found this help page which is useful to have an idea on different solutions adopted by users in wikipedia chapters. Help:Table page on Wikipedia

06/15/2016 - Meeting - Discussed with mentors progresses obtained and difficulties encountered. Fossati explained to me that is more useful to concentrate efforts in dataset creation more than in code style and I agreed. I started working on the structure of tables relating to wiki pages of politics and administratives elections (it chapter, by now). Take a look at this page: 1984 USA Presidential Elections, Italian chapter

06/12/2016 - This week I have worked on the Analyzer module, on the statistics script and on the Selector. The statistics.py script now has a major update: it continues to call JSONpedia service until it gets back a useful response. Even if with this kind of concept time of execution can be really expanded, it is important to have clear results.

06/05/2016 - Selector Module completed. It takes 2 parameters:

wiki_chapter e.g en/it/fr and so on. Default "en"
tag/where_clause to identify a collection of resources. Default "all" that stands for all wiki pages. Once the parameters has been tested the selector collects a list of resources and write them into a file (.txt). I guess it is useful to keep a trace of resources found by the selector and to make tests to modules that usually would be after this one (eg. the Analyzer).

05/28/2016 - I have found that would be a better choice to test the parameters passed to the python package, in order to be sure of quality. NEW MODULE param_test. It tests and set 2 parameters which are: the chapter of wiki considered and the query to target out a scope. The query could be either a tag (such as "soccer" for soccer players, dir for "directors", "act" for actors and so on) or a real SPARQL where statement. In the second case the user has to be sure to set a correct and useful query.

05/23/2016 - Start of coding - I have started by engineering the algorithm - so I have found out there would be 3 principal modules: A selector of resources, an analyzer module and a utilities module (used by the other modules). The possibility to use JSONpedia as a module is deferred, due to some incompatibilities. If it is possible I will try later during GSoC.

05/22/2016 - Little Report on CB period - The CB period is already over, but it was really useful: I have been able to contact my mentors with success. They helped a lot clarifying targets and strategies. They have been available immediately since the beginning of CB period. I had to email some other community members in order to get acquainted with tools they contributed to develop. I found out they were very helpful. I really hope this type of collaboration would last during all the GSoC. My firsts impressions are very positive. Let the real work begin!

05/19/2016 - Second Meeting - I show the statistics result to my mentors. As I was running out of domains idea, they helped me discussing which kind of Wiki pages could have major tables number ( or tables with more interesting infos). So my first step is to evaluate these domains either in it.wiki or in en.wiki:

Soccer Players
Political Elections
Music Artist’s Discography
Writers
Motorsports
Drivers of motorsports
Basketball seasons and Players
Statistics on Awards (eg. Oscar, Nobel, Grammy and so on)
Actors and filmographies

05/15/2016 - I have started a collaboration with Feddie to extend the capabilities of statistics.py. Now it has the possibility to count lists. We are working together on building up some extraction tools (extraction framework itself, JSONpedia) in order to run them from our own machines. This can be useful to edit parts of tools’ docs o maybe to make our own guide to them.

05/08/2016 - Later on this week I started a little python script (statistics.py) to interact with Wiki and Dbpedia’s chapter in order to do some statistics. This script, you can find it in “Table Extractor”’s repo on github, is able to show how many tables there are in a domain of Wiki pages.

05/03/2016 - First Meeting - We started pointing out some simple rules I've to follow during the GSoC (eg. how and when I will have to report on my work). My DBpedia mentor (marfox) showed my repo page on Github.com and the possibility to maintain a “progress page” here in the extraction framework’s wiki. Therefore my mentors gave me some project goals. Then we discussed a strategy and we all agreed, as a first approach, to start analyzing some particular domains (and wiki chapters too) of interest, trying to extract relevant data immediately.

04/27/2016 - Contact with mentor and co-mentors. I'm really excited to know that my project is starting to catch interest from community members.

04/22/2016 - My GSoC2016' proposal to DBpedia has been chosen! I can't wait to make first contact with community

Statistics about tables

Scope	Chapter	# of resources	Tables	Resources lost
Elections	EN	1000	5191	0
Elections	IT	137	347	0
USA Elections	IT	52	53	0
Actors	EN	6621	3328	6
Actors	IT	1704	85	96
Writers	EN	29581	1690	100
Writers	IT	1177	11	2
Soccer Players	EN	108202	29220	189
Motorcycle Riders	IT	1028	115	2
Motorcycle Riders	EN	1780	2131	1

GSoC_2016_Progress_Simone

The Table Extractor

Mentors:

Changing point of view:

Problems encountered with JSONpedia:

Proposal goals:

Results achieved:

USA Election related wiki pages for ‘it’ wiki chapter

Example of RDF data set:

Statistics:

of wiki pages: 52

tables: 53

of rows extracted: 295

of data cells extracted : 1723

of exceptions extracting data : 1 /53

of 'header not resolved' errors : 0

of 'no headers' errors : 1 /53

of mapping errors : 0

of 'no mapping rule' errors : 33/1723 = 1,9 % (cells lost due to a lack of mapping rules)

of headers with no adequate mapping rule: 5

cells mapped : 1443/1723 = 83,7 % (cells correctly extracted and mapped)

of triples serialized : 1730

Election related wiki pages for ‘it’ wiki chapter

Example of RDF data set:

Statistics and metrics:

of wiki pages: 137

tables: 347

of rows extracted: 5572

of data cells extracted : 24017

of exceptions extracting data : 28/347

of 'header not resolved' errors : 65

of 'no headers' errors : 0

of mapping errors : 0

of 'no mapping rule' errors : 10236/24017 = 42,6 % (cells lost due to a lack of mapping rules)

of headers with no adequate mapping rule: 100

cells mapped : 10338/24017 = 43,0 % (cells correctly extracted and mapped)

of triples serialized : 10544

Table Extractor Main Features:

Tables or 'It takes all sorts, to make the World':

Future Developments:

Project Timeline:

Statistics about tables

Clone this wiki locally