-
Notifications
You must be signed in to change notification settings - Fork 270
GSoC_2016_Progress_Simone
This project has the aim to build a module of the extraction framework with the capability of extract useful rdf formatted data from tables of a wiki page. This kind of structures are really particular: they convey data in a semi-structured way. The first approach is to take a domain with interesting data and build a python script to retrieve them out. The script would get a json representation of the wiki page using JSONpedia.
- Marco Fossati
- Claudia Diamantini
- Domenico Potena
- Emanuele Storti
I encountered a lot of obstacles with the first approach I had to the problem. In GSoC proposal I stated that I would have used JSONpedia web service to retrieve a json representation of selected wiki pages. Problems started there. JSONpedia is a wonderful community project (and I want to thanks Michele Mostarda for his efforts trying to make the web service efficient and available.) but it has some difficulties dealing with tables of a wiki page. I have to say that most of the problems do not strictly depend on JSONpedia itself, but on the users who wrote the tables in question. I love the Wikipedia spirit about community efforts and freedom in writing articles, but this create problems when you have to extract data from complex structure like Wiki Tables. In fact users don’t care about wiki text validation, as long as they think the html representation of table is correct. So even if there are errors in wiki text, these are not a problem in wiki webpages, as MediaWiki Parser and browsers generally do a good job interpreting and managing little errors. The problem comes out when JSONpedia instead wraps wiki text so spreads the errors over. In the first part of my work (till midTerm evaluation) I tried to manage and resolve problems involve concerning users’ errors, but it cost me too much time, and it didn’t seem to come to an end. So, accordingly with my mentors I changed approach to the problem and I start to retrieve html representations (using lxml) of wiki pages instead of json ones. Actually I lost a lot of time dealing with this as I had to build up two parsers, Html and Json.
- Wiki text errors are spread over json resource’s representation.
- No tag for rows delimitation so it is difficult to know what a cell is (header | data).
- No tag for headers, so, as tables have multiple rows of headers, it is hard to know which row is header and which one is data.
- If a cell is empty in wiki text, it is not written in the json’s representation, so we lose the ability to deal with table’s structure reconstruction as we loose positional informations.
- Make a stable software capable of analyze a topic for one wiki language. +
- Extend topics and languages involved. []
- Try to translate Table Extractor in Scala language in order to make possible the integration with Extraction Framework. -
Reification of row concept:
<http://it.dbpedia.org/resource/Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1912> <http://dbpedia.org/ontology/Election> <http://it.dbpedia.org/resource/Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1912__1> .
Data regarding that row:
<http://it.dbpedia.org/resource/Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1912__1> <http://dbpedia.org/ontology/PoliticalParty> <http://it.dbpedia.org/resource/Partito_Democratico_(Stati_Uniti_d'America)> ; <http://dbpedia.org/ontology/President> <http://it.dbpedia.org/resource/Thomas_Woodrow_Wilson> ; <http://dbpedia.org/ontology/VicePresident> "Thomas Riley Marshal"^^xsd:string ; <http://dbpedia.org/ontology/popularVote> "6296184"^^xsd:positiveInteger ; <http://dbpedia.org/property/electoralVote> "435"^^xsd:positiveInteger .
Reification for row concept same as USA elections:
<http://it.dbpedia.org/resource/Elezioni_politiche_italiane_del_2006> <http://dbpedia.org/ontology/Election> <http://it.dbpedia.org/resource/Elezioni_politiche_italiane_del_2006__1>,
Data regarding that row:
http://it.dbpedia.org/resource/Elezioni_politiche_italiane_del_2006__1> <http://dbpedia.org/ontology/PoliticalParty> <http://it.dbpedia.org/resource/L'Unione>; <http://dbpedia.org/ontology/President> <http://it.dbpedia.org/resource/Carlo_Perrin>; <http://dbpedia.org/ontology/popularVote> "198"^^xsd:positiveInteger; <http://dbpedia.org/property/pvPct "34.54"^^xsd:float
Note: General election's pages are subject to a condition widespread to a lot of topic pages. In fact in some elections pages, such as 2013_Administrative_Italian_Elections, there are a bunch of tables regarding different electoral result but with same headers. Other topics are affected by this condition, see 2015_Edition_of_TT, which cause a problem of 'same row' in the extracted dataset. So for every page, every row would be extracted. Data cells for each row would be associated to row number X, where X is the position of that cell's row in their respective tables. As far as extractor isn't able to analyze more than 1 table at a time, it doesn't know what happened with the last table, and some names used for row reifications are repeated. This means that more than one cell of the same type is associated with the same row. Eg. 2013_Italian_Presidential_Election which has a lot of different tables with same headers.
<http://it.dbpedia.org/resource/Elezione_del_Presidente_della_Repubblica_Italiana_del_2013__1> <http://dbpedia.org/ontology/popularVote> "10"^^xsd:positiveInteger, "210"^^xsd:positiveInteger, "230"^^xsd:positiveInteger, "250"^^xsd:positiveInteger, "395"^^xsd:positiveInteger, "521"^^xsd:positiveInteger, "738"^^xsd:positiveInteger ; <http://dbpedia.org/property/pvPct> "1.0"^^xsd:float .
I contacted my mentors about this but up to now none of them discussed this with me.
- Possibility to customize topics and corresponding SPARQL where_clause stored in order to easy extends domains involved You can launch the extraction over a topic, a single resource, a set of resources determined by a SPARQL query.
- The lists of resources involved in your extractions are stored, to help user understand if the scope targeted has been correctly hit.
- Log file containing infos and complete report regarding the analysis resource by resource, table by table, and a final extraction report
- WYSIWYG approach: As the mapping rules are based on the header text, when you have to create a new mapping rule simply add the text you see as header and you’re sure to select the right cells.
- Useful report to increase effectiveness of software: as the mapping rules are based upon the text of header cells and users change them even for pages related to the same domain, it is good to know which headers of a selected topic haven’t been mapped because they don’t match any of the mapping rules already set. Eg. take a look to USA_presidential_election_2000 and to USA_presidential_election_2004. As you can see the tables under section ‘Risultati’ have same structure and data cells type but with little light differences in headers’ text. 2000_page:Presidente 2004_page:Presidente (Stato) .So a mapping based upon header’s text would differentiate these cases. If you don’t know all the possible cases, it is impossible to map efficiently a domain. Log file helps you resolving this problem as you can find, just before the final report, a section in which the headers ,the Extractor hasn’t be able to map due to a lack of adequate mapping rules, are reported along with values examples and the resource you can find the problem. Eg.
- INFO 08/23 03:27:44 AM - Header: -Stato- , Value example ['Virginia', 'Virginia'], Resource: Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1789
- INFO 08/23 03:27:44 AM - Header: -Stato di origine- , Value example ['Illinois', 'Illinois'], Resource: Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_2008
- INFO 08/23 03:27:44 AM - Header: -Stati vinti- , Value example ['Washington_DC', '28 + Washington DC'], Resource: Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_2008
- INFO 08/23 03:27:44 AM - Header: -Metodo di selezione dei grandi elettori- , Value example ["Gli elettori furono investiti dall'assemblea legislativa dello Stato."], Resource: Elezioni_presidenziali_negli_Stati_Uniti_d'America_del_1789
Main problem dealing with table is the users who write them. As everyone has the ability to add tables to a wiki page, tables in wiki pages are as heterogeneous as the imagination of those who created it. Even between same topic pages, tables are differently structured or with data of different meaning. This is not only due to the users: they are writing wiki text in order to add a table in a wiki page. So even if there are logical or structural errors (headers written as data cells, cells used as vertical divider line) as long as users see a fancy and graphically pleasing table in their browser, they will continue to be happy and not concerned to this little (fatal for us) 'errors'.
As soon as the final evaluation of my project is completed, I would like to:
- Improve mapping class and methods so that a user of Table Extractor can write his own mapping rules despite touching code. I guess a structure containing the actual mapping rules would be great to Table Extractor's expandability. You can find a possible solution's sketch here.
- Expand scopes analyzed as being honest I had problems finding time to expand domains of interest
- Test solution over a lot of different tables type as it has problems with very complex tables like this one (hard to notice,but there is an header cell with a 'rowspan' = 17 in order to have a fancy divider line which cause problems to data extraction)
- Have the possibility to increase concepts in the dbpedia's ontology | properties set, as a lot of data inside tables refer to concepts not already described there.
08/23/2016 Last day of GSoC. Final little software and mapping fixes, added other log and data set examples. Improved DocStrings for every doc file. Readme updated. Added requirements file.
08/21/2016 New mapping feature: once data are extracted, before automatically map them as a dbpedia resources, the Table Extractor ask directly to dbpedia if it is or not a resource. Then only actually existing resources are added to the graph as "http://dbpedia.org/resource/ResourceName". The others are added as simple Literal string
08/19/2016 New feature: once the extraction is finished, headers with ’no mapping rule found’ condition (refer to software features) will be reported in the Log. This means that you now can find easily if some headers are not mapped due to the lack a corresponding mapping rule. So it is simpler to increase effectiveness on a domain.
08/17/2016 Log improvement, now after the last resource has been analyzed a little final REPORT is printed. This report has a lot of statistics regarding general extraction values. Here an example of final report for 'elections'-'it':
- Total # of resources collected for this topic: 52
- Total # of resources analyzed: 52
- Total # tables found : 53
- Total # tables analyzed : 53
- Total # of rows extracted: 295
- Total # of data cells extracted : 1723
- Total # of exceptions extracting data : 1
- Total # of 'header not resolved' errors : 0
- Total # of 'no headers' errors : 1
- Total # of mapping errors : 0
- Total # of 'no mapping rule' errors : 33
- Total # cells mapped : 1443
- Total # of triples serialized : 1730
08/11/2016 - Now HtmlTableParser can find sections under which resides the table that is under analysis. Eg. In 2004_USA_presidential_election the table resides under section "Risultati". You can find some Extraction examples here.
08/09/2016 - Readme updated with script usage, options, folder structure and classes explanation.
08/07/2016 - Now table which headers has a 'rowspan' > 1 are correctly parsed.
08/05/2016 - Added support to 'Sortable' table. A kind of fancy wiki tables which user can reorder clicking over the headers. Now the algorithm find out the section under which stands the table analyzed. It can be very useful to distinguish one table to another, and to apply different mapping rules, as I found many tables with same headers but different data meaning, depending on the section. Eg. 2016 Tourist Trophy standings
08/03/2016 - I've reformat the project folder structure. Now it contains three folders, making it simple to understand where to find sources file, resources lists, datasets and log files.
08/02/2016 - First functioning version of html parser. It uses lxml to obtain an 'etree' html object. It finds tables and other html tag nodes using Xpath query language. I found that very interesting because it works directly with Html structure of a wiki page, which is the same users check out once they write down a new table, instead of a json representation of the wikitext they used, really different from the result those users controlled in their browsers.
07/20/2016 - As already request by mentors, I changed the module interpreting arguments which now use ArgParse, a std 2.7 Python library.
07/16/2016 - I realized that is too difficult to be implement correctly the 'json' approach, so I started working on a html parser that would be more usable a flexible than the json one. In fact if I want to work with other domains, I need a more sustainable solution. Please refer to tables at this page to have an idea of structure complexity I have to face.
07/15/2016 - Logging : I started used a log to store infos, problems and statistics related to data extraction and all the algorithm.
07/13/2016 - Mapper.py : new class used to map in rdf statements data extracted with tableParser. I used it with success, making possible to create the first project rdf dataset.
07/12/2016 - Meeting - As I arrived at the meeting without a rdf dataset, Diamantini urged me to go on with data extraction and to try the approach I used over other wiki pages topic. She helped me finding out express solutions capable of speed my project up. I don't agree with her as I found these solutions a bit useless to hit the target of the project. In the next few days I will try to complete data extraction and the mapping of this data, in order to have a rdf dataset complete.
07/08/2016 - Despite problems with json table representation, I made out to finally reconstruct headers. See this page to have an idea about which kind of structures I am analyzing. I thought it would be useful to associate different row of headers (if there is more than a row of them), as it could be used afterward to tag data cells and to make easy data mapping. I also started working on data extraction. The algorithm works on the assumption that every cell and his data are wrapped in dictionaries with anonymous tag. Eg.
{u'content': {u'width': [u'"35%" colspan=3'], u'@an0': u'----'}, u'@type': u'head_cell'}, {u'content': {u'width': [u'"10%"'], u'@an0': u'Candidati'}, u'@type': u'body_cell'}, {u'content': {u'width': [u'"20%" colspan=2'], u'@an0': u'Grandi elettori'}, u'@type': u'body_cell'}, {u'content': {u'@an1': u'----', u'@an0': u'Voti'}, u'@type': u'body_cell'},
this part is only representing the first header row.
07/04/2016 - Meeting - I set out problems encountered with tables wrapped by JSONpedia. In particular it's clear that a lot of wiki tables are full of little errors (due to the user who wrote them) which don't affect directly tables in the wiki pages, but that are transferred with JSONpedia to the json object representing the page. More than that, JSONpedia doesn't support the tables structures, so that headers aren't tagged. Potena pointed out the possibility of changing approach using an html parser rather than a json one. He spurred me to accelerate my work as these errors could be a big problem for my project. Diamantini presented me some other possibilities. She said it would be better to go straight on with data extraction, avoiding the part of structure reconstruction, using an brute-force extractor and hard-coded mapping rules.
06/27/2016 - Midterm Evaluation - My work has been evaluated good and DBpedia let me continue GSoC but they pointed out a warning about my code style. I hope to upgrade my code skills during next weeks.
06/20/2016 - Working on a possible solution to keep trace of table structures based on the assumption that every table has to contain a number of data cells equal to the header cells. Moreover the first (the number depends) rows always represent headers, so I am trying to apply some rules to get over the problem.
06/17/2016 - tableParser.py : I started to write down this new class, which has to parse and reconstruct the structure of JSONpedia wrapped tables. Having some problems interpreting the structure of tables. They really depend on how users wrote them, so I am trying a general approach to find out which part of the table represents header cells and which one data. I found this help page which is useful to have an idea on different solutions adopted by users in wikipedia chapters. Help:Table page on Wikipedia
06/15/2016 - Meeting - Discussed with mentors progresses obtained and difficulties encountered. Fossati explained to me that is more useful to concentrate efforts in dataset creation more than in code style and I agreed. I started working on the structure of tables relating to wiki pages of politics and administratives elections (it chapter, by now). Take a look at this page: 1984 USA Presidential Elections, Italian chapter
06/12/2016 - This week I have worked on the Analyzer module, on the statistics script and on the Selector. The statistics.py script now has a major update: it continues to call JSONpedia service until it gets back a useful response. Even if with this kind of concept time of execution can be really expanded, it is important to have clear results.
06/05/2016 - Selector Module completed. It takes 2 parameters:
- wiki_chapter e.g en/it/fr and so on. Default "en"
- tag/where_clause to identify a collection of resources. Default "all" that stands for all wiki pages. Once the parameters has been tested the selector collects a list of resources and write them into a file (.txt). I guess it is useful to keep a trace of resources found by the selector and to make tests to modules that usually would be after this one (eg. the Analyzer).
05/28/2016 - I have found that would be a better choice to test the parameters passed to the python package, in order to be sure of quality. NEW MODULE param_test. It tests and set 2 parameters which are: the chapter of wiki considered and the query to target out a scope. The query could be either a tag (such as "soccer" for soccer players, dir for "directors", "act" for actors and so on) or a real SPARQL where statement. In the second case the user has to be sure to set a correct and useful query.
05/23/2016 - Start of coding - I have started by engineering the algorithm - so I have found out there would be 3 principal modules: A selector of resources, an analyzer module and a utilities module (used by the other modules). The possibility to use JSONpedia as a module is deferred, due to some incompatibilities. If it is possible I will try later during GSoC.
05/22/2016 - Little Report on CB period - The CB period is already over, but it was really useful: I have been able to contact my mentors with success. They helped a lot clarifying targets and strategies. They have been available immediately since the beginning of CB period. I had to email some other community members in order to get acquainted with tools they contributed to develop. I found out they were very helpful. I really hope this type of collaboration would last during all the GSoC. My firsts impressions are very positive. Let the real work begin!
05/19/2016 - Second Meeting - I show the statistics result to my mentors. As I was running out of domains idea, they helped me discussing which kind of Wiki pages could have major tables number ( or tables with more interesting infos). So my first step is to evaluate these domains either in it.wiki or in en.wiki:
- Soccer Players
- Political Elections
- Music Artist’s Discography
- Writers
- Motorsports
- Drivers of motorsports
- Basketball seasons and Players
- Statistics on Awards (eg. Oscar, Nobel, Grammy and so on)
- Actors and filmographies
05/15/2016 - I have started a collaboration with Feddie to extend the capabilities of statistics.py. Now it has the possibility to count lists. We are working together on building up some extraction tools (extraction framework itself, JSONpedia) in order to run them from our own machines. This can be useful to edit parts of tools’ docs o maybe to make our own guide to them.
05/08/2016 - Later on this week I started a little python script (statistics.py) to interact with Wiki and Dbpedia’s chapter in order to do some statistics. This script, you can find it in “Table Extractor”’s repo on github, is able to show how many tables there are in a domain of Wiki pages.
05/03/2016 - First Meeting - We started pointing out some simple rules I've to follow during the GSoC (eg. how and when I will have to report on my work). My DBpedia mentor (marfox) showed my repo page on Github.com and the possibility to maintain a “progress page” here in the extraction framework’s wiki. Therefore my mentors gave me some project goals. Then we discussed a strategy and we all agreed, as a first approach, to start analyzing some particular domains (and wiki chapters too) of interest, trying to extract relevant data immediately.
04/27/2016 - Contact with mentor and co-mentors. I'm really excited to know that my project is starting to catch interest from community members.
04/22/2016 - My GSoC2016' proposal to DBpedia has been chosen! I can't wait to make first contact with community
Scope | Chapter | # of resources | Tables | Resources lost |
---|---|---|---|---|
Elections | EN | 1000 | 5191 | 0 |
Elections | IT | 137 | 347 | 0 |
USA Elections | IT | 52 | 53 | 0 |
Actors | EN | 6621 | 3328 | 6 |
Actors | IT | 1704 | 85 | 96 |
Writers | EN | 29581 | 1690 | 100 |
Writers | IT | 1177 | 11 | 2 |
Soccer Players | EN | 108202 | 29220 | 189 |
Motorcycle Riders | IT | 1028 | 115 | 2 |
Motorcycle Riders | EN | 1780 | 2131 | 1 |