Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsed files in final too small #8

Open
kurzum opened this issue Aug 31, 2019 · 0 comments
Open

Parsed files in final too small #8

kurzum opened this issue Aug 31, 2019 · 0 comments
Assignees

Comments

@kurzum
Copy link
Member

kurzum commented Aug 31, 2019

I managed to derive mappings and wikidata.
However, after the last 5 commits or so the files are only partially generated for generic.

Setup on generic server:

# pom is modified to only derive geo-coordinates/2018.08.01
cd /data/derive/databus-maven-plugin/dbpedia/generic
mvn databus-derive:clone -e

result:

root@dbpedia-generic:/data/derive/databus-maven-plugin/dbpedia/generic/target/databus/derive# 
du -sh *
90M     downloads
3.2M    final
208K    reports

I removed the lbzip2file, but that is not it, still smaller in final:

du -sh *
90M     downloads
55M     final
508K    reports

Potential solutions:

  1. I tried to revert to 4cb41bc
    4cb41bc , but the error is in there.

  2. I noticed this line: https://github.com/dbpedia/databus-derive/blob/master/src/main/scala/org/dbpedia/databus/derive/mojo/CloneGoal.scala#L191
    > instead of >> but this only affects the reports, I tested it, it somehow makes bigger reports

  du -sh *
90M     downloads
3.2M    final
508K    reports
  1. the files in final parse well with rapper so they seem parsed
  2. A previous commit changed, but I am not sure, what this even does
val jenaTriples = new LangNTriplesSkipBad(tokenizer, parserProfile, null).filter(

-           wrappedTriple => { ! removeWarnings || ! reports.getCorruptRows.contains(wrappedTriple.getRow) }
+          wrappedTriple => { ! removeWarnings || ! errorHandler.getViolatedRowsBuffer.contains(wrappedTriple.getRow) }
  1. might be the outputstream of the finalfile, but this seems straightforward

Overall, I didn't find anything useful really.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants