Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nifconverter] add tika #75

Closed
jnehring opened this issue Aug 2, 2016 · 11 comments
Closed

[nifconverter] add tika #75

jnehring opened this issue Aug 2, 2016 · 11 comments
Assignees

Comments

@jnehring
Copy link
Member

jnehring commented Aug 2, 2016

Apache Tika will be integrated into the nifconverter in order to support more input formats. Using Tika the nifconverter can convert many formats to NIF, e.g. PDF, all MS Office formats and many more.

@jmschnei
Copy link
Contributor

jmschnei commented Aug 8, 2016

The Apache TIKA functionality has been added in the nif-converter endpoint. There are also two tests for validating it with MSWord and PDF files.

@fsasaki
Copy link

fsasaki commented Aug 8, 2016

Great work! Do you have a sample CURL request?

Best,

Felix

2016-08-08 13:45 GMT+02:00 Julian Moreno Schneider <[email protected]

:

The Apache TIKA functionality has been added in the nif-converter
endpoint. There are also two tests for validating it with MSWord and PDF
files.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#75 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABH5Agtb1y0jZIy2ka0pX7VP1EayGmt3ks5qdxbTgaJpZM4JaV_K
.

@jmschnei
Copy link
Contributor

jmschnei commented Aug 8, 2016

The curl request is:

curl -X POST -d @../src/data.pdf "http://localhost:8080/toolbox/nif-converter?informat=TIKAFile" -H "Accept: text/turtle"

The localhost should be modified using the server URL.

@jnehring
Copy link
Member Author

How does tika determine the data format of the file? Depending on the file ending?

@jnehring
Copy link
Member Author

I installed the codes on freme-dev. They do not work. This CURL

curl -X POST -d @source_pdf.pdf "http://api-dev.freme-project.eu/current/toolbox/nif-converter?informat=TIKAFile" -H "Accept: text/turtle"

produces

{
  "exception": "eu.freme.common.exception.BadRequestException",
  "path": "/toolbox/nif-converter",
  "message": "There is no file because it is not a multipart request",
  "error": "Bad Request",
  "status": 400,
  "timestamp": 1471333818984
}

What is wrong?

Another question: Did you write documentation about this in the DKT documentation somewhere that we can reuse in the FREME documentation?

@jmschnei
Copy link
Contributor

The problem is that the endpoint expects a Multipart Request. In order to perform it using CURL it should be specified the -F/--form parameter. This is a working request.

curl -X POST -F "inputFile=@source_pdf.pdf" "http://api-dev.freme-project.eu/current/toolbox/nif-converter?informat=TIKAFile" -H "Accept: text/turtle"

The file format is automatically determined by TIKA and the supported formats can be found in:

https://tika.apache.org/1.13/formats.html

jnehring added a commit that referenced this issue Aug 16, 2016
@jnehring
Copy link
Member Author

Thank you. I added support for the prefix parameter to the nif converter.

@fsasaki
Copy link

fsasaki commented Aug 18, 2016

@jnehring , will this functionality be added to the documentation of the NIF converter
http://api-dev.freme-project.eu/doc/api-doc/full.html#!/Toolbox/NIF-Converter
?

@jnehring jnehring reopened this Aug 18, 2016
@jnehring jnehring assigned jnehring and unassigned jmschnei Aug 18, 2016
@jnehring
Copy link
Member Author

Thanks for the reminder. The documentation issue got lost somehow. I created a new issue: freme-project/freme-project.github.io#284

@fsasaki
Copy link

fsasaki commented Sep 25, 2016

I tried

curl -X POST -H "Content-Type: text/html" -d '' "http://api-dev.freme-project.eu/current/e-terminology/tilde?informat=text/html&source-lang=de&target-lang=en&filter=count-domains&intype=url&input=http://www.zeit.de/wirtschaft/2016-09/arbeitskraeftepotenzial-deutschland-ungenutzt-arbeit-wille"

and it works. If I try the same with the informat TIKAFile, I get

"parameter informat has invalid value "TIKAFile"

It would be great to be able to use TIKA without having to call explicitly the NIF conversion endpoint. Is that feasible?

@jnehring
Copy link
Member Author

I created a new issue for this: #107

I hope we can find the time to implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants