Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

muc7-reader #17

Open
fmatthies opened this issue Jun 7, 2016 · 2 comments
Open

muc7-reader #17

fmatthies opened this issue Jun 7, 2016 · 2 comments

Comments

@fmatthies
Copy link
Contributor

fmatthies commented Jun 7, 2016

added a script that converts sgml to xml files expected by the reader; however the script is just rudimentary right now, but works for all files of the following structure:

<DOC>
<DOCID> nyt960214.0765 </DOCID>
<STORYID cat=a pri=r> A4505 </STORYID>
<SLUG fv=ttx-z> BC-<COREF ID="1">PANTEX</COREF>-<COREF ID="3">FLIGHTS</COREF>-TEX </SLUG>
<DATE> <COREF ID="104">02-14</COREF> </DATE>
<NWORDS> 0535 </NWORDS>
<PREAMBLE>
[...] 

The script takes as argument the name of the file to convert
python muc7_SGML2XML.py training.tr.keys.980410
and produces a file with the same name but an additional ".xml" ending.

But:
it seems the reader doesn't annotate coreferences in the CAS? Need to investigate!

@fmatthies
Copy link
Contributor Author

Solved issue with not annotating corefs; however the reader still only annotates corefs and nothing else.
--> see capabilities

@fmatthies
Copy link
Contributor Author

Feedback per E-Mail:

Mir ist auch noch etwas aufgefallen. Und zwar annotiert der MUC7Reader momentan nur den "Text"-Teil des Dokumentes. Wenn du das ganze Dokument annotieren möchtest (was der Normalfall sein sollte) musst du die Kommentare der auskommentierten Methoden in der Methode getNext(CAS) entfernen. Und bei den statischen Variablen musst du die Kommentare bei

/**
     * XML elements comprised in an object list
     */
     public static final String[] ELEMENT_TEXT_TO_BE_PROCESSED = { ELEMENT_SLUG, ELEMENT_DATE,
     ELEMENT_NWORDS,
     ELEMENT_PREAMBLE, ELEMENT_TEXT, ELEMENT_TRAILER };

entfernen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant