You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
defto_text(self):
""" Returns text of a document by iterating through all the sections '\n' """text=""forsectioninself.sections():
text=text+section.to_text(include_children=True, recurse=True) +"\n"returntext
and self.sections() reads child nodes of root_node with tag header:
defsections(self):
""" Returns all the sections in the block. This is useful for getting all the sections in a document. """sections= []
defchunk_collector(node):
ifnode.tagin ['header']:
sections.append(node)
self.iter_children(self, 0, chunk_collector)
returnsections
When the response from nlm-ingestor server doesn't contain sections, the function will return emtpy string. Should it get text from all children of root_node?
The text was updated successfully, but these errors were encountered:
I'd also like to add that calling Document.to_text() outputs duplicated text as it's being called on each section and sections can be children of other sections.
In this example, self.sections()[0] (block_idx=1) is the parent of self.sections()[1] (block_idx=2), so obviously calling to_text() on both will result in duplicated text.
I also think it would make more sense to have to_text() implemented on the Document.root_node, which was the behaviour I was expecting before looking through the documentation.
It seems much more logical to simply do this? It seems to work on a simple PDF:
def to_text(self):
text = ""
for n in self.root_node.children:
text = text + n.to_text(include_children=True, recurse=True) + "\n"
return text
I found that the
to_text()
reads sections:and
self.sections()
reads child nodes ofroot_node
with tagheader
:When the response from nlm-ingestor server doesn't contain sections, the function will return emtpy string. Should it get text from all children of
root_node
?The text was updated successfully, but these errors were encountered: