Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Plone5.2-rc2/Python3.6/c.solr8.0.0a1] Parsing error xmlSAX2Characters: huge text node #239

Open
NicolasGoeddel opened this issue Sep 17, 2019 · 4 comments

Comments

@NicolasGoeddel
Copy link

There is problem with parsing huge XML outputs of the extraction handler of Solr.
When I want to index a PDF file with nearly 3000 pages of text, Solr extracts that text and returns with a XML response that is handled by collective.solr.indexer.BinaryAdder. The problem here is etree.parse(response) which does not work with big text nodes. It needs to be changed to etree.iterparse() I guess. But that is a bigger change.

It would be nicer if collective.solr would extract and indexing a binary object in one single step. I don't know if this is possible with Solr's API. At the moment collective.solr extracts all the text of a binary blob using Solr, saves that text into a Dictionary and sends it back to Solr to index it. That looks not very efficient in my opinion. Maybe you know of a simple change to do both things together without that step in between.

For your information this is the whole warning:

2019-09-17 17:04:24,067 WARNING [collective.solr.indexer:178][waitress] Parsing error xmlSAX2Characters: huge text node, line 160970, column 47 (<string>, line 160970) @ /bfd-db/content/mypdf.pdf.
@NicolasGoeddel
Copy link
Author

I was able to solve that problem using the etree.iterparse() method. Therefore I modified collective.solr.indexer.BinaryAdder.__call__() within the try-block directyl after the call to conn.doPost like so:

        try:
            response = conn.doPost(
                url, encodedPost.to_string(), headers
            )
            
            context = etree.iterparse(response, huge_tree = True)
            
            data["SearchableText"] = u""
            for event, elem in context :
                if elem.getparent() is not None and elem.getparent().tag == 'response' :
                    if elem.text is not None :
                        data["SearchableText"] += elem.text.strip()
            
        except SolrConnectionException as e:
        ....

@tisto
Copy link
Member

tisto commented Sep 20, 2019

@NicolasGoeddel thanks for reporing this and providing a fix. This is highly appreciated. I'd be more than happy to review and merge a PR if you would care to open one. :)

@NicolasGoeddel
Copy link
Author

I will take a look into how PRs work. I never did one. Seems like I have to Fork first, make a branch and such things.

@tisto
Copy link
Member

tisto commented Sep 21, 2019

@NicolasGoeddel awesome! Yes, you can fork the repo and then do a pull request or checkout the repository from the collective. For the latter option, I would have to add you to the Plone collective. I'd be more than happy to do so if you are ok with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants