Enable the `huge_tree` option for the `lxml` parser #3365

seberm · 2024-11-18T10:20:03Z

This PR should resolve the problem with the LXML limit on the size of text nodes it can handle:

Polarion XML fails to generate due to xmlSAX2Characters: huge text node #3363

For more info about a huge_tree option, please see:

https://lxml.de/api/lxml.etree.XMLParser-class.html

huge_tree - disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)

Especially this LXML FAQ section about security concerns:

https://lxml.de/FAQ.html#is-lxml-vulnerable-to-xml-bombs

Is lxml vulnerable to XML bombs?

This has nothing to do with lxml itself, only with the parser of libxml2. Since libxml2 version 2.7, the parser imposes hard security limits on input documents to prevent DoS attacks with forged input data. Since lxml 2.2.1, you can disable these limits with the huge_tree parser option if you need to parse really large, trusted documents. All lxml versions will leave these restrictions enabled by default.

Note that libxml2 versions of the 2.6 series do not restrict their parser and are therefore vulnerable to DoS attacks.

TODOs:

Try to test the test with huge output without lxml (with just jinja) to check if bottleneck is really lxml (schema or pretty print)
Enable the "huge output" test in the CI? -> yes
Add a test for XML deep trees (using a custom junit flavor)

Pull Request Checklist:

seberm · 2024-11-18T10:32:35Z

Hello @thrix, @happz, @psss ,
Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

seberm · 2024-11-18T11:53:01Z

@KwisatzHaderach Are you able to somehow test this change and let me know if it resolves your problem?

The processing of huge files using LXML can take a lot of resources (esp. CPU). We can always add an option/condition to completely bypass the LXML processing and use just Jinja2. The LXML is used to just check the XML schema for non-custom JUnit flavors and for prettifying the XML output.

tests/report/junit/data/main.fmf

thrix · 2024-11-19T13:20:54Z

Hello @thrix, @happz, @psss , Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

@seberm fine with me, the XML is not something user can easily inject I assume, if he wants to DOS us he has a lot of other options directly from the tests anyway.

KwisatzHaderach

didn't get a chance to test it yet, but approving since it looks good and we need this sooner rather than later

seberm · 2024-11-19T22:33:50Z

So I gave it an another round of tests. Let's take following test as an example:

require:
  - python
result: fail
test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)"

As you can see from the results below, enabling the huge_tree=True option makes sense, and I think it will help solve the #3363 issue.

1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnit

Takes around 43s on my machine to generate, but it works.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    43.01s user 2.08s system 71% cpu 1:03.41 total

2) LXML enabled (schema validation + pretty print), `huge_tree=False`

Takes around 44s on my machine, it crashes on huge text node error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.30s user 2.47s system 69% cpu 1:06.87 total

3) LXML enabled (schema validation + pretty print), `huge_tree=True`

Takes also around 44s on my machine, it works without error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.22s user 2.20s system 71% cpu 1:04.93 total

(another?) problem I was facing

If you change the test to the following command (similar amount of data as in previous test but no line ending):

test: python -c "print('a' * (10 * 1024 * 1024 + 1))"

The tmt is still "working" on generating the report and it seems like it will never finish. This behavior is the same with or without the LXML installed.

I would appreciate it if someone could help me to profile this test in the tmt, so we could decide if this is actually a problem that needs to be solved. Perhaps not? Any ideas?

seberm · 2024-11-20T15:59:16Z

@psss @happz Do you think it's a good idea to enable the test mentioned above (test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)") in the CI? I am not sure if it's a good idea to slow down the CI in this way. Thanks.

martinhoyer

If I understand it correctly, we want large depth (above 256) to require huge_tree.

We played with it locally with:

DEPTH=257
OUTPUT_FILE="deep.xml"

# Start the XML file with a root tag
echo "<root>" > "$OUTPUT_FILE"

# Generate nested tags
for i in $(seq 1 $DEPTH); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "<tag$i>" >> "$OUTPUT_FILE"
done

# Close the nested tags in reverse order
for i in $(seq $DEPTH -1 1); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "</tag$i>" >> "$OUTPUT_FILE"
done

# Close the root tag
echo "</root>" >> "$OUTPUT_FILE"

And then I thought we could catch the exception and only use huge_tree, when it's necessary?

from lxml import etree

def parse_with_huge_tree(file_path):
    try:
        tree = etree.parse(file_path)
        print("XML parsed successfully (default settings).")
        return tree
    except etree.LxmlError as e:
        print(f"LxmlError encountered: {e}")
        print("Retrying with huge_tree=True...")
        try:
            parser = etree.XMLParser(huge_tree=True)
            tree = etree.parse(file_path, parser)
            print("XML parsed successfully with huge_tree=True.")
            return tree
        except etree.LxmlError as retry_error:
            print(f"Retry failed with huge_tree=True: {retry_error}")
        except Exception as retry_exception:
            print(f"Unexpected error during retry: {retry_exception}")
    except Exception as e:
        print(f"Unexpected error: {e}")

# Usage
parse_with_huge_tree('./deep.xml')

Make sense?

(and maybe print a warning that it's quite large and we are being forced to use huge_tree, security vulnerability, et cetera)

seberm · 2024-11-22T10:53:47Z

Hi @mhoyer,
it's eighter "large tag depth" or/and "big text content in tag data".

Thanks for testing the deep trees. The deep XML trees in the tmt junit plugin can happen only with custom flavors. I will add a test for it. In the default flavor or when using the polarion jinja template for generating XUnit, this should never happen.

Regarding trying with huge_tree=False and retrying with huge_tree=True, considering that the huge_tree option will still be effectively enabled, I don't see a reason to complicate the code other than informing the user about potential security/DoS concerns or that it may consume more system resources and take longer. Or is there any more?

martinhoyer · 2024-11-22T13:19:19Z

still be effectively enabled, I don't see a reason to complicate the code other than informing the user about potential security/DoS concerns or that it may consume more system resources and take longer. Or is there any more?

Yep, just informing users. I haven't looked into what it actually does, so I'm not sure how beneficial it is to try it without the flag first

seberm added plugin | junit The junit report plugin plugin | reportportal The reportportal report plugin labels Nov 18, 2024

seberm added this to the 1.40 milestone Nov 18, 2024

seberm requested review from psss, lukaszachy, happz, thrix and janhavlin as code owners November 18, 2024 10:20

seberm added the ci | full test Pull request is ready for the full test execution label Nov 18, 2024

seberm linked an issue Nov 18, 2024 that may be closed by this pull request

Polarion XML fails to generate due to xmlSAX2Characters: huge text node #3363

Open

thrix reviewed Nov 19, 2024

View reviewed changes

tests/report/junit/data/main.fmf Outdated Show resolved Hide resolved

KwisatzHaderach approved these changes Nov 19, 2024

View reviewed changes

seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 2a7c422 to 5317bc7 Compare November 21, 2024 14:42

seberm requested a review from thrix November 21, 2024 14:42

psss changed the title ~~Enable the huge_tree option for lxml parser~~ Enable the huge_tree option for the lxml parser Nov 21, 2024

martinhoyer reviewed Nov 21, 2024

View reviewed changes

seberm added 6 commits November 22, 2024 11:55

Enable the huge_tree option for lxml parser

8f07a2d

Add disabled test which generates huge output into JUnit file

4df1be7

Enable and change the test command to a working one

5d58e86

Improve the test the way it executes the execute step only once

8d09156

Use rlRun_LOG instead of teeing the output

31f724a

Add test for XML deep trees

517971a

seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 41204e8 to 517971a Compare November 22, 2024 11:25

Do not remove output files one by one

825cf3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the `huge_tree` option for the `lxml` parser #3365

Enable the `huge_tree` option for the `lxml` parser #3365

seberm commented Nov 18, 2024 •

edited

Loading

seberm commented Nov 18, 2024

seberm commented Nov 18, 2024

thrix commented Nov 19, 2024 •

edited

Loading

KwisatzHaderach left a comment

seberm commented Nov 19, 2024

seberm commented Nov 20, 2024

martinhoyer left a comment •

edited

Loading

seberm commented Nov 22, 2024

martinhoyer commented Nov 22, 2024

Enable the huge_tree option for the lxml parser #3365

Are you sure you want to change the base?

Enable the huge_tree option for the lxml parser #3365

Conversation

seberm commented Nov 18, 2024 • edited Loading

TODOs:

seberm commented Nov 18, 2024

seberm commented Nov 18, 2024

thrix commented Nov 19, 2024 • edited Loading

KwisatzHaderach left a comment

Choose a reason for hiding this comment

seberm commented Nov 19, 2024

1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnit

2) LXML enabled (schema validation + pretty print), huge_tree=False

3) LXML enabled (schema validation + pretty print), huge_tree=True

(another?) problem I was facing

seberm commented Nov 20, 2024

martinhoyer left a comment • edited Loading

Choose a reason for hiding this comment

seberm commented Nov 22, 2024

martinhoyer commented Nov 22, 2024

Enable the `huge_tree` option for the `lxml` parser #3365

Enable the `huge_tree` option for the `lxml` parser #3365

seberm commented Nov 18, 2024 •

edited

Loading

thrix commented Nov 19, 2024 •

edited

Loading

2) LXML enabled (schema validation + pretty print), `huge_tree=False`

3) LXML enabled (schema validation + pretty print), `huge_tree=True`

martinhoyer left a comment •

edited

Loading