Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DomDocument::loadXML cannot parse files larger than 1,3 GiB #14684

Open
pjpawel opened this issue Jun 27, 2024 · 5 comments
Open

DomDocument::loadXML cannot parse files larger than 1,3 GiB #14684

pjpawel opened this issue Jun 27, 2024 · 5 comments

Comments

@pjpawel
Copy link

pjpawel commented Jun 27, 2024

Description

The following code:

<?php
function loadXMLFile($filename) {

    $content = file_get_contents($filename);

    $doc = new DOMDocument();
    
    $success = $doc->loadXML($content);

    if (!$success) {
        echo "Failed to parse $filename\n";
        return null;
    }

    return $doc;
}

function main($filename) {
    $doc = loadXMLFile($filename);

    if ($doc !== null) {
        $root = $doc->documentElement;
        echo "Root element: " . $root->nodeName . "\n";
    } else {
        echo "Empty document or failed to load.\n";
    }
}

if ($argc != 2) {
    echo "Usage: php " . $argv[0] . " <xmlfile>\n";
    exit(1);
}

main($argv[1]);

Resulted in this output:

PHP Warning:  DOMDocument::loadXML(): Memory allocation failed : growing input buffer in domLoadXml.php on line 8

Warning: DOMDocument::loadXML(): Memory allocation failed : growing input buffer in domLoadXml.php on line 8
Root element: tns:JPK

But I expected this output instead:

Root element: tns:JPK

I tried using option LIBXML_PARSEHUGE, but result was the same.
The most curious thing is that DomDocument::load() is loading large files correctly.

PHP Version

8.2.5

Operating System

Windows 11, Ubuntu 22.04

@nielsdos
Copy link
Member

My system goes out of memory before I can complete the parse of the file.
The error you're seeing actually comes from libxml, I think it's due to the fact that buffer sizes are stored as 32-bit integers inside libxml, and that 1.3 GiB * 2 (times 2 to grow the buffer) is above the integer limit.
I'd suggest reporting the issue over at libxml.

@pjpawel
Copy link
Author

pjpawel commented Jul 11, 2024

Ok, from libxml2 2.13.1 function xmlreadMemory returns null while loading huge document.

https://gitlab.gnome.org/GNOME/libxml2/-/issues/451

@nielsdos
Copy link
Member

Technically I think this can be worked around on by using a custom IO reader.

@nielsdos
Copy link
Member

When you say that the load method works fine for large files, that is also a file of roughly 1.3GIB?

@pjpawel
Copy link
Author

pjpawel commented Jul 11, 2024

DomDocument::load() can load files of size at least 3GiB without a problem. Bigger files are out of my range, because loading that file cost my computer 15 GB of RAM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants