Skip to content

HTML generated from a Docx is way too big #2203

Open
@Mouke

Description

@Mouke

Describe the Bug

I use PHPWord (and PHPSpreadsheet) to convert word/excel (and Openoffice equivalent) files into PDF (by converting them in HTML then using DomPDF). When my file has pictures in it, the rendered HTML size exploses : for instance a 1MB .docx file goes into a 36MB html string. (The PDF conversion then brings it back to 26MB, which is still way too much) After dumping the HTML, I would guess it's the base64 conversion of the pictures that makes everything go crazy.

Steps to Reproduce

Using that file :
test-long.docx

<?php
require __DIR__ . '/vendor/autoload.php';

$path = 'PATH_TO_FILE';

$phpWord = \PhpOffice\PhpWord\IOFactory::load(file_get_contents($path), 'Word2007');
$htmlWriter = new \PhpOffice\PhpWord\Writer\HTML($phpWord);
$html = $htmlWriter->getContent();
echo strlen($html);

Expected Behavior

I would expect it to be more concise. I understand that the conversion may produce a bigger filer, but in that case it's more than 10x bigger.

Context

Please fill in your environment information:

  • PHP 7.4.28 (cli) (built: Mar 3 2022 09:59:56) ( NTS )
    Copyright (c) The PHP Group
    Zend Engine v3.4.0, Copyright (c) Zend Technologies
  • PHPWord Version 0.18.2
  • Server is a dockerized Ubuntu based on the php:7.4-fpm image.

Best regards,

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions