Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Doc file reading problem with slovak characters #2383

Closed
SamoUhliar opened this issue Mar 1, 2023 · 4 comments · Fixed by #2664
Closed

.Doc file reading problem with slovak characters #2383

SamoUhliar opened this issue Mar 1, 2023 · 4 comments · Fixed by #2664

Comments

@SamoUhliar
Copy link

I'm trying to read .doc file in php(I'm using laravel v6). For reading .doc I'm using phpword library. It works fine with .doc file in english. But I'm from slovakia a we have in our alphabet characters like á,é,í,č,ň,ô. And whit those characters i have problem.

My code:

protected static function doc_to_text( $filename )
    {
        $objReader = IOFactory::createReader('MsDoc');
        $phpWord = $objReader->load($filename); // instance of \PhpOffice\PhpWord\PhpWord

        $text = '';

        foreach ($phpWord->getSections() as $section) {
            foreach ($section->getElements() as $element) {
                if ($element instanceof Text) {
                    $text .= $element->getText();
                }
            }
        }
        return $text;
    }

This is output of function with slovak .doc:

"}\x01IVOTOPIS Titul, menoKontaktné údaje:Ulica, stoTelefón: 0xx/ xxx xxx Mobil: 09xx xxx xxx e-mail: \x13 HYPERLINK "mailto:[email protected]" \[email protected]\x15 Dosiahnuté vzdelanie: Vysokoa\x01kolské/stredoa\x01kolskéVzdelanie: 2000-2006 Fakulta/ univerzita1995-2000 stredná aDoplH\x01ujúce informácie o vzdelaní: 1998-2000 kurzy 1996-1997 a\x01tudijné pobytyPracovné skúsenosti: 2000-2004 zamestnávate>\x01, pozícia 2004-2006 zamestnávate> pozícia Jazykové znalosti: Anglický jazyk - aktívne 8:|<U+0094><U+009E>¶ÔÖþ\x16\vB\vZ\v<U+0086>\v
<U+008A>\v®\vä\v\x10B\x10´\x10Ö\x10\x1E\x11F\x11H\x11J\x11üøòøìøüøäøäÞäøìüøìøüøüøìøüøüøìøüøüøìøÜìøìøìøØ\x06\x16h\e\t_\x03U\x08\x01\x16hÌeh0J\x11\x0F\x03j\x16hÌehU\x08\x01\x16hÌeh0J\x10\x16hO\x1FÝ0J\x10\x06\x16hÌeh\x06\x16hO\x1FÝ-\x08\x16\x08.\x08P\x08<U+0082>\x08°\x08Ú\x08R\t¶\tÎ\t:<U+0080>¢Ö&\vF\x04\x13¤d\x14¤d[$\x01\$\x01gdÌeh\x0F&\vF\x03\x13¤d\x14¤d[$\x01\$\x01gdÌeh\x0F&\vF\x02\x13¤d\x14¤d[$\x01\$\x01gdÌeh\x0F&\vF\x01\x13¤d\x14¤d[$\x01\$\x01gdÌeh\x04\x0FgdÌeh\x04\x03gdÌeh\x13PoPHP, C++ XHTML, CSS Microsoft Excel Microsoft Word Vodi preukaz: sk. C (najazdených cca 600 000km) Vlastnosti a záujmy: "

This is output of function with english .doc:
"Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla.Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.Maecenas non lorem quis tellus placerat varius. Nulla facilisi. Aenean congue fringilla justo ut aliquam. In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. "

I have tried google and ask my friends. Also i try https://github.com/neitanod/forceutf8. I need some ideas what should be problem or how to solve it.

@huelsgp27
Copy link

same problem

@Progi1984
Copy link
Member

@SamoUhliar @huelsgp27 Could you give me a simple file, please ?

@Progi1984 Progi1984 added the Status: Waiting for feedback Question has been asked, waiting for response from PR author label Aug 8, 2024
@huelsgp27
Copy link

@Progi1984 Progi1984 added MS-DOC (Word 97) and removed Status: Waiting for feedback Question has been asked, waiting for response from PR author labels Aug 9, 2024
@Progi1984 Progi1984 self-assigned this Aug 9, 2024
@Progi1984 Progi1984 added this to the 2.0.0 milestone Aug 9, 2024
@Progi1984
Copy link
Member

@SamoUhliar @huelsgp27

This issue has been fixed by a maintainer in the PR #2664. You can help him by sponsoring him through Github sponsors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

3 participants