Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocr-transform hocr text: &#x1b is an invalid XML character #185

Closed
jbarth-ubhd opened this issue Jul 1, 2024 · 2 comments · May be fixed by filak/hOCR-to-ALTO#30
Closed

ocr-transform hocr text: &#x1b is an invalid XML character #185

jbarth-ubhd opened this issue Jul 1, 2024 · 2 comments · May be fixed by filak/hOCR-to-ALTO#30

Comments

@jbarth-ubhd
Copy link

ocr-transform hocr text 010/01.tif/00100156_digit.hocr 
Error on line 19 column 27 of hocr__text.xsl:
  SXXP0003   Error reported by XML parser: Character reference "&#x1b" is an invalid XML
  character.: Character reference "&#x1b" is an invalid XML character.
org.xml.sax.SAXParseException; systemId: file:/usr/local/share/ocr-fileformat/xslt/hocr__text.xsl; lineNumber: 19; columnNumber: 27; Character reference "&#x1b" is an invalid XML character.

input file (gzip, base64):

H4sIAAAAAAACA71Y226jVhR9z1fsUqlMpTGcC5dDghm1yYxSadKOKkfp9MXC5iRBdcADJHb6Lf2U
vs2PdR8YGzh2piGumijyAZu172vtOHyzvlvAgyzKNM/GBrWIATKb50ma3YyNy8m7kTDeREfhN2e/
nE4+fngLtxV+/sPlj+9/OgVjZNtX/NS2zyZn8Nv55OI9IABMijgr0woB44Vtv/3ZOAL8MW6ranls
26vVylpxKy9u7Mmv9lrhUQXw5TiqOk9bSZUYaL02io5m5XgPDA2CoHnaUB86XsTKd5kZsD1FRxDe
yjjBVwirtFrIKLSbV3XnTlYxKOCR/HSfPoyN0zyrZFaNJo9LacC8uRoblVxXtjJ0Mr+Ni1JW4/vq
GjNktyhZfCfHZj4vRuVjWck7c/u0WcmylEU8r8C1HIuasPexebyMZ+kCMyDLzsP41nQZ30hQh3lc
yBiaW0X9ukiz+q31dJUXiTotpyt8+NpURkL7S/DhLE8ea6tJ+gDzRVyWLbQJaTI21WmKztXZGZvp
nTJqWDahBP+sKr22CZ4JdT11YZzAbJavgeAvpcQDz6cnsFQoWQ7kBMp5nE0LWQLjRP2ZyvyO/Tqi
xoHZIp//MaWtD0ZtgAVAHaDEZUA5M2oUCJf9IIpNDEUDUDeAmcj7Z2BBWC7jrIunktoAqtOuS0KQ
BodzcDDUWVzKuhBEBb6elumfErhjueoikeVcZglOGojmTry94VvuxokdL5qKNm6oU9cNs3UDI8Ka
YGQKua48MNeM7kJb4Q0AZ31wSqjoBblBp4EZXfbQ+xf/mky2UxPmNzUJeDeXFoY1YrxNKNfTqSVz
SCp5P1p0wnExXAK+aEPFnpVF8fh6cC4dLZeeAEcAcxwQtIUPEP8iLh4Hw7taH3BfpVC4AQivhXeF
GYVlVeTZTfT5LzTSHAdb8zRrAQUeQBAQEJ22cD0z8rnDBsP7et8R7DenbYitAezqY+v3QzqP62OM
HU58CLxA0cH/N8dCn2PNj+2ocTOilGpJ7V6E9rKhVRt5dS/BlhJJMa7yok+y2hBSoMxTPI5DwLkR
PY23h7D1tAL1g0YTmPsswmZPEvZerGdU2nkahnp9kiEOjGiHZZhWbO8Algm0QjduYKYZcdsyC2pG
5zkuCHLw8FCiTQ/SDEVZ4MoG5T0qu4rLXH73LQ9OyuF2NOnhSkEFBV9JD2U9O6d5gbK/zDFf2Vyu
0s9/DzeniZFPlTmO5IxiQZ1O6l5KoVQTAEG92oLwcVkhok/ScZYMN+DotKlC8GrerHuwa+CHm0y+
oCjujmSrINim0emOkh1Cnq7GGExZAIY5w8nUdNvFkWoZ1KHaSPn9kRKDdiBNjnQ/tp2B6veKWcsi
zSqLBk7w/X9PohrLcLfOOniu6lIxkEW1/GIwuJV4TS0d4j6HR/nTi+9+tGeU3dsP5CHBcLV09qiU
w0g8va/5BzAp1dYE5QdH6mbY7px1Oh0ZMeLHw2dJ02ROSR2n47lNnJ1JuojTZS6Hb4VUkwMn6Oay
07u+sxGE14dMrK8v2ipVoukBrv5b29ZupIrHcGhpWz3xdR2kfEj5GNH37caXOr1+n6henb+rEzy8
iIzuJphzB1wkde6RnpWPcTpcbpmmS26A7nMXPGXI64QhGBJuYuXXw01wXfp4bcIPtBg47sNX1gv1
nDm78qesNL3hdeTJZ7qiH9STQqNMFZ2HY4Ybv6N2os4KPvLaZgy+3oyDFIS5+qhrPnT3MkotyqwB
2rE5hHbzlUtYf28UHf0D+dc3bXETAAA=
stweil referenced this issue in filak/hOCR-to-ALTO Jul 1, 2024
@stweil
Copy link
Member

stweil commented Jul 1, 2024

Reverting filak/hOCR-to-ALTO@ec3c27f helps.

@stweil
Copy link
Member

stweil commented Jul 1, 2024

The conversion from alto to text has the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants