From 7be7b1ecd891d6ed3801963b8cf0bf2c13e0387e Mon Sep 17 00:00:00 2001 From: Luca Foppiano Date: Wed, 25 Dec 2024 23:19:51 +0100 Subject: [PATCH] move changelog outside readme --- CHANGELOG.md | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++ Readme.md | 57 +------------------------------------------- 2 files changed, 68 insertions(+), 56 deletions(-) create mode 100644 CHANGELOG.md diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..e148b11 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,67 @@ +# Changelog + +All notable changes to this project will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). + +## [0.5] -TBD + + +## [0.4] + +New in version 0.4 (apart various bug fixes): + +- support for xpdf language support package for language-specific fonts like Arabic, Chinese-simplified, Japanese, etc. they are pre-installed locally and portable + +- refined line number detection and fixing a bug which could result in random missing numbers in the ALTO output + +- update to xpdf-4.03 + +- fix issue with character spacing due to invalid rotation condition + +- update dependencies and dependency install script + +## [0.3] + + +- line number detection: line numbers (typically added for review in manuscripts/preprints) are specifically identified and not anymore mixed with the rest of text content, they will be grouped in a separate block or, optionally, not outputted in the ALTO file (`noLineNumbers` option) + +- removal of `-blocks` option, the block information are always returned for ensuring ALTO validation (`` element) + +- bug fixing on reading order + +- fix possible incorrect XMax and YMax values at 0 on block coordinates having only one line + +## [0.2] + + +- support Unicode composition of characters + +- generalize reading order to all blocks (it was limited to the blocks of the first page) + +- detect subscript/superscript text font style attribute + +- use SVG as a format for vectorial images + +- propagate unsolved character Unicode value (free Unicode range for embedded fonts) as encoded special character in ALTO (so-called "placeholder" approach) + +- generate metadata information in a separate XML file (as ALTO schema does not support that) + +- use the latest version of xpdf, version 4.00 + +- add cmake + +- [ALTO](https://github.com/altoxml/documentation/wiki) output is replacing custom Xerox XML format + +- Note: this released version was used for Grobid release 0.5.6 + +## [0.1] + +- encode URI (using `xmlURIEscape` from libxml2) for the @href attribute content to avoid blocking XML wellformedness issues. From our experiments, this problem happens in average for 2-3 scholar PDF out of one thousand. +- output coordinates attributes for the BLOCK elements when the `-block` option is selected, +- add a parameter `-readingOrder` which re-order the blocks following the reading order when the -block option is selected. By default in pdf2xml, the elements followed the PDF content stream (the so-called _raw order_). In xpdf, several text flow orders are available including the raw order and the reading order. Note that, with this modification and this new option, only the blocks are re-ordered. + From our experiments, the raw order can diverge quite significantly from the order of elements according to the visual/reading layout in 2-4% of scholar PDF (e.g. title element is introduced at the end of the page element, while visually present at the top of the page), and minor changes can be present in up to 100% of PDF for some scientific publishers (e.g. headnote introduced at the end of the page content). This additional mode can be thus quite useful for information/structure extraction applications exploiting pdfalto output. + +- use the latest version of xpdf, version 3.04. + + \ No newline at end of file diff --git a/Readme.md b/Readme.md index 8f12dcd..49e95f1 100644 --- a/Readme.md +++ b/Readme.md @@ -115,62 +115,7 @@ languages pdfalto xpdfrc # Changes -New in version 0.4 (apart various bug fixes): - -- support for xpdf language support package for language-specific fonts like Arabic, Chinese-simplified, Japanese, etc. they are pre-installed locally and portable - -- refined line number detection and fixing a bug which could result in random missing numbers in the ALTO output - -- update to xpdf-4.03 - -- fix issue with character spacing due to invalid rotation condition - -- update dependencies and dependency install script - -New in version 0.3 (apart various bug fixes): - -- line number detection: line numbers (typically added for review in manuscripts/preprints) are specifically identified and not anymore mixed with the rest of text content, they will be grouped in a separate block or, optionally, not outputted in the ALTO file (`noLineNumbers` option) - -- removal of `-blocks` option, the block information are always returned for ensuring ALTO validation (`` element) - -- bug fixing on reading order - -- fix possible incorrect XMax and YMax values at 0 on block coordinates having only one line - -New in version 0.2 (apart various bug fixes): - -- support Unicode composition of characters - -- generalize reading order to all blocks (it was limited to the blocks of the first page) - -- detect subscript/superscript text font style attribute - -- use SVG as a format for vectorial images - -- propagate unsolved character Unicode value (free Unicode range for embedded fonts) as encoded special character in ALTO (so-called "placeholder" approach) - -- generate metadata information in a separate XML file (as ALTO schema does not support that) - -- use the latest version of xpdf, version 4.00 - -- add cmake - -- [ALTO](https://github.com/altoxml/documentation/wiki) output is replacing custom Xerox XML format - -- Note: this released version was used for Grobid release 0.5.6 - -New in version 0.1 (apart various bug fixes): - -- encode URI (using `xmlURIEscape` from libxml2) for the @href attribute content to avoid blocking XML wellformedness issues. From our experiments, this problem happens in average for 2-3 scholar PDF out of one thousand. - -- output coordinates attributes for the BLOCK elements when the `-block` option is selected, - -- add a parameter `-readingOrder` which re-order the blocks following the reading order when the -block option is selected. By default in pdf2xml, the elements followed the PDF content stream (the so-called _raw order_). In xpdf, several text flow orders are available including the raw order and the reading order. Note that, with this modification and this new option, only the blocks are re-ordered. - - From our experiments, the raw order can diverge quite significantly from the order of elements according to the visual/reading layout in 2-4% of scholar PDF (e.g. title element is introduced at the end of the page element, while visually present at the top of the page), and minor changes can be present in up to 100% of PDF for some scientific publishers (e.g. headnote introduced at the end of the page content). This additional mode can be thus quite useful for information/structure extraction applications exploiting pdfalto output. - -- use the latest version of xpdf, version 3.04. - +All changes are in the [CHANGELOG.md](CHANGELOG.md) # Contributors