-
Notifications
You must be signed in to change notification settings - Fork 311
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #671 from axa-group/debug-fonts
Add a small utility to fix broken pdf font
- Loading branch information
Showing
190 changed files
with
89,313 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Parsr - Fix PDF Font | ||
|
||
**Parsr-fix-pdf-font** is a utility designed specifically to remedy broken unicode maps for PDF fonts. Issues with broken unicode maps can arise due to various reasons, including incomplete or corrupt font embedding, or issues during the PDF creation process. Such problems can render text in a PDF file unreadable or undecipherable. | ||
|
||
This tool leverages Tesseract.js, an optical character recognition engine, to recognize the broken glyphs present in the PDF. Once these glyphs are identified, **Parsr-fix-pdf-font** rebuilds the unicode map, ensuring that the PDF becomes readable and retains its original design and layout. | ||
|
||
## Features | ||
|
||
- OCR Powered Correction: Uses Tesseract.js to perform Optical Character Recognition on the broken glyphs, ensuring accurate text representation. | ||
|
||
- Rebuilding Unicode Maps: After identifying the incorrect mappings, the tool regenerates the correct unicode map, preserving the original design of the PDF. | ||
|
||
- Easy-to-Use Command Line Interface: Simplified command line usage for quick fixes. | ||
|
||
## Requirements | ||
|
||
nodejs >18 | ||
|
||
ImageMagick Convert | ||
|
||
|
||
## Usage | ||
Use the command line interface to run the Parsr tool: | ||
|
||
``` | ||
parsr-fix-pdf-fonts --input <path-to-pdf> --ouput <path-to-out-pdf> --lang eng | ||
``` | ||
Parameters: | ||
|
||
- --input <path-to-pdf>: Specifies the path to the source PDF file that needs to be fixed. | ||
- --ouput <path-to-out-pdf>: Designates the path where the fixed PDF will be saved. If the specified file already exists, it will be overwritten. | ||
- --lang eng: Sets the language for the OCR process. By default, it's set to English (eng). Tesseract supports multiple languages, so ensure you choose the appropriate one for your document. | ||
|
||
## Troubleshooting | ||
If you encounter any issues: | ||
|
||
Inspect PDF: Ensure that the PDF isn't password protected or encrypted. If it is, decrypt it before running the tool. | ||
|
||
Language Mismatch: If the OCR isn't accurate, ensure you've chosen the correct language setting for the document. | ||
|
||
## Limits | ||
|
||
Tesseract OCR is not really good on single Glyph, but at least the text is readable / understandable for an LLM. | ||
|
||
We do not reconstruct the XREF table yet. Using a tool like ```mutools clean ``` will allow you to fix them if needed. | ||
|
||
## Contribution | ||
Parsr is an open-source tool. Contributions in the form of bug reports, feature requests, or code are always welcome. Check our GitHub repository for more details. | ||
|
Binary file not shown.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
{ | ||
"name": "fixfontinpdf", | ||
"version": "1.0.0", | ||
"description": "# Usage", | ||
"main": "fixPdfFonts.js", | ||
"directories": { | ||
"test": "test" | ||
}, | ||
"bin": { | ||
"parsr-fix-pdf-font": "fix-pdf-font.js" | ||
}, | ||
"scripts": { | ||
"test": "echo \"Error: no test specified\" && exit 1" | ||
}, | ||
"author": "", | ||
"license": "ISC", | ||
"dependencies": { | ||
"commander": "^11.0.0", | ||
"opentype.js": "^1.3.4", | ||
"tesseract.js": "^5.0.0" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
//const dotenv = require('dotenv'); | ||
//dotenv.config({ path: require('find-config')('.env') }); | ||
|
||
const path = require('path'); | ||
const fs = require('fs'); | ||
const outDirPath = `${__dirname}/tmp`; | ||
|
||
|
||
const extractAndCorrectFontsFromPDF = require('./src/extractAndCorrectFontsFromPDF.js'); | ||
|
||
let filePath = (process.argv.length > 2) ? process.argv[2] : `${__dirname}/testPDF/test.pdf`; | ||
|
||
const { Command } = require('commander'); | ||
const program = new Command(); | ||
|
||
program | ||
.name('parsr-fix-pdf-font') | ||
.description('CLI to fix PDF fonts') | ||
.version('0.0.1') | ||
.option('--input <pdf-input-file-path>') | ||
.option('--output <pdf-output-file-path>') | ||
.option('--lang <language-code>') | ||
.parse(); | ||
|
||
const options = program.opts(); | ||
|
||
if (!options.input) { | ||
console.error('--input is Required'); | ||
return; | ||
} | ||
|
||
if (!options.output) { | ||
console.error('--output is Required'); | ||
return; | ||
} | ||
|
||
|
||
async function main(input, output, lang='eng') { | ||
if (!fs.existsSync(outDirPath)) { | ||
fs.mkdirSync(outDirPath); | ||
} | ||
|
||
await extractAndCorrectFontsFromPDF(input, output, lang, outDirPath); | ||
|
||
return; | ||
|
||
} | ||
|
||
main(options.input, options.output, options.lang); |
Oops, something went wrong.