Skip to content

Commit

Permalink
Merge pull request #671 from axa-group/debug-fonts
Browse files Browse the repository at this point in the history
Add a small utility to fix broken pdf font
  • Loading branch information
BinaryBrain authored Oct 2, 2023
2 parents b1fc36f + 243ecf7 commit 0e5fe8f
Show file tree
Hide file tree
Showing 190 changed files with 89,313 additions and 0 deletions.
49 changes: 49 additions & 0 deletions parsr-fix-pdf-font/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Parsr - Fix PDF Font

**Parsr-fix-pdf-font** is a utility designed specifically to remedy broken unicode maps for PDF fonts. Issues with broken unicode maps can arise due to various reasons, including incomplete or corrupt font embedding, or issues during the PDF creation process. Such problems can render text in a PDF file unreadable or undecipherable.

This tool leverages Tesseract.js, an optical character recognition engine, to recognize the broken glyphs present in the PDF. Once these glyphs are identified, **Parsr-fix-pdf-font** rebuilds the unicode map, ensuring that the PDF becomes readable and retains its original design and layout.

## Features

- OCR Powered Correction: Uses Tesseract.js to perform Optical Character Recognition on the broken glyphs, ensuring accurate text representation.

- Rebuilding Unicode Maps: After identifying the incorrect mappings, the tool regenerates the correct unicode map, preserving the original design of the PDF.

- Easy-to-Use Command Line Interface: Simplified command line usage for quick fixes.

## Requirements

nodejs >18

ImageMagick Convert


## Usage
Use the command line interface to run the Parsr tool:

```
parsr-fix-pdf-fonts --input <path-to-pdf> --ouput <path-to-out-pdf> --lang eng
```
Parameters:

- --input <path-to-pdf>: Specifies the path to the source PDF file that needs to be fixed.
- --ouput <path-to-out-pdf>: Designates the path where the fixed PDF will be saved. If the specified file already exists, it will be overwritten.
- --lang eng: Sets the language for the OCR process. By default, it's set to English (eng). Tesseract supports multiple languages, so ensure you choose the appropriate one for your document.

## Troubleshooting
If you encounter any issues:

Inspect PDF: Ensure that the PDF isn't password protected or encrypted. If it is, decrypt it before running the tool.

Language Mismatch: If the OCR isn't accurate, ensure you've chosen the correct language setting for the document.

## Limits

Tesseract OCR is not really good on single Glyph, but at least the text is readable / understandable for an LLM.

We do not reconstruct the XREF table yet. Using a tool like ```mutools clean ``` will allow you to fix them if needed.

## Contribution
Parsr is an open-source tool. Contributions in the form of bug reports, feature requests, or code are always welcome. Check our GitHub repository for more details.

Binary file added parsr-fix-pdf-font/eng.traineddata
Binary file not shown.
158 changes: 158 additions & 0 deletions parsr-fix-pdf-font/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 22 additions & 0 deletions parsr-fix-pdf-font/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"name": "fixfontinpdf",
"version": "1.0.0",
"description": "# Usage",
"main": "fixPdfFonts.js",
"directories": {
"test": "test"
},
"bin": {
"parsr-fix-pdf-font": "fix-pdf-font.js"
},
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"dependencies": {
"commander": "^11.0.0",
"opentype.js": "^1.3.4",
"tesseract.js": "^5.0.0"
}
}
49 changes: 49 additions & 0 deletions parsr-fix-pdf-font/parsr-fix-pdf-font.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
//const dotenv = require('dotenv');
//dotenv.config({ path: require('find-config')('.env') });

const path = require('path');
const fs = require('fs');
const outDirPath = `${__dirname}/tmp`;


const extractAndCorrectFontsFromPDF = require('./src/extractAndCorrectFontsFromPDF.js');

let filePath = (process.argv.length > 2) ? process.argv[2] : `${__dirname}/testPDF/test.pdf`;

const { Command } = require('commander');
const program = new Command();

program
.name('parsr-fix-pdf-font')
.description('CLI to fix PDF fonts')
.version('0.0.1')
.option('--input <pdf-input-file-path>')
.option('--output <pdf-output-file-path>')
.option('--lang <language-code>')
.parse();

const options = program.opts();

if (!options.input) {
console.error('--input is Required');
return;
}

if (!options.output) {
console.error('--output is Required');
return;
}


async function main(input, output, lang='eng') {
if (!fs.existsSync(outDirPath)) {
fs.mkdirSync(outDirPath);
}

await extractAndCorrectFontsFromPDF(input, output, lang, outDirPath);

return;

}

main(options.input, options.output, options.lang);
Loading

0 comments on commit 0e5fe8f

Please sign in to comment.