Merge pull request #671 from axa-group/debug-fonts

Add a small utility to fix broken pdf font
axa-group · Oct 2, 2023 · 0e5fe8f · 0e5fe8f
2 parents b1fc36f + 243ecf7
commit 0e5fe8f
Show file tree

Hide file tree

Showing 190 changed files with 89,313 additions and 0 deletions.
diff --git a/parsr-fix-pdf-font/README.md b/parsr-fix-pdf-font/README.md
@@ -0,0 +1,49 @@
+# Parsr - Fix PDF Font
+
+**Parsr-fix-pdf-font** is a utility designed specifically to remedy broken unicode maps for PDF fonts. Issues with broken unicode maps can arise due to various reasons, including incomplete or corrupt font embedding, or issues during the PDF creation process. Such problems can render text in a PDF file unreadable or undecipherable.
+
+This tool leverages Tesseract.js, an optical character recognition engine, to recognize the broken glyphs present in the PDF. Once these glyphs are identified, **Parsr-fix-pdf-font** rebuilds the unicode map, ensuring that the PDF becomes readable and retains its original design and layout.
+
+## Features
+
+- OCR Powered Correction: Uses Tesseract.js to perform Optical Character Recognition on the broken glyphs, ensuring accurate text representation.
+
+- Rebuilding Unicode Maps: After identifying the incorrect mappings, the tool regenerates the correct unicode map, preserving the original design of the PDF.
+
+- Easy-to-Use Command Line Interface: Simplified command line usage for quick fixes.
+
+## Requirements
+
+nodejs >18
+
+ImageMagick Convert
+
+
+## Usage
+Use the command line interface to run the Parsr tool:
+
+```
+  parsr-fix-pdf-fonts --input <path-to-pdf> --ouput <path-to-out-pdf> --lang eng
+```
+Parameters:
+
+- --input <path-to-pdf>: Specifies the path to the source PDF file that needs to be fixed.
+- --ouput <path-to-out-pdf>: Designates the path where the fixed PDF will be saved. If the specified file already exists, it will be overwritten.
+- --lang eng: Sets the language for the OCR process. By default, it's set to English (eng). Tesseract supports multiple languages, so ensure you choose the appropriate one for your document.
+
+## Troubleshooting
+If you encounter any issues:
+
+Inspect PDF: Ensure that the PDF isn't password protected or encrypted. If it is, decrypt it before running the tool.
+
+Language Mismatch: If the OCR isn't accurate, ensure you've chosen the correct language setting for the document.
+
+## Limits
+
+Tesseract OCR is not really good on single Glyph, but at least the text is readable / understandable for an LLM.
+
+We do not reconstruct the XREF table yet. Using a tool like ```mutools clean ``` will allow you to fix them if needed.
+
+## Contribution
+Parsr is an open-source tool. Contributions in the form of bug reports, feature requests, or code are always welcome. Check our GitHub repository for more details.
+
diff --git a/parsr-fix-pdf-font/eng.traineddata b/parsr-fix-pdf-font/eng.traineddata
diff --git a/parsr-fix-pdf-font/package-lock.json b/parsr-fix-pdf-font/package-lock.json
diff --git a/parsr-fix-pdf-font/package.json b/parsr-fix-pdf-font/package.json
@@ -0,0 +1,22 @@
+{
+  "name": "fixfontinpdf",
+  "version": "1.0.0",
+  "description": "# Usage",
+  "main": "fixPdfFonts.js",
+  "directories": {
+    "test": "test"
+  },
+  "bin": {
+    "parsr-fix-pdf-font": "fix-pdf-font.js"
+  },
+  "scripts": {
+    "test": "echo \"Error: no test specified\" && exit 1"
+  },
+  "author": "",
+  "license": "ISC",
+  "dependencies": {
+    "commander": "^11.0.0",
+    "opentype.js": "^1.3.4",
+    "tesseract.js": "^5.0.0"
+  }
+}
diff --git a/parsr-fix-pdf-font/parsr-fix-pdf-font.js b/parsr-fix-pdf-font/parsr-fix-pdf-font.js
@@ -0,0 +1,49 @@
+//const dotenv = require('dotenv');
+//dotenv.config({ path: require('find-config')('.env') });
+
+const path = require('path');
+const fs = require('fs');
+const outDirPath = `${__dirname}/tmp`;
+
+
+const extractAndCorrectFontsFromPDF = require('./src/extractAndCorrectFontsFromPDF.js');
+
+let filePath = (process.argv.length > 2) ? process.argv[2] : `${__dirname}/testPDF/test.pdf`;
+
+const { Command } = require('commander');
+const program = new Command();
+
+program
+  .name('parsr-fix-pdf-font')
+  .description('CLI to fix PDF fonts')
+  .version('0.0.1')
+  .option('--input <pdf-input-file-path>')
+  .option('--output <pdf-output-file-path>')
+  .option('--lang <language-code>')
+  .parse();
+
+const options = program.opts();
+
+if (!options.input) {
+  console.error('--input is Required');
+  return;
+}
+
+if (!options.output) {
+  console.error('--output is Required');
+  return;
+}
+
+
+async function main(input, output, lang='eng') {  
+  if (!fs.existsSync(outDirPath)) {
+    fs.mkdirSync(outDirPath);
+  }
+
+  await extractAndCorrectFontsFromPDF(input, output, lang, outDirPath);
+
+  return;
+
+}
+
+main(options.input, options.output, options.lang);