Read and extract text and other content from PDFs in C# (port of PDFBox)
-
Updated
Jul 1, 2024 - C#
Read and extract text and other content from PDFs in C# (port of PDFBox)
A Gtk/Qt front-end to tesseract-ocr.
OCR engine for all the languages
Document Layout Analysis resources repos for development with PdfPig.
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
Text Overlay plugin for Mirador 3
Ergonomic line-by-line transcription of scanned text.
Text-to-tibble
Some basic data and text extraction from the New York City Directories
Python parser for hOCR files using lxml
Add a description, image, and links to the hocr topic page so that developers can more easily learn about it.
To associate your repository with the hocr topic, visit your repo's landing page and select "manage topics."