A Web App and API made for the Codefiesta 2022 Hackathon
It is hosted on AWS for demoing http://ec2-54-174-113-31.compute-1.amazonaws.com:3000
It takes time to OCR the PDF after uploading. Have Patience 😄
For now, this works great only for PDFs in this format
There are crores of FIR copies stored in pdf’s from all over india in various states which are present in regional languages which need to be parsed and the information need to collected.
-
A Web App for uploading a PDF (or a PDF URL) and obtaining a JSON String with all the necessary info about the FIR, nicely formatted inside the HTML page, available for copy-pasting.
-
An API for batch-processing multiple files and getting JSON Object as a response. Which can be directly stored in a NoSQL DB like MongoDB.
This project is meant to be run on a Ubuntu (Or some specific Debian distros) server. It is possible to run it on other platforms, although, its not as straight forward as this.
- clone repo
git clone https://github.com/shell-raiser/codefiesta-web-app.git
- Install all dependencies and packages
sudo apt-get install tesseract-ocr -y; npm install; sudo wget -P /usr/share/tesseract-ocr/4.00/tessdata/ https://github.com/tesseract-ocr/tessdata/raw/4.00/hin.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.00/tam.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.00/pan.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.00/ori.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.00/mar.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.00/mal.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.00/kan.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.00/guj.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.00/tel.traineddata
- run the below command to start the server
npm start
and open localhost:3000
in browser to view the website
https://docs.google.com/presentation/d/1S7rnvWj5elVapgD0HpdAr4orYBSNJtxE78vGL2NQ1WE/edit