Parsify

A Web App and API made for the Codefiesta 2022 Hackathon

AWS Hosting

It is hosted on AWS for demoing http://ec2-54-174-113-31.compute-1.amazonaws.com:3000
It takes time to OCR the PDF after uploading. Have Patience 😄
For now, this works great only for PDFs in this format

Problem statement: PDF parser for FIR copy

There are crores of FIR copies stored in pdf’s from all over india in various states which are present in regional languages which need to be parsed and the information need to collected.

Our Solution:

A Web App for uploading a PDF (or a PDF URL) and obtaining a JSON String with all the necessary info about the FIR, nicely formatted inside the HTML page, available for copy-pasting.
An API for batch-processing multiple files and getting JSON Object as a response. Which can be directly stored in a NoSQL DB like MongoDB.

Development

This project is meant to be run on a Ubuntu (Or some specific Debian distros) server. It is possible to run it on other platforms, although, its not as straight forward as this.

clone repo

git clone https://github.com/shell-raiser/codefiesta-web-app.git

Install all dependencies and packages

sudo apt-get install tesseract-ocr -y; npm install; sudo wget -P /usr/share/tesseract-ocr/4.00/tessdata/ https://github.com/tesseract-ocr/tessdata/raw/4.00/hin.traineddata https://github.com/tesseract-ocr/tessdata/raw/4.00/tam.traineddata   https://github.com/tesseract-ocr/tessdata/raw/4.00/pan.traineddata   https://github.com/tesseract-ocr/tessdata/raw/4.00/ori.traineddata   https://github.com/tesseract-ocr/tessdata/raw/4.00/mar.traineddata   https://github.com/tesseract-ocr/tessdata/raw/4.00/mal.traineddata   https://github.com/tesseract-ocr/tessdata/raw/4.00/kan.traineddata   https://github.com/tesseract-ocr/tessdata/raw/4.00/guj.traineddata   https://github.com/tesseract-ocr/tessdata/raw/4.00/tel.traineddata

run the below command to start the server

npm start

and open localhost:3000 in browser to view the website

Presentation Link

https://docs.google.com/presentation/d/1S7rnvWj5elVapgD0HpdAr4orYBSNJtxE78vGL2NQ1WE/edit

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
Package demos & tests		Package demos & tests
Sample-PDFs		Sample-PDFs
public		public
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
Dockerfile		Dockerfile
Output Template (1).csv		Output Template (1).csv
README.md		README.md
extractedText.txt		extractedText.txt
index.html		index.html
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
sampleoutput.json		sampleoutput.json
tester.js		tester.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parsify

AWS Hosting

Problem statement: PDF parser for FIR copy

Our Solution:

Development

Presentation Link

Made By Team - Tense Turtles

About

Releases

Packages

Languages

Tense-Turtles/Parsify-WebApp

Folders and files

Latest commit

History

Repository files navigation

Parsify

AWS Hosting

Problem statement: PDF parser for FIR copy

Our Solution:

Development

Presentation Link

Made By Team - Tense Turtles

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages