Pdf-extraction-to-vector-storage

This project is a Python-based solution for extracting text from PDF files, preprocessing the text, vectorizing it using Cohere embeddings, and storing the vectors in Pinecone for further use.

Workflow Breakdown

Initialization: The script begins by initializing the necessary libraries. These include:
- PyMuPDF for PDF processing
- pytesseract for Optical Character Recognition (OCR)
- spaCy for Natural Language Processing (NLP)
- Cohere for generating text embeddings
- Pinecone for vector storage
Text Extraction: The extract_text_from_page function is responsible for extracting text from each page of the PDF. It uses PyMuPDF for text extraction and Tesseract for OCR in case the page contains scanned images.
Text Preprocessing: The preprocess_text function uses spaCy to normalize and clean the extracted text. The chunk_text function then divides the cleaned text into smaller pieces for efficient processing.
Vectorization: The vectorize_text function takes the preprocessed text chunks and generates vector embeddings using the Cohere model.
Upload to Pinecone: The upload_vectors function takes the generated vectors and uploads them to a Pinecone index for storage and retrieval.
Process PDF: The process_pdf function orchestrates the entire workflow for each PDF file. It extracts, preprocesses, and vectorizes the text from each page, and then uploads the vectors to Pinecone.
Main Function: The main function serves as the entry point of the script. It iterates through a specified directory, identifies all PDF files, and processes each one using the process_pdf function.

Usage

To use this script, specify the directory containing your PDF files in the main function and run the script. Ensure that all necessary environment variables are set in your .env file.

Name		Name	Last commit message	Last commit date
Latest commit History 250 Commits
helper_scripts		helper_scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pdf_to_vectorstore_main.py		pdf_to_vectorstore_main.py
requirements.txt		requirements.txt
resource_pdf_to_vectorstore_test.py		resource_pdf_to_vectorstore_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pdf-extraction-to-vector-storage

Workflow Breakdown

Usage

About

Releases

Packages

Languages

License

Feed-dev/Pdf-extraction-to-vector-storage

Folders and files

Latest commit

History

Repository files navigation

Pdf-extraction-to-vector-storage

Workflow Breakdown

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages