Skip to content

Commit

Permalink
added final readme
Browse files Browse the repository at this point in the history
  • Loading branch information
yachty66 committed Nov 11, 2023
1 parent 2e76d86 commit f0c8fb9
Show file tree
Hide file tree
Showing 5 changed files with 61 additions and 69 deletions.
36 changes: 17 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,38 @@
# gpt_pdf_md

gpt_pdf_md is a Python package which uses GPT-4V and other tools to convert pdf into Markdown files. current limitation of raw gpt-4v is that that it does not support pdf documents in the api and if prompted to convert text which contains figures to markdown, figures are not getting not converted correctly because the image url in the markdown is missing. IT TURNS OUT gpt_pdf_md IS EVEN COMING CLOSE TO OCR QUALITY OF MATHPIX!
`gpt_pdf_md` is a Python package that leverages GPT-4V and other tools to convert PDF files into Markdown. The current limitation of raw GPT-4V is that it does not support PDF documents in the API. Additionally, when prompted to convert text containing figures to Markdown, the figures are not converted correctly due to missing image URLs in the Markdown. However, `gpt_pdf_md` is coming close to the OCR quality of Mathpix!

## Features

- Extracts figures from PDF files using the `pdffigures2` Scala library.
- Converts PDF pages to images and uploads them to Google Cloud Bucket.
- Utilizes GPT-4V Vision to generate Markdown content from pdf an than inserts image urls into markdown.
- Converts PDF pages to images and uploads them to a Google Cloud Bucket.
- Utilizes GPT-4V Vision to generate Markdown content from a PDF and then inserts image URLs into the Markdown.

## Additional Dependencies

This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library https://github.com/allenai/pdffigures2. (this can be quite a hassle because parts of the library are written in scala so you need to have the right version of java and scala installed - we are looking for an alternative, more easy going way to extract images from a pdf, if youn have any ideas, feel free open an [issue](https://github.com/yachty66/gpt_vision_plus/issues) on that)
This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library. You can find more information [here](https://github.com/allenai/pdffigures2). Please note that this can be quite a hassle because parts of the library are written in Scala, so you need to have the correct versions of Java and Scala installed. We are looking for an alternative, more straightforward way to extract images from a PDF. If you have any ideas, feel free to open an [issue](https://github.com/yachty66/gpt_pdf_md/issues).

## Installation

Once you have `pdffigures2` setup you can install gpt_pdf_md via pip:
Once you have `pdffigures2` set up, you can install `gpt_pdf_md` via pip:

```bash
pip install gpt-pdf-md
```

Configure the required environment variables in your .env file without spaces or unnecessary quotes:
Configure the required environment variables in your `.env` file without spaces or unnecessary quotes:

```env
OPENAI_API_KEY=open_ai_key
GOOGLE_ID=google_project_id
GOOGLE_BUCKET=google_bucket_name
```

NOTE: the project requires a public google bucket where the images which later are getting rendered in the markdown are getting uploaded to.
NOTE: This project requires a public Google bucket where the images, which are later rendered in the Markdown, are uploaded.

## Usage

To process a PDF and generate Markdown content its important that the python file is in the same directory than the `pdffigures2` folder. You can use the gpt_pdf_md as following:
To process a PDF and generate Markdown content, it's important that the Python file is in the same directory as the `pdffigures2` folder. You can use `gpt_pdf_md` as follows:

```python
from gpt_pdf_md.reader import process_pdf
Expand All @@ -46,28 +46,26 @@ GOOGLE_ID = os.getenv('GOOGLE_ID')
GOOGLE_BUCKET = os.getenv('GOOGLE_BUCKET')

absolute_path = os.path.dirname(os.path.abspath(__file__))
#absolute path to pdf file
# Absolute path to the PDF file
PDF = absolute_path + "/example.pdf"
#absolute padth to pdffigures2
# Absolute path to pdffigures2
PDFFIGURES2_PATH = absolute_path + "/pdffigures2/"
process_pdf(PDF, PDFFIGURES2_PATH, OPENAI_API_KEY, GOOGLE_ID, GOOGLE_BUCKET)
```

This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file which is the converted result of `example.pdf`
This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file, which is the converted result of `example.pdf` created by running the `example.py` script.

## Next steps
## Next Steps

- [ ] try rust [vortex](https://github.com/omkar-mohanty/vortex) for pdf image extraction
- [ ] use gpt-4 128k for final formatting of markdown
- [ ] clearer readme to make it easier for everyone to use the python package
- [ ] error handling
- [ ] Try Rust [vortex](https://github.com/omkar-mohanty/vortex) for PDF image extraction
- [ ] Use GPT-4 128k for final formatting of Markdown
- [ ] Create a clearer README to make it easier for everyone to use the Python package
- [ ] Improve error handling

## Contributing & Support

We welcome contributions! Please open an issue or submit a pull request on our GitHub repository.

## License

This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).


This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).
2 changes: 1 addition & 1 deletion experiments.py → example.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from gptpdfreader.reader import process_pdf
from gpt_pdf_md.reader import process_pdf
import os
from dotenv import load_dotenv

Expand Down
36 changes: 17 additions & 19 deletions gpt_pdf_md/README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,38 @@
# gpt_pdf_md

gpt_pdf_md is a Python package which uses GPT-4V and other tools to convert pdf into Markdown files. current limitation of raw gpt-4v is that that it does not support pdf documents in the api and if prompted to convert text which contains figures to markdown, figures are not getting not converted correctly because the image url in the markdown is missing. IT TURNS OUT gpt_pdf_md IS EVEN COMING CLOSE TO OCR QUALITY OF MATHPIX!
`gpt_pdf_md` is a Python package that leverages GPT-4V and other tools to convert PDF files into Markdown. The current limitation of raw GPT-4V is that it does not support PDF documents in the API. Additionally, when prompted to convert text containing figures to Markdown, the figures are not converted correctly due to missing image URLs in the Markdown. However, `gpt_pdf_md` is coming close to the OCR quality of Mathpix!

## Features

- Extracts figures from PDF files using the `pdffigures2` Scala library.
- Converts PDF pages to images and uploads them to Google Cloud Bucket.
- Utilizes GPT-4V Vision to generate Markdown content from pdf an than inserts image urls into markdown.
- Converts PDF pages to images and uploads them to a Google Cloud Bucket.
- Utilizes GPT-4V Vision to generate Markdown content from a PDF and then inserts image URLs into the Markdown.

## Additional Dependencies

This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library https://github.com/allenai/pdffigures2. (this can be quite a hassle because parts of the library are written in scala so you need to have the right version of java and scala installed - we are looking for an alternative, more easy going way to extract images from a pdf, if youn have any ideas, feel free open an [issue](https://github.com/yachty66/gpt_vision_plus/issues) on that)
This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library. You can find more information [here](https://github.com/allenai/pdffigures2). Please note that this can be quite a hassle because parts of the library are written in Scala, so you need to have the correct versions of Java and Scala installed. We are looking for an alternative, more straightforward way to extract images from a PDF. If you have any ideas, feel free to open an [issue](https://github.com/yachty66/gpt_pdf_md/issues).

## Installation

Once you have `pdffigures2` setup you can install gpt_pdf_md via pip:
Once you have `pdffigures2` set up, you can install `gpt_pdf_md` via pip:

```bash
pip install gpt-pdf-md
```

Configure the required environment variables in your .env file without spaces or unnecessary quotes:
Configure the required environment variables in your `.env` file without spaces or unnecessary quotes:

```env
OPENAI_API_KEY=open_ai_key
GOOGLE_ID=google_project_id
GOOGLE_BUCKET=google_bucket_name
```

NOTE: the project requires a public google bucket where the images which later are getting rendered in the markdown are getting uploaded to.
NOTE: This project requires a public Google bucket where the images, which are later rendered in the Markdown, are uploaded.

## Usage

To process a PDF and generate Markdown content its important that the python file is in the same directory than the `pdffigures2` folder. You can use the gpt_pdf_md as following:
To process a PDF and generate Markdown content, it's important that the Python file is in the same directory as the `pdffigures2` folder. You can use `gpt_pdf_md` as follows:

```python
from gpt_pdf_md.reader import process_pdf
Expand All @@ -46,28 +46,26 @@ GOOGLE_ID = os.getenv('GOOGLE_ID')
GOOGLE_BUCKET = os.getenv('GOOGLE_BUCKET')

absolute_path = os.path.dirname(os.path.abspath(__file__))
#absolute path to pdf file
# Absolute path to the PDF file
PDF = absolute_path + "/example.pdf"
#absolute padth to pdffigures2
# Absolute path to pdffigures2
PDFFIGURES2_PATH = absolute_path + "/pdffigures2/"
process_pdf(PDF, PDFFIGURES2_PATH, OPENAI_API_KEY, GOOGLE_ID, GOOGLE_BUCKET)
```

This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file which is the converted result of `example.pdf`
This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file, which is the converted result of `example.pdf` created by running the `example.py` script.

## Next steps
## Next Steps

- [ ] try rust [vortex](https://github.com/omkar-mohanty/vortex) for pdf image extraction
- [ ] use gpt-4 128k for final formatting of markdown
- [ ] clearer readme to make it easier for everyone to use the python package
- [ ] error handling
- [ ] Try Rust [vortex](https://github.com/omkar-mohanty/vortex) for PDF image extraction
- [ ] Use GPT-4 128k for final formatting of Markdown
- [ ] Create a clearer README to make it easier for everyone to use the Python package
- [ ] Improve error handling

## Contributing & Support

We welcome contributions! Please open an issue or submit a pull request on our GitHub repository.

## License

This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).


This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).
2 changes: 1 addition & 1 deletion gpt_pdf_md/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

setup(
name='gpt_pdf_md',
version='0.1',
version='0.2',
packages=find_packages(),
description='A Python package that utilizes GPT-4V and other tools to convert PDFs into Markdown files.',
long_description=open('README.md').read(),
Expand Down
Loading

0 comments on commit f0c8fb9

Please sign in to comment.