added final readme

yachty66 · Nov 11, 2023 · f0c8fb9 · f0c8fb9
1 parent 2e76d86
commit f0c8fb9
Show file tree

Hide file tree

Showing 5 changed files with 61 additions and 69 deletions.
diff --git a/README.md b/README.md
@@ -1,38 +1,38 @@
 # gpt_pdf_md
 
-gpt_pdf_md is a Python package which uses GPT-4V and other tools to convert pdf into Markdown files. current limitation of raw gpt-4v is that that it does not support pdf documents in the api and if prompted to convert text which contains figures to markdown, figures are not getting not converted correctly because the image url in the markdown is missing. IT TURNS OUT gpt_pdf_md IS EVEN COMING CLOSE TO OCR QUALITY OF MATHPIX!
+`gpt_pdf_md` is a Python package that leverages GPT-4V and other tools to convert PDF files into Markdown. The current limitation of raw GPT-4V is that it does not support PDF documents in the API. Additionally, when prompted to convert text containing figures to Markdown, the figures are not converted correctly due to missing image URLs in the Markdown. However, `gpt_pdf_md` is coming close to the OCR quality of Mathpix!
 
 ## Features
 
 - Extracts figures from PDF files using the `pdffigures2` Scala library.
-- Converts PDF pages to images and uploads them to Google Cloud Bucket.
-- Utilizes GPT-4V Vision to generate Markdown content from pdf an than inserts image urls into markdown.
+- Converts PDF pages to images and uploads them to a Google Cloud Bucket.
+- Utilizes GPT-4V Vision to generate Markdown content from a PDF and then inserts image URLs into the Markdown.
 
 ## Additional Dependencies
 
-This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library https://github.com/allenai/pdffigures2. (this can be quite a hassle because parts of the library are written in scala so you need to have the right version of java and scala installed - we are looking for an alternative, more easy going way to extract images from a pdf, if youn have any ideas, feel free open an [issue](https://github.com/yachty66/gpt_vision_plus/issues) on that)
+This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library. You can find more information [here](https://github.com/allenai/pdffigures2). Please note that this can be quite a hassle because parts of the library are written in Scala, so you need to have the correct versions of Java and Scala installed. We are looking for an alternative, more straightforward way to extract images from a PDF. If you have any ideas, feel free to open an [issue](https://github.com/yachty66/gpt_pdf_md/issues).
 
 ## Installation
 
-Once you have `pdffigures2` setup you can install gpt_pdf_md via pip:
+Once you have `pdffigures2` set up, you can install `gpt_pdf_md` via pip:
 
 ```bash
 pip install gpt-pdf-md
 ```
 
-Configure the required environment variables in your .env file without spaces or unnecessary quotes:
+Configure the required environment variables in your `.env` file without spaces or unnecessary quotes:
 
 ```env
 OPENAI_API_KEY=open_ai_key
 GOOGLE_ID=google_project_id
 GOOGLE_BUCKET=google_bucket_name
 ```
 
-NOTE: the project requires a public google bucket where the images which later are getting rendered in the markdown are getting uploaded to.
+NOTE: This project requires a public Google bucket where the images, which are later rendered in the Markdown, are uploaded.
 
 ## Usage
 
-To process a PDF and generate Markdown content its important that the python file is in the same directory than the `pdffigures2` folder. You can use the gpt_pdf_md as following:
+To process a PDF and generate Markdown content, it's important that the Python file is in the same directory as the `pdffigures2` folder. You can use `gpt_pdf_md` as follows:
 
 ```python
 from gpt_pdf_md.reader import process_pdf
@@ -46,28 +46,26 @@ GOOGLE_ID = os.getenv('GOOGLE_ID')
 GOOGLE_BUCKET = os.getenv('GOOGLE_BUCKET')
 
 absolute_path = os.path.dirname(os.path.abspath(__file__))
-#absolute path to pdf file
+# Absolute path to the PDF file
 PDF = absolute_path + "/example.pdf"
-#absolute padth to pdffigures2
+# Absolute path to pdffigures2
 PDFFIGURES2_PATH = absolute_path + "/pdffigures2/"
 process_pdf(PDF, PDFFIGURES2_PATH, OPENAI_API_KEY, GOOGLE_ID, GOOGLE_BUCKET)
 ```
 
-This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file which is the converted result of `example.pdf`
+This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file, which is the converted result of `example.pdf` created by running the `example.py` script.
 
-## Next steps
+## Next Steps
 
-- [ ] try rust [vortex](https://github.com/omkar-mohanty/vortex) for pdf image extraction
-- [ ] use gpt-4 128k for final formatting of markdown
-- [ ] clearer readme to make it easier for everyone to use the python package
-- [ ] error handling  
+- [ ] Try Rust [vortex](https://github.com/omkar-mohanty/vortex) for PDF image extraction
+- [ ] Use GPT-4 128k for final formatting of Markdown
+- [ ] Create a clearer README to make it easier for everyone to use the Python package
+- [ ] Improve error handling
 
 ## Contributing & Support
 
 We welcome contributions! Please open an issue or submit a pull request on our GitHub repository.
 
 ## License
 
-This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).
-
-
+This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).
diff --git a/experiments.py → example.py b/experiments.py → example.py
@@ -1,4 +1,4 @@
-from gptpdfreader.reader import process_pdf
+from gpt_pdf_md.reader import process_pdf
 import os
 from dotenv import load_dotenv
 

diff --git a/gpt_pdf_md/README.md b/gpt_pdf_md/README.md
@@ -1,38 +1,38 @@
 # gpt_pdf_md
 
-gpt_pdf_md is a Python package which uses GPT-4V and other tools to convert pdf into Markdown files. current limitation of raw gpt-4v is that that it does not support pdf documents in the api and if prompted to convert text which contains figures to markdown, figures are not getting not converted correctly because the image url in the markdown is missing. IT TURNS OUT gpt_pdf_md IS EVEN COMING CLOSE TO OCR QUALITY OF MATHPIX!
+`gpt_pdf_md` is a Python package that leverages GPT-4V and other tools to convert PDF files into Markdown. The current limitation of raw GPT-4V is that it does not support PDF documents in the API. Additionally, when prompted to convert text containing figures to Markdown, the figures are not converted correctly due to missing image URLs in the Markdown. However, `gpt_pdf_md` is coming close to the OCR quality of Mathpix!
 
 ## Features
 
 - Extracts figures from PDF files using the `pdffigures2` Scala library.
-- Converts PDF pages to images and uploads them to Google Cloud Bucket.
-- Utilizes GPT-4V Vision to generate Markdown content from pdf an than inserts image urls into markdown.
+- Converts PDF pages to images and uploads them to a Google Cloud Bucket.
+- Utilizes GPT-4V Vision to generate Markdown content from a PDF and then inserts image URLs into the Markdown.
 
 ## Additional Dependencies
 
-This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library https://github.com/allenai/pdffigures2. (this can be quite a hassle because parts of the library are written in scala so you need to have the right version of java and scala installed - we are looking for an alternative, more easy going way to extract images from a pdf, if youn have any ideas, feel free open an [issue](https://github.com/yachty66/gpt_vision_plus/issues) on that)
+This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library. You can find more information [here](https://github.com/allenai/pdffigures2). Please note that this can be quite a hassle because parts of the library are written in Scala, so you need to have the correct versions of Java and Scala installed. We are looking for an alternative, more straightforward way to extract images from a PDF. If you have any ideas, feel free to open an [issue](https://github.com/yachty66/gpt_pdf_md/issues).
 
 ## Installation
 
-Once you have `pdffigures2` setup you can install gpt_pdf_md via pip:
+Once you have `pdffigures2` set up, you can install `gpt_pdf_md` via pip:
 
 ```bash
 pip install gpt-pdf-md
 ```
 
-Configure the required environment variables in your .env file without spaces or unnecessary quotes:
+Configure the required environment variables in your `.env` file without spaces or unnecessary quotes:
 
 ```env
 OPENAI_API_KEY=open_ai_key
 GOOGLE_ID=google_project_id
 GOOGLE_BUCKET=google_bucket_name
 ```
 
-NOTE: the project requires a public google bucket where the images which later are getting rendered in the markdown are getting uploaded to.
+NOTE: This project requires a public Google bucket where the images, which are later rendered in the Markdown, are uploaded.
 
 ## Usage
 
-To process a PDF and generate Markdown content its important that the python file is in the same directory than the `pdffigures2` folder. You can use the gpt_pdf_md as following:
+To process a PDF and generate Markdown content, it's important that the Python file is in the same directory as the `pdffigures2` folder. You can use `gpt_pdf_md` as follows:
 
 ```python
 from gpt_pdf_md.reader import process_pdf
@@ -46,28 +46,26 @@ GOOGLE_ID = os.getenv('GOOGLE_ID')
 GOOGLE_BUCKET = os.getenv('GOOGLE_BUCKET')
 
 absolute_path = os.path.dirname(os.path.abspath(__file__))
-#absolute path to pdf file
+# Absolute path to the PDF file
 PDF = absolute_path + "/example.pdf"
-#absolute padth to pdffigures2
+# Absolute path to pdffigures2
 PDFFIGURES2_PATH = absolute_path + "/pdffigures2/"
 process_pdf(PDF, PDFFIGURES2_PATH, OPENAI_API_KEY, GOOGLE_ID, GOOGLE_BUCKET)
 ```
 
-This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file which is the converted result of `example.pdf`
+This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file, which is the converted result of `example.pdf` created by running the `example.py` script.
 
-## Next steps
+## Next Steps
 
-- [ ] try rust [vortex](https://github.com/omkar-mohanty/vortex) for pdf image extraction
-- [ ] use gpt-4 128k for final formatting of markdown
-- [ ] clearer readme to make it easier for everyone to use the python package
-- [ ] error handling  
+- [ ] Try Rust [vortex](https://github.com/omkar-mohanty/vortex) for PDF image extraction
+- [ ] Use GPT-4 128k for final formatting of Markdown
+- [ ] Create a clearer README to make it easier for everyone to use the Python package
+- [ ] Improve error handling
 
 ## Contributing & Support
 
 We welcome contributions! Please open an issue or submit a pull request on our GitHub repository.
 
 ## License
 
-This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).
-
-
+This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).
diff --git a/gpt_pdf_md/setup.py b/gpt_pdf_md/setup.py
@@ -5,7 +5,7 @@
 
 setup(
     name='gpt_pdf_md',
-    version='0.1',
+    version='0.2',
     packages=find_packages(),
     description='A Python package that utilizes GPT-4V and other tools to convert PDFs into Markdown files.',
     long_description=open('README.md').read(),