-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tables in pdf not getting saved into csv file #824
Comments
assign me this issue |
@vijayproxima you can fix this in 3 ways Path to your PDF filepdf_file = 'your_pdf_file.pdf' Extract tables from the PDF (all pages)tables = tabula.read_pdf(pdf_file, pages='all', multiple_tables=True) Check how many tables were extractedprint(f'Total tables extracted: {len(tables)}') Export the tables to CSV filesfor i, table in enumerate(tables): |
import camelot Path to your PDF filepdf_file = 'your_pdf_file.pdf' Extract tables from all pages of the PDFtables = camelot.read_pdf(pdf_file, pages='all') Check how many tables were extractedprint(f'Total tables extracted: {len(tables)}') Export each extracted table to a separate CSV filefor i, table in enumerate(tables): |
3.import pdfplumber Path to your PDF filepdf_file = 'your_pdf_file.pdf' Open the PDF with pdfplumberwith pdfplumber.open(pdf_file) as pdf:
please let me if any of this is helpfull for your repositries |
HI,
In my pdf file, I have 4 tables [4 regions] for listing the holidays for a year. the tables has columns, Sr.No, Date, Day and Festival. The title on the table is Region Name Holiday List 2024. However, when i execute this line, there is no csv file being created nor the pdfdocs.jsonl file is created. it is just creating the data.jsonl file.
def parsing_the_pdfs():
t0 = time.time()
# Create a Library
LLMWareConfig().set_active_db("sqlite")
p= parsing_the_pdfs()
This is the output when I execute the code:
Update: parsing time : 0.0057866573333740234
Update: parsing output : {'docs_added': 0, 'blocks_added': 0, 'images_added': 0, 'pages_added': 0, 'tables_added': 0, 'rejected_files': []}
The text was updated successfully, but these errors were encountered: