Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlite3 error when running Fast Start RAG, example 1 on Windows 10 #1109

Open
arautio89 opened this issue Dec 1, 2024 · 3 comments
Open

Comments

@arautio89
Copy link

Hi,
I'm running the first Fast Start RAG example example-1-create_first_library.py on Windows 10 and I get this error.

Example - Parsing Files into Library

Step 1 - creating library example1_library
INFO: Setup - sample_files path already exists - C:\Users\Käyttäjä\llmware_data\sample_files
Step 2 - loading the llmware sample files and saving at: C:\Users\Käyttäjä\llmware_data\sample_files
Step 3 - parsing and indexing files from C:\Users\Käyttäjä\llmware_data\sample_files\Agreements
INFO: update:  Duplicate files (skipped): 0
INFO: update:  Total uploaded: 15
INFO: Parser - parse_pdf - start parsing of PDF Documents...
WARNING: pdf_parser - update_library_inc_totals_sqlite - can not open database: unable to open database file
WARNING: pdf_parser - register_status_update_sqlite - can not open database: unable to open database file
INFO: pdf_parser - total pdf files processed - 0
INFO: pdf_parser - total input files received - 0
INFO: pdf_parser - total blocks created - 0
INFO: pdf_parser - total images created - 0
INFO: pdf_parser - total tables created - 0
INFO: pdf_parser - total pages added - 0
INFO: pdf_parser - PDF Processing - Finished - time elapsed - 0.010000 
INFO: pdf_parser - Completed Parsing - processing time - 0.010000
INFO: Parser - parse_pdf - completed parsing of pdf documents - time taken: 0.030765771865844727
Step 4 - completed parsing - {'docs_added': 0, 'blocks_added': 0, 'images_added': 0, 'pages_added': 0, 'tables_added': 0, 'rejected_files': ['Rhea EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Artemis Poseidon EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Aphrodite EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Leto EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Eileithyia EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Nyx EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Gaia EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Demeter EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Amphitrite EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Persephone EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Apollo EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Nike EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Athena EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Bia EXECUTIVE EMPLOYMENT AGREEMENT.pdf', 'Metis EXECUTIVE EMPLOYMENT AGREEMENT.pdf']}
Step 5 - updated library card - documents - 0 - blocks - 0 - {'_id': 1, 'library_name': 'example1_library', 'embedding': [{'embedding_status': 'no', 'embedding_model': 'none', 'embedding_db': 'none', 'embedded_blocks': 0, 'embedding_dims': 0, 'time_stamp': 'NA'}], 'knowledge_graph': 'no', 'unique_doc_id': 0, 'documents': 0, 'blocks': 0, 'images': 0, 'pages': 0, 'tables': 0, 'account_name': 'llmware'}
Step 6 - library artifacts - including extracted images - saved at folder path - C:\Users\Käyttäjä\llmware_data\accounts\llmware\example1_library

Step 7 - running a test query - base salary

First time I ran the file it was like this it ended like this

Step 7 - running a test query - base salary

Traceback (most recent call last):
  File "C:\Users\Käyttäjä\AppData\Local\Programs\Python\Python312\Lib\weakref.py", line 666, in _exitfunc
    f()
  File "C:\Users\Käyttäjä\AppData\Local\Programs\Python\Python312\Lib\weakref.py", line 590, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Käyttäjä\Desktop\Coding\Data Science\llmware-rag\.venv\Lib\site-packages\urllib3\connectionpool.py", line 1180, in _close_pool_connections
    conn.close()
  File "C:\Users\Käyttäjä\Desktop\Coding\Data Science\llmware-rag\.venv\Lib\site-packages\botocore\awsrequest.py", line 80, in close
    super().close()
  File "C:\Users\Käyttäjä\Desktop\Coding\Data Science\llmware-rag\.venv\Lib\site-packages\urllib3\connection.py", line 318, in close
    super().close()
  File "C:\Users\Käyttäjä\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 1003, in close
    sock.close()   # close it manually... there may be other refs
    ^^^^^^^^^^^^
  File "C:\Users\Käyttäjä\AppData\Local\Programs\Python\Python312\Lib\socket.py", line 504, in close
    self._real_close()
  File "C:\Users\Käyttäjä\AppData\Local\Programs\Python\Python312\Lib\ssl.py", line 1308, in _real_close
    super()._real_close()
    _ss.close(self)

The code did create a sqlite_llmware.db` file and I can work with it in the terminal

>>> conn = sqlite3.connect(os.path.join(LLMWareConfig().get_library_path(), 'sqlite_llmware.db'))
>>> cur = conn.cursor()
>>> cur.execute("""SELECT name FROM sqlite_master WHERE type='table';""")
<sqlite3.Cursor object at 0x000001EF59891D40>
>>> print(cur.fetchall())
[('library',), ('example1_library',), ('example1_library_data',), ('example1_library_idx',), ('example1_library_content',), ('example1_library_docsize',), ('example1_library_config',), ('parser_events',), ('parser_events_data',), ('parser_events_idx',), ('parser_events_content',), ('parser_events_docsize',), ('parser_events_config',), ('status',), ('movie',)]

(Note that I added ('movie',) in testing.)

@doberst
Copy link
Contributor

doberst commented Dec 2, 2024

@arautio89 - sorry you ran into this issue - thanks for sharing it - it does seem unusual. It looks like you have been able to parse other documents and create other libraries with llmware in that specific database? I would recommend checking the obvious stuff so we can rule those out (e.g., that the files or DB were not corrupted, and that you have the pip dependencies installed) - and then please do try to run the example again with a clean setup - if the issue persists, then we can look at the next level of debugging ... If there are any other notable items about the environment, please share - and I will try to recreate the environment and see if we can reproduce the issue.

@arautio89
Copy link
Author

None of the sample folders were successfully parsed, although SmallLibrary stopped with different error:

Example - Parsing Files into Library

Step 1 - creating library example1_library
INFO: Setup - sample_files path already exists - C:\Users\Käyttäjä\llmware_data\sample_files
Step 2 - loading the llmware sample files and saving at: C:\Users\Käyttäjä\llmware_data\sample_files
Step 3 - parsing and indexing files from C:\Users\Käyttäjä\llmware_data\sample_files\SmallLibrary
INFO: update:  Duplicate files (skipped): 2
INFO: update:  Total uploaded: 6
INFO: Parser - parse_office - start parsing of office documents...

I guess all of the pdf files couldn't be parsed and were rejected, and then office document parsing exited with error?

I created a virtual environment with Python 3.12 and installed with pip3 install 'llmware[full].
However to make that work I had to download visual-cpp-build-tools and select two individual components like in this post:
https://stackoverflow.com/a/76245995
I chose Windows 10 SDK.

I could try again from the beginning with different Python version and see what happens?

@arautio89
Copy link
Author

arautio89 commented Dec 3, 2024

Looking at the code the I suspect the problem is somewhere in libpdf_llmware.dll, however I don't know how to troubleshoot further than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants