You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following code portion has been converted to be working with Ollama :
# Some prerequisitesfromlangchain_ollamaimportChatOllamaLOCAL_LLM_CONTEXT_SIZE_IN_TOKENS=8192TOKEN_CUSHION=300TOKEN_BUFFER=500TESSERACT_PATH=r"C:\Program Files\Tesseract-OCR"os.environ['PATH'] +=";"+TESSERACT_PATHOLLAMA_HOST="http://127.0.0.1:11434"OLLAMA_OCR_MODEL="llama3.3:latest"OLLAMA_OCR_FUNCTION=ChatOllama(
base_url=OLLAMA_HOST,
model=OLLAMA_OCR_MODEL,
temperature=0.7,
seed=1234567890,
)
defconvert_nanoseconds(nano):
seconds=nano/1e9minutes, seconds=divmod(seconds, 60)
hours, minutes=divmod(minutes, 60)
days, hours=divmod(hours, 24)
ifdays!=0:
formatted=f"{int(days)} days, {int(hours)} hours, {int(minutes)} minutes, and {seconds:.2f} seconds"elifhours!=0:
formatted=f"{int(hours)} hours, {int(minutes)} minutes, and {seconds:.2f} seconds"elifminutes!=0:
formatted=f"{int(minutes)} minutes, and {seconds:.2f} seconds"else:
formatted=f"{seconds:.2f} seconds"returnformatted## Code modified to work with Ollamadefpreprocess_image(image):
logging.info("Preprocess Image")
gray=cv2.cvtColor(np.array(image), cv2.COLOR_RGB2GRAY)
gray=cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY|cv2.THRESH_OTSU)[1]
kernel=np.ones((1, 1), np.uint8)
gray=cv2.dilate(gray, kernel, iterations=1)
logging.info("Preprocessing done.")
returnImage.fromarray(gray)
defconvert_pdf_to_images(input_pdf_file_path: str, max_pages: int=0, skip_first_n_pages: int=0) ->List[Image.Image]:
logging.info(f"Processing PDF file {input_pdf_file_path}")
ifmax_pages==0:
last_page=Nonelogging.info("Converting all pages to images...")
else:
last_page=skip_first_n_pages+max_pageslogging.info(f"Converting pages {skip_first_n_pages+1} to {last_page}")
first_page=skip_first_n_pages+1# pdf2image uses 1-based indexingimages=convert_from_path(input_pdf_file_path, first_page=first_page, last_page=last_page, poppler_path=r"\poppler-24.08.0\Library\bin")
logging.info(f"Converted {len(images)} pages from PDF file to images.")
returnimagesdefocr_image(image):
preprocessed_image=preprocess_image(image)
returnpytesseract.image_to_string(preprocessed_image)
asyncdefprocess_chunk(chunk: str, prev_context: str, chunk_index: int, total_chunks: int, reformat_as_markdown: bool, suppress_headers_and_page_numbers: bool) ->Tuple[str, str]:
logging.info(f"Processing chunk {chunk_index+1}/{total_chunks} (length: {len(chunk):,} characters)")
# Step 1: OCR Correctionocr_correction_prompt=f"""Correct OCR-induced errors in the text, ensuring it flows coherently with the previous context. Follow these guidelines:1. Fix OCR-induced typos and errors: - Correct words split across line breaks - Fix common OCR errors (e.g., 'rn' misread as 'm') - Use context and common sense to correct errors - Only fix clear errors, don't alter the content unnecessarily - Do not add extra periods or any unnecessary punctuation2. Maintain original structure: - Keep all headings and subheadings intact3. Preserve original content: - Keep all important information from the original text - Do not add any new information not present in the original text - Remove unnecessary line breaks within sentences or paragraphs - Maintain paragraph breaks4. Maintain coherence: - Ensure the content connects smoothly with the previous context - Handle text that starts or ends mid-sentence appropriatelyIMPORTANT: Respond ONLY with the corrected text. Preserve all original formatting, including line breaks. Do not include any introduction, explanation, or metadata.Previous context:{prev_context[-500:]}Current chunk to process:{chunk}Corrected text:"""ocr_corrected_chunk=awaitgenerate_completion(ocr_correction_prompt, max_tokens=len(chunk) +500)
processed_chunk=ocr_corrected_chunk# Step 2: Markdown Formatting (if requested)ifreformat_as_markdown:
markdown_prompt=f"""Reformat the following text as markdown, improving readability while preserving the original structure. Follow these guidelines:1. Preserve all original headings, converting them to appropriate markdown heading levels (# for main titles, ## for subtitles, etc.) - Ensure each heading is on its own line - Add a blank line before and after each heading2. Maintain the original paragraph structure. Remove all breaks within a word that should be a single word (for example, "cor- rect" should be "correct")3. Format lists properly (unordered or ordered) if they exist in the original text4. Use emphasis (*italic*) and strong emphasis (**bold**) where appropriate, based on the original formatting5. Preserve all original content and meaning6. Do not add any extra punctuation or modify the existing punctuation7. Remove any spuriously inserted introductory text such as "Here is the corrected text:" that may have been added by the LLM and which is obviously not part of the original text.8. Remove any obviously duplicated content that appears to have been accidentally included twice. Follow these strict guidelines: - Remove only exact or near-exact repeated paragraphs or sections within the main chunk. - Consider the context (before and after the main chunk) to identify duplicates that span chunk boundaries. - Do not remove content that is simply similar but conveys different information. - Preserve all unique content, even if it seems redundant. - Ensure the text flows smoothly after removal. - Do not add any new content or explanations. - If no obvious duplicates are found, return the main chunk unchanged.9. {"Identify but do not remove headers, footers, or page numbers. Instead, format them distinctly, e.g., as blockquotes."ifnotsuppress_headers_and_page_numberselse"Carefully remove headers, footers, and page numbers while preserving all other content."}Text to reformat:{ocr_corrected_chunk}Reformatted markdown:"""processed_chunk=awaitgenerate_completion(markdown_prompt, max_tokens=len(ocr_corrected_chunk) +500)
new_context=processed_chunk['generated_text'][-1000:] # Use the last 1000 characters as context for the next chunklogging.info(f"Chunk {chunk_index+1}/{total_chunks} processed. Output length: {len(processed_chunk):,} characters")
returnprocessed_chunk, new_contextasyncdefprocess_chunks(chunks: List[str], reformat_as_markdown: bool, suppress_headers_and_page_numbers: bool) ->List[str]:
total_chunks=len(chunks)
logging.info("Using local LLM. Processing chunks sequentially...")
context=""processed_chunks= []
fori, chunkinenumerate(chunks):
processed_chunk, context=awaitprocess_chunk(chunk, context, i, total_chunks, reformat_as_markdown, suppress_headers_and_page_numbers)
processed_chunks.append(processed_chunk)
logging.info(f"All {total_chunks} chunks processed successfully")
returnprocessed_chunksasyncdefprocess_document(list_of_extracted_text_strings: List[str], reformat_as_markdown: bool=True, suppress_headers_and_page_numbers: bool=True) ->str:
logging.info(f"Starting document processing. Total pages: {len(list_of_extracted_text_strings):,}")
full_text="\n\n".join(list_of_extracted_text_strings)
logging.info(f"Size of full text before processing: {len(full_text):,} characters")
chunk_size, overlap=8000, 10# Improved chunking logicparagraphs=re.split(r'\n\s*\n', full_text)
chunks= []
current_chunk= []
current_chunk_length=0forparagraphinparagraphs:
paragraph_length=len(paragraph)
ifcurrent_chunk_length+paragraph_length<=chunk_size:
current_chunk.append(paragraph)
current_chunk_length+=paragraph_lengthelse:
# If adding the whole paragraph exceeds the chunk size,# we need to split the paragraph into sentencesifcurrent_chunk:
chunks.append("\n\n".join(current_chunk))
sentences=re.split(r'(?<=[.!?])\s+', paragraph)
current_chunk= []
current_chunk_length=0forsentenceinsentences:
sentence_length=len(sentence)
ifcurrent_chunk_length+sentence_length<=chunk_size:
current_chunk.append(sentence)
current_chunk_length+=sentence_lengthelse:
ifcurrent_chunk:
chunks.append(" ".join(current_chunk))
current_chunk= [sentence]
current_chunk_length=sentence_length# Add any remaining content as the last chunkifcurrent_chunk:
chunks.append("\n\n".join(current_chunk) iflen(current_chunk) >1elsecurrent_chunk[0])
# Add overlap between chunksforiinrange(1, len(chunks)):
overlap_text=chunks[i-1].split()[-overlap:]
chunks[i] =" ".join(overlap_text) +" "+chunks[i]
logging.info(f"Document split into {len(chunks):,} chunks. Chunk size: {chunk_size:,}, Overlap: {overlap:,}")
processed_chunks=awaitprocess_chunks(chunks, reformat_as_markdown, suppress_headers_and_page_numbers)
logging.debug(processed_chunks)
final_text="".join(processed_chunks[0]["generated_text"])
logging.info(f"Size of text after combining chunks: {len(final_text):,} characters")
logging.info(f"Document processing complete. Final text length: {len(final_text):,} characters")
returnfinal_textdefremove_corrected_text_header(text):
returntext.replace("# Corrected text\n", "").replace("# Corrected text:", "").replace("\nCorrected text", "").replace("Corrected text:", "")
asyncdefassess_output_quality(original_text, processed_text):
max_chars=15000# Limit to avoid exceeding token limitsavailable_chars_per_text=max_chars//2# Split equally between original and processedoriginal_sample=original_text[:available_chars_per_text]
processed_sample=processed_text[:available_chars_per_text]
prompt=f'''Compare the following samples of original OCR text with the processed output and assess the quality of the processing. Consider the following factors:1. Accuracy of error correction2. Improvement in readability3. Preservation of original content and meaning4. Appropriate use of markdown formatting (if applicable)5. Removal of hallucinations or irrelevant contentOriginal text sample:```{original_sample}```Processed text sample:```{processed_sample}```Provide a quality score between 0 and 100, where 100 is perfect processing. Also provide a brief explanation of your assessment.Your response should be in the following format:SCORE: [Your score]EXPLANATION: [Your explanation]'''response=awaitgenerate_completion(prompt, max_tokens=1000)
response=response["generated_text"]
logging.debug(response)
try:
lines=response.strip().split('\n')
score_line=next(lineforlineinlinesifline.startswith('SCORE:'))
score=int(score_line.split(':')[1].strip())
explanation='\n'.join(lineforlineinlinesifline.startswith('EXPLANATION:')).replace('EXPLANATION:', '').strip()
ifexplanation=="":
explanation="\n".join([lineforlineinlines[2:] ifline.strip()])
logging.info(f"== Quality assessment: Score {score}/100")
logging.info(f"== Explanation: {explanation}")
returnscore, explanationexceptExceptionase:
logging.error(f"Error parsing quality assessment response: {e}")
logging.error(f"Raw response: {response}")
returnNone, None# API Interaction Functionsasyncdefgenerate_completion(prompt: str, max_tokens: int=5000) ->Optional[str]:
returnawaitgenerate_completion_from_ollama(OLLAMA_OCR_MODEL, prompt, max_tokens)
defget_tokenizer(model_name: str):
logging.info(f"Model Name : {model_name}")
ifmodel_name.lower().startswith("llama"):
returnAutoTokenizer.from_pretrained("huggyllama/llama-7b", clean_up_tokenization_spaces=False, legacy=False)
else:
raiseValueError(f"Unsupported model: {model_name}")
defapproximate_tokens(text: str) ->int:
# Normalize whitespacetext=re.sub(r'\s+', ' ', text.strip())
# Split on whitespace and punctuation, keeping punctuationtokens=re.findall(r'\b\w+\b|\S', text)
count=0fortokenintokens:
iftoken.isdigit():
count+=max(1, len(token) //2) # Numbers often tokenize to multiple tokenselifre.match(r'^[A-Z]{2,}$', token): # Acronymscount+=len(token)
elifre.search(r'[^\w\s]', token): # Punctuation and special characterscount+=1eliflen(token) >10: # Long words often split into multiple tokenscount+=len(token) //4+1else:
count+=1# Add a 10% buffer for potential underestimationreturnint(count*1.1)
defestimate_tokens(text: str, model_name: str) ->int:
try:
tokenizer=get_tokenizer(model_name)
returnlen(tokenizer.encode(text))
exceptExceptionase:
logging.warning(f"Error using tokenizer for {model_name}: {e}. Falling back to approximation.")
returnapproximate_tokens(text)
defadjust_overlaps(chunks: List[str], tokenizer, max_chunk_tokens: int, overlap_size: int=50) ->List[str]:
adjusted_chunks= []
foriinrange(len(chunks)):
ifi==0:
adjusted_chunks.append(chunks[i])
else:
overlap_tokens=len(tokenizer.encode(' '.join(chunks[i-1].split()[-overlap_size:])))
current_tokens=len(tokenizer.encode(chunks[i]))
ifoverlap_tokens+current_tokens>max_chunk_tokens:
overlap_adjusted=chunks[i].split()[:-overlap_size]
adjusted_chunks.append(' '.join(overlap_adjusted))
else:
adjusted_chunks.append(' '.join(chunks[i-1].split()[-overlap_size:] +chunks[i].split()))
returnadjusted_chunksdefchunk_text(text: str, max_chunk_tokens: int, model_name: str) ->List[str]:
chunks= []
tokenizer=get_tokenizer(model_name)
sentences=re.split(r'(?<=[.!?])\s+', text)
current_chunk= []
current_chunk_tokens=0forsentenceinsentences:
sentence_tokens=len(tokenizer.encode(sentence))
ifcurrent_chunk_tokens+sentence_tokens>max_chunk_tokens:
chunks.append(' '.join(current_chunk))
current_chunk= [sentence]
current_chunk_tokens=sentence_tokenselse:
current_chunk.append(sentence)
current_chunk_tokens+=sentence_tokensifcurrent_chunk:
chunks.append(' '.join(current_chunk))
adjusted_chunks=adjust_overlaps(chunks, tokenizer, max_chunk_tokens)
returnadjusted_chunksasyncdefgenerate_completion_from_ollama(llm_model_name: str, input_prompt: str, number_of_tokens_to_generate: int=100, temperature: float=0.7, grammar_file_string: str=None):
logging.info(f"Starting text completion using Ollama Model: '{llm_model_name}'")
logging.debug(f"for input prompt: '{input_prompt}'")
llm=OLLAMA_OCR_FUNCTIONprompt_tokens=estimate_tokens(input_prompt, OLLAMA_OCR_MODEL)
logging.info(f"Prompt Tokens : {prompt_tokens}")
adjusted_max_tokens=min(number_of_tokens_to_generate, LOCAL_LLM_CONTEXT_SIZE_IN_TOKENS-prompt_tokens-TOKEN_BUFFER)
ifadjusted_max_tokens<=0:
logging.warning("Prompt is too long for LLM. Chunking the input.")
chunks=chunk_text(input_prompt, LOCAL_LLM_CONTEXT_SIZE_IN_TOKENS-TOKEN_CUSHION, llm_model_name)
results= []
forchunkinchunks:
try:
output=llm.invoke(
input=chunk,
#max_tokens=LOCAL_LLM_CONTEXT_SIZE_IN_TOKENS - TOKEN_CUSHION,
)
logging.info(output["content"])
results.append(output.content)
logging.info(f"Chunk processed. Output tokens: {output['usage']['completion_tokens']:,}")
exceptExceptionase:
logging.error(f"An error occurred while processing a chunk: {e}")
return" ".join(results)
else:
logging.info("Prompt is OK for LLM. Processing...")
output=llm.invoke(
input=input_prompt,
)
generated_text=output.contentifgrammar_file_string=='json':
generated_text=generated_text.encode('unicode_escape').decode()
response_metadata= (output.response_metadata)
finish_reason=response_metadata["done_reason"]
total_duration=convert_nanoseconds(response_metadata["total_duration"])
logging.info(f"Completed text completion in {total_duration}. Beginning of generated text: \n'{generated_text[:150]}'...")
return {
"generated_text": generated_text,
"finish_reason": finish_reason,
}
asyncdefdo_OCR(filepath):
try:
# Suppress HTTP request logsinput_pdf_file_path=filepathmax_test_pages=0skip_first_n_pages=0reformat_as_markdown=Truesuppress_headers_and_page_numbers=Truebase_name=os.path.splitext(input_pdf_file_path)[0]
output_extension='.md'ifreformat_as_markdownelse'.txt'raw_ocr_output_file_path=f"{base_name}__raw_ocr_output.txt"llm_corrected_output_file_path=base_name+'_llm_corrected'+output_extensionlist_of_scanned_images=convert_pdf_to_images(input_pdf_file_path, max_test_pages, skip_first_n_pages)
logging.info(f"Tesseract version: {pytesseract.get_tesseract_version()}")
logging.info("Extracting text from converted pages...")
withThreadPoolExecutor() asexecutor:
list_of_extracted_text_strings=list(executor.map(ocr_image, list_of_scanned_images))
logging.info("Done extracting text from converted pages.")
raw_ocr_output="\n".join(list_of_extracted_text_strings)
withopen(raw_ocr_output_file_path, "w") asf:
f.write(raw_ocr_output)
logging.info(f"Raw OCR output written to: {raw_ocr_output_file_path}")
logging.info("Processing document...")
final_text=awaitprocess_document(list_of_extracted_text_strings, reformat_as_markdown, suppress_headers_and_page_numbers)
cleaned_text=remove_corrected_text_header(final_text)
# Save the LLM corrected outputwithopen(llm_corrected_output_file_path, 'w') asf:
f.write(cleaned_text)
logging.info(f"LLM Corrected text written to: {llm_corrected_output_file_path}")
iffinal_text:
logging.debug(f"First 500 characters of LLM corrected processed text:\n{final_text[:500]}...")
else:
logging.warning("final_text is empty or not defined.")
logging.info(f"Done processing {input_pdf_file_path}.")
logging.info(" == Output files below ==")
logging.info(f"[FILE] Raw OCR : {raw_ocr_output_file_path}")
logging.info(f"[FILE] LLM Corrected : {llm_corrected_output_file_path}")
# Perform a final quality checkquality_score, explanation=awaitassess_output_quality(raw_ocr_output, final_text)
ifquality_scoreisnotNone:
logging.info(f"Final quality score : {quality_score}/100")
logging.info(f"Explanation : {explanation}")
else:
logging.warning("Unable to determine final quality score.")
returnexplanation, final_textexceptExceptionase:
logging.error(f"An error occurred in the main function:\n{e}")
logging.error(traceback.format_exc())
Inside my existing asyncio.run() I call explanation, final_answer = await do_OCR(file_path) to get the corresponding strings.
The text was updated successfully, but these errors were encountered:
Please add Ollama support as Local LLM
The following code portion has been converted to be working with Ollama :
Inside my existing
asyncio.run()
I callexplanation, final_answer = await do_OCR(file_path)
to get the corresponding strings.The text was updated successfully, but these errors were encountered: