Improper results on scanned pdfs #193

Shravan-Ganji · 2023-08-21T11:02:01Z

I have been trying to analyze the documents using layout parser on different types of documents, I am able to get expected results on True pdfs but not on scanned pdfs, it is detecting the scanned pdf image contents as figure or not as expected results.

I am facing this issue only for the scanned pdfs

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version, see the Layout Parser Releases

To Reproduce

import layoutparser as lp
import cv2

image = cv2.imread("test.png")
image = image[..., ::-1]

model = lp.models.Detectron2LayoutModel('lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})

color_map = {
'Text': 'red',
'Title': 'blue',
'List': 'green',
'Table': 'purple',
'Figure': 'pink',
}

layout = model.detect(image)

lp.draw_box(image, layout, box_width=3,color_map=color_map)

Environment

I am using windows
Latest layout parser version

Contains 2 images:

1: Scanned pdf image result
2: Proper pdf image result

Permafacture · 2024-04-18T03:31:50Z

Have you tried correcting the scanned images to make the background plain white? Here's a robust looking example using opencv:

https://www.freedomvc.com/index.php/2022/01/17/basic-background-remover-with-opencv/

Shravan-Ganji added the bug Something isn't working label Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improper results on scanned pdfs #193

Improper results on scanned pdfs #193

Shravan-Ganji commented Aug 21, 2023

Permafacture commented Apr 18, 2024

Improper results on scanned pdfs #193

Improper results on scanned pdfs #193

Comments

Shravan-Ganji commented Aug 21, 2023

Permafacture commented Apr 18, 2024