ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #952

tomer2406 · 2022-10-02T13:23:16Z

Hello,
I'm using newspaper3k package to parse the following article: https://spectrum.ieee.org/3d-printed-meat
In debugged it until I reached the code section of ContentExtractor.nodes_to_check method and I saw that when it execute the following:
items = self.parser.getElementsByTag(doc, tag=tag)
when tag = 'p'
I get 75 elements which do not include the article text, compared to when I'm using BeautifulSoup with soup.find_all('p') I get 76 elements with the right text.

can you please help me to understand the problem?
Thank you.

The text was updated successfully, but these errors were encountered:

AndyTheFactory mentioned this issue Oct 24, 2023

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article AndyTheFactory/newspaper4k#565

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #952

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #952

tomer2406 commented Oct 2, 2022

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #952

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #952

Comments

tomer2406 commented Oct 2, 2022