ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #565

AndyTheFactory · 2023-10-24T20:00:11Z

Issue by tomer2406
Sun Oct 2 13:23:16 2022
Originally opened as codelucas/newspaper#952

Hello,
I'm using newspaper3k package to parse the following article: https://spectrum.ieee.org/3d-printed-meat
In debugged it until I reached the code section of ContentExtractor.nodes_to_check method and I saw that when it execute the following:
items = self.parser.getElementsByTag(doc, tag=tag)
when tag = 'p'
I get 75 elements which do not include the article text, compared to when I'm using BeautifulSoup with soup.find_all('p') I get 76 elements with the right text.

can you please help me to understand the problem?
Thank you.

The text was updated successfully, but these errors were encountered:

AndyTheFactory · 2024-03-18T08:49:18Z

Code in that area changed a lot in 0.9.3

AndyTheFactory added the help wanted Extra attention is needed label Oct 25, 2023

AndyTheFactory closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #565

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #565

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Mar 18, 2024

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #565

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #565

Comments

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Mar 18, 2024