How to extract the news detail page? 新闻详情页怎么提取？ #19

annian101 · 2024-03-28T09:25:09Z

请问一下新闻详情页怎么提取？

platonai · 2024-04-05T11:38:49Z

val url = "https://www.eeo.com.cn/2024/0330/648712.shtml"
val session = ScentContexts.createSession()
val document = session.harvestArticle(url, session.options())

println(document.contentTitle)
println(document.textContent)

eeo.com.cn crawler

platonai · 2024-04-05T12:36:22Z

If you need a open source solution, use the code below:

    fun harvestArticle(page: WebPage): TextDocument {
        return SAXInput().parse(page.baseUrl, page.contentAsSaxInputSource).also { ChineseNewsExtractor().process(it) }
    }

ChineseNewsExtractor is implemented in PulsarRPA.

annian101 · 2024-04-07T01:03:11Z

val url = "https://www.eeo.com.cn/2024/0330/648712.shtml"
val session = ScentContexts.createSession()
val document = session.harvestArticle(url, session.options())

println(document.contentTitle)
println(document.textContent)

eeo.com.cn爬虫

请问您这个是新闻类网站通用的吗？我看您代码目录里有分百度新闻网站、eeo新闻网站这些等等，如果我应用于这些网站之外的网站进行详情页获取，是不是还能获取到？

annian101 · 2024-04-07T01:04:54Z

如果您需要开源解决方案，请使用以下代码：
    fun harvestArticle(page: WebPage): TextDocument {
        return SAXInput().parse(page.baseUrl, page.contentAsSaxInputSource).also { ChineseNewsExtractor().process(it) }
    }
ChineseNewsExtractor在 PulsarRPA 中实现。

还有大佬，请问下Exotic可以提取详情页吗？

ZhujingJava · 2024-04-18T15:59:19Z

val url = "https://www.eeo.com.cn/2024/0330/648712.shtml"
val session = ScentContexts.createSession()
val document = session.harvestArticle(url, session.options())

println(document.contentTitle)
println(document.textContent)
eeo.com.cn爬虫
请问您这个是新闻类网站通用的吗？我看您代码目录里有分百度新闻网站、eeo新闻网站这些等等，如果我应用于这些网站之外的网站进行详情页获取，是不是还能获取到？

不同的网站元素结构不同，每家公司网站都需要单独编写逻辑，比如amazon，zhihu，jd等等。

platonai · 2024-04-20T13:38:23Z

如果您需要开源解决方案，请使用以下代码：
    fun harvestArticle(page: WebPage): TextDocument {
        return SAXInput().parse(page.baseUrl, page.contentAsSaxInputSource).also { ChineseNewsExtractor().process(it) }
    }
ChineseNewsExtractor在 PulsarRPA 中实现。
还有大佬，请问下Exotic可以提取详情页吗？

项目主页 README 有介绍。

更多信息：

https://www.bilibili.com/video/BV1qV411R7Xq/
这个视频介绍了我们的 AI 技术如何准确理解网页上的每一个字段，并且将网页转变为结构化数据或者Excel表格。使用无监督学习+监督学习进行网页数据提取，我们将网页数据提取的人效提升了1000倍以上，提升了数据提取准确率，降低了人员技能要求，同时也不再需要频繁维护数据提取规则。

http://platonic.fun/i/ai?url=aHR0cHM6Ly93d3cuaHVhLmNvbS9tZWlndWkv
这是 AI 技术准确理解并提取网页字段的实时演示。

https://www.bilibili.com/video/BV1Zi4y1h7aq/

platonai · 2024-04-20T13:38:42Z

不同的网站元素结构不同，每家公司网站都需要单独编写逻辑，比如amazon，zhihu，jd等等。

项目主页 README 有介绍。

更多信息：

https://www.bilibili.com/video/BV1qV411R7Xq/
这个视频介绍了我们的 AI 技术如何准确理解网页上的每一个字段，并且将网页转变为结构化数据或者Excel表格。使用无监督学习+监督学习进行网页数据提取，我们将网页数据提取的人效提升了1000倍以上，提升了数据提取准确率，降低了人员技能要求，同时也不再需要频繁维护数据提取规则。

http://platonic.fun/i/ai?url=aHR0cHM6Ly93d3cuaHVhLmNvbS9tZWlndWkv
这是 AI 技术准确理解并提取网页字段的实时演示。

https://www.bilibili.com/video/BV1Zi4y1h7aq/

xieliaing · 2024-08-10T06:21:53Z

你好，联系贵公司电子邮箱，但是没有回复，请问如何接洽。

galaxyeye · 2024-08-18T03:54:54Z

感谢您的关注。您可以直接加我微信: galaxyeye, 非常感谢。 Wechat: galaxyeye Weibo: galaxyeye Email: ***@***.***, ***@***.*** Twitter: galaxyeye8 Website: platon.ai Liang Xie ***@***.***> 于2024年8月10日周六 14:22写道：

…

你好，联系贵公司电子邮箱，但是没有回复，请问如何接洽。 — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAM7MS5V45HORR3G6DNOJJLZQWWRRAVCNFSM6AAAAABFMNONNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZZGY2TMNBWGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- 开放商品云张南 Email: ***@***.*** 微信: galaxyeye QQ: 263206207 手机: 18621538660

xieliaing · 2024-08-18T08:35:04Z

好的，我加你微信。我同事也会通过公司电子邮件联系你 Thank you,Liang On Sunday, August 18, 2024 at 12:55:16 PM GMT+9, Vincent Zhang ***@***.***> wrote: 感谢您的关注。您可以直接加我微信: galaxyeye, 非常感谢。 Wechat: galaxyeye Weibo: galaxyeye Email: ***@***.***, ***@***.*** Twitter: galaxyeye8 Website: platon.ai Liang Xie ***@***.***> 于2024年8月10日周六 14:22写道：

…

你好，联系贵公司电子邮箱，但是没有回复，请问如何接洽。 — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAM7MS5V45HORR3G6DNOJJLZQWWRRAVCNFSM6AAAAABFMNONNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZZGY2TMNBWGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- 开放商品云张南 Email: ***@***.*** 微信: galaxyeye QQ: 263206207 手机: 18621538660 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

platonai added good first issue Good for newcomers wontfix This will not be worked on labels Apr 5, 2024

platonai changed the title ~~请问一下新闻详情页怎么提取？~~ How to extract the news detail page? 新闻详情页怎么提取？ Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract the news detail page? 新闻详情页怎么提取？ #19

How to extract the news detail page? 新闻详情页怎么提取？ #19

annian101 commented Mar 28, 2024

platonai commented Apr 5, 2024 •

edited

Loading

platonai commented Apr 5, 2024

annian101 commented Apr 7, 2024

annian101 commented Apr 7, 2024

ZhujingJava commented Apr 18, 2024

platonai commented Apr 20, 2024

platonai commented Apr 20, 2024

xieliaing commented Aug 10, 2024

galaxyeye commented Aug 18, 2024 via email

xieliaing commented Aug 18, 2024 via email

How to extract the news detail page? 新闻详情页怎么提取？ #19

How to extract the news detail page? 新闻详情页怎么提取？ #19

Comments

annian101 commented Mar 28, 2024

platonai commented Apr 5, 2024 • edited Loading

platonai commented Apr 5, 2024

annian101 commented Apr 7, 2024

annian101 commented Apr 7, 2024

ZhujingJava commented Apr 18, 2024

platonai commented Apr 20, 2024

platonai commented Apr 20, 2024

xieliaing commented Aug 10, 2024

galaxyeye commented Aug 18, 2024 via email

xieliaing commented Aug 18, 2024 via email

platonai commented Apr 5, 2024 •

edited

Loading