Skip to content

Commit

Permalink
Merge pull request #330 from extractus/7.2.9
Browse files Browse the repository at this point in the history
v7.2.9
  • Loading branch information
ndaidong authored Feb 20, 2023
2 parents 4e3debb + 7ead9a2 commit ba62bad
Show file tree
Hide file tree
Showing 10 changed files with 149 additions and 59 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -425,7 +425,7 @@ In this scenario, `@extractus/article-extractor` will execute both transformatio

### `sanitize-html`'s options

`@extractus/article-extractor` uses [sanitize-html](https://www.npmjs.com/package/sanitize-html) to make a clean sweep of HTML content.
`@extractus/article-extractor` uses [sanitize-html](https://github.com/apostrophecms/sanitize-html) to make a clean sweep of HTML content.

Here is the [default options](src/config.js#L5)

Expand All @@ -436,7 +436,7 @@ There are 2 methods to access and modify these options in `@extractus/article-ex
- `getSanitizeHtmlOptions()`
- `setSanitizeHtmlOptions(Object sanitizeHtmlOptions)`

Read [sanitize-html](https://www.npmjs.com/package/sanitize-html#what-are-the-default-options) docs for more info.
Read [sanitize-html](https://github.com/apostrophecms/sanitize-html#default-options) docs for more info.

---

Expand Down
34 changes: 17 additions & 17 deletions dist/article-extractor.esm.js

Large diffs are not rendered by default.

64 changes: 32 additions & 32 deletions dist/cjs/article-extractor.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion dist/cjs/package.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "@extractus/article-extractor",
"version": "7.2.8",
"version": "7.2.9",
"main": "./article-extractor.js"
}
12 changes: 6 additions & 6 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"version": "7.2.8",
"version": "7.2.9",
"name": "@extractus/article-extractor",
"description": "To extract main article from given URL",
"homepage": "https://github.com/extractus/article-extractor",
Expand Down Expand Up @@ -37,15 +37,15 @@
"@mozilla/readability": "^0.4.2",
"bellajs": "^11.1.1",
"cross-fetch": "^3.1.5",
"linkedom": "^0.14.21",
"sanitize-html": "^2.8.1",
"linkedom": "^0.14.22",
"sanitize-html": "2.10.0",
"string-similarity": "^4.0.4"
},
"devDependencies": {
"@types/sanitize-html": "^2.8.0",
"esbuild": "^0.16.16",
"eslint": "^8.31.0",
"jest": "^29.3.1",
"esbuild": "^0.17.9",
"eslint": "^8.34.0",
"jest": "^29.4.3",
"nock": "^13.3.0"
},
"keywords": [
Expand Down
1 change: 1 addition & 0 deletions src/config.js
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ const sanitizeHtmlOptions = {
'github.com', 'codepen.com',
'twitter.com', 'facebook.com', 'instagram.com',
],
allowVulnerableTags: false,
}

/**
Expand Down
22 changes: 22 additions & 0 deletions src/main.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,25 @@ describe('test extract(regular article url)', () => {
expect(result.description).toEqual(expDesc)
})
})

describe('test extract with modified sanitize-html options', () => {
const currentSanitizeOptions = getSanitizeHtmlOptions()

setSanitizeHtmlOptions({
...currentSanitizeOptions,
allowedAttributes: {
...currentSanitizeOptions.allowedAttributes,
code: ['class'],
div: ['class'],
},
allowedClasses: {
code: ['language-*', 'lang-*'],
},
})

test('check if output contain class attribute', async () => {
const html = readFileSync('./test-data/article-with-classes-attributes.html', 'utf8')
const result = await extract(html)
expect(result.content).toEqual(expect.stringContaining('code class="lang-js"'))
})
})
4 changes: 3 additions & 1 deletion src/utils/extractWithReadability.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@ export default (html, inputUrl = '') => {
const base = doc.createElement('base')
base.setAttribute('href', inputUrl)
doc.head.appendChild(base)
const reader = new Readability(doc)
const reader = new Readability(doc, {
keepClasses: true,
})
const result = reader.parse() ?? {}
return result.textContent ? result.content : null
}
Expand Down
1 change: 1 addition & 0 deletions src/utils/html.js
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ export const purify = (html) => {
return sanitize(html, {
allowedTags: false,
allowedAttributes: false,
allowVulnerableTags: true,
})
}

Expand Down
64 changes: 64 additions & 0 deletions test-data/article-with-classes-attributes.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Article title here - ArticleParser</title>
<meta name="author" content="Alice">
<meta name="description" content="Few words about this article">
<meta name="keywords" content="alpha, beta, gamma">
<meta name="twitter:site" content="@ArticleParser">
<meta name="twitter:url" content="https://somewhere.com/path/to/article-title-here">
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:image" content="https://somewhere.com/path/to/image.jpg">
<meta name="twitter:creator" content="@alice">
<meta property="og:title" content="Article title here">
<meta property="og:type" content="article">
<meta property="og:url" content="https://somewhere.com/path/to/article-title-here">
<meta property="og:description" content="Navigation here Few can name a rational peach that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs.">
<meta property="og:image" content="https://somewhere.com/path/to/image.jpg">
<meta property="article:published_time" content="2021-12-15T10:00:00.000+07:00">
<meta property="article:modified_time" content="2021-12-16T09:00:00.000+07:00">

<link rel="stylesheet" href="/path/to/cssfile.css">
<link rel="canonical" href="https://somewhere.com/another/path/to/article-title-here">
<link rel="amphtml" href="https://m.somewhere.com/another/path/to/article-title-here.amp">
<link rel="shortlink" href="https://sw.re/419283">

<link rel="alternate" title="ArticleParser" type="application/atom+xml" href="https://somewhere.com/atom.xml">

<link rel="manifest" href="/manifest.json">
</head>
<body>
<header>Page header here</header>
<main>
<section>
<nav>Navigation here</nav>
</section>
<section>
<h1>Article title here</h1>
<article>
<div class="contentdetail">Few can name a <a href="https://otherwhere.com/descriptions/rational-peach">rational peach</a> that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs. Some posit the enchanting tiger to be less than confident. The literature would have us believe that an impartial turtle is not but a hippopotamus. Unfortunately, that is wrong; on the contrary, those cows are nothing more than pandas! The chicken is a shark; A turtle can hardly be considered a kind horse without also being a pomegranate. Zebras are witty persimmons.</div>
<p class="contentdetail">
Those cheetahs are nothing more than dogs. A <a href="/dict/watermelon">watermelon</a> is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.</p>
<p>The first fair dog is, in its own way, a lemon.</p>
<address>4746 Kelly Drive, West Virginia</address>
<img src="./orange.png" style="border: solid 1px #000">
<pre>
<code class="lang-js">
const add = (a, b) => {
return a + b
}
</code>
</pre>
<p class="demo">OK, that is good</p>
</article>
</section>
<section class="sidebar-widget">
<widget>Some widget here</widget>
<widget>Some widget here</widget>
</section>
</main>
<footer>Page footer here</footer>
</body>
</html>

0 comments on commit ba62bad

Please sign in to comment.