Skip to content

Commit

Permalink
Merge pull request #276 from ndaidong/6.0.6
Browse files Browse the repository at this point in the history
v6.0.6
  • Loading branch information
ndaidong authored Jul 5, 2022
2 parents 0a70987 + dd3434e commit e9e5492
Show file tree
Hide file tree
Showing 19 changed files with 328 additions and 226 deletions.
55 changes: 51 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,9 +105,10 @@ If the extraction works well, you should get an `article` object with the struct
#### addQueryRules(Array queryRules)

Add custom rules to get main article from the specific domains.
New rules will be appended to the current list of rules.

This can be useful when the default extraction algorithm fails, or when you want to remove some parts of main article content.
New rules will be added at the end of the list of current rules.

This can be useful when the default extraction algorithm fails, or when you want to adjust content of extracted article.

Example:

Expand All @@ -132,6 +133,7 @@ addQueryRules([
])

// extractor will try to find article at `#noop_article_locates_here`
// the elements with class .advertise-area or .stupid-banner will be removed

// call it again, hopefully it works for you now :)
extract('https://bad-website.domain/page/article')
Expand All @@ -141,11 +143,13 @@ extract('https://bad-website.domain/page/article')

A query rule is an object with the following properties:

- `patterns`: required, list of [URLPattern](https://developer.mozilla.org/en-US/docs/Web/API/URLPattern) objects. See [the syntax for patterns](https://developer.mozilla.org/en-US/docs/Web/API/URL_Pattern_API).
- `patterns`: required, list of [URLPattern](https://developer.mozilla.org/en-US/docs/Web/API/URLPattern) objects
- `selector`: optional, where to find the HTMLElement which contains main article content
- `unwanted`: optional, list of selectors to filter unwanted HTML elements from the last result
- `transform`, optional, function to fine-tune article content more thoroughly

The rules without `patterns` will be ignored. Regarding the syntax for patterns, please view [URL Pattern API](https://developer.mozilla.org/en-US/docs/Web/API/URL_Pattern_API).

Here is an example using rule with transformation:

```js
Expand Down Expand Up @@ -174,9 +178,52 @@ addQueryRules([

To write better `transform()` logic, please refer [Document Object](https://developer.mozilla.org/en-US/docs/Web/API/Document).

#### Priority order

While processing an article, more than one rule can be matched. Suppose that we have the following rules:

```js
[
{
patterns: [
'*://google.com/*',
'*://goo.gl/*'
],
selector: '#selector1',
unwanted: [
'.class-1',
'.class-2',
'.class-3'
],
transform: transformOne
},
{
patterns: [
'*://goo.gl/*',
'*://google.inc/*'
],
selector: '#selector2',
unwanted: [
'.class-3',
'.class-4',
'.class-5'
],
transform: transformTwo
}
]
```

As you can see, an article from `goo.gl` certainly matches both them.

In this scenario, `article-parser` handles as below:

* only selector from the first rule (`#selector1`) is being used
* two lists of `unwanted` will be merged and used (after removing duplicates)
* both transform functions will be used, `transformOne()` then `transformTwo()`

#### setQueryRules(Array queryRules)

Similar to `addQueryRules()` but new rules will replace the current query rules.
Similar to `addQueryRules()` but new rules will replace completely the current query rules.

#### getQueryRules()

Expand Down
104 changes: 52 additions & 52 deletions dist/article-parser.browser.js

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions dist/article-parser.browser.js.map

Large diffs are not rendered by default.

138 changes: 69 additions & 69 deletions dist/cjs/article-parser.js

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions dist/cjs/article-parser.js.map

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion dist/cjs/package.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "article-parser-cjs",
"version": "6.0.5",
"version": "6.0.6",
"main": "./article-parser.js"
}
4 changes: 2 additions & 2 deletions index.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,11 @@ export function setSanitizeHtmlOptions(options: SanitizeOptions): void;

export function setHtmlCrushOptions(options: HtmlCrushOptions): void;

export function addQueryRules(...rules: Array<QueryRule>): Number;
export function addQueryRules(rules: Array<QueryRule>): Number;

export function getQueryRules(): Array<QueryRule>;

export function setQueryRules(rules: Array<QueryRule>): void;
export function setQueryRules(rules: Array<QueryRule>): Number;

export function getParserOptions(): ParserOptions;

Expand Down
4 changes: 2 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"version": "6.0.5",
"version": "6.0.6",
"name": "article-parser",
"description": "To extract main article from given URL",
"homepage": "https://ndaidong.github.io/article-parser-demo/",
Expand Down Expand Up @@ -33,7 +33,7 @@
"axios": "^0.27.2",
"bellajs": "^11.0.3",
"debug": "^4.3.4",
"html-crush": "^5.0.18",
"html-crush": "^5.0.19",
"linkedom": "^0.14.12",
"sanitize-html": "^2.7.0",
"string-comparison": "^1.1.0",
Expand Down
81 changes: 49 additions & 32 deletions src/config.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ import { clone, copies, isArray } from 'bellajs'

import { rules as defaultRules } from './rules.js'

let rules = clone(defaultRules)

const requestOptions = {
headers: {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0',
Expand All @@ -31,7 +29,15 @@ const sanitizeHtmlOptions = {
iframe: ['src', 'frameborder', 'height', 'width', 'scrolling'],
svg: ['width', 'height'] // sanitize-html does not support svg fully yet
},
allowedIframeDomains: ['youtube.com', 'vimeo.com']
allowedIframeDomains: ['youtube.com', 'twitter.com', 'facebook.com', 'vimeo.com']
}

/**
* @type {HtmlCrushOptions}
*/
const htmlCrushOptions = {
removeHTMLComments: 2,
removeLineBreaks: true
}

const parserOptions = {
Expand All @@ -42,76 +48,87 @@ const parserOptions = {
contentLengthThreshold: 200 // content must have at least 200 chars
}

/**
* @type {HtmlCrushOptions}
*/
const htmlCrushOptions = {
removeHTMLComments: 2,
removeLineBreaks: true
const state = {
requestOptions,
sanitizeHtmlOptions,
htmlCrushOptions,
parserOptions,
rules: clone(defaultRules)
}

/**
* @returns {ParserOptions}
* @returns {RequestOptions}
*/
export const getParserOptions = () => {
return clone(parserOptions)
export const getRequestOptions = () => {
return clone(state.requestOptions)
}

/**
* @returns {RequestOptions}
* @returns {SanitizeOptions}
*/
export const getRequestOptions = () => {
return clone(requestOptions)
export const getSanitizeHtmlOptions = () => {
return clone(state.sanitizeHtmlOptions)
}

/**
* @returns {HtmlCrushOptions}
*/
export const getHtmlCrushOptions = () => {
return clone(htmlCrushOptions)
return clone(state.htmlCrushOptions)
}

/**
* @returns {SanitizeOptions}
* @returns {ParserOptions}
*/
export const getSanitizeHtmlOptions = () => {
return clone(sanitizeHtmlOptions)
export const getParserOptions = () => {
return clone(state.parserOptions)
}

export const setParserOptions = (opts) => {
Object.keys(parserOptions).forEach((key) => {
export const setParserOptions = (opts = {}) => {
Object.keys(state.parserOptions).forEach((key) => {
if (key in opts) {
parserOptions[key] = opts[key]
state.parserOptions[key] = opts[key]
}
})
}

export const setRequestOptions = (opts) => {
copies(opts, requestOptions)
export const setRequestOptions = (opts = {}) => {
copies(opts, state.requestOptions)
}

export const setHtmlCrushOptions = (opts) => {
copies(opts, htmlCrushOptions)
export const setHtmlCrushOptions = (opts = {}) => {
copies(opts, state.htmlCrushOptions)
}

export const setSanitizeHtmlOptions = (opts) => {
export const setSanitizeHtmlOptions = (opts = {}) => {
Object.keys(opts).forEach((key) => {
sanitizeHtmlOptions[key] = clone(opts[key])
state.sanitizeHtmlOptions[key] = clone(opts[key])
})
}

/**
* @returns {QueryRule[]}
*/
export const getQueryRules = () => clone(rules)
export const getQueryRules = () => clone(state.rules)

/**
* @param value {QueryRule[]}
* @param entries {QueryRule[]}
* @returns {number}
*/
export const setQueryRules = (value) => { rules = value }
export const setQueryRules = (entries = []) => {
state.rules = []
const newRules = [...entries].filter((item) => isArray(item?.patterns))
state.rules = [...newRules]
return state.rules.length
}

/**
* @param entries {QueryRule}
* @returns {number}
*/
export const addQueryRules = (...entries) => rules.unshift(...entries.filter((item) => isArray(item?.patterns)))
export const addQueryRules = (entries = []) => {
const { rules } = state
const newRules = [...entries].filter((item) => isArray(item?.patterns))
state.rules = [...rules, ...newRules]
return state.rules.length
}
10 changes: 5 additions & 5 deletions src/config.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ test('Testing addQueryRules/setQueryRules/getQueryRules methods', () => {
expect(currentRules).toEqual(defaultRules)

addQueryRules()
addQueryRules(...[])
addQueryRules([])
expect(getQueryRules()).toHaveLength(defaultRules.length)

const newRules = [
Expand All @@ -122,15 +122,15 @@ test('Testing addQueryRules/setQueryRules/getQueryRules methods', () => {
]
}
]
addQueryRules(...newRules)
addQueryRules(newRules)

const updatedRules = getQueryRules()
expect(updatedRules).toHaveLength(defaultRules.length + newRules.length)
expect(updatedRules[0]).toEqual(newRules[0])
expect(updatedRules[updatedRules.length - 1]).toEqual(defaultRules[defaultRules.length - 1])
expect(updatedRules[0]).toEqual(defaultRules[0])
expect(updatedRules[updatedRules.length - 1]).toEqual(newRules[newRules.length - 1])

setQueryRules(newRules)
const latestUpdatedRules = getQueryRules()
expect(latestUpdatedRules).toHaveLength(2)
expect(updatedRules[1]).toEqual(newRules[1])
expect(latestUpdatedRules[1]).toEqual(newRules[1])
})
18 changes: 8 additions & 10 deletions src/utils/extractWithSelector.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,24 +18,22 @@ const countWord = (text) => {
* @returns {null|string}
*/
export default (html, selector = null) => {
if (!selector) return null

if (!selector) return html
try {
const document = new DOMParser().parseFromString(html, 'text/html')
const parts = []
document.querySelectorAll(selector).forEach(node => {
document.querySelectorAll(selector).forEach((node) => {
const text = node.innerHTML.trim()
if (countWord(text) >= MIN_SECTION_LENGTH) { parts.push(text) }
if (countWord(text) >= MIN_SECTION_LENGTH) {
parts.push(text)
}
})

if (parts.length) {
return parts
.reduce((prev, curr) => prev.concat([curr]), [])
return parts.length > 0
? parts.reduce((prev, curr) => prev.concat([curr]), [])
.filter((sect) => stripTags(sect).length > MIN_TEXT_LENGTH)
.join('')
}

return document.documentElement.innerHTML
: document.documentElement.innerHTML
} catch (err) {
logger.error(err)
}
Expand Down
5 changes: 3 additions & 2 deletions src/utils/findRulesByUrl.js
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,14 @@ import 'urlpattern-polyfill'
* @returns {QueryRule|{}}
*/
export default (urls = []) => {
const appliedRules = []
const rules = getQueryRules()
for (const rule of rules) {
const { patterns } = rule
const matched = urls.some((url) => patterns.some((pattern) => new URLPattern(pattern).test(url)))
if (matched) {
return rule
appliedRules.push(rule)
}
}
return {}
return appliedRules
}
Loading

0 comments on commit e9e5492

Please sign in to comment.