Skip to content

Commit

Permalink
Merge pull request #327 from extractus/7.2.8
Browse files Browse the repository at this point in the history
v7.2.8
  • Loading branch information
ndaidong authored Jan 11, 2023
2 parents aabc0e7 + 27afe61 commit 4e3debb
Show file tree
Hide file tree
Showing 28 changed files with 642 additions and 491 deletions.
2 changes: 2 additions & 0 deletions .eslintignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
node_modules
dist
121 changes: 121 additions & 0 deletions .eslintrc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
{
"parserOptions": {
"ecmaVersion": "latest",
"sourceType": "module"
},
"env": {
"es6": true,
"node": true,
"browser": true,
"jest": true
},
"globals": {
"globalThis": true
},
"plugins": [],
"overrides": [],
"extends": ["eslint:recommended"],
"rules": {
"arrow-spacing": ["error", { "before": true, "after": true }],
"block-spacing": ["error", "always"],
"brace-style": ["error", "1tbs", { "allowSingleLine": true }],
"camelcase": ["error", {
"allow": ["^UNSAFE_"],
"properties": "never",
"ignoreGlobals": true
}],
"comma-dangle": ["error", {
"arrays": "always-multiline",
"objects": "always-multiline",
"imports": "never",
"exports": "never",
"functions": "never"
}],
"comma-spacing": ["error", { "before": false, "after": true }],
"eol-last": "error",
"eqeqeq": ["error", "always", { "null": "ignore" }],
"func-call-spacing": ["error", "never"],
"indent": [
"error",
2,
{
"MemberExpression": 1,
"FunctionDeclaration": {
"body": 1,
"parameters": 2
},
"SwitchCase": 1
}
],
"key-spacing": ["error", { "beforeColon": false, "afterColon": true }],
"keyword-spacing": ["error", { "before": true, "after": true }],
"lines-between-class-members": ["error", "always", { "exceptAfterSingleLine": true }],
"max-len": [
"error",
{
"code": 120,
"ignoreTrailingComments": true,
"ignoreComments": true,
"ignoreUrls": true
}
],
"max-lines": [
"error",
{
"max": 360,
"skipBlankLines": true,
"skipComments": false
}
],
"max-lines-per-function": [
"error",
{
"max": 150,
"skipBlankLines": true
}
],
"max-params": ["error", 3],
"no-array-constructor": "error",
"no-mixed-spaces-and-tabs": "error",
"no-multi-spaces": "error",
"no-multi-str": "error",
"no-multiple-empty-lines": [
"error",
{
"max": 1,
"maxEOF": 0
}
],
"no-restricted-syntax": [
"error",
"WithStatement",
"BinaryExpression[operator='in']"
],
"no-trailing-spaces": "error",
"no-use-before-define": [
"error",
{
"functions": true,
"classes": true,
"variables": false
}
],
"no-var": "warn",
"object-curly-spacing": ["error", "always"],
"padded-blocks": [
"error",
{
"blocks": "never",
"switches": "never",
"classes": "never"
}
],
"quotes": ["error", "single"],
"space-before-blocks": ["error", "always"],
"space-before-function-paren": ["error", "always"],
"space-infix-ops": "error",
"space-unary-ops": ["error", { "words": true, "nonwords": false }],
"space-in-parens": ["error", "never"],
"semi": ["error", "never"]
}
}
6 changes: 6 additions & 0 deletions .npmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
node_modules
coverage
.github
pnpm-lock.yaml
examples
test-data
17 changes: 0 additions & 17 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,6 @@ Collaborations and pull requests are always welcomed, though larger proposals sh

As an OSS, it's better to follow the Unix philosophy: "do one thing and do it well".

## What you can contribute?

We are planing to re-write this tool in TypeScript and make it Deno-first library.
If you are interested, please join our team.

Besides that, you can also:

- Test and report bugs
- Fix unresolved issues
- Improve performance
- Write better documentation
- Create examples or build demos
- Feedback on software design and APIs


## Third-party libraries

Please avoid using libaries other than those available in the standard library, unless necessary.
Expand All @@ -30,8 +15,6 @@ This library needs to be simple and flexible to run on multiple platforms such a

## Coding convention

Please follow [standardjs](https://standardjs.com/) style guide.

Make sure your code lints before opening a pull request.


Expand Down
92 changes: 64 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,9 @@ Extract main article, main image and meta data from URL.
![CodeQL](https://github.com/extractus/article-extractor/workflows/CodeQL/badge.svg)
![CI test](https://github.com/extractus/article-extractor/workflows/ci-test/badge.svg)
[![Coverage Status](https://coveralls.io/repos/github/extractus/article-extractor/badge.svg?branch=main)](https://coveralls.io/github/extractus/article-extractor?branch=main)
[![JavaScript Style Guide](https://img.shields.io/badge/code_style-standard-brightgreen.svg)](https://standardjs.com)
[![CodeFactor](https://www.codefactor.io/repository/github/extractus/article-extractor/badge)](https://www.codefactor.io/repository/github/extractus/article-extractor)


## Intro

*article-extractor* is a part of tool sets for content builder:

- [feed-extractor](https://github.com/extractus/feed-extractor): extract & normalize RSS/ATOM/JSON feed
- [article-extractor](https://github.com/extractus/article-extractor): extract main article from given URL
- [oembed-extractor](https://github.com/extractus/oembed-extractor): extract oEmbed data from supported providers

You can use one or combination of these tools to build news sites, create automated content systems for marketing campaign or gather dataset for NLP projects...

### Attention

`article-parser` has been renamed to `@extractus/article-extractor` since v7.2.5
Expand Down Expand Up @@ -73,16 +63,10 @@ import { read } from 'https://unpkg.com/@extractus/article-extractor@latest/dist
Please check [the examples](examples) for reference.


### Deta cloud

For [Deta](https://www.deta.sh/) devs please refer [the source code and guideline here](https://github.com/ndaidong/article-parser-deta) or simply click the button below.

[![Deploy](https://button.deta.dev/1/svg)](https://go.deta.dev/deploy?repo=https://github.com/ndaidong/article-parser-deta)


## APIs

- [extract()](#extract)
- [extractFromHtml()](#extractfromhtml)
- [Transformations](#transformations)
- [`transformation` object](#transformation-object)
- [.addTransformations](#addtransformationsobject-transformation--array-transformations)
Expand All @@ -104,21 +88,20 @@ extract(String input, Object parserOptions)
extract(String input, Object parserOptions, Object fetchOptions)
```

#### Parameters

##### `input` *required*

URL string links to the article or HTML content of that web page.

For example:
Example:

```js
import { extract } from '@extractus/article-extractor'

const input = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
extract(input)
.then(article => console.log(article))
.catch(err => console.error(err))

// here we use top-level await, assume current platform supports it
try {
const article = await extract(input)
console.log(article)
} catch (err) {
console.error(err)
}
```

The result - `article` - can be `null` or an object with the following structure:
Expand All @@ -138,6 +121,13 @@ The result - `article` - can be `null` or an object with the following structure
}
```


#### Parameters

##### `input` *required*

URL string links to the article or HTML content of that web page.

##### `parserOptions` *optional*

Object with all or several of the following properties:
Expand Down Expand Up @@ -207,6 +197,52 @@ For more info about proxy authentication, please refer [HTTP authentication](htt

For a deeper customization, you can consider using [Proxy](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy) to replace `fetch` behaviors with your own handlers.


### `extractFromHtml()`

Extract article data from HTML string. Return a Promise object as same as `extract()` method above.

#### Syntax

```ts
extractFromHtml(String html)
extractFromHtml(String html, String url)
extractFromHtml(String html, String url, Object parserOptions)
```

Example:

```js
import { extractFromHtml } from '@extractus/article-extractor'

const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'

const res = await fetch(url)
const html = await res.text()

// you can do whatever with this raw html here: clean up, remove ads banner, etc
// just ensure a html string returned

const article = await extractFromHtml(html, url)
console.log(article)
```

#### Parameters

##### `html` *required*

HTML string which contains the article you want to extract.

##### `url` *optional*

URL string that indicates the source of that HTML content.
`article-extractor` may use this info to handle internal/relative links.

##### `parserOptions` *optional*

See [parserOptions](#parseroptions-optional) above.


---

### Transformations
Expand Down
16 changes: 8 additions & 8 deletions build.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@ const pkg = JSON.parse(readFileSync('./package.json', { encoding: 'utf-8' }))

rmSync('dist', {
force: true,
recursive: true
recursive: true,
})
mkdirSync('dist')

const buildTime = (new Date()).toISOString()
const comment = [
`// ${pkg.name}@${pkg.version}, by ${pkg.author}`,
`built with esbuild at ${buildTime}`,
`published under ${pkg.license} license`
`published under ${pkg.license} license`,
].join(' - ')

const baseOpt = {
Expand All @@ -31,7 +31,7 @@ const baseOpt = {
legalComments: 'none',
minify: true,
sourcemap: false,
write: true
write: true,
}

const esmVersion = {
Expand All @@ -40,8 +40,8 @@ const esmVersion = {
format: 'esm',
outfile: 'dist/article-extractor.esm.js',
banner: {
js: comment
}
js: comment,
},
}
buildSync(esmVersion)

Expand All @@ -52,15 +52,15 @@ const cjsVersion = {
mainFields: ['main'],
outfile: 'dist/cjs/article-extractor.js',
banner: {
js: comment
}
js: comment,
},
}
buildSync(cjsVersion)

const cjspkg = {
name: pkg.name,
version: pkg.version,
main: './article-extractor.js'
main: './article-extractor.js',
}

writeFileSync(
Expand Down
54 changes: 26 additions & 28 deletions dist/article-extractor.esm.js

Large diffs are not rendered by default.

84 changes: 41 additions & 43 deletions dist/cjs/article-extractor.js

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions dist/cjs/index.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,5 @@ export interface ArticleData {
}

export function extract(input: string, parserOptions?: ParserOptions, fetchOptions?: FetchOptions): Promise<ArticleData>;

export function extractFromHtml(html: string, url?: string, parserOptions?: ParserOptions): Promise<ArticleData>;
2 changes: 1 addition & 1 deletion dist/cjs/package.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "@extractus/article-extractor",
"version": "7.2.7",
"version": "7.2.8",
"main": "./article-extractor.js"
}
2 changes: 1 addition & 1 deletion examples/browser-article-parser/server.js
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ const app = express()
const loadRemotePage = async (url) => {
try {
const headers = {
'Accept-Charset': 'utf-8'
'Accept-Charset': 'utf-8',
}
const data = await got(url, { headers }).text()
return data
Expand Down
Loading

0 comments on commit 4e3debb

Please sign in to comment.