Skip to content

Commit

Permalink
Merge pull request #23 from fmacpro/development
Browse files Browse the repository at this point in the history
Development
  • Loading branch information
fmacpro authored Sep 19, 2019
2 parents 9c5ef77 + a8cd932 commit df5ff76
Show file tree
Hide file tree
Showing 7 changed files with 994 additions and 992 deletions.
40 changes: 35 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,15 @@ The options below are set by default
var options = {
// puppeteer options (https://github.com/GoogleChrome/puppeteer)
puppeteer: {
headless: true,
defaultViewport: null,
// puppeteer launch options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions)
launch: {
headless: true,
defaultViewport: null
},
// puppeteer goto options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)
goto: {
waitUntil: 'domcontentloaded'
}
},
// clean-html options (https://ghub.io/clean-html)
cleanhtml: {
Expand All @@ -90,8 +97,6 @@ var options = {
}
```

For more Puppeteer launch options see https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions

At a minimum you should pass a url

```
Expand All @@ -109,6 +114,32 @@ var options = {
}
```

If you want to pass cookies to puppeteer use the following

```
var options = {
puppeteer: {
cookies: [{ name: 'cookie1', value: 'val1', domain: '.domain1' },{ name: 'cookie2', value: 'val2', domain: '.domain2' }]
}
}
```

To strip tags before processing use the following

```
var options = {
striptags: ['.something', '#somethingelse']
}
```

If you need to dismiss any popups e.g. a privacy popup use the following

```
var options = {
clickelements: ['#button1', '#button2']
}
```

there are some additional "complex" options available

```
Expand Down Expand Up @@ -144,7 +175,6 @@ npm run test
## Dependencies

- [Puppeteer](https://github.com/GoogleChrome/puppeteer/): High-level API to control Chrome or Chromium over the DevTools Protocol
- [lighthouse](https://ghub.io/lighthouse): Automated auditing, performance metrics, and best practices for the web.
- [compromise](https://ghub.io/compromise): natural language processing in the browser
- [retext](https://ghub.io/retext): Natural language processor powered by plugins
- [retext-pos](https://github.com/retextjs/retext-pos): Plugin to add part-of-speech (POS) tags
Expand Down
34 changes: 30 additions & 4 deletions index.js
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,22 @@ module.exports = {
}

if (typeof options.puppeteer === 'undefined') {
options.puppeteer = {
options.puppeteer = {}
}

if (typeof options.puppeteer.launch === 'undefined') {
options.puppeteer.launch = {
headless: true,
defaultViewport: null
}
}

if (typeof options.puppeteer.goto === 'undefined') {
options.puppeteer.goto = {
waitUntil: 'domcontentloaded'
}
}

const actions = [articleParser(options, socket)]

if (options.enabled.includes('lighthouse')) {
Expand Down Expand Up @@ -71,14 +81,30 @@ const articleParser = async function (options, socket) {
socket.emit('parse:status', 'Starting Horseman')

// Init puppeteer
const browser = await puppeteer.launch(options.puppeteer)
const browser = await puppeteer.launch(options.puppeteer.launch)

const page = await browser.newPage()

// Inject jQuery - https://stackoverflow.com/a/50598512
const jquery = await page.evaluate(() => window.fetch('https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.min.js').then((res) => res.text()))

const response = await page.goto(options.url)
const response = await page.goto(options.url, options.puppeteer.goto)

// Inject cookies if set
if (typeof options.puppeteer.cookies !== 'undefined') {
await page.setCookie(...options.puppeteer.cookies)
}

// Click buttons if defined (for dismissing privacy popups etc)
if (typeof options.clickelements !== 'undefined') {
let clickelement = ''

for (clickelement of options.clickelements) {
if (await page.$(clickelement) !== null) {
await page.click(clickelement)
}
}
}

await page.evaluate(jquery)

Expand Down Expand Up @@ -491,7 +517,7 @@ const lighthouseAnalysis = async function (options, socket) {
socket.emit('parse:status', 'Starting Lighthouse')

// Init puppeteer
const browser = await puppeteer.launch(options.puppeteer)
const browser = await puppeteer.launch(options.puppeteer.launch)

const results = await lighthouse(options.url, {
port: (new URL(browser.wsEndpoint())).port,
Expand Down
2 changes: 1 addition & 1 deletion package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "horseman-article-parser",
"version": "0.6.1",
"version": "0.7.1",
"description": "Web Page Inspection Tool. Sentiment Analysis, Keyword Extraction, Named Entity Recognition & Spell Check",
"main": "index.js",
"scripts": {
Expand Down
5 changes: 4 additions & 1 deletion stripTags.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,8 @@ module.exports = [
'.inline-audio_container',
'.mw-editsection',
'.mw-cite-backlink',
'ul.gallery'
'ul.gallery',
'.contributions__epic',
'.site-message',
'.site-message-overlay'
]
2 changes: 1 addition & 1 deletion test.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ const fs = require('fs')
const options = {
url: 'https://en.wikipedia.org/wiki/Act_of_Parliament_(UK)',
enabled: ['lighthouse', 'screenshot', 'links', 'sentiment', 'entities', 'spelling', 'keywords']
//enabled: ['links', 'sentiment', 'entities', 'spelling', 'keywords']
// enabled: ['links', 'sentiment', 'entities', 'spelling', 'keywords']
}

parser.parseArticle(options)
Expand Down
Loading

0 comments on commit df5ff76

Please sign in to comment.