-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 4207730
Showing
19 changed files
with
4,375 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
data/*.jsonl | ||
data/*.json | ||
data/*.txt | ||
data/*/*.jsonl | ||
data/*/*.json | ||
data/*/*.txt | ||
!data/websites_to_exclude.txt | ||
!data/test_data.jsonl | ||
!data/*/test_data.jsonl | ||
|
||
debugging_screenshots/* | ||
!debugging_screenshots/.gitkeep | ||
|
||
node_modules/ | ||
package-lock.json | ||
|
||
*.code-workspace | ||
.vscode/ | ||
|
||
*.img | ||
|
||
proxy.conf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2020 TU/e and EPFL | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# IRIS websites scraper | ||
Scraper for the IRIS project to run using a [Rasperry Pi (RPi)](https://www.raspberrypi.org/). | ||
|
||
## Reproducibility | ||
To reproduce the results, please follow these steps [1] | ||
|
||
1. Install [Rasperry Pi OS](https://www.raspberrypi.org/downloads/) | ||
2. Install [Git](https://git-scm.com/) running ``sudo apt install -y git jq`` | ||
3. Clone this repository | ||
* ``cd ~/Documents`` | ||
* ``git clone https://gitlab.tue.nl/iris/iris-scraper.git`` | ||
* ``cd iris-scraper`` | ||
4. Install [Node.js](https://nodejs.org/) | ||
* If you are using a RPi Zero or a RPi 1 (i.e., a device with a ARMv6 architecture) use the following script (be aware that the support for this architecture is experimental) | ||
* ``sudo ./install-nodejs-12-rpi_zero.sh`` | ||
* Otherwise, run the following commands | ||
* ``curl -sL https://deb.nodesource.com/setup_12.x | sudo bash -`` | ||
* ``sudo apt-get install -y nodejs`` | ||
5. Install the needed node.js packages with ``npm install`` | ||
6. Add a ``data.jsonl`` file with the information to scrape the ``data/data_from_database/`` folder | ||
7. Create a new daemon that manages the scraping script | ||
* ``sudo cp [email protected] /etc/systemd/system/`` | ||
|
||
[1] Note that, for this part of the IRIS project, the results cannot be perfectly reproduced, since they depend on many factors, some of which random and/or time-evolving. | ||
|
||
### Possible types of scraper | ||
* Websites scrapers (first phase) [2] | ||
* choose ``website`` to use only Bloomberg and the Google front-page; | ||
* choose ``website_sbir`` to use also SBIR as a scraping source; | ||
* VPM pages scrapers (second phase) | ||
* choose ``vpm`` to use Google to search the patent numbers within the detected websites | ||
|
||
The websites scrapers, for each recipient/assignee name listed in ``data/data_from_database/data.jsonl``, search for its website on | ||
* The SBIR website (not mandatory) | ||
* Bloomberg | ||
* Google (first page; about 10 results) | ||
|
||
The results are listed in the ``data/data_from_websites_scraper/results.jsonl`` file. | ||
|
||
Instead, the VPM pages scraper uses the websites previously detected and ask Google to search for the related patent numbers within them. The number of results scraped are no more than the number of websites times the number of patents. | ||
|
||
[2] The idea is that only if you are actually looking for the name of recipients of the SBIR program it makes sense to look on the SBIR website for the websites of these recipients. | ||
|
||
## Scrapers configuration | ||
The scrapers can be fine-tuned through some parameters that you can modify in the ``scraper.conf`` file. Specifically | ||
* ``SCRAPING_RATE`` controls the target rate (in seconds) at which the scraper should go (default ``120``). If the scraper takes less than the target, it will wait for longer. If it takes more, it will try to compensate in the following rounds (since this is the average target and not the punctual target). | ||
* ``USE_HEADLESS`` controls if the scraper should not show the browser (default ``true``) | ||
* ``USE_MOBILE`` controls if the scraper should simulate a mobile environment or not (default ``true``) | ||
* ``CHROME_PATH`` contains the path to the Chromium/Chorme browser in your system (default ``null``). If it is ``null``, the script will try to guess the path in the following preferential order | ||
* if your OS is perceived as MS Windows by Python, the script will use the executable file found at ``C:\Program Files (x86)\Google\Chrome\Application\chrome.exe``; | ||
* otherwise, the executable at ``/usr/bin/google-chrome-stable`` will be used, if present; | ||
* or ``/usr/bin/chromium-browser`` will be used (if this file is not present, an error will be raised by the scraper). | ||
|
||
## Scraping phases | ||
1. The first thing to do is to search for the website of the US Federal funds recipients and/or USPTO patent assignees, with the following commands (where ``<scraper-type>`` is the type of scraper you want to use; see below) | ||
* ``sudo systemctl enable iris-scraper@<scraper-type>.service`` | ||
* ``sudo systemctl start iris-scraper@<scraper-type>.service`` | ||
2. Now you can stop and deactivate the scraper used till now | ||
* ``sudo systemctl disable iris-scraper@<scraper-type>.service`` | ||
* ``sudo systemctl stop iris-scraper@<scraper-type>.service`` | ||
3. Than, you must clean the websites so to remove the too-common websites that are likely false positives [3] | ||
* ``python clean-scraped-websites.py -I data/data_from_websites_scraper/results.jsonl data/websites_to_exclude.txt -o data/data_from_websites_scraper/results_clean.jsonl`` | ||
4. Lastly, you must use the VPM pages scraper with the following commands | ||
* ``sudo systemctl enable [email protected]`` | ||
* ``sudo systemctl start [email protected]`` | ||
5. Again stop and deactivate the scraper used | ||
* ``sudo systemctl disable [email protected]`` | ||
* ``sudo systemctl stop [email protected]`` | ||
|
||
If you work in a GNU/Linux environment, you can have some basic statistics about the ongoing scraping process by running ``./stats.sh website`` (or ``vpm`` according to the step you are actually running) | ||
|
||
[3] For now, this script is in Python. I advice you to execute it within a Conda environment. The ``json`` and ``datetime`` Python packages must be installed. Moreover, you must install the ``iris-utils`` package from the [iris-utils](https://gitlab.tue.nl/iris/iris-utils) repository. The advice is to re-use the environment of the [iris-database](https://gitlab.tue.nl/iris/iris-database) repository. | ||
|
||
## Working of the Systemd daemons | ||
When you start one of the daemons, the script will start (provided that you have a working Internet connection) and will restart every time you switch on your RPi and get connected to the Internet.<br> | ||
At the end of the scraping process, the rows on which an error has been reported are deleted and the scraper trys another time.<br> | ||
The systemd daemon that controls this process will restard, if errors occurs, for 5 times authomatically. Than a manual intervention is required to, eventually, go on (consider that at least a restart is useful to deal with the eventual, but likely, errors that will occur during the scraping process; by experience, you can expect at least 0.15% failures in a "successful" run). | ||
|
||
To stop the script, run<br> | ||
``sudo systemctl stop iris-scraper.service``<br> | ||
Be patient, it can take even more than 2 min to stop because of the way in which the JavaScript code is written.<br> | ||
Moreover, consider that this is a brutal operation that will end in an ERROR in results.jsonl | ||
|
||
To look what the script is doing, you can run the following command (use ``website``, ``website_sbir``, or ``vpm`` according with the daemon currently running)<br> | ||
``journalctl -u [email protected] -f``<br> | ||
Press CTRL+C to go back to the shall | ||
|
||
Note: Consider that, by deafult on the RPi the logs (i.e., what the ``journalctl`` command reads) are arases when you shoutdown the machine. To preserve the logs of past sessions you need either to run | ||
* ``sudo mkdir -p /var/log/journald`` | ||
* ``sudo systemd-tmpfiles --create --prefix /var/log/journal`` | ||
* ``sudo systemctl restart systemd-journald`` | ||
|
||
or to set ``Storage=persistent`` into ``/etc/systemd/journal.conf``. | ||
|
||
## Data format | ||
The data files are formatted according to the JSONL (i.e., lines of JSON objects). | ||
|
||
Each line must contain the following structure<br> | ||
``{"award_recipient": "corporation name with legal type", "patent_assignee": "corporation name with legal type"}, "patent_id": [193765482, 917253468]``<br> | ||
The award recipient's name is not mandatory. | ||
|
||
To split the full database into random chunks one for each machine (RPi) you have, you can use the following commands | ||
* ``shuf f_in.jsonl | split -a1 -d -l $(( $(wc -l <f_in.jsonl) * 1 / N )) - f_out`` | ||
* ``find . -type f ! -name "*.*" -exec mv {} {}.jsonl \;`` | ||
where f_in.jsonl is the full database; f_out is the name you want to give to the chunks (will be followed by a progressive number); N is the number of chunks you want to create; and `.` is the local folder (if the files are in another folder, substitute it with the correct path).<br> | ||
Remember that the standard input file name for the scraping process is always ``data/data_from_database/data.jsonl``. The easiest way is simply copy one of the files with the progressive numbers in each device you have. Than, you can create locally a copy of the file simply called ``data/data_from_database/data.jsonl`` (to preserve the original file will help in remembering the progressive number, if it were useful for some reasons).<br> | ||
Note: if the number of lines of the original file (f_in.jsonl) are not divisible by the number of chunks desired, an additional (N+1) file will be created with the few extra lines still unassigned. | ||
|
||
After the scraping, you can collect the results from each device in a common folder. Rename each ``data/data_from_websites_scraper/results.jsonl`` file with a progressive number (as for the output files). Than, use a command like this to concatenate the output files in a common one<br> | ||
``cat dod_sbir_citations_to_scrape_with_potential_websites_<?>.jsonl > dod_sbir_citations_to_scrape_with_potential_websites.jsonl``<br> | ||
where the star (``<?>``) stands for any of the progressive numbers.<br> | ||
Note: it works only if you have less than 10 files. | ||
|
||
## Control the process remotely | ||
From the RPi configuration tool (``sudo raspi-config`` or from the desktop menu) enable the CLI interface (not mandatory) and enable the SSH interface.<br> | ||
You can now controll the RPi remotely through SSH, both from another computer, or through a smartphone App (there are even some explicitely dedicated to the RPi). | ||
|
||
## Use a Proxy server | ||
It is possible to use a proxy server. To do so, you must modify the ``proxy.conf.example`` file and rename it as ``proxy.conf``. | ||
|
||
Parameters:<br> | ||
* ``PROXY_ADDRESS`` is the address of the proxy server | ||
* ``PROXY_PORT`` is the port of the proxy server | ||
* ``PROXY_USER`` is the proxy server's username | ||
* ``PROXY_PASSWORD`` is the proxy server's password | ||
* ``PROXY_ROTATE`` is the API address called to rotate your proxy server | ||
* ``PROXY_STATUS`` is a function (passed as a string) that must return two values [proxy_ok, proxy_msg]: a boolean that sais whether the proxy rotated correctly and a string that will be printed (e.g., with the IP address assigned by the proxy server) | ||
|
||
## Run the scraper without Systemd | ||
You can also ran the scraping process without the use of Systemd. In this case, you must run<br> | ||
* ``node scrape-for-websites.js -i <INPUT_FILE.jsonl> -o <OUTPUT_FILE.jsonl> --sbir <true/false> --proxy <true/false> --timestamp <true/false>`` | ||
* ``node scrape-for-vpm-pages.js -i <INPUT_FILE.jsonl> -o <OUTPUT_FILE.jsonl> --sbir <true/false> --proxy <true/false> --timestamp <true/false>`` | ||
|
||
Parameters<br> | ||
* ``i`` is the input file | ||
* ``o`` is the output file | ||
* ``sbir`` uses also the SBIR website as a source of information (default ``false``). Note: anyhow, only the lines with an ``award_recipient`` use also the SBIR website | ||
* ``proxy`` uses the ``proxy.conf`` parameters in the scraper | ||
* ``timestamp`` prints also the date alongside the messages | ||
|
||
## Acknowledgements | ||
The authors thank the EuroTech Universities Alliance for sponsoring this work. Carlo Bottai was supported by the European Union's Marie Skłodowska-Curie programme for the project Insights on the "Real Impact" of Science (H2020 MSCA-COFUND-2016 Action, Grant Agreement No 754462). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
#!/usr/bin/env python | ||
|
||
""" | ||
Clean the most common websites from the list of detected websites | ||
You are supposed to run this script after the scraping of the websites | ||
and before the scraping of the VPM pages | ||
First, it adds the domains that appear more than 20 times to a file | ||
that collects these common domains from previous scrapings | ||
Then, it removes the domains listed in the file from the results | ||
Author: Carlo Bottai | ||
Copyright (c) 2020 - TU/e and EPFL | ||
License: See the LICENSE file. | ||
Date: 2020-10-15 | ||
""" | ||
|
||
|
||
# TODO | ||
# This version of the code has a potentially significant limitation. It | ||
# excludes websites like google.com, apple.com This is desirable in most cases, | ||
# but not for the true case; i.e., Google or Apple in the examples. | ||
# On the other hand it is also true that is unlikely to find any VPM page | ||
# into these domains, since many of these big companies will have | ||
# a separate domain for their VPM, if any exists | ||
|
||
|
||
import json | ||
from datetime import datetime | ||
from iris_utils.parse_args import parse_io | ||
|
||
|
||
def main(): | ||
args = parse_io() | ||
|
||
with open(args.input_list[0], 'r') as f_in: | ||
data = [json.loads(line) for line in f_in.read().splitlines()] | ||
|
||
all_websites = [url for line in data \ | ||
for url in line['scraped_websites']] | ||
all_websites = ['.'.join(url.split('.')[-2:]) \ | ||
for url in all_websites if url is not None] | ||
all_websites = [domain \ | ||
for domain in all_websites if len(domain.split('.')[0])>2] | ||
all_websites_freq = {i:all_websites.count(i) for i in set(all_websites)} | ||
# TODO: generalize next line using an argument for the threshold | ||
exclude_websites = [k for k,v in all_websites_freq.items() if v>=10] | ||
|
||
f_out_name = args.input_list[1].split('.')[-2:-1][0] | ||
now = datetime.now().strftime('%H%M%y%m%d') | ||
with open(args.input_list[1], 'a') as f_out, \ | ||
open(args.input_list[1], 'r') as f_in, \ | ||
open(f'{f_out_name}_{now}.txt', 'w') as f_bak: | ||
websites_already_in_file = f_in.read() | ||
f_bak.write(websites_already_in_file) | ||
websites_already_in_file = websites_already_in_file.splitlines() | ||
exclude_websites_to_add = [exclude_website \ | ||
for exclude_website in exclude_websites if \ | ||
exclude_website is not None and \ | ||
exclude_website not in websites_already_in_file and \ | ||
not exclude_website.replace('.','').isnumeric()] | ||
for website in exclude_websites_to_add: | ||
add = input(f'Add {website} to the list of excluded websites? [y]/n ') | ||
if add=='' or add=='y': | ||
f_out.write(f'{website}\n') | ||
|
||
with open(args.input_list[1], 'r') as f_in: | ||
exclude_websites = f_in.read().splitlines() | ||
|
||
# for line in data: | ||
# line['scraped_websites'] = [website \ | ||
# for website in line['scraped_websites'] if \ | ||
# website is not None and \ | ||
# not any([website.endswith(end) \ | ||
# for end in ['.gov','.edu','.mil','.int']]) and \ | ||
# '.'.join(website.split('.')[-2:]) not in exclude_websites and \ | ||
# not any([website.find(exclude_website)>=0 \ | ||
# for exclude_website in exclude_websites]) and \ | ||
# not website.replace('.','').isnumeric()] | ||
# if len(line['scraped_websites'])==0: | ||
# line['scraped_websites'] = [None] | ||
|
||
for line in data: | ||
line['scraped_websites'] = [website \ | ||
for website in line['scraped_websites'] if \ | ||
website is not None and \ | ||
not any([website.endswith(end) \ | ||
for end in ['.gov','.edu','.mil','.int']]) and \ | ||
'.'.join(website.split('.')[-2:]) not in exclude_websites and \ | ||
not website.replace('.','').isnumeric()] | ||
for website in line['scraped_websites']: | ||
to_exclude = False | ||
for exclude_website in exclude_websites: | ||
exclude_website_len = len(exclude_website.split('.')) | ||
# FIXME Why lens.org has not been removed? I think that this is not working properly because you can have cases like http://www.foo.com/web_page that are not excluded by foo.com (even though in theory you should have only foo.com in the webpages extracted, without the pages) | ||
if '.'.join(website.split('.')[-exclude_website_len:])==exclude_website: | ||
to_exclude = True | ||
if to_exclude==True: | ||
line['scraped_websites'] = [ws for ws in line['scraped_websites'] if ws!=website] | ||
if len(line['scraped_websites'])==0: | ||
line['scraped_websites'] = [None] | ||
|
||
with open(args.output, 'w') as f_out: | ||
for line_data in data: | ||
json.dump(line_data, f_out, separators=(',',':')) | ||
f_out.write('\n') | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{"award_recipient":"21CT INC","patent_assignee":"21CT INC","patent_id":[9020816]} | ||
{"award_recipient":"21ST CENTURY TECHNOLOGIES INC","patent_assignee":"21CT INC","patent_id":[8069127,8467628,8929645,8611591,9020816]} | ||
{"patent_assignee": "ZYNON TECHNOLOGIES LLC", "patent_id": [9733229]} | ||
{"patent_assignee": "__PERSON__", "patent_id": [8628955, 8670837, 8790455, 8970706]} |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
{"index":[119530,119531,119532],"patent_assignee":"BRE COMMUNICATION LLC","forward_citation_id":[10212077],"scraped_websites":["silo.tips","corpora.tika.apache.org"]} | ||
{"index":[2450,13121,13122,74198,150442,180702,203456],"patent_assignee":"PEREGRINE TURBINE TECHNOLOGIES LLC","forward_citation_id":[10254048,10101092,9540999,10385735,10072574,9657599],"scraped_websites":["peregrineconsulting.com","peregrineturbine.com","maineaerospace.com"]} | ||
{"index":[32976,32977,32978,32979],"patent_assignee":"ICONTROL INC","forward_citation_id":[8525642],"scraped_websites":["icontrol-inc.com"]} | ||
{"index":[212781,212782],"patent_assignee":"NEOPHOTONICS CORP","forward_citation_id":[9482862],"scraped_websites":["en.rusnano.com"]} | ||
{"index":[15308],"patent_assignee":"TECHNOSKIN LLC","forward_citation_id":[6360615],"scraped_websites":[null]} |
Oops, something went wrong.