Funnel is a lightweight yara-based feed scraper. Give a list of inputs and it will check them. Put it in a crontab and it will regularly update the database. If the article gets matched to the yara rule, it will be put into the database. All matched results get put into an sqlite database, with the rule it flagged.
Install your required dependencies and you're good to go.
pip3 install -r requirements.txt
Funnel.py [-h] [-v] [-u] rule_path target_path
positional arguments:
rule_path path to directory of rules used on list of feeds
target_path path to sources list or url
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity
-u, --url scan one url instead of using sources list
Example:
You want to get every new post on the internet that has your name or personal info in it. You would use as many sources as possible,and fill out the personal_info.yar rule.
Schedule this command to run regularly using crontab:
python3 Funnel.py rules/personal/ sources/sources-large.json
Want to scan just one url to see if it matches against any of your rule set?
python3 Funnel.py -u rules/ https://www.bbc.com/news/world-asia-47844000
A bar that wants all the newest margharita recipes? You could do that. Every single post about a politician, for a data visualization project on how much each person is talked about? Works too! Just add rules and sources.
The sources should be in a json file, with a url and a title for each source in the list. Here is a barebones example:
{
"sources-rss":[
{
"url": "https://www.reddit.com/r/netsec/.rss",
"title": "netsec subreddit"
},
{
"url": "https://www.reddit.com/r/malware/.rss",
"title": "malware subreddit"
}
]
}
Tip: Extract sources from feedly by using the opml_to_json.py file in the util folder to turn your exported feedly opml file into a valid sources file
Some sample rules have been provided in the rules folder. Any standard yara rule will work, it is always being compared on text content at this point, no file analysis yet. You can pass in either a directory of rules, a nested directory of rules, or just one rule.
The database is in sqlite, and works with two tables. The first, is a table of links of matched articles, which have a unique id. The second table is a table of the matched rules with the matched article's id together. This keeps duplicates out of the links table, and makes for easy reference.
Feel free to add your suggestions for what to add to this project, even better if you give me a pull request!
Inspired by ThreatIngestor from InQuest