Skip to content

Commit

Permalink
Merge pull request #440 from aliparlakci/development
Browse files Browse the repository at this point in the history
v2.2
  • Loading branch information
aliparlakci authored Jun 6, 2021
2 parents 80bb4a8 + 19e9717 commit 349abbf
Show file tree
Hide file tree
Showing 28 changed files with 1,313 additions and 980 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,6 @@ cython_debug/

# Test configuration file
test_config.cfg

.vscode/
.idea/
51 changes: 43 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,24 @@ If you want to use the source code or make contributions, refer to [CONTRIBUTING

The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded to disk, according to a naming and organisational scheme defined by the user.

There are two modes to the BDFR: download, and archive. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML.
There are three modes to the BDFR: download, archive, and clone. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML. Lastly, the `clone` command will perform both functions of the previous commands at once and is more efficient than running those commands sequentially.

Note that the `clone` command is not a true, failthful clone of Reddit. It simply retrieves much of the raw data that Reddit provides. To get a true clone of Reddit, another tool such as HTTrack should be used.

After installation, run the program from any directory as shown below:

```bash
python3 -m bdfr download
```

```bash
python3 -m bdfr archive
```

```bash
python3 -m bdfr clone
```

However, these commands are not enough. You should chain parameters in [Options](#options) according to your use case. Don't forget that some parameters can be provided multiple times. Some quick reference commands are:

```bash
Expand Down Expand Up @@ -64,6 +72,10 @@ The following options are common between both the `archive` and `download` comma
- `--config`
- If the path to a configuration file is supplied with this option, the BDFR will use the specified config
- See [Configuration Files](#configuration) for more details
- `--disable-module`
- Can be specified multiple times
- Disables certain modules from being used
- See [Disabling Modules](#disabling-modules) for more information and a list of module names
- `--log`
- This allows one to specify the location of the logfile
- This must be done when running multiple instances of the BDFR, see [Multiple Instances](#multiple-instances) below
Expand Down Expand Up @@ -124,6 +136,8 @@ The following options are common between both the `archive` and `download` comma
- `-u, --user`
- This specifies the user to scrape in concert with other options
- When using `--authenticate`, `--user me` can be used to refer to the authenticated user
- Can be specified multiple times for multiple users
- If downloading a multireddit, only one user can be specified
- `-v, --verbose`
- Increases the verbosity of the program
- Can be specified multiple times
Expand All @@ -132,13 +146,6 @@ The following options are common between both the `archive` and `download` comma

The following options apply only to the `download` command. This command downloads the files and resources linked to in the submission, or a text submission itself, to the disk in the specified directory.

- `--exclude-id`
- This will skip the download of any submission with the ID provided
- Can be specified multiple times
- `--exclude-id-file`
- This will skip the download of any submission with any of the IDs in the files provided
- Can be specified multiple times
- Format is one ID per line
- `--make-hard-links`
- This flag will create hard links to an existing file when a duplicate is downloaded
- This will make the file appear in multiple directories while only taking the space of a single instance
Expand All @@ -159,6 +166,13 @@ The following options apply only to the `download` command. This command downloa
- Sets the scheme for folders
- Default is `{SUBREDDIT}`
- See [Folder and File Name Schemes](#folder-and-file-name-schemes) for more details
- `--exclude-id`
- This will skip the download of any submission with the ID provided
- Can be specified multiple times
- `--exclude-id-file`
- This will skip the download of any submission with any of the IDs in the files provided
- Can be specified multiple times
- Format is one ID per line
- `--skip-domain`
- This adds domains to the download filter i.e. submissions coming from these domains will not be downloaded
- Can be specified multiple times
Expand All @@ -183,6 +197,10 @@ The following options are for the `archive` command specifically.
- `xml`
- `yaml`

### Cloner Options

The `clone` command can take all the options listed above for both the `archive` and `download` commands since it performs the functions of both.

## Authentication and Security

The BDFR uses OAuth2 authentication to connect to Reddit if authentication is required. This means that it is a secure, token-based system for making requests. This also means that the BDFR only has access to specific parts of the account authenticated, by default only saved posts, upvoted posts, and the identity of the authenticated account. Note that authentication is not required unless accessing private things like upvoted posts, saved posts, and private multireddits.
Expand Down Expand Up @@ -253,6 +271,7 @@ The following keys are optional, and defaults will be used if they cannot be fou
- `backup_log_count`
- `max_wait_time`
- `time_format`
- `disabled_modules`

All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the configuration directory as the default.

Expand All @@ -264,6 +283,22 @@ The option `time_format` will specify the format of the timestamp that replaces

The format can be specified through the [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) that are standard in the Python `datetime` library.

#### Disabling Modules

The individual modules of the BDFR, used to download submissions from websites, can be disabled. This is helpful especially in the case of the fallback downloaders, since the `--skip-domain` option cannot be effectively used in these cases. For example, the Youtube-DL downloader can retrieve data from hundreds of websites and domains; thus the only way to fully disable it is via the `--disable-module` option.

Modules can be disabled through the command line interface for the BDFR or more permanently in the configuration file via the `disabled_modules` option. The list of downloaders that can be disabled are the following. Note that they are case-insensitive.

- `Direct`
- `Erome`
- `Gallery` (Reddit Image Galleries)
- `Gfycat`
- `Imgur`
- `Redgifs`
- `SelfPost` (Reddit Text Post)
- `Youtube`
- `YoutubeDlFallback`

### Rate Limiting

The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases.
Expand Down
89 changes: 60 additions & 29 deletions bdfr/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,35 +8,58 @@
from bdfr.archiver import Archiver
from bdfr.configuration import Configuration
from bdfr.downloader import RedditDownloader
from bdfr.cloner import RedditCloner

logger = logging.getLogger()

_common_options = [
click.argument('directory', type=str),
click.option('--config', type=str, default=None),
click.option('-v', '--verbose', default=None, count=True),
click.option('-l', '--link', multiple=True, default=None, type=str),
click.option('-s', '--subreddit', multiple=True, default=None, type=str),
click.option('-m', '--multireddit', multiple=True, default=None, type=str),
click.option('-L', '--limit', default=None, type=int),
click.option('--authenticate', is_flag=True, default=None),
click.option('--config', type=str, default=None),
click.option('--disable-module', multiple=True, default=None, type=str),
click.option('--log', type=str, default=None),
click.option('--submitted', is_flag=True, default=None),
click.option('--upvoted', is_flag=True, default=None),
click.option('--saved', is_flag=True, default=None),
click.option('--search', default=None, type=str),
click.option('--submitted', is_flag=True, default=None),
click.option('--time-format', type=str, default=None),
click.option('-u', '--user', type=str, default=None),
click.option('--upvoted', is_flag=True, default=None),
click.option('-L', '--limit', default=None, type=int),
click.option('-l', '--link', multiple=True, default=None, type=str),
click.option('-m', '--multireddit', multiple=True, default=None, type=str),
click.option('-s', '--subreddit', multiple=True, default=None, type=str),
click.option('-v', '--verbose', default=None, count=True),
click.option('-u', '--user', type=str, multiple=True, default=None),
click.option('-t', '--time', type=click.Choice(('all', 'hour', 'day', 'week', 'month', 'year')), default=None),
click.option('-S', '--sort', type=click.Choice(('hot', 'top', 'new',
'controversial', 'rising', 'relevance')), default=None),
]

_downloader_options = [
click.option('--file-scheme', default=None, type=str),
click.option('--folder-scheme', default=None, type=str),
click.option('--make-hard-links', is_flag=True, default=None),
click.option('--max-wait-time', type=int, default=None),
click.option('--no-dupes', is_flag=True, default=None),
click.option('--search-existing', is_flag=True, default=None),
click.option('--exclude-id', default=None, multiple=True),
click.option('--exclude-id-file', default=None, multiple=True),
click.option('--skip', default=None, multiple=True),
click.option('--skip-domain', default=None, multiple=True),
click.option('--skip-subreddit', default=None, multiple=True),
]

_archiver_options = [
click.option('--all-comments', is_flag=True, default=None),
click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None),
]


def _add_common_options(func):
for opt in _common_options:
func = opt(func)
return func
def _add_options(opts: list):
def wrap(func):
for opt in opts:
func = opt(func)
return func
return wrap


@click.group()
Expand All @@ -45,18 +68,8 @@ def cli():


@cli.command('download')
@click.option('--exclude-id', default=None, multiple=True)
@click.option('--exclude-id-file', default=None, multiple=True)
@click.option('--file-scheme', default=None, type=str)
@click.option('--folder-scheme', default=None, type=str)
@click.option('--make-hard-links', is_flag=True, default=None)
@click.option('--max-wait-time', type=int, default=None)
@click.option('--no-dupes', is_flag=True, default=None)
@click.option('--search-existing', is_flag=True, default=None)
@click.option('--skip', default=None, multiple=True)
@click.option('--skip-domain', default=None, multiple=True)
@click.option('--skip-subreddit', default=None, multiple=True)
@_add_common_options
@_add_options(_common_options)
@_add_options(_downloader_options)
@click.pass_context
def cli_download(context: click.Context, **_):
config = Configuration()
Expand All @@ -73,9 +86,8 @@ def cli_download(context: click.Context, **_):


@cli.command('archive')
@_add_common_options
@click.option('--all-comments', is_flag=True, default=None)
@click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None)
@_add_options(_common_options)
@_add_options(_archiver_options)
@click.pass_context
def cli_archive(context: click.Context, **_):
config = Configuration()
Expand All @@ -85,7 +97,26 @@ def cli_archive(context: click.Context, **_):
reddit_archiver = Archiver(config)
reddit_archiver.download()
except Exception:
logger.exception('Downloader exited unexpectedly')
logger.exception('Archiver exited unexpectedly')
raise
else:
logger.info('Program complete')


@cli.command('clone')
@_add_options(_common_options)
@_add_options(_archiver_options)
@_add_options(_downloader_options)
@click.pass_context
def cli_clone(context: click.Context, **_):
config = Configuration()
config.process_click_arguments(context)
setup_logging(config.verbose)
try:
reddit_scraper = RedditCloner(config)
reddit_scraper.download()
except Exception:
logger.exception('Scraper exited unexpectedly')
raise
else:
logger.info('Program complete')
Expand Down
1 change: 1 addition & 0 deletions bdfr/archive_entry/base_archive_entry.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ def _convert_comment_to_dict(in_comment: Comment) -> dict:
'stickied': in_comment.stickied,
'body': in_comment.body,
'is_submitter': in_comment.is_submitter,
'distinguished': in_comment.distinguished,
'created_utc': in_comment.created_utc,
'parent_id': in_comment.parent_id,
'replies': [],
Expand Down
4 changes: 4 additions & 0 deletions bdfr/archive_entry/submission_archive_entry.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ def _get_post_details(self):
'link_flair_text': self.source.link_flair_text,
'num_comments': self.source.num_comments,
'over_18': self.source.over_18,
'spoiler': self.source.spoiler,
'pinned': self.source.pinned,
'locked': self.source.locked,
'distinguished': self.source.distinguished,
'created_utc': self.source.created_utc,
}

Expand Down
21 changes: 11 additions & 10 deletions bdfr/archiver.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,24 +14,24 @@
from bdfr.archive_entry.comment_archive_entry import CommentArchiveEntry
from bdfr.archive_entry.submission_archive_entry import SubmissionArchiveEntry
from bdfr.configuration import Configuration
from bdfr.downloader import RedditDownloader
from bdfr.connector import RedditConnector
from bdfr.exceptions import ArchiverError
from bdfr.resource import Resource

logger = logging.getLogger(__name__)


class Archiver(RedditDownloader):
class Archiver(RedditConnector):
def __init__(self, args: Configuration):
super(Archiver, self).__init__(args)

def download(self):
for generator in self.reddit_lists:
for submission in generator:
logger.debug(f'Attempting to archive submission {submission.id}')
self._write_entry(submission)
self.write_entry(submission)

def _get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
def get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
supplied_submissions = []
for sub_id in self.args.link:
if len(sub_id) == 6:
Expand All @@ -42,12 +42,13 @@ def _get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
supplied_submissions.append(self.reddit_instance.submission(url=sub_id))
return [supplied_submissions]

def _get_user_data(self) -> list[Iterator]:
results = super(Archiver, self)._get_user_data()
def get_user_data(self) -> list[Iterator]:
results = super(Archiver, self).get_user_data()
if self.args.user and self.args.all_comments:
sort = self._determine_sort_function()
logger.debug(f'Retrieving comments of user {self.args.user}')
results.append(sort(self.reddit_instance.redditor(self.args.user).comments, limit=self.args.limit))
sort = self.determine_sort_function()
for user in self.args.user:
logger.debug(f'Retrieving comments of user {user}')
results.append(sort(self.reddit_instance.redditor(user).comments, limit=self.args.limit))
return results

@staticmethod
Expand All @@ -59,7 +60,7 @@ def _pull_lever_entry_factory(praw_item: (praw.models.Submission, praw.models.Co
else:
raise ArchiverError(f'Factory failed to classify item of type {type(praw_item).__name__}')

def _write_entry(self, praw_item: (praw.models.Submission, praw.models.Comment)):
def write_entry(self, praw_item: (praw.models.Submission, praw.models.Comment)):
archive_entry = self._pull_lever_entry_factory(praw_item)
if self.args.format == 'json':
self._write_entry_json(archive_entry)
Expand Down
21 changes: 21 additions & 0 deletions bdfr/cloner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env python3
# coding=utf-8

import logging

from bdfr.archiver import Archiver
from bdfr.configuration import Configuration
from bdfr.downloader import RedditDownloader

logger = logging.getLogger(__name__)


class RedditCloner(RedditDownloader, Archiver):
def __init__(self, args: Configuration):
super(RedditCloner, self).__init__(args)

def download(self):
for generator in self.reddit_lists:
for submission in generator:
self._download_submission(submission)
self.write_entry(submission)
9 changes: 5 additions & 4 deletions bdfr/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,21 @@ def __init__(self):
self.authenticate = False
self.config = None
self.directory: str = '.'
self.disable_module: list[str] = []
self.exclude_id = []
self.exclude_id_file = []
self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
self.folder_scheme: str = '{SUBREDDIT}'
self.limit: Optional[int] = None
self.link: list[str] = []
self.log: Optional[str] = None
self.make_hard_links = False
self.max_wait_time = None
self.multireddit: list[str] = []
self.no_dupes: bool = False
self.saved: bool = False
self.search: Optional[str] = None
self.search_existing: bool = False
self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
self.folder_scheme: str = '{SUBREDDIT}'
self.skip: list[str] = []
self.skip_domain: list[str] = []
self.skip_subreddit: list[str] = []
Expand All @@ -35,9 +37,8 @@ def __init__(self):
self.time: str = 'all'
self.time_format = None
self.upvoted: bool = False
self.user: Optional[str] = None
self.user: list[str] = []
self.verbose: int = 0
self.make_hard_links = False

# Archiver-specific options
self.format = 'json'
Expand Down
Loading

0 comments on commit 349abbf

Please sign in to comment.