Merge pull request #440 from aliparlakci/development

v2.2
Serene-Arc · Jun 6, 2021 · 349abbf · 349abbf
2 parents 80bb4a8 + 19e9717
commit 349abbf
Show file tree

Hide file tree

Showing 28 changed files with 1,313 additions and 980 deletions.
diff --git a/.gitignore b/.gitignore
@@ -139,3 +139,6 @@ cython_debug/
 
 # Test configuration file
 test_config.cfg
+
+.vscode/
+.idea/
diff --git a/README.md b/README.md
@@ -27,16 +27,24 @@ If you want to use the source code or make contributions, refer to [CONTRIBUTING
 
 The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded to disk, according to a naming and organisational scheme defined by the user.
 
-There are two modes to the BDFR: download, and archive. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML.
+There are three modes to the BDFR: download, archive, and clone. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML. Lastly, the `clone` command will perform both functions of the previous commands at once and is more efficient than running those commands sequentially.
+
+Note that the `clone` command is not a true, failthful clone of Reddit. It simply retrieves much of the raw data that Reddit provides. To get a true clone of Reddit, another tool such as HTTrack should be used.
 
 After installation, run the program from any directory as shown below:
+
 ```bash
 python3 -m bdfr download
 ```
+
 ```bash
 python3 -m bdfr archive
 ```
 
+```bash
+python3 -m bdfr clone
+```
+
 However, these commands are not enough. You should chain parameters in [Options](#options) according to your use case. Don't forget that some parameters can be provided multiple times. Some quick reference commands are:
 
 ```bash
@@ -64,6 +72,10 @@ The following options are common between both the `archive` and `download` comma
 - `--config`
   - If the path to a configuration file is supplied with this option, the BDFR will use the specified config
   - See [Configuration Files](#configuration) for more details
+- `--disable-module`
+  - Can be specified multiple times
+  - Disables certain modules from being used
+  - See [Disabling Modules](#disabling-modules) for more information and a list of module names
 - `--log`
   - This allows one to specify the location of the logfile
   - This must be done when running multiple instances of the BDFR, see [Multiple Instances](#multiple-instances) below
@@ -124,6 +136,8 @@ The following options are common between both the `archive` and `download` comma
 - `-u, --user`
   - This specifies the user to scrape in concert with other options
   - When using `--authenticate`, `--user me` can be used to refer to the authenticated user
+  - Can be specified multiple times for multiple users
+    - If downloading a multireddit, only one user can be specified
 - `-v, --verbose`
   - Increases the verbosity of the program
   - Can be specified multiple times
@@ -132,13 +146,6 @@ The following options are common between both the `archive` and `download` comma
 
 The following options apply only to the `download` command. This command downloads the files and resources linked to in the submission, or a text submission itself, to the disk in the specified directory.
 
-- `--exclude-id`
-  - This will skip the download of any submission with the ID provided
-  - Can be specified multiple times
-- `--exclude-id-file`
-  - This will skip the download of any submission with any of the IDs in the files provided
-  - Can be specified multiple times
-  - Format is one ID per line
 - `--make-hard-links`
   - This flag will create hard links to an existing file when a duplicate is downloaded
   - This will make the file appear in multiple directories while only taking the space of a single instance
@@ -159,6 +166,13 @@ The following options apply only to the `download` command. This command downloa
   - Sets the scheme for folders
   - Default is `{SUBREDDIT}`
   - See [Folder and File Name Schemes](#folder-and-file-name-schemes) for more details
+- `--exclude-id`
+  - This will skip the download of any submission with the ID provided
+  - Can be specified multiple times
+- `--exclude-id-file`
+  - This will skip the download of any submission with any of the IDs in the files provided
+  - Can be specified multiple times
+  - Format is one ID per line
 - `--skip-domain`
   - This adds domains to the download filter i.e. submissions coming from these domains will not be downloaded
   - Can be specified multiple times
@@ -183,6 +197,10 @@ The following options are for the `archive` command specifically.
     - `xml`
     - `yaml`
 
+### Cloner Options
+
+The `clone` command can take all the options listed above for both the `archive` and `download` commands since it performs the functions of both.
+
 ## Authentication and Security
 
 The BDFR uses OAuth2 authentication to connect to Reddit if authentication is required. This means that it is a secure, token-based system for making requests. This also means that the BDFR only has access to specific parts of the account authenticated, by default only saved posts, upvoted posts, and the identity of the authenticated account. Note that authentication is not required unless accessing private things like upvoted posts, saved posts, and private multireddits.
@@ -253,6 +271,7 @@ The following keys are optional, and defaults will be used if they cannot be fou
   - `backup_log_count`
   - `max_wait_time`
   - `time_format`
+  - `disabled_modules`
 
 All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the configuration directory as the default.
 
@@ -264,6 +283,22 @@ The option `time_format` will specify the format of the timestamp that replaces
 
 The format can be specified through the [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) that are standard in the Python `datetime` library.
 
+#### Disabling Modules
+
+The individual modules of the BDFR, used to download submissions from websites, can be disabled. This is helpful especially in the case of the fallback downloaders, since the `--skip-domain` option cannot be effectively used in these cases. For example, the Youtube-DL downloader can retrieve data from hundreds of websites and domains; thus the only way to fully disable it is via the `--disable-module` option.
+
+Modules can be disabled through the command line interface for the BDFR or more permanently in the configuration file via the `disabled_modules` option. The list of downloaders that can be disabled are the following. Note that they are case-insensitive.
+
+- `Direct`
+- `Erome`
+- `Gallery` (Reddit Image Galleries)
+- `Gfycat`
+- `Imgur`
+- `Redgifs`
+- `SelfPost` (Reddit Text Post)
+- `Youtube`
+- `YoutubeDlFallback`
+
 ### Rate Limiting
 
 The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases.

diff --git a/bdfr/__main__.py b/bdfr/__main__.py
@@ -8,35 +8,58 @@
 from bdfr.archiver import Archiver
 from bdfr.configuration import Configuration
 from bdfr.downloader import RedditDownloader
+from bdfr.cloner import RedditCloner
 
 logger = logging.getLogger()
 
 _common_options = [
     click.argument('directory', type=str),
-    click.option('--config', type=str, default=None),
-    click.option('-v', '--verbose', default=None, count=True),
-    click.option('-l', '--link', multiple=True, default=None, type=str),
-    click.option('-s', '--subreddit', multiple=True, default=None, type=str),
-    click.option('-m', '--multireddit', multiple=True, default=None, type=str),
-    click.option('-L', '--limit', default=None, type=int),
     click.option('--authenticate', is_flag=True, default=None),
+    click.option('--config', type=str, default=None),
+    click.option('--disable-module', multiple=True, default=None, type=str),
     click.option('--log', type=str, default=None),
-    click.option('--submitted', is_flag=True, default=None),
-    click.option('--upvoted', is_flag=True, default=None),
     click.option('--saved', is_flag=True, default=None),
     click.option('--search', default=None, type=str),
+    click.option('--submitted', is_flag=True, default=None),
     click.option('--time-format', type=str, default=None),
-    click.option('-u', '--user', type=str, default=None),
+    click.option('--upvoted', is_flag=True, default=None),
+    click.option('-L', '--limit', default=None, type=int),
+    click.option('-l', '--link', multiple=True, default=None, type=str),
+    click.option('-m', '--multireddit', multiple=True, default=None, type=str),
+    click.option('-s', '--subreddit', multiple=True, default=None, type=str),
+    click.option('-v', '--verbose', default=None, count=True),
+    click.option('-u', '--user', type=str, multiple=True, default=None),
     click.option('-t', '--time', type=click.Choice(('all', 'hour', 'day', 'week', 'month', 'year')), default=None),
     click.option('-S', '--sort', type=click.Choice(('hot', 'top', 'new',
                                                     'controversial', 'rising', 'relevance')), default=None),
 ]
 
+_downloader_options = [
+    click.option('--file-scheme', default=None, type=str),
+    click.option('--folder-scheme', default=None, type=str),
+    click.option('--make-hard-links', is_flag=True, default=None),
+    click.option('--max-wait-time', type=int, default=None),
+    click.option('--no-dupes', is_flag=True, default=None),
+    click.option('--search-existing', is_flag=True, default=None),
+    click.option('--exclude-id', default=None, multiple=True),
+    click.option('--exclude-id-file', default=None, multiple=True),
+    click.option('--skip', default=None, multiple=True),
+    click.option('--skip-domain', default=None, multiple=True),
+    click.option('--skip-subreddit', default=None, multiple=True),
+]
+
+_archiver_options = [
+    click.option('--all-comments', is_flag=True, default=None),
+    click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None),
+]
+
 
-def _add_common_options(func):
-    for opt in _common_options:
-        func = opt(func)
-    return func
+def _add_options(opts: list):
+    def wrap(func):
+        for opt in opts:
+            func = opt(func)
+        return func
+    return wrap
 
 
 @click.group()
@@ -45,18 +68,8 @@ def cli():
 
 
 @cli.command('download')
-@click.option('--exclude-id', default=None, multiple=True)
-@click.option('--exclude-id-file', default=None, multiple=True)
-@click.option('--file-scheme', default=None, type=str)
-@click.option('--folder-scheme', default=None, type=str)
-@click.option('--make-hard-links', is_flag=True, default=None)
-@click.option('--max-wait-time', type=int, default=None)
-@click.option('--no-dupes', is_flag=True, default=None)
-@click.option('--search-existing', is_flag=True, default=None)
-@click.option('--skip', default=None, multiple=True)
-@click.option('--skip-domain', default=None, multiple=True)
-@click.option('--skip-subreddit', default=None, multiple=True)
-@_add_common_options
+@_add_options(_common_options)
+@_add_options(_downloader_options)
 @click.pass_context
 def cli_download(context: click.Context, **_):
     config = Configuration()
@@ -73,9 +86,8 @@ def cli_download(context: click.Context, **_):
 
 
 @cli.command('archive')
-@_add_common_options
-@click.option('--all-comments', is_flag=True, default=None)
-@click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None)
+@_add_options(_common_options)
+@_add_options(_archiver_options)
 @click.pass_context
 def cli_archive(context: click.Context, **_):
     config = Configuration()
@@ -85,7 +97,26 @@ def cli_archive(context: click.Context, **_):
         reddit_archiver = Archiver(config)
         reddit_archiver.download()
     except Exception:
-        logger.exception('Downloader exited unexpectedly')
+        logger.exception('Archiver exited unexpectedly')
+        raise
+    else:
+        logger.info('Program complete')
+
+
+@cli.command('clone')
+@_add_options(_common_options)
+@_add_options(_archiver_options)
+@_add_options(_downloader_options)
+@click.pass_context
+def cli_clone(context: click.Context, **_):
+    config = Configuration()
+    config.process_click_arguments(context)
+    setup_logging(config.verbose)
+    try:
+        reddit_scraper = RedditCloner(config)
+        reddit_scraper.download()
+    except Exception:
+        logger.exception('Scraper exited unexpectedly')
         raise
     else:
         logger.info('Program complete')

diff --git a/bdfr/archive_entry/base_archive_entry.py b/bdfr/archive_entry/base_archive_entry.py
@@ -26,6 +26,7 @@ def _convert_comment_to_dict(in_comment: Comment) -> dict:
             'stickied': in_comment.stickied,
             'body': in_comment.body,
             'is_submitter': in_comment.is_submitter,
+            'distinguished': in_comment.distinguished,
             'created_utc': in_comment.created_utc,
             'parent_id': in_comment.parent_id,
             'replies': [],

diff --git a/bdfr/archive_entry/submission_archive_entry.py b/bdfr/archive_entry/submission_archive_entry.py
@@ -35,6 +35,10 @@ def _get_post_details(self):
             'link_flair_text': self.source.link_flair_text,
             'num_comments': self.source.num_comments,
             'over_18': self.source.over_18,
+            'spoiler': self.source.spoiler,
+            'pinned': self.source.pinned,
+            'locked': self.source.locked,
+            'distinguished': self.source.distinguished,
             'created_utc': self.source.created_utc,
         }
 

diff --git a/bdfr/archiver.py b/bdfr/archiver.py
@@ -14,24 +14,24 @@
 from bdfr.archive_entry.comment_archive_entry import CommentArchiveEntry
 from bdfr.archive_entry.submission_archive_entry import SubmissionArchiveEntry
 from bdfr.configuration import Configuration
-from bdfr.downloader import RedditDownloader
+from bdfr.connector import RedditConnector
 from bdfr.exceptions import ArchiverError
 from bdfr.resource import Resource
 
 logger = logging.getLogger(__name__)
 
 
-class Archiver(RedditDownloader):
+class Archiver(RedditConnector):
     def __init__(self, args: Configuration):
         super(Archiver, self).__init__(args)
 
     def download(self):
         for generator in self.reddit_lists:
             for submission in generator:
                 logger.debug(f'Attempting to archive submission {submission.id}')
-                self._write_entry(submission)
+                self.write_entry(submission)
 
-    def _get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
+    def get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
         supplied_submissions = []
         for sub_id in self.args.link:
             if len(sub_id) == 6:
@@ -42,12 +42,13 @@ def _get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
                 supplied_submissions.append(self.reddit_instance.submission(url=sub_id))
         return [supplied_submissions]
 
-    def _get_user_data(self) -> list[Iterator]:
-        results = super(Archiver, self)._get_user_data()
+    def get_user_data(self) -> list[Iterator]:
+        results = super(Archiver, self).get_user_data()
         if self.args.user and self.args.all_comments:
-            sort = self._determine_sort_function()
-            logger.debug(f'Retrieving comments of user {self.args.user}')
-            results.append(sort(self.reddit_instance.redditor(self.args.user).comments, limit=self.args.limit))
+            sort = self.determine_sort_function()
+            for user in self.args.user:
+                logger.debug(f'Retrieving comments of user {user}')
+                results.append(sort(self.reddit_instance.redditor(user).comments, limit=self.args.limit))
         return results
 
     @staticmethod
@@ -59,7 +60,7 @@ def _pull_lever_entry_factory(praw_item: (praw.models.Submission, praw.models.Co
         else:
             raise ArchiverError(f'Factory failed to classify item of type {type(praw_item).__name__}')
 
-    def _write_entry(self, praw_item: (praw.models.Submission, praw.models.Comment)):
+    def write_entry(self, praw_item: (praw.models.Submission, praw.models.Comment)):
         archive_entry = self._pull_lever_entry_factory(praw_item)
         if self.args.format == 'json':
             self._write_entry_json(archive_entry)

diff --git a/bdfr/cloner.py b/bdfr/cloner.py
@@ -0,0 +1,21 @@
+#!/usr/bin/env python3
+# coding=utf-8
+
+import logging
+
+from bdfr.archiver import Archiver
+from bdfr.configuration import Configuration
+from bdfr.downloader import RedditDownloader
+
+logger = logging.getLogger(__name__)
+
+
+class RedditCloner(RedditDownloader, Archiver):
+    def __init__(self, args: Configuration):
+        super(RedditCloner, self).__init__(args)
+
+    def download(self):
+        for generator in self.reddit_lists:
+            for submission in generator:
+                self._download_submission(submission)
+                self.write_entry(submission)
diff --git a/bdfr/configuration.py b/bdfr/configuration.py
@@ -13,19 +13,21 @@ def __init__(self):
         self.authenticate = False
         self.config = None
         self.directory: str = '.'
+        self.disable_module: list[str] = []
         self.exclude_id = []
         self.exclude_id_file = []
+        self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
+        self.folder_scheme: str = '{SUBREDDIT}'
         self.limit: Optional[int] = None
         self.link: list[str] = []
         self.log: Optional[str] = None
+        self.make_hard_links = False
         self.max_wait_time = None
         self.multireddit: list[str] = []
         self.no_dupes: bool = False
         self.saved: bool = False
         self.search: Optional[str] = None
         self.search_existing: bool = False
-        self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
-        self.folder_scheme: str = '{SUBREDDIT}'
         self.skip: list[str] = []
         self.skip_domain: list[str] = []
         self.skip_subreddit: list[str] = []
@@ -35,9 +37,8 @@ def __init__(self):
         self.time: str = 'all'
         self.time_format = None
         self.upvoted: bool = False
-        self.user: Optional[str] = None
+        self.user: list[str] = []
         self.verbose: int = 0
-        self.make_hard_links = False
 
         # Archiver-specific options
         self.format = 'json'