Skip to content

Commit

Permalink
Merge pull request #361 from aliparlakci/development
Browse files Browse the repository at this point in the history
Release v2.1.0
  • Loading branch information
aliparlakci authored May 11, 2021
2 parents 2017994 + ff36aeb commit 6c086e7
Show file tree
Hide file tree
Showing 43 changed files with 634 additions and 275 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ assignees: ''

- [ ] I am reporting a bug.
- [ ] I am running the latest version of BDfR
- [ ] I have read the [Opening an issue](README.md#configuration)
- [ ] I have read the [Opening an issue](../../README.md#configuration)

## Description
A clear and concise description of what the bug is.
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ assignees: ''

- [ ] I am requesting a feature.
- [ ] I am running the latest version of BDfR
- [ ] I have read the [Opening an issue](README.md#configuration)
- [ ] I have read the [Opening an issue](../../README.md#configuration)

## Description
Clearly state the current situation and issues you experience. Then, explain how this feature would solve these issues and make life easier. Also, explain the feature with as many detail as possible.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/site-support-request.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ assignees: ''

- [ ] I am requesting a site support.
- [ ] I am running the latest version of BDfR
- [ ] I have read the [Opening an issue](README.md#configuration)
- [ ] I have read the [Opening an issue](../../README.md#configuration)

## Site
Provide a URL to domain of the site.
Expand Down
26 changes: 14 additions & 12 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,17 @@ on:

jobs:
test:

runs-on: ubuntu-latest

runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest]
python-version: [3.9]

ext: [.sh]
include:
- os: windows-latest
python-version: 3.9
ext: .ps1
steps:

- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
Expand All @@ -26,19 +28,19 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip flake8 pytest pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install -r requirements.txt
- name: Setup test configuration
- name: Make configuration for tests
env:
REDDIT_TOKEN: ${{ secrets.REDDIT_TEST_TOKEN }}
run: |
cp bdfr/default_config.cfg ./test_config.cfg
echo -e "\nuser_token = ${{ secrets.REDDIT_TEST_TOKEN }}" >> ./test_config.cfg
./devscripts/configure${{ matrix.ext }}
- name: Lint w/ flake8
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
- name: Test w/ PyTest
- name: Test with pytest
run: |
pytest -m 'not slow' --verbose --cov=./bdfr/ --cov-report term:skip-covered --cov-report html
Expand Down
45 changes: 42 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ If you wish to open an issue, please read [the guide on opening issues](docs/CON
python3 -m pip install bdfr
```

If on Arch Linux or derivative operating systems such as Manjaro, the BDFR can be installed through the AUR.

- Latest Release: https://aur.archlinux.org/packages/python-bdfr/
- Latest Development Build: https://aur.archlinux.org/packages/python-bdfr-git/

If you want to use the source code or make contributions, refer to [CONTRIBUTING](docs/CONTRIBUTING.md#preparing-the-environment-for-development)

## Usage
Expand Down Expand Up @@ -55,6 +60,9 @@ The following options are common between both the `archive` and `download` comma
- `--config`
- If the path to a configuration file is supplied with this option, the BDFR will use the specified config
- See [Configuration Files](#configuration) for more details
- `--log`
- This allows one to specify the location of the logfile
- This must be done when running multiple instances of the BDFR, see [Multiple Instances](#multiple-instances) below
- `--saved`
- This option will make the BDFR use the supplied user's saved posts list as a download source
- This requires an authenticated Reddit instance, using the `--authenticate` flag, as well as `--user` set to `me`
Expand Down Expand Up @@ -106,6 +114,9 @@ The following options are common between both the `archive` and `download` comma
- `week`
- `month`
- `year`
- `--time-format`
- This specifies the format of the datetime string that replaces `{DATE}` in file and folder naming schemes
- See [Time Formatting Customisation](#time-formatting-customisation) for more details, and the formatting scheme
- `-u, --user`
- This specifies the user to scrape in concert with other options
- When using `--authenticate`, `--user me` can be used to refer to the authenticated user
Expand Down Expand Up @@ -208,30 +219,47 @@ It is highly recommended that the file name scheme contain the parameter `{POSTI
## Configuration

The configuration files are, by default, stored in the configuration directory for the user. This differs depending on the OS that the BDFR is being run on. For Windows, this will be:

- `C:\Users\<User>\AppData\Local\BDFR\bdfr`

If Python has been installed through the Windows Store, the folder will appear in a different place. Note that the hash included in the file path may change from installation to installation.

- `C:\Users\<User>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\Local\BDFR\bdfr`

On Mac OSX, this will be:

- `~/Library/Application Support/bdfr`.

Lastly, on a Linux system, this will be:
- `~/.local/share/bdfr`

- `~/.config/bdfr/`

The logging output for each run of the BDFR will be saved to this directory in the file `log_output.txt`. If you need to submit a bug, it is this file that you will need to submit with the report.

### Configuration File

The `config.cfg` is the file that supplies the BDFR with the configuration to use. At the moment, the following keys **must** be included in the configuration file supplied.

- `backup_log_count`
- `max_wait_time`
- `client_id`
- `client_secret`
- `scopes`

The following keys are optional, and defaults will be used if they cannot be found.

- `backup_log_count`
- `max_wait_time`
- `time_format`

All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the configuration directory as the default.

Most of these values have to do with OAuth2 configuration and authorisation. The key `backup_log_count` however has to do with the log rollover. The logs in the configuration directory can be verbose and for long runs of the BDFR, can grow quite large. To combat this, the BDFR will overwrite previous logs. This value determines how many previous run logs will be kept. The default is 3, which means that the BDFR will keep at most three past logs plus the current one. Any runs past this will overwrite the oldest log file, called "rolling over". If you want more records of past runs, increase this number.

#### Time Formatting Customisation

The option `time_format` will specify the format of the timestamp that replaces `{DATE}` in filename and folder name schemes. By default, this is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format which is highly recommended due to its standardised nature. If you don't **need** to change it, it is recommended that you do not. However, you can specify it to anything required with this option. The `--time-format` option supersedes any specification in the configuration file

The format can be specified through the [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) that are standard in the Python `datetime` library.

### Rate Limiting

The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases.
Expand All @@ -240,6 +268,16 @@ To this end, the BDFR will sleep for a time before retrying the download, giving

The option `--max-wait-time` and the configuration option `max_wait_time` both specify the maximum time the BDFR will wait. If both are present, the command-line option takes precedence. For instance, the default is 120, so the BDFR will wait for 60 seconds, then 120 seconds, and then move one. **Note that this results in a total time of 180 seconds trying the same download**. If you wish to try to bypass the rate-limiting system on the remote site, increasing the maximum wait time may help. However, note that the actual wait times increase exponentially if the resource is not downloaded i.e. specifying a max value of 300 (5 minutes), can make the BDFR pause for 15 minutes on one submission, not 5, in the worst case.

## Multiple Instances

The BDFR can be run in multiple instances with multiple configurations, either concurrently or consecutively. The use of scripting files facilitates this the easiest, either Powershell on Windows operating systems or Bash elsewhere. This allows multiple scenarios to be run with data being scraped from different sources, as any two sets of scenarios might be mutually exclusive i.e. it is not possible to download any combination of data from a single run of the BDFR. To download from multiple users for example, multiple runs of the BDFR are required.

Running these scenarios consecutively is done easily, like any single run. Configuration files that differ may be specified with the `--config` option to switch between tokens, for example. Otherwise, almost all configuration for data sources can be specified per-run through the command line.

Running scenarious concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.

The way to fix this is to use the `--log` option to manually specify where the logfile is to be stored. If the given location is unique to each instance of the BDFR, then it will run fine.

## List of currently supported sources

- Direct links (links leading to a file)
Expand All @@ -252,6 +290,7 @@ The option `--max-wait-time` and the configuration option `max_wait_time` both s
- Reddit Videos
- Redgifs
- YouTube
- Streamable

## Contributing

Expand Down
4 changes: 3 additions & 1 deletion bdfr/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,12 @@
click.option('-m', '--multireddit', multiple=True, default=None, type=str),
click.option('-L', '--limit', default=None, type=int),
click.option('--authenticate', is_flag=True, default=None),
click.option('--log', type=str, default=None),
click.option('--submitted', is_flag=True, default=None),
click.option('--upvoted', is_flag=True, default=None),
click.option('--saved', is_flag=True, default=None),
click.option('--search', default=None, type=str),
click.option('--time-format', type=str, default=None),
click.option('-u', '--user', type=str, default=None),
click.option('-t', '--time', type=click.Choice(('all', 'hour', 'day', 'week', 'month', 'year')), default=None),
click.option('-S', '--sort', type=click.Choice(('hot', 'top', 'new',
Expand Down Expand Up @@ -73,7 +75,7 @@ def cli_download(context: click.Context, **_):
@cli.command('archive')
@_add_common_options
@click.option('--all-comments', is_flag=True, default=None)
@click.option('-f,', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None)
@click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None)
@click.pass_context
def cli_archive(context: click.Context, **_):
config = Configuration()
Expand Down
2 changes: 1 addition & 1 deletion bdfr/archiver.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def _write_entry_yaml(self, entry: BaseArchiveEntry):
def _write_content_to_disk(self, resource: Resource, content: str):
file_path = self.file_name_formatter.format_path(resource, self.download_directory)
file_path.parent.mkdir(exist_ok=True, parents=True)
with open(file_path, 'w') as file:
with open(file_path, 'w', encoding="utf-8") as file:
logger.debug(
f'Writing entry {resource.source_submission.id} to file in {resource.extension[1:].upper()}'
f' format at {file_path}')
Expand Down
2 changes: 2 additions & 0 deletions bdfr/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ def __init__(self):
self.exclude_id_file = []
self.limit: Optional[int] = None
self.link: list[str] = []
self.log: Optional[str] = None
self.max_wait_time = None
self.multireddit: list[str] = []
self.no_dupes: bool = False
Expand All @@ -32,6 +33,7 @@ def __init__(self):
self.submitted: bool = False
self.subreddit: list[str] = []
self.time: str = 'all'
self.time_format = None
self.upvoted: bool = False
self.user: Optional[str] = None
self.verbose: int = 0
Expand Down
3 changes: 2 additions & 1 deletion bdfr/default_config.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ client_id = U-6gk4ZCh3IeNQ
client_secret = 7CZHY6AmKweZME5s50SfDGylaPg
scopes = identity, history, read, save
backup_log_count = 3
max_wait_time = 120
max_wait_time = 120
time_format = ISO
15 changes: 12 additions & 3 deletions bdfr/download_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
import logging
import re

from bdfr.resource import Resource

logger = logging.getLogger(__name__)


Expand All @@ -21,13 +23,20 @@ def check_url(self, url: str) -> bool:
else:
return True

def _check_extension(self, url: str) -> bool:
def check_resource(self, res: Resource) -> bool:
if not self._check_extension(res.extension):
return False
elif not self._check_domain(res.url):
return False
return True

def _check_extension(self, resource_extension: str) -> bool:
if not self.excluded_extensions:
return True
combined_extensions = '|'.join(self.excluded_extensions)
pattern = re.compile(r'.*({})$'.format(combined_extensions))
if re.match(pattern, url):
logger.log(9, f'Url "{url}" matched with "{str(pattern)}"')
if re.match(pattern, resource_extension):
logger.log(9, f'Url "{resource_extension}" matched with "{str(pattern)}"')
return False
else:
return True
Expand Down
Loading

0 comments on commit 6c086e7

Please sign in to comment.