Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enhance parser domain-agnostic support #117

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

filipchristiansen
Copy link
Collaborator

This PR introduces improvements and refactoring across multiple modules. The key changes include making the URL parser domain-agnostic, refactoring HTTP response handling, renaming functions for clarity, converting core functions to asynchronous operations, and standardizing terminology and documentation.

Highlights:

  1. Domain-Agnostic Parsing:

    • Updated query_parser.py to support multiple Git hosts by maintaining a list of known domains.
    • Resolved a bug from #115 concerning case sensitivity of URL components.
    • Implemented try_domains_for_user_and_repo to iteratively guess the correct domain for a given user/repo.
    • Added helper functions (_get_user_and_repo_from_path, _validate_host, _validate_scheme) to facilitate robust parsing.
    • Extended _parse_repo_source to leverage the new domain-agnostic logic.
    • Added comprehensive tests in test_query_parser.py and a new test file test_git_host_agnostic.py to verify these changes.
  2. Enhanced Repository Existence Check:

    • Introduced _get_status_code in repository_clone.py to extract HTTP response codes cleanly.
    • Adjusted _check_repo_exists to utilize _get_status_code, refining its logic:
      • Returns True for status codes 200 and 301.
      • Returns False for status codes 302 and 404.
    • Updated and added tests in test_repository_clone.py to cover redirect scenarios and ensure correctness.
  3. Function Renaming and Documentation:

    • Renamed _parse_url to _parse_repo_source in query_parser.py for clarity.
    • Standardized docstrings across modules to adhere to PEP 257, using imperative tense for consistency.
  4. Asynchronous Conversions:

    • Converted key functions (parse_query in query_processor.py, main in cli.py, and ingest in repository_ingest.py) to asynchronous to support domain-agnostic parsing.
    • Updated associated tests in test_query_parser.py to support async execution.
  5. Terminology and Documentation Standardization:

    • Standardized capitalization for terms like 'Git', 'GitHub', and 'URL'.
    • Cleaned up the README.md, fixed trailing slashes in links, and ensured punctuation consistency.
    • Renamed templates and variables to replace GitHub-specific references with generic Git terminology, enhancing broad usability.
    • Renamed template files (github.jinjagit.jinja, github_form.jinjagit_form.jinja) and variables (github_urlrepo_url) accordingly.
  6. Test Organization:

    • Moved test_query_parser.py to a more structured location under tests/query_parser/ for better organization.

These changes collectively improve the flexibility of the parser for multiple Git hosts and enhance code clarity and consistency.

- Standardized capitalization of 'Git', 'GitHub', and 'URL'
- Removed trailing slashes in links and added missing sentence periods in `README.md`
- Adjusted docstrings to adhere to PEP 257 by using imperative tense
- Standardized docstrings in `exceptions.py`
- Replaced 'GitHub' with 'Git' when referring to broader context
- Renamed templates: `github.jinja` → `git.jinja`, `github_form.jinja` → `git_form.jinja`
- Renamed variables: `github_url` → `repo_url`
- Made `parse_query` in query_processor.py asynchronous
- Made `main` in cli.py asynchronous
- Made `ingest` in repository_ingest.py asynchronous
- Updated test functions in test_query_parser.py to support async
- Renamed `_parse_url` to `_parse_repo_source` in query_parser.py
- Adjusted docstrings to adhere to PEP 257 by using imperative tense
…update tests

- Implemented function `_get_status_code` in repository_clone.py to extract the status code from an HTTP response
- Adjusted `_check_repo_exists` in repository_clone.py to utilize the new `_get_status_code` function
- Modified `_check_repo_exists` to return True for status codes 200 and 301, and False for 404 and 302
- Updated `test_check_repo_exists_with_redirect` in test_repository_clone.py to verify that `_check_repo_exists` returns False for status code 302
- Implemented test `test_check_repo_exists_with_permanent_redirect` in test_repository_clone.py to verify that `_check_repo_exists` returns True for status code 301
- added list of known domains/Git hosts in `query_parser.py`
- fixed bug from [#115](#115): corrected case handling for URL components—scheme, domain, username, and repository are case-insensitive, but paths beyond (e.g., file names, branches) are case-sensitive
- implemented `try_domains_for_user_and_repo` in `query_parser.py` to iteratively guess the correct domain until success or supported hosts are exhausted
- added helper functions `_get_user_and_repo_from_path`, `_validate_host`, and `_validate_scheme` in `query_parser.py`
- extended `_parse_repo_source` in `query_parser.py` to be Git host agnostic by using `try_domains_for_user_and_repo`
- added tests `test_parse_url_unsupported_host` and `test_parse_query_with_branch` in `test_query_parser.py`
- created new file `test_git_host_agnostic.py` to verify domain/Git host agnostic behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant