feat: enhance parser domain-agnostic support #117

filipchristiansen · 2025-01-10T13:05:05Z

This PR introduces improvements and refactoring across multiple modules. The key changes include making the URL parser domain-agnostic, refactoring HTTP response handling, renaming functions for clarity, converting core functions to asynchronous operations, and standardizing terminology and documentation.

Highlights:

Domain-Agnostic Parsing:
- Updated query_parser.py to support multiple Git hosts by maintaining a list of known domains.
- Resolved a bug from #115 concerning case sensitivity of URL components.
- Implemented try_domains_for_user_and_repo to iteratively guess the correct domain for a given user/repo.
- Added helper functions (_get_user_and_repo_from_path, _validate_host, _validate_scheme) to facilitate robust parsing.
- Extended _parse_repo_source to leverage the new domain-agnostic logic.
- Added comprehensive tests in test_query_parser.py and a new test file test_git_host_agnostic.py to verify these changes.
Enhanced Repository Existence Check:
- Introduced _get_status_code in repository_clone.py to extract HTTP response codes cleanly.
- Adjusted _check_repo_exists to utilize _get_status_code, refining its logic:
  - Returns True for status codes 200 and 301.
  - Returns False for status codes 302 and 404.
- Updated and added tests in test_repository_clone.py to cover redirect scenarios and ensure correctness.
Function Renaming and Documentation:
- Renamed _parse_url to _parse_repo_source in query_parser.py for clarity.
- Standardized docstrings across modules to adhere to PEP 257, using imperative tense for consistency.
Asynchronous Conversions:
- Converted key functions (parse_query in query_processor.py, main in cli.py, and ingest in repository_ingest.py) to asynchronous to support domain-agnostic parsing.
- Updated associated tests in test_query_parser.py to support async execution.
Terminology and Documentation Standardization:
- Standardized capitalization for terms like 'Git', 'GitHub', and 'URL'.
- Cleaned up the README.md, fixed trailing slashes in links, and ensured punctuation consistency.
- Renamed templates and variables to replace GitHub-specific references with generic Git terminology, enhancing broad usability.
- Renamed template files (github.jinja → git.jinja, github_form.jinja → git_form.jinja) and variables (github_url → repo_url) accordingly.
Test Organization:
- Moved test_query_parser.py to a more structured location under tests/query_parser/ for better organization.

These changes collectively improve the flexibility of the parser for multiple Git hosts and enhance code clarity and consistency.

- Standardized capitalization of 'Git', 'GitHub', and 'URL' - Removed trailing slashes in links and added missing sentence periods in `README.md` - Adjusted docstrings to adhere to PEP 257 by using imperative tense - Standardized docstrings in `exceptions.py` - Replaced 'GitHub' with 'Git' when referring to broader context - Renamed templates: `github.jinja` → `git.jinja`, `github_form.jinja` → `git_form.jinja` - Renamed variables: `github_url` → `repo_url`

- Made `parse_query` in query_processor.py asynchronous - Made `main` in cli.py asynchronous - Made `ingest` in repository_ingest.py asynchronous - Updated test functions in test_query_parser.py to support async

- Renamed `_parse_url` to `_parse_repo_source` in query_parser.py - Adjusted docstrings to adhere to PEP 257 by using imperative tense

…update tests - Implemented function `_get_status_code` in repository_clone.py to extract the status code from an HTTP response - Adjusted `_check_repo_exists` in repository_clone.py to utilize the new `_get_status_code` function - Modified `_check_repo_exists` to return True for status codes 200 and 301, and False for 404 and 302 - Updated `test_check_repo_exists_with_redirect` in test_repository_clone.py to verify that `_check_repo_exists` returns False for status code 302 - Implemented test `test_check_repo_exists_with_permanent_redirect` in test_repository_clone.py to verify that `_check_repo_exists` returns True for status code 301

- added list of known domains/Git hosts in `query_parser.py` - fixed bug from [#115](#115): corrected case handling for URL components—scheme, domain, username, and repository are case-insensitive, but paths beyond (e.g., file names, branches) are case-sensitive - implemented `try_domains_for_user_and_repo` in `query_parser.py` to iteratively guess the correct domain until success or supported hosts are exhausted - added helper functions `_get_user_and_repo_from_path`, `_validate_host`, and `_validate_scheme` in `query_parser.py` - extended `_parse_repo_source` in `query_parser.py` to be Git host agnostic by using `try_domains_for_user_and_repo` - added tests `test_parse_url_unsupported_host` and `test_parse_query_with_branch` in `test_query_parser.py` - created new file `test_git_host_agnostic.py` to verify domain/Git host agnostic behavior

filipchristiansen requested a review from cyclotruc January 10, 2025 13:05

filipchristiansen added 6 commits January 10, 2025 14:08

refactor: convert key functions and tests to asynchronous

95b5e27

- Made `parse_query` in query_processor.py asynchronous - Made `main` in cli.py asynchronous - Made `ingest` in repository_ingest.py asynchronous - Updated test functions in test_query_parser.py to support async

refactor: rename _parse_url and standardize docstrings

9a19c92

- Renamed `_parse_url` to `_parse_repo_source` in query_parser.py - Adjusted docstrings to adhere to PEP 257 by using imperative tense

chore: move test_query_parser.py from tests/ to tests/query_parser/

cd1b14e

filipchristiansen force-pushed the feature/extend-to-multiple-git-hosts branch from 48c1695 to cd1b14e Compare January 10, 2025 13:11

This was referenced Jan 10, 2025

Feature request: Allow gitlab and codeberg, and gitea and so on. #11

Closed

Should We Expand Gitingest Support for Additional Git Hosts? #118

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enhance parser domain-agnostic support #117

feat: enhance parser domain-agnostic support #117

filipchristiansen commented Jan 10, 2025

feat: enhance parser domain-agnostic support #117

Are you sure you want to change the base?

feat: enhance parser domain-agnostic support #117

Conversation

filipchristiansen commented Jan 10, 2025