Line-watch is a versatile command-line interface (CLI) utility, developed in Python to emulate and expand upon the core functionalities of tools like grep
. What makes Line-watch unique is its entirely custom-built regular expression engine. This project was an engaging exploration into fundamental computer science concepts, including compiler design principles, lexical analysis, parsing techniques, and finite automata theory, aiming to build a self-contained and efficient solution for text pattern matching.
It provides a practical tool for searching text while also serving as a clear demonstration of how a regex engine can be constructed from the ground up. The engine's core relies on a recursive backtracking algorithm, which effectively simulates the behavior of Non-deterministic Finite Automata (NFAs) to provide flexible and accurate pattern matching.
Line-watch offers a solid set of features, covering many common regular expression syntaxes:
- Extensive Regex Syntax Coverage:
- Literal Character Matching: Exact identification of character sequences.
- Wildcard (
.
): Matches any single character (excluding newline). - Anchors (
^
,$
):^
: Asserts position at the beginning of a line.$
: Asserts position at the end of a line.
- Character Classes:
\d
: Matches any decimal digit (0-9).\w
: Matches any "word" character (alphanumeric[a-z, A-Z, 0-9]
and underscore_
).
- Character Sets (
[]
):[abc]
: Matches any single character from a specified explicit set.[^abc]
: Matches any single character not in the specified set (negated character class).[a-z]
: Supports character ranges for concise definition (e.g.,[a-z]
,[0-9]
).
- Quantifiers: Control over the repetition of the preceding element:
*
: Matches zero or more occurrences.+
: Matches one or more occurrences.?
: Matches zero or one occurrence (optional).{n}
: Matches exactlyn
occurrences.{n,}
: Matchesn
or more occurrences.{n,m}
: Matches betweenn
andm
(inclusive) occurrences.
- Escaped Special Characters: Handles matching of special regex characters as literals (e.g.,
\.
to match a literal dot,\*
to match a literal asterisk). - Robust Command-Line Interface (CLI): Provides a user-friendly interface for performing pattern searches against direct input strings or by processing content from specified files.
- Comprehensive Unit Testing: Supported by a rigorous test suite using
pytest
andunittest
, ensuring reliability and validating the engine's behavior across diverse scenarios, including complex patterns and edge cases.
The project maintains a well-defined and modular directory structure, promoting code readability and maintainability:
line-watch/
βββ .github/
β βββ workflows/
β βββ security.yml # GitHub Actions workflow for automated security analysis and CI/CD
βββ line_watch/
β βββ __init__.py # Python package initialization file
β βββ cli.py # Command-Line Interface (CLI) implementation
β βββ engine.py # Core Regular Expression Engine: contains parsing and matching logic
β βββ utils.py # Utility functions, including character check helpers and regex component parsing
βββ tests/
β βββ __init__.py
β βββ test_engine.py # Comprehensive unit tests for the regex engine and utility functions
βββ .gitignore # Specifies files and directories to be excluded from version control
βββ LICENSE # Project licensing information (MIT License)
βββ pyproject.toml # Project metadata, build configuration, and command-line entry point (PEP 517/621 standard)
βββ README.md # This project documentation file
To set up and run Line-watch on your local machine, follow these instructions:
-
Clone the Repository: Begin by cloning the project repository:
git clone https://github.com/SREEHARI-M-S/line-watch.git cd line-watch
-
Create and Activate a Virtual Environment (Recommended): It's good practice to use a virtual environment for dependency management, ensuring project isolation.
python3 -m venv venv source venv/bin/activate # On Windows: `venv\Scripts\activate`
-
Install Project Package: Install Line-watch as an editable package. This allows you to run the
lw
command directly from your terminal and ensures any changes you make to the source code are immediately reflected.pip install -e .
This command will also install
pytest
, which is used for testing.
The lw
CLI utility provides a streamlined interface for pattern matching.
The general command structure is:
lw <regex_pattern> [-f <filename> | -s <string_to_match>]
<regex_pattern>
: The regular expression pattern you wish to match.-f <filename>
/--file <filename>
: Specify a file path. Line-watch will search for the pattern within each line of this file.-s <string_to_match>
/--string <string_to_match>
: Provide a direct string to match against. This is suitable for single-line pattern checks.- No arguments provided for
-f
or-s
: If neither-f
nor-s
is specified,lw
will read input line by line from standard input (stdin).
-
Matching a Pattern Against a Direct String Input (
-s
):lw "hello" -s "This is a hello world string." # Output: β Matched: 'This is a hello world string.' with pattern 'hello' lw "^start" -s "The line starts here." # Output: β Not Matched: 'The line starts here.' with pattern '^start'
-
Utilizing Character Classes:
lw "id\\d+" -s "User ID12345" # Output: β Matched: 'User ID12345' with pattern 'id\d+' lw "func_\\w+" -s "Calling func_initialize." # Output: β Matched: 'Calling func_initialize.' with pattern 'func_\w+'
-
Employing Character Sets and Ranges:
lw "[aeiou]pple" -s "apple" # Output: β Matched: 'apple' with pattern '[aeiou]pple' lw "[A-Z0-9]{3}" -s "ABCDEF" # Output: β Matched: 'ABCDEF' with pattern '[A-Z0-9]{3}'
-
Demonstrating Quantifiers (
*
,+
,?
,{n,m}
):lw "colou?r" -s "colour" # Output: β Matched: 'colour' with pattern 'colou?r' lw "a{2,4}b" -s "aaab" # Output: β Matched: 'aaab' with pattern 'a{2,4}b' lw "X.*Y" -s "This is XcontentY within text." # Output: β Matched: 'This is XcontentY within text.' with pattern 'X.*Y'
-
Processing Patterns Against a File (
-f
): First, create a sampledata.log
file:echo "INFO: Application started." > data.log echo "DEBUG: Variable x = 10." >> data.log echo "ERROR: File not found in /path/to/data." >> data.log echo "WARNING: Disk space low (20GB left)." >> data.log
Then, execute Line-watch:
lw "^ERROR:" -f data.log # Output: # β Line 3: ERROR: File not found in /path/to/data. lw "\\d+GB" -f data.log # Output: # β Line 4: WARNING: Disk space low (20GB left).
-
Reading Input from Standard Input (stdin): If neither
-f
nor-s
is provided,lw
will read lines from standard input until an End-of-File (EOF) signal is received (Ctrl+D
on Unix/macOS,Ctrl+Z
thenEnter
on Windows).lw "keyword" # Output: # Enter text to match (Ctrl+D or Ctrl+Z to finish input): # This line contains a keyword. # β Line 1: This line contains a keyword. # Another line without the word. # Final keyword test. # β Line 3: Final keyword test. # (Press Ctrl+D or Ctrl+Z then Enter to finish input) # β No matches found. (If no matches were found overall)
A comprehensive suite of unit tests is included to ensure the integrity, accuracy, and reliability of the regex engine's behavior across various scenarios.
To execute the test suite:
- Navigate to the project's root directory in your terminal.
- Ensure
pytest
is installed (it should be if you installed withpip install -e .
). - Run the tests using the
pytest
command:A successful execution will indicate that all tests have passed, confirming the engine's functionality across a wide range of regex constructs and edge cases.pytest
While Line-watch currently offers robust functionality for fundamental regex operations, the domain of regular expressions is expansive. Future enhancements I'm considering include:
- Performance Optimizations:
- NFA to DFA Conversion: Investigating and implementing algorithms to convert the NFA-based matching to a Deterministic Finite Automaton (DFA) for potentially superior performance and more predictable matching times, particularly with complex or pathological regexes.
- Optimized NFA Simulation: Exploring and integrating more advanced NFA simulation techniques that can significantly reduce redundant backtracking and improve overall efficiency.
- Advanced Regex Constructs:
- Capturing Groups (
()
): Extending functionality to capture and extract specific matched substrings for further processing. - Alternation (
|
): Implementing "OR" logic within patterns (e.g.,(cat|dog)
). - Non-greedy Quantifiers (
*?
,+?
,??
,{n,m}?
): Introducing support for matching the shortest possible string segment that satisfies the pattern. - Lookaheads and Lookbehinds: Implementing zero-width assertions that assert conditions about what follows or precedes the current matching position without consuming characters.
- Word Boundaries (
\b
,\B
): For precise matching at word edges. - Whitespace Classes (
\s
,\S
): For matching any whitespace character or non-whitespace character, respectively.
- Capturing Groups (
- Enhanced Error Handling and Reporting: Developing more granular, specific, and user-friendly error messages for syntactically incorrect or malformed regex patterns.
- Formal Parser Implementation: Transitioning to a more structured regex parser, potentially incorporating a dedicated lexer (tokenizer) and an Abstract Syntax Tree (AST) builder, to enhance extensibility, maintainability, and improve error recovery mechanisms.
- Full Unicode Support: Expanding character matching and character class definitions to fully support the vast Unicode character set, enabling global application.
I welcome and value contributions to the Line-watch project! Your insights and expertise can significantly help improve it. To contribute effectively, please adhere to the following guidelines:
- Fork the repository on GitHub.
- Create a new branch for your feature or bug fix. Use descriptive names like
feature/add-alternation-operator
orbugfix/resolve-anchor-issue
.git checkout -b feature/your-feature-name
- Implement your changes, ensuring they align with the project's architectural principles and coding style.
- Write comprehensive unit tests for any new functionality or bug fixes. This is crucial for maintaining the project's stability and reliability.
- Commit your changes with a clear, concise, and descriptive commit message. Follow conventional commit guidelines where applicable (e.g.,
feat: Add new feature X
,fix: Resolve issue with nested quantifiers
). - Push your branch to your forked repository:
git push origin feature/your-feature-name
- Open a Pull Request (PR) against the
main
branch of the original repository. Provide a detailed description of your changes, their rationale, and any relevant test results.
Before submitting a Pull Request, please ensure your code adheres to Python's best practices and successfully passes all existing tests.
This project is open-source and distributed under the terms of the MIT License. A copy of the license is available in the LICENSE file within the repository.