This repo is for a python script (match_files.py) to recurse a given directory location and return a list of files whose names match a regex.
Use either
python match_files.py -p <dir_loc> -r <regex> -s <max_size>from console, or- Call
match_files.find_regex(dir_loc, regex, max_size)from your code to find the files indir_locthat satisfyregexpattern or their size are greater thanmax_size. match_filesuses Python Regular Expression fromrepackage. Information onrepackage is avaiable at here and here.
Test cases are in test folder. To run test cases, use python run_test.py. The following regular expressions are tested:
- sample string: to test if the function has query a nonspecial string
- empty character: to check if the function works with an empty string, the output should be empty
_character: to test if the function works with special characters- space character: to test if the function works with spaces
- A test case is considered for max_size.
Profiling gives us some insight on runtime of each line of the code. Porfiling was done using line_profiler for two cases:
- a large number of files on a machine using
/as the root path and.as theregex, and, - a large number of files on a machine using
/as the root path and a small number (100 bytes) as themax_size.
The outputs of profiling are in the profiler folder: regex_prfile.txt and max_size_profile.txt. Looking at these files, we can see that the main loop with on line 21 with os.walk has the time per hit percentage for both queries. The reason is that, os.walk is significantly slow as it mentioned in here. Therefore, the main improvement would be to replace os.walk with a faster module such as scandir. Another option is to migrate to C++, which has lower overhead and handles IO and loops faster.
Another improvement is to remove the function is_large_file and directly use it the main loop of find_regex. It reduces the overhead for function call but decreases the readability of the code.