Skip to content

Commit ecc115d

Browse files
authored
Merge pull request #45 from nishabalamurugan/file-scanner
Integrated filescanner
2 parents 5852e04 + c351098 commit ecc115d

9 files changed

+1602
-103
lines changed

README.md

+122-1
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Designed and Developed by Comcast Cybersecurity Research and Development Team</p
2222
- [Usage](#usage)
2323
- [Enterprise Github Secrets Detection](#enterprise-github-secrets-detection)
2424
- [Public Github Secrets Detection](#public-github-secrets-detection)
25+
- [FileScan](#filescan)
2526
- [ML Model Training](#ml-model-training)
2627
- [Custom Keyword Scan](#custom-keyword-scan)
2728
- [License](#license)
@@ -120,7 +121,7 @@ Designed and Developed by Comcast Cybersecurity Research and Development Team</p
120121
- url_validator: `https://github.<<`**`Enterprise_Name`**`>>.com/api/v3/search/code`
121122
- enterprise_commits_url: `https://github.<<`**`Enterprise_Name`**`>>.com/api/v3/repos/{user_name}/{repo_name}/commits?path={file_path}`
122123

123-
#### Running Enterprise Secret Detection
124+
### Running Enterprise Secret Detection
124125

125126
- Traverse into the `github-enterprise` script folder
126127

@@ -515,6 +516,126 @@ Pass the Console Logging as Yes or No. Default is Yes
515516

516517
> **Note:** By Default, the detected secrets will be masked to hide sensitive data. If needed, user can skip the masking to write raw secret using command line argument `-u Yes or --unmask_secret Yes`. Refer command line options for more details.
517518
519+
### FileScan
520+
521+
**Detecting Exposed Secrets on File System at Scale**
522+
523+
- xGitGuard Filescanner detects secrets, such as keys and credentials, exposed on the filesystem.
524+
- Traverse into the `file-scanner` folder
525+
526+
```
527+
cd file-scanner
528+
```
529+
530+
#### Running Extension Filter
531+
532+
By default, the extension Search script runs for configured directories/files under config/xgg_search_paths.csv & config/extesnions.csv,
533+
534+
```
535+
# Run with Default configs
536+
python xgg_extension_search.py
537+
```
538+
539+
To run with specific directories or file path,
540+
541+
```
542+
# Run with targetted directories/filepaths for all extensions
543+
python extension_search.py -p "file-path"
544+
```
545+
546+
To run with specific extensions & directories/filepaths,
547+
548+
```
549+
# Run with targetted filepaths/directories for specific extensions
550+
python xgg_extension_search.py -p "file-path" -e "py,txt"
551+
```
552+
553+
> **Note:** By default extensions are picked from extensions.csv config file.But user can also search for targeted extensions either by proving in CLI option/updating extensions.csv
554+
555+
##### Command-Line Arguments
556+
557+
```
558+
Run usage:
559+
xgg_extension_search.py [-h] [-e Extensions] [-p Directory/Path] [-l Logger Level] [-c Console Logging]
560+
561+
optional arguments:
562+
-h, --help show this help message and exit
563+
-e Extensions, --extensions Extensions
564+
Pass the Extensions list as a comma-separated string
565+
-p Search Path, --search_path Search Path/File
566+
Pass the Directory or file to be searched
567+
-l Logger Level, --log_level Logger Level
568+
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
569+
-c Console Logging, --console_logging Console Logging
570+
Pass the Console Logging as Yes or No. Default is Yes
571+
```
572+
573+
#### Search Output Format:
574+
575+
##### Output Files
576+
577+
```
578+
1. Paths Detected: xgitguard\output\xgg_search_files.csv
579+
```
580+
581+
#### Secrets Detection
582+
583+
By default, the Secrets Detection script runs for given processed search paths(output/xgg_search_files.csv) with ML Filter detecting both keys and credentials.xGitGuard has an additional ML filter to reduce the false positives from the detection.
584+
585+
```
586+
# Run with Default configs
587+
python secret_detection.py
588+
```
589+
590+
##### Command to Run Scanner without ML Filter
591+
592+
```
593+
# Run for given Searched Paths without ML model,
594+
python secret_detection.py -m No
595+
```
596+
597+
##### Command-Line Arguments for Secret Scanner
598+
599+
```
600+
Run usage:
601+
secret_detection.py [-h help] [-keys Secondary Keywords] [-creds Secondary Credentials] [-m Ml Prediction ] [-f File Path] [-model_pref model_preference] [-l Logger Level] [-c Console Logging]
602+
603+
optional arguments:
604+
-h, --help show this help message and exit
605+
-keys Secondary Keywords, --secondary_keywords Secondary Keywords
606+
Pass the Secondary Keyword as string
607+
-creds Secondary Credentials, --secondary_Credentials
608+
Pass the Secondary Credentials as string
609+
-m ML Prediction, --ml_prediction ML Prediction
610+
Pass the ML Filter as Yes or No. Default is Yes
611+
-F File Path, --File_path Scan path of the File
612+
Pass the file to be scanned
613+
-model_preference, --model_preference
614+
Specify whether to use the public model or the enterprise model.Default is public
615+
-l Logger Level, --log_level Logger Level
616+
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
617+
-c Console Logging, --console_logging Console Logging
618+
Pass the Console Logging as Yes or No. Default is Yes
619+
```
620+
621+
- Inputs used for search and scan
622+
623+
> **Note:** Command-line argument keywords have precedence over config files (Default). If no keywords are passed in cli, data from config files will be used for the search.
624+
625+
> **Note:** If ML Prediction flag is set to false the -model_preference flag is not required.
626+
627+
- xgg_search_files.csv file has a default list of file paths for search based on extension scan, which can be updated by users based on their requirement.
628+
629+
#### Output Format:
630+
631+
##### Output Files
632+
633+
```
634+
1. Secrets Detected: xgitguard\output\xgg_file_scan_*_secrets_detected.csv
635+
2. Log File: xgitguard\logs\xgg_file_scan_*_secret_detection*yyyymmdd_hhmmss*.log
636+
3. Hash File: xgitguard\output\xgg_file_scan_*_hashed_file.csv
637+
```
638+
518639
#### ML Model Training
519640

520641
#### Enterprise ML Model Training Procedure

xgitguard/common/configs_read.py

+129-29
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,10 @@ def __init__(self):
5050

5151
def read_xgg_configs(self, file_name):
5252
"""
53-
Read the given xgg_configs yaml file in config path
54-
Set the Class Variable for further use
55-
params: file_name - string
53+
Read the given xgg_configs YAML file in the config path and set the class variable for further use.
54+
55+
Args:
56+
file_name (str): The name of the configuration file.
5657
"""
5758
logger.debug("<<<< 'Current Executing Function' >>>>")
5859
# Loading xgg_configs from xgg_configs_file
@@ -70,11 +71,13 @@ def read_xgg_configs(self, file_name):
7071

7172
def read_primary_keywords(self, file_name):
7273
"""
73-
Read the given primary keywords csv file in config path
74-
Set the Class Variable for further use
75-
params: file_name - string
74+
Read the given primary keywords CSV file in the config path and set the class variable for further use.
75+
76+
Args:
77+
file_name (str): The name of the CSV file.
7678
"""
7779
logger.debug("<<<< 'Current Executing Function' >>>>")
80+
7881
# Loading primary keywords from primary keywords file
7982
self.primary_keywords_file = os.path.join(self.config_dir, file_name)
8083
self.primary_keywords = read_csv_file(
@@ -87,11 +90,13 @@ def read_primary_keywords(self, file_name):
8790

8891
def read_secondary_keywords(self, file_name):
8992
"""
90-
Read the given secondary keywords csv file in config path
91-
Set the Class Variable for further use
92-
params: file_name - string
93+
Read the given secondary keywords CSV file in the config directory and set the class variable for further use.
94+
95+
Args:
96+
file_name (str): The name of the CSV file.
9397
"""
9498
logger.debug("<<<< 'Current Executing Function' >>>>")
99+
95100
# Loading secondary keywords from secondary keywords file
96101
self.secondary_keywords_file = os.path.join(self.config_dir, file_name)
97102
self.secondary_keywords = read_csv_file(
@@ -102,37 +107,63 @@ def read_secondary_keywords(self, file_name):
102107
]
103108
# logger.debug(f"secondary_keywords: {self.secondary_keywords}")
104109

110+
def read_secondary_credentials(self, file_name):
111+
"""
112+
Read the given secondary credentials CSV file in the config directory and set the class variable for further use.
113+
114+
Args:
115+
file_name (str): The name of the CSV file.
116+
"""
117+
logger.debug("<<<< 'Current Executing Function' >>>>")
118+
119+
# Loading secondary Credentials from secondary credentials file
120+
self.secondary_credentials_file = os.path.join(self.config_dir, file_name)
121+
self.secondary_credentials = read_csv_file(
122+
self.secondary_credentials_file, output="list", header=0
123+
)
124+
self.secondary_credentials = [
125+
item for sublist in self.secondary_credentials for item in sublist
126+
]
127+
# logger.debug(f"secondary_credentials: {self.secondary_credentials}")
128+
105129
def read_extensions(self, file_name="extensions.csv"):
106130
"""
107-
Read the given extensions csv file in config path
108-
Set the Class Variable for further use
109-
params: file_name - string
131+
Read the given extensions CSV file in the config path and set the class variable for further use.
132+
133+
Args:
134+
file_name (str): The name of the CSV file.
110135
"""
111136
logger.debug("<<<< 'Current Executing Function' >>>>")
137+
112138
# Get the extensions from extensions file
113139
self.extensions_file = os.path.join(self.config_dir, file_name)
114140
self.extensions = read_csv_file(self.extensions_file, output="list", header=0)
115141
self.extensions = [item for sublist in self.extensions for item in sublist]
142+
116143
# logger.debug(f"Extensions: {self.extensions}")
117144

118145
def read_hashed_url(self, file_name):
119146
"""
120-
Read the given hashed url csv file in output path
121-
Set the Class Variable for further use
122-
params: file_name - string
147+
Read the given hashed URL CSV file in the output path and set the class variable for further use.
148+
149+
Args:
150+
file_name (str): The name of the CSV file.
123151
"""
124152
logger.debug("<<<< 'Current Executing Function' >>>>")
153+
125154
# Loading Existing url hash detections
126155
self.hashed_url_file = os.path.join(self.output_dir, file_name)
127156
hashed_key_urls = read_csv_file(self.hashed_url_file, output="list", header=0)
128157
self.hashed_urls = [row[0] for row in hashed_key_urls]
158+
129159
# logger.debug(f"hashed_urls: {self.hashed_urls}")
130160

131161
def read_training_data(self, file_name):
132162
"""
133-
Read the given training data csv file in output path
134-
Set the Class Variable for further use
135-
params: file_name - string
163+
Read the given training data CSV file in the output path and set the class variable for further use.
164+
165+
Args:
166+
file_name (str): The name of the CSV file.
136167
"""
137168
logger.debug("<<<< 'Current Executing Function' >>>>")
138169
self.training_data_file = os.path.join(self.output_dir, file_name)
@@ -151,9 +182,12 @@ def read_training_data(self, file_name):
151182

152183
def read_confidence_values(self, file_name="confidence_values.csv"):
153184
"""
154-
Read the given confidence values csv file in config path
155-
Set the key as index and the Class Variable for further use
156-
params: file_name - string
185+
Read the given confidence values CSV file in the config path and set the key as index.
186+
187+
This function sets the class variable for further use.
188+
189+
Args:
190+
file_name (str): The name of the CSV file.
157191
"""
158192
logger.debug("<<<< 'Current Executing Function' >>>>")
159193
# Loading confidence levels from file
@@ -178,10 +212,12 @@ def read_confidence_values(self, file_name="confidence_values.csv"):
178212

179213
def read_dictionary_words(self, file_name="dictionary_words.csv"):
180214
"""
181-
Read the given dictionary words csv file in config path
182-
Create dictionary similarity values
183-
Set the Class Variables for further use
184-
params: file_name - string
215+
Read the given dictionary words CSV file in the config path.
216+
217+
This function creates dictionary similarity values and sets the class variables for further use.
218+
219+
Args:
220+
file_name (str): The name of the CSV file.
185221
"""
186222
logger.debug("<<<< 'Current Executing Function' >>>>")
187223
# Creating dictionary similarity values
@@ -216,9 +252,10 @@ def read_dictionary_words(self, file_name="dictionary_words.csv"):
216252

217253
def read_stop_words(self, file_name="stop_words.csv"):
218254
"""
219-
Read the given stop words csv file in config path
220-
Set the Class Variable for further use
221-
params: file_name - string
255+
Read the given stop words CSV file in the config path and set the class variable for further use.
256+
257+
Args:
258+
file_name (str): The name of the CSV file.
222259
"""
223260
logger.debug("<<<< 'Current Executing Function' >>>>")
224261
# Get the programming language stop words
@@ -227,6 +264,69 @@ def read_stop_words(self, file_name="stop_words.csv"):
227264
self.stop_words = [item for sublist in self.stop_words for item in sublist]
228265
# logger.debug(f"Total Stop Words: {len(self.stop_words)}")
229266

267+
def read_search_paths(self, file_name):
268+
"""
269+
Read the given search paths CSV file in the config directory and set the class variable for further use.
270+
271+
Args:
272+
file_name (str): The name of the CSV file.
273+
"""
274+
logger.debug("<<<< 'Current Executing Function' >>>>")
275+
276+
# Loading the search paths file to retrieve the paths that need the extension filter applied
277+
self.search_paths_file = os.path.join(self.config_dir, file_name)
278+
self.search_paths = read_csv_file(
279+
self.search_paths_file, output="list", header=0
280+
)
281+
self.search_paths = [item for sublist in self.search_paths for item in sublist]
282+
# logger.debug(f"search_paths: {self.search_paths}")
283+
284+
def read_search_files(self, file_name):
285+
"""
286+
Read the given search paths CSV file in the config directory and set the class variable for further use.
287+
288+
Args:
289+
file_name (str): The name of the CSV file.
290+
"""
291+
logger.debug("<<<< 'Current Executing Function' >>>>")
292+
293+
# Reading the paths of files to be searched after applying the extension filter
294+
self.target_paths_file = os.path.join(self.output_dir, file_name)
295+
self.search_files = read_csv_file(
296+
self.target_paths_file, output="list", header=0
297+
)
298+
self.search_files = [item for sublist in self.search_files for item in sublist]
299+
# logger.debug(f"search_files: {self.search_files}")
300+
301+
def read_hashed_file(self, file_name):
302+
"""
303+
Read the given hashed file CSV file in the output path and set the class variable for further use.
304+
305+
Args:
306+
file_name (str): The name of the CSV file.
307+
"""
308+
logger.debug("<<<< 'Current Executing Function' >>>>")
309+
# Loading Existing url hash detections
310+
self.hashed_file = os.path.join(self.output_dir, file_name)
311+
hashed_key_files = read_csv_file(self.hashed_file, output="", header=0)
312+
try:
313+
self.hashed_files = (
314+
hashed_key_files.get("hashed_files").drop_duplicates().tolist()
315+
)
316+
self.hashed_file_modified_time = (
317+
hashed_key_files.get("file_modification_hash")
318+
.drop_duplicates()
319+
.tolist()
320+
)
321+
self.hash_file_path = (
322+
hashed_key_files.get("files").drop_duplicates().tolist()
323+
)
324+
except:
325+
self.hashed_files = []
326+
self.hashed_file_modified_time = []
327+
self.hash_file_path = []
328+
# logger.debug(f"hashed_urls: {self.hashed_urls}")
329+
230330

231331
if __name__ == "__main__":
232332

@@ -239,4 +339,4 @@ def read_stop_words(self, file_name="stop_words.csv"):
239339
logger = create_logger(
240340
log_level=10, console_logging=True, log_dir=log_dir, log_file_name=log_file_name
241341
)
242-
configs = ConfigsData()
342+
configs = ConfigsData()

0 commit comments

Comments
 (0)