Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement to Full Disclosure Crawler and Parsers #92

Open
jgwl opened this issue Dec 20, 2017 · 1 comment
Open

Enhancement to Full Disclosure Crawler and Parsers #92

jgwl opened this issue Dec 20, 2017 · 1 comment

Comments

@jgwl
Copy link
Collaborator

jgwl commented Dec 20, 2017

Taken from #74

1. seclists_crawler_raw.py

1.1 Still doesn't provide an optional flag as save path.

Output parameter -o

For both Crawler and Parser, rather than default to save in the folder the script is run, an optional parameter -o could be useful for both Crawler and Parser. For us who will be versioning the code, this would help avoiding having to move the files manually every time we download a new month and makes it more scriptable from the command line.

Note also the expected behavior (although intuitively I see where you are going) is inconsistent in the 2 scripts which may leave a student confused: The Crawler script downloads in the same folder. The Parser script downloads at the provided input instead of where the script is run.

1.2 README.md

Should mention what the user is expected to be downloaded. Currently, it is each individual e-mail html page + an index.html page whose name format is _.raw.html. Main difference being the absence of a relative id in the file name.

2. seclists_index_parse.py

2.1 Script help message example is incorrect (?)

-f , parse single raw file, e.g. -f ./2011_Jan_0.raw.html

From your README.md (very nicely done by the way), I assume this would be without the 0 in it? i.e. 2011_Jan.raw.html.

2.2 Lacks save path

Currently adds to the input path directory.
should mention on README.md possible-follow ups case

should mention in the readme "possible-follow ups" are added to the parser the same way as follow-ups without any "possible" statement.

3. Add some python tests to ensure consistency across the scripts

Given it is hard to see from the results files are missing now or in the future, it would be interesting to have tests that:

  • Unit Test that the number of entries in the .csv generated by seclists_index_parse.py equal the number of raw.html files -1 in a given month folder (-1 represents the index .html file).

    • Assert the equality on a month with possible-followups case.
    • Assert the equality of a month without any possible-followup case.
    • Assert the equality of a month with and without any possible-follow up case.
  • Unit Test The number of files generated by seclists_reply_parse.py equals the number of raw html files.

This should suffice to minimally check all scripts are working consistently. Additional tests could include for example checking that the number of authors are correct, and the number of e-mail parents matches the expected.

General Notes

5. Missing requirements.txt with python libraries.

6. Parent README.md

Should probably add a parent folder to both Crawler and Parser with a readme mentioning the existence of the 3 scripts, a 1 line statement of what they do, and the agreed taxonomy of the file names.

@carlosparadis
Copy link
Member

@jgwl thank you for putting it together on a new issue and wanting to wrap up on this! :-)

@carlosparadis carlosparadis changed the title Enhancements to crawler/parsers Enhancement to Full Disclosure Crawler and Parsers Feb 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants