Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve URL boundary with quotations and parentheses #254

Merged
merged 8 commits into from
Aug 20, 2022

Conversation

fhightower
Copy link
Owner

@fhightower fhightower commented Jul 9, 2022

Fixes #130

Key changes:

  1. If the URL has a ) and not a (, we remove the ) (which we previously did) and everything after it
  2. We remove ) and everything after it before we remove trailing quotation marks so http://example.com/foo') is properly parsed as http://example.com/foo

@codecov-commenter
Copy link

codecov-commenter commented Jul 9, 2022

Codecov Report

Merging #254 (725a113) into main (f46e970) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #254   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines          408       408           
=========================================
  Hits           408       408           
Impacted Files Coverage Δ
ioc_finder/ioc_finder.py 100.00% <100.00%> (ø)
ioc_finder/ioc_grammars.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

# some urls are written with the ":" unencoded so we include it below
url_path_word = Word(alphanums + "-._~!$&'()*+,;:=%")
url_path_word = Word(alphanums + "-._~!$&'()*+,;=:%")
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch order of characters to align w/ RFC.

Comment on lines 21 to 36
param(
"DownloadString('https://example[.]com/rdp.ps1');g $I DownloadString(\"https://example[.]com/rdp.ps2\");g $I",
{
"urls": ["https://example.com/rdp.ps1", "https://example.com/rdp.ps2"],
},
{"parse_domain_from_url": False},
id="URL boundary w/ single or double quotes handled properly",
),
param(
"url,https://example.com/g/foo,Malicious Google Groups discussion",
{
"urls": ["https://example.com/g/foo"],
},
{"parse_domain_from_url": False},
id="URL boundary w/ comma handled properly",
),
Copy link
Owner Author

@fhightower fhightower Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To resolve these two failing cases, my approach will be to:

  1. Identify the character before the URL
  2. If the preceding character is not whitespace, either:
    • Stop parsing once that character is hit again
    • Remove that character and anything after it from the URL

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not handle cases like #261 where the url is not preceded by a comma (it is only postceded by one).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am comfortable removing punctuation from the end of a URL or it the URL is preceded by the same character, but other than that, we would have to break significantly with the RFC (which is something I need to think about more).

+ Combine(one_of("pub- PUB-") + Word(nums, exact=16)).set_parse_action(
pyparsing_common.downcase_tokens
)
+ Combine(one_of("pub- PUB-") + Word(nums, exact=16)).set_parse_action(pyparsing_common.downcase_tokens)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint update

mac_address_16_bit_section = Combine(
(Word(hexnums, exact=2) + one_of("- :")) * 5 + Word(hexnums, exact=2)
)
mac_address_16_bit_section = Combine((Word(hexnums, exact=2) + one_of("- :")) * 5 + Word(hexnums, exact=2))
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint update

Comment on lines +195 to +201
assert sorted(
[
r"HKEY_LOCAL_MACHINE\Software\Microsoft\Windows",
r"HKLM\Software\Microsoft\Windows",
r"HKCC\Software\Microsoft\Windows",
]
) == sorted(iocs["registry_key_paths"])
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint update

Comment on lines +91 to +104
def _clean_url(url: str) -> str:
"""Clean the given URL, removing common, unwanted characters which are usually not part of the URL."""
# if there is a ")" in the URL and not a "(", remove everything including and after the ")"
if ")" in url and "(" not in url:
url = url.split(")")[0]

# remove `"` and `'` characters from the end of a URL
url = url.rstrip('"').rstrip("'")

# remove `'/>` and `"/>` from the end of a URL (this character string occurs at the end of an HMTL tag with )
url = string_remove_from_end(url, "'/>")
url = string_remove_from_end(url, '"/>')

return url
Copy link
Owner Author

@fhightower fhightower Aug 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three changes to note:

  1. Moving this into its own function for easier extensibility in the future
  2. Removing ) before rstriping quotation marks so cases like http://example.com/foo') are properly parsed as http://example.com/foo
  3. Previously, we were just doing url.rstrip(")") to remove a closing paren. at the end of a URL; now we do url = url.split(")")[0] to remove the paren and everything after it.

@fhightower fhightower changed the title Improve URL boundary Improve URL boundary with quotations and parentheses Aug 20, 2022
@fhightower fhightower marked this pull request as ready for review August 20, 2022 09:28
@fhightower fhightower merged commit ee715ec into main Aug 20, 2022
@fhightower fhightower deleted the 130-url-boundary-fix branch August 20, 2022 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve URL boundary regarding quotation marks and parentheses
2 participants