Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed parsing by running concurrent processes #267

Open
fhightower opened this issue Sep 28, 2022 · 11 comments
Open

Speed parsing by running concurrent processes #267

fhightower opened this issue Sep 28, 2022 · 11 comments
Labels
enhancement New feature or request priority: 1 (high) time est: 10 hours We estimate this issue will take ≈10 hours to complete

Comments

@fhightower
Copy link
Owner

No description provided.

@fhightower fhightower added enhancement New feature or request time est: 10 hours We estimate this issue will take ≈10 hours to complete labels Sep 28, 2022
@FANGOD
Copy link

FANGOD commented Oct 5, 2022

I think speed is more important than memory. Is it possible to try to slice the text for multi-process parsing?

@fhightower
Copy link
Owner Author

fhightower commented Oct 14, 2022

True, I've tried using concurrent processes in the past and ran into an issue, but I never identified the root cause and can revisit this. I've renamed this issue to reflect this objective. Thanks for the input @FANGOD .

@fhightower fhightower changed the title Make library more memory efficient by returning data as generators rather than lists Speed parsing by running concurrent processes Oct 14, 2022
@FANGOD
Copy link

FANGOD commented Nov 30, 2022

import pandas as pd
pd.options.display.max_colwidth
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=4, progress_bar=True)


def find_keys(text: str, filters=False, **kwargs):
    key, keys = find_iocs(text, filters, **kwargs)
    return keys

def split_text(text):
    lines = text.split("\n")
    num = 50
    new_text = ""
    new_text_list = []
    for line in lines:
        num -= 1
        new_text += line + "\n"
        if num == 0:
            new_text_list.append(new_text.strip("\n"))
            num = 50
            new_text = ""
    return new_text_list

# text contains 585724 lines: https://raw.githubusercontent.com/bitsight-research/threat_research/main/pseudomanuscrypt/toa.mygametoa.com_domains
new_lines = split_text(text)
df = pd.DataFrame(new_lines)
df["res"] = df.loc[:, 0].parallel_apply(find_keys)

#-- results -- cp from ipython
In [86]: time df["res"] = df.loc[:, 0].parallel_apply(find_keys)
 100.00% :::::::::::::::::::::::::::::::::::::::: |     3661 /     3661 |
 100.00% :::::::::::::::::::::::::::::::::::::::: |     3661 /     3661 |
 100.00% :::::::::::::::::::::::::::::::::::::::: |     3661 /     3661 |
 100.00% :::::::::::::::::::::::::::::::::::::::: |     3660 /     3660 |                                                                                                                                             CPU times: user 3.1 s, sys: 1.36 s, total: 4.46 s
Wall time: 11min

In [88]: df
Out[88]: 
                                                       0                                                res
0      pmcdgjpohi.com\nbjchcfobdq.com\njarigbbooq.com...  [bjchcfobdq.com, ajpqcrbnmp.com, pbfhiirqrn.co...
1      dcabebqfpq.com\nmmoipjnbdm.com\nepobjhrfja.com...  [fdqrbmcmbd.com, fmcegcnjmn.com, fgaedchifp.co...
2      idpenpehnb.com\nqhpbhcqjnp.com\nbqbijaraob.com...  [pmaipcmahi.com, fjpdobeino.com, iefrfbjfnc.co...
3      egbcedjcqc.com\ndegjmbreaa.com\nimpihgeabe.com...  [nqrgpbpggm.com, niaoeenndq.com, egbcedjcqc.co...
4      jhjnjeefca.com\nccjnhbqman.com\nfambqjchfe.com...  [ojabjjpdej.com, gfrdmafdpa.com, ohnnigbmbq.co...
...                                                  ...                                                ...
14638  dhffjndmgp.com\neqbfjcbqae.com\nogdmoqecig.com...  [edjnrorfad.com, mgchfafqcp.com, raeajqmdfq.co...
14639  hjamphrhqq.com\ngibcfmfqqq.com\nbaqjbrjrgo.com...  [baqjbrjrgo.com, deedibcnri.com, bqmjgdqcgf.co...
14640  beohneidor.com\nnbqihhnqgo.com\ncrbbnjpacd.com...  [fgarbpnoiq.com, jndojbednr.com, epeanjcrgh.co...
14641  mgacmpafjd.com\nhrodjrmeeg.com\neiqcnciaea.com...  [oddmfgepaq.com, mahdehbnrd.com, jjgpmqjjhb.co...
14642  iaaroboimg.com\ndcjepnmjim.com\nfmihjanroo.com...  [ebnaqidifd.com, cfjhccefcc.com, epnmhepjoj.co...

[14643 rows x 2 columns]

The old single-process version has been running for more than three hours. Maybe the longer the string, the slower the parsing.

A temporary version written for, hope useful.

Thanks ~

@fhightower
Copy link
Owner Author

fhightower commented Nov 30, 2022

Thanks for sharing! I'll prioritize this ticket.

Just recording this thought for myself when I revisit this ticket (hopefully soon):

The challenge with implementing this will be to find a good place to split the text. Some of the grammars may include spaces, so it is not sufficient to split on a space. We could split on a "\n" as FANGOD did, but not all large bodies of text will necessarily have newlines in them. The solution may be to just split in "\n" for now and document that large texts without newlines won't benefit from the parallel processing.

@FANGOD
Copy link

FANGOD commented Dec 6, 2022

def split_text(text):
    lines = text.split("\n")
    new_text_list = []
    if lines < 300:
        num = 100
        window = 256
        text_length = 100 * 100
        split_source = list(range(0, len(text), text_length))
        split_delta = [i+window for i in split_source]
        for i1, i2 in zip(split_source, split_delta):
            new_text_list.append(text[i1:i2])
    else:
        num = 50
        new_text = ""
        for line in lines:
            num -= 1
            new_text += line + "\n"
            if num == 0:
                new_text_list.append(new_text.strip("\n"))
                num = 50
                new_text = ""
    return new_text_list

It was wrong when splitting the json file, so adding splitting by length, and adding a window with a length of 256.

@fhightower
Copy link
Owner Author

Thanks! I've started working on this in #276.

@fhightower
Copy link
Owner Author

Update on this issue:

After giving this issue some thought, I've reduce the priority on this issue because I don't think this issue is particularly urgent. Chunking text for concurrent processing would be really useful and we may still implement it in this library, but someone using this library can reasonably chunk text outside of this library and pass the chunks of text into the library.

@frikky
Copy link
Contributor

frikky commented May 10, 2023

Heyo! So I want to bump this issue, as we've been running the library for some time now, and find it to get way too slow around 0.0.5 - 0.1Mb of data with all parsers. This has been partially worked around by giving our users options to choose which they want, but that's not really enough either.

Sample:

  • Running with 0.2Mb of data and the following filters takes ~36 seconds on my local machine, and even more in our cloud functions/limited docker containers (python 3.11): ioc_types = ["ipv4s", "urls", "domains", "email_addresses", "ipv4_cidrs", "ipv6s"]

Using more memory and CPU to do concurrent processing would be beautiful, as I found that e.g. URL's take about 40% of the time by itself. We have tons of usecases where data passes more than 0.2 as well :)

My thoughts:

  • Split text and run them concurrently
  • Run all~ available filters simultaneously (reducing it to essentially just URL parsing speed)

@fhightower
Copy link
Owner Author

fhightower commented Jun 5, 2023

Thanks for the input, I'll bump the priority up a bit on this ticket. I won't be able to spend a lot of time on this for the time being, unfortunately, so it may be some time before I get to work on this - but I'll try my best.

@frikky
Copy link
Contributor

frikky commented Jun 6, 2023

Thanks for the input, I'll bump the priority up a bit on this ticket. I won't be able to spend a lot of time on this for the time being, unfortunately, so it may be some time before I get to work on this - but I'll try my best.

Would you be able to give me a quick look into it over a 15 min call to show how the parsing works, so we can try to fork the threading for it ourselves?

@fhightower
Copy link
Owner Author

Sorry for the delayed response. Unfortunately, life circumstances have significantly reduced the amount of time I can spend on this project for the foreseeable future. I've reached out to you via the email you use to make commits to github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: 1 (high) time est: 10 hours We estimate this issue will take ≈10 hours to complete
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants