Speed parsing by running concurrent processes #267

fhightower · 2022-09-28T14:16:42Z

No description provided.

FANGOD · 2022-10-05T10:04:25Z

I think speed is more important than memory. Is it possible to try to slice the text for multi-process parsing?

fhightower · 2022-10-14T00:33:24Z

True, I've tried using concurrent processes in the past and ran into an issue, but I never identified the root cause and can revisit this. I've renamed this issue to reflect this objective. Thanks for the input @FANGOD .

FANGOD · 2022-11-30T04:09:40Z

import pandas as pd
pd.options.display.max_colwidth
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=4, progress_bar=True)


def find_keys(text: str, filters=False, **kwargs):
    key, keys = find_iocs(text, filters, **kwargs)
    return keys

def split_text(text):
    lines = text.split("\n")
    num = 50
    new_text = ""
    new_text_list = []
    for line in lines:
        num -= 1
        new_text += line + "\n"
        if num == 0:
            new_text_list.append(new_text.strip("\n"))
            num = 50
            new_text = ""
    return new_text_list

# text contains 585724 lines: https://raw.githubusercontent.com/bitsight-research/threat_research/main/pseudomanuscrypt/toa.mygametoa.com_domains
new_lines = split_text(text)
df = pd.DataFrame(new_lines)
df["res"] = df.loc[:, 0].parallel_apply(find_keys)

#-- results -- cp from ipython
In [86]: time df["res"] = df.loc[:, 0].parallel_apply(find_keys)
 100.00% :::::::::::::::::::::::::::::::::::::::: |     3661 /     3661 |
 100.00% :::::::::::::::::::::::::::::::::::::::: |     3661 /     3661 |
 100.00% :::::::::::::::::::::::::::::::::::::::: |     3661 /     3661 |
 100.00% :::::::::::::::::::::::::::::::::::::::: |     3660 /     3660 |                                                                                                                                             CPU times: user 3.1 s, sys: 1.36 s, total: 4.46 s
Wall time: 11min

In [88]: df
Out[88]: 
                                                       0                                                res
0      pmcdgjpohi.com\nbjchcfobdq.com\njarigbbooq.com...  [bjchcfobdq.com, ajpqcrbnmp.com, pbfhiirqrn.co...
1      dcabebqfpq.com\nmmoipjnbdm.com\nepobjhrfja.com...  [fdqrbmcmbd.com, fmcegcnjmn.com, fgaedchifp.co...
2      idpenpehnb.com\nqhpbhcqjnp.com\nbqbijaraob.com...  [pmaipcmahi.com, fjpdobeino.com, iefrfbjfnc.co...
3      egbcedjcqc.com\ndegjmbreaa.com\nimpihgeabe.com...  [nqrgpbpggm.com, niaoeenndq.com, egbcedjcqc.co...
4      jhjnjeefca.com\nccjnhbqman.com\nfambqjchfe.com...  [ojabjjpdej.com, gfrdmafdpa.com, ohnnigbmbq.co...
...                                                  ...                                                ...
14638  dhffjndmgp.com\neqbfjcbqae.com\nogdmoqecig.com...  [edjnrorfad.com, mgchfafqcp.com, raeajqmdfq.co...
14639  hjamphrhqq.com\ngibcfmfqqq.com\nbaqjbrjrgo.com...  [baqjbrjrgo.com, deedibcnri.com, bqmjgdqcgf.co...
14640  beohneidor.com\nnbqihhnqgo.com\ncrbbnjpacd.com...  [fgarbpnoiq.com, jndojbednr.com, epeanjcrgh.co...
14641  mgacmpafjd.com\nhrodjrmeeg.com\neiqcnciaea.com...  [oddmfgepaq.com, mahdehbnrd.com, jjgpmqjjhb.co...
14642  iaaroboimg.com\ndcjepnmjim.com\nfmihjanroo.com...  [ebnaqidifd.com, cfjhccefcc.com, epnmhepjoj.co...

[14643 rows x 2 columns]

The old single-process version has been running for more than three hours. Maybe the longer the string, the slower the parsing.

A temporary version written for, hope useful.

Thanks ~

fhightower · 2022-11-30T15:02:37Z

Thanks for sharing! I'll prioritize this ticket.

Just recording this thought for myself when I revisit this ticket (hopefully soon):

The challenge with implementing this will be to find a good place to split the text. Some of the grammars may include spaces, so it is not sufficient to split on a space. We could split on a "\n" as FANGOD did, but not all large bodies of text will necessarily have newlines in them. The solution may be to just split in "\n" for now and document that large texts without newlines won't benefit from the parallel processing.

FANGOD · 2022-12-06T09:23:20Z

def split_text(text):
    lines = text.split("\n")
    new_text_list = []
    if lines < 300:
        num = 100
        window = 256
        text_length = 100 * 100
        split_source = list(range(0, len(text), text_length))
        split_delta = [i+window for i in split_source]
        for i1, i2 in zip(split_source, split_delta):
            new_text_list.append(text[i1:i2])
    else:
        num = 50
        new_text = ""
        for line in lines:
            num -= 1
            new_text += line + "\n"
            if num == 0:
                new_text_list.append(new_text.strip("\n"))
                num = 50
                new_text = ""
    return new_text_list

It was wrong when splitting the json file, so adding splitting by length, and adding a window with a length of 256.

fhightower · 2022-12-07T02:06:53Z

Thanks! I've started working on this in #276.

fhightower · 2022-12-12T17:53:56Z

Update on this issue:

After giving this issue some thought, I've reduce the priority on this issue because I don't think this issue is particularly urgent. Chunking text for concurrent processing would be really useful and we may still implement it in this library, but someone using this library can reasonably chunk text outside of this library and pass the chunks of text into the library.

frikky · 2023-05-10T23:09:50Z

Heyo! So I want to bump this issue, as we've been running the library for some time now, and find it to get way too slow around 0.0.5 - 0.1Mb of data with all parsers. This has been partially worked around by giving our users options to choose which they want, but that's not really enough either.

Sample:

Running with 0.2Mb of data and the following filters takes ~36 seconds on my local machine, and even more in our cloud functions/limited docker containers (python 3.11): ioc_types = ["ipv4s", "urls", "domains", "email_addresses", "ipv4_cidrs", "ipv6s"]

Using more memory and CPU to do concurrent processing would be beautiful, as I found that e.g. URL's take about 40% of the time by itself. We have tons of usecases where data passes more than 0.2 as well :)

My thoughts:

Split text and run them concurrently
Run all~ available filters simultaneously (reducing it to essentially just URL parsing speed)

fhightower · 2023-06-05T20:50:21Z

Thanks for the input, I'll bump the priority up a bit on this ticket. I won't be able to spend a lot of time on this for the time being, unfortunately, so it may be some time before I get to work on this - but I'll try my best.

frikky · 2023-06-06T05:15:06Z

Thanks for the input, I'll bump the priority up a bit on this ticket. I won't be able to spend a lot of time on this for the time being, unfortunately, so it may be some time before I get to work on this - but I'll try my best.

Would you be able to give me a quick look into it over a 15 min call to show how the parsing works, so we can try to fork the threading for it ourselves?

fhightower · 2023-06-26T21:43:22Z

Sorry for the delayed response. Unfortunately, life circumstances have significantly reduced the amount of time I can spend on this project for the foreseeable future. I've reached out to you via the email you use to make commits to github.

fhightower added enhancement New feature or request time est: 10 hours We estimate this issue will take ≈10 hours to complete labels Sep 28, 2022

fhightower changed the title ~~Make library more memory efficient by returning data as generators rather than lists~~ Speed parsing by running concurrent processes Oct 14, 2022

fhightower added the priority: 1 (high) label Dec 6, 2022

fhightower mentioned this issue Dec 7, 2022

Split and process large texts concurrently #276

Closed

3 tasks

fhightower added priority: 2 (medium) and removed priority: 1 (high) labels Dec 12, 2022

fhightower added priority: 1 (high) and removed priority: 2 (medium) labels Jun 5, 2023

frikky mentioned this issue Jun 12, 2023

Finish optimization of IOC-Finder Shuffle/Shuffle#1125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed parsing by running concurrent processes #267

Speed parsing by running concurrent processes #267

fhightower commented Sep 28, 2022

FANGOD commented Oct 5, 2022

fhightower commented Oct 14, 2022 •

edited

Loading

FANGOD commented Nov 30, 2022 •

edited

Loading

fhightower commented Nov 30, 2022 •

edited

Loading

FANGOD commented Dec 6, 2022 •

edited

Loading

fhightower commented Dec 7, 2022

fhightower commented Dec 12, 2022

frikky commented May 10, 2023

fhightower commented Jun 5, 2023 •

edited

Loading

frikky commented Jun 6, 2023

fhightower commented Jun 26, 2023

Speed parsing by running concurrent processes #267

Speed parsing by running concurrent processes #267

Comments

fhightower commented Sep 28, 2022

FANGOD commented Oct 5, 2022

fhightower commented Oct 14, 2022 • edited Loading

FANGOD commented Nov 30, 2022 • edited Loading

fhightower commented Nov 30, 2022 • edited Loading

FANGOD commented Dec 6, 2022 • edited Loading

fhightower commented Dec 7, 2022

fhightower commented Dec 12, 2022

frikky commented May 10, 2023

fhightower commented Jun 5, 2023 • edited Loading

frikky commented Jun 6, 2023

fhightower commented Jun 26, 2023

fhightower commented Oct 14, 2022 •

edited

Loading

FANGOD commented Nov 30, 2022 •

edited

Loading

fhightower commented Nov 30, 2022 •

edited

Loading

FANGOD commented Dec 6, 2022 •

edited

Loading

fhightower commented Jun 5, 2023 •

edited

Loading