Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tartufo multi-threaded #325

Open
gforsythe-godaddy opened this issue Feb 8, 2022 · 1 comment
Open

Make tartufo multi-threaded #325

gforsythe-godaddy opened this issue Feb 8, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@gforsythe-godaddy
Copy link

gforsythe-godaddy commented Feb 8, 2022

Feature Request

Add threading or concurrency to tartufo

Is your feature request related to a problem? Please describe.

Not a problem per-se, but on a monorepo with a long history, running tartufo can take a LONG time.
I happened to open htop and noticed only a single CPU core was being used (essentially wasting 11 cores). I would guess making tartufo multi-threaded would decrease not only the time for a scan, but also amount of $$$ spent on CICD instances on AWS.

Describe the solution you'd like

Preferrably either discoverable (via /proc) or number of threads via command line option.
I have not done any python coding for a loooong time so I am not sure what is available or if this is possible.
I would assume there would be a parent thread which perhaps farms out chunks of commits to child threads or something like that.

Describe alternatives you've considered

I suppose it might be possible to run separate instances of tartufo on different branches, but in my specific case I'm running on a bare mirror of a repo so branches are not available.

Teachability, Documentation, Adoption, Migration Strategy

Should be backwards compatible... may be good to make multi-threading the default option so everyone would just profit.

@gforsythe-godaddy gforsythe-godaddy added the enhancement New feature or request label Feb 8, 2022
@jhall1-godaddy
Copy link
Contributor

I think this would be a great feature. I ran tartufo on a local repo with 4k commits and it took about 112 seconds.

I don't necessarily think the threading module is the way to go, as threads are still bound by the Python GIL so we would not see any significant increase in the number of cores used. The multiprocessing module however, seems to fit the bill quite well. I would like to see a cli option to specify the number of cores to use with the default being all of them like @gforsythe-godaddy mentioned.

A global interpreter lock (GIL) is a mechanism used in computer-language interpreters to synchronize the execution of threads so that only one native thread (per process) can execute at a time.[1] An interpreter that uses GIL always allows exactly one thread to execute at a time, even if run on a multi-core processor.

I am interested in the development of this feature, a multiprocessing.Queue could be used to handle IPC. It would probably be a pretty significant rewrite, but could also be a significant performance increase.

Testing

I wrote a small script to determine the legitimacy of these claims.

EXPAND: View the script
import time
import tracemalloc
from multiprocessing import Process
from threading import Thread
from typing import Type


def do_work() -> None:
    """Simulates CPU intensive work."""
    x = [0]

    for i in range(10_000_000):
        j = x.pop()
        k = i + j - 100
        x.append(k)


def run_worker(worker: Type[Thread | Process]) -> None:
    """Creates 10 workers and runs the `do_work` function in each."""
    tracemalloc.start()
    start = time.perf_counter()
    workers: list[Thread | Process] = []

    for _ in range(10):
        w = worker(target=do_work)
        workers.append(w)
        w.start()

    for w in workers:
        w.join()

    end = time.perf_counter()
    duration = end - start
    current_mem, peak_mem = tracemalloc.get_traced_memory()
    tracemalloc.stop()

    print(f"Method:            {worker.__name__}")
    print(f"Completion time:   {duration:.2f} seconds")
    print(f"Current mem usage: {current_mem} bytes")
    print(f"Peak mem usage:    {peak_mem} bytes")
    print()


if __name__ == "__main__":
    run_worker(Thread)
    run_worker(Process)
EXPAND: View the script output
Method:            Thread
Completion time:   118.48 seconds
Current mem usage: 50535 bytes
Peak mem usage:    54471 bytes

Method:            Process
Completion time:   13.29 seconds
Current mem usage: 241608 bytes
Peak mem usage:    248048 bytes
Method:            Thread
Completion time:   115.34 seconds
Current mem usage: 50103 bytes
Peak mem usage:    54307 bytes

Method:            Process
Completion time:   12.63 seconds
Current mem usage: 242043 bytes
Peak mem usage:    248048 bytes
Method:            Thread
Completion time:   118.38 seconds
Current mem usage: 49847 bytes
Peak mem usage:    54307 bytes

Method:            Process
Completion time:   12.94 seconds
Current mem usage: 241555 bytes
Peak mem usage:    247560 bytes
Method:            Thread
Completion time:   118.65 seconds
Current mem usage: 49847 bytes
Peak mem usage:    54307 bytes

Method:            Process
Completion time:   12.35 seconds
Current mem usage: 242003 bytes
Peak mem usage:    248008 bytes
Method:            Thread
Completion time:   121.74 seconds
Current mem usage: 50535 bytes
Peak mem usage:    54471 bytes

Method:            Process
Completion time:   12.55 seconds
Current mem usage: 241555 bytes
Peak mem usage:    248048 bytes

As you can see, the multiprocessing implementation is much faster.
Multiprocessing ran approximately 10x faster.
Threading used approximately 5x less memory.

CPU usage with threading:
image

CPU usage with multiprocessing:
image

I noticed the current implementation of ScannerBase uses a threading.Lock, but after a search of the repo it doesn't look like any multithreading is being used and so the lock is not really doing anything in the program. The lock is used to stop other threads from accessing the block encompassed in the context manager, but there are no other threads to compete for the lock.

$ find ./tartufo -type f -not -path "*.pyc" -exec grep -iHn "threading" {} \;
./tartufo/scanner.py:11:import threading
./tartufo/scanner.py:146:    _scan_lock: threading.Lock = threading.Lock()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants