Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Multi-thread Support #2594

Merged
merged 19 commits into from
Dec 20, 2023
Merged

Add Multi-thread Support #2594

merged 19 commits into from
Dec 20, 2023

Conversation

Sea-n
Copy link
Contributor

@Sea-n Sea-n commented Dec 14, 2023

This PR achieve x2.88 speedup for 4-thread, and it's x3.65 speedup in compare to current version. benchmark result as follow:

Threads Time Speedup 1 Speedup 2
current 59.6 sec - baseline
1 thread 47.3 sec baseline x1.26
2 threads 35.9 sec x1.32 x1.66
3 threads 21.2 sec x2.23 x2.80
4 threads 16.4 sec x2.88 x3.65

(1 GB log, 5.6 M lines, M2 Air, average of 5 runs)

Method

The simple control flow is as follows:

for each file:
  read file
  for each line:
    for every specifier in format ('%h %^[%d:%t %^] "%r" %s ...'):
      parse token into LogItem    <- parallelize here
    process LogItem to Metric     <- uplift this function

Since parse_specifier() takes much time here and doesn't have data races with others, we parallelize here first.

for each file:
  read file
  while not eof:
    for each line:
      save lines to Job1
    create threads:
      for every line in Job1:
        for every specifier in format ('%h %^[%d:%t %^] "%r" %s ...'):
          parse token into LogItem    <- parallelize here
          save LogItem to Job2
    for every LogItem in Job2:
      process LogItem to Metric     <- uplift this function

Major speedup causes

Parallelize parse_log()

Parsing logs takes about 70% of execution time, we parallelize this part.

And let process_log() run along with parse_log() to achieve x3.0 speedup for 5-thread machine.

Use GLog->start_time instead of localtime_r()

When solving the data race in init_log_item(), we introduced start_time in GLog.
And find out that parse_specifier() (which will be executed about 10 * times, depends on log format) also called this function every time.

Copying struct tm instead of convert Unix Timestamp to {year, month, day, hour, ...} saved a lot of time.

This contributed x1.15 speedup for serial version.

Change GLogItem->status from char* to int

When we measure the execute time of each modules, we find out that "Status Code" module takes 30% among them.

The reason is finding the {"404" => "404 - Not Found"} mapping with foreach() and string compare consume too many time. So we changed status_code to integer, and use status_codes[404] to lookup.

This will speedup process_log() by x1.38, and have x1.28 overall speedup if parse_log() in the bottleneck.

Details

Instead of --multi-thread-parsing we talked about before, using -j, --jobs (same as make) might be more reasonable for users. The default option conf.jobs=1 will not create new threads for parse/process log.

We split it into 20 commits:

  • Prepare
    • Merge read_lines() for with/without GETLINE
    • Rename pre_process_log() to parse_line()
    • Uplift process_log() from parse_line()
    • Uplift process_log() from read_line()
    • Add --jobs and --chunk-size options
  • Parse in batch
    • Replace logitems with GJob in read_lines()
    • Replace lines with GJob in read_lines()
    • Parse lines to GLogItem in batch
  • Solve data races
    • Eliminate data races with __sync_bool_compare_and_swap()
    • Use __sync_add_and_fetch()
    • Add GLog->start_time to prevent using localtime_r() per GLogItem
    • Use logitem instead of glog->item for init_log_item()
  • Parallelize
    • Implement the parallel version for log parsing
    • Process logs in batch
    • Run proc_logs while read_lines_thread in progress
    • Change GLogItem->status from char * to int
  • Misc
    • Change char * to const char * for parser
    • Add multi-thread to README
    • Update .gitignore

With every commits:

  • exactly the same output as master branch
  • have no compiling warning
  • passed the Valgrind test
  • conform to coding style
  • have no big change in compare to previous one, hope it would be easy to review step by step

Future Work

  • Parallelize process_log() by separating modules into different threads
    • I didn't think of a simple way to balance the thread count between parse_logs() and process_logs()
    • And which thread parses which modules, the workload varies for modules
  • Decide CHUNK_SIZE more scientifically
    • Not considered about low memory machines or highly efficient servers yet
  • Thread pool / Job queue
  • Check for rarely happened crashes
    • During the development stage, we encountered many crashes too, but this version seems to be fine

Fixes #377

src/parser.c Outdated Show resolved Hide resolved
src/parser.c Outdated Show resolved Hide resolved
@Sea-n Sea-n force-pushed the multithread-part1 branch from 7e07e20 to e5012f4 Compare December 16, 2023 03:05
@Sea-n Sea-n force-pushed the multithread-part1 branch from e5012f4 to 0f2f8c0 Compare December 18, 2023 05:28
@Sea-n
Copy link
Contributor Author

Sea-n commented Dec 18, 2023

Hi Gerardo,

I've revised this PR according to what we discussed in mail, it's ready for review now! 😍

Wait, we found another bottlneck can be improved, please give me another day.

@Sea-n Sea-n force-pushed the multithread-part1 branch 2 times, most recently from 0a13956 to 35f6f49 Compare December 18, 2023 18:49
@Sea-n
Copy link
Contributor Author

Sea-n commented Dec 18, 2023

Hi Gerardo,

We find that processing status code is time consuming, and achieved x1.2 speedup.

Version Overall Time proc_log() "Status Code" Module
Current 20.6 sec 17.4 sec 6.2 sec
Improved 16.2 sec 12.6 sec 1.4 sec

And as you advised, we extracted CHUNK_SIZE to --chunk-size option now, it's easier then we though!

This PR is ready for review now. 😊

src/parser.c Outdated Show resolved Hide resolved
@allinurl
Copy link
Owner

This is fantastic news! I tested it on my end, and there's a significant improvement when using 2 threads with the default CHUNK_SIZE. We're making great progress. I also ran it with 4+ threads, and it didn't crash. Before any merging, there are a few things I'd like to address:

  1. I think it's wise to set a limit when the user specifies the number of jobs, say between 1-12. I'm unsure of the appropriate maximum limit, but my assumption is that beyond a certain point, it might not be beneficial and could potentially crash the system. Same with chunk-size. If greater or lower, we should exit and display the reason, i.e., FATAL ('Max limit...').

  2. I observed that, since we modified pre_process_log()/parse_line() to return a GLogItem *, the numtests/dryrun part in the code isn't correctly providing the reason for an invalid log/date/time format when it doesn't match the given string.
    I suspect that part of the challenge stems from the adjustments in error handling. Let me take some time to consider the best approach for handling it on my end.

  3. I noticed a couple of memory leaks when encountering an invalid log format, such as W3C on a COMBINED log when generating an HTML report. My initial thought is that it might be related to the error handling, but I'll investigate further to determine how we can address this issue. e.g.,

valgrind -s --tool=memcheck --leak-check=full --show-leak-kinds=all --track-origins=yes ./goaccess access.log -o a.html --log-format=W3C

Thanks again

Sea-n added 17 commits December 19, 2023 20:09
And sort getopt short_options list
And move count_process() out of parse_line() to avoid data race.
- set conf.* flags in parse_specifier()
- glog->lp.ts = logitem->dt in parse_line()
In get_ii32, get_si32, inc_ii32, inc_si32, count_process
Prevent data race for init_log_item()
And reduced the workload of unnecessary localtime_r() function calls
This improvement have x1.15 overall speedup
In process_log(), use array lookup instead of string compare
This will speedup process_log() by x1.38, and have x1.28 overall speedup
@Sea-n Sea-n force-pushed the multithread-part1 branch from 35f6f49 to f4ca02d Compare December 19, 2023 18:26
@Sea-n
Copy link
Contributor Author

Sea-n commented Dec 19, 2023

The code is updated! Here is the summary/reason.

1-a. jobs count

It now has --jobs=<1-6> and recommended 4-thread in the help text.
The hard limit is set to --jobs=12.

1-b. chunk-size

It's a long part, leave it to the end.

2-a. error message

Oh, I forgot to check if logitem->errstr exists and just replaced it.
Now it will display the original error message.

2-b. dry run

I've thinking about whether should we run strictly 10 items or one round (4 thread * chunk 1024 = 4096 items).

Compared to the added complexity to the codebase (and maybe slow down a little bit because of the if-else), I think 10 ms waste of time at the startup is acceptable.

In the extreme case, 6 threads * chunk 8,192 ~= 50,000 lines, it should be done in 0.x seconds.

3-a. memory leak

Changed from cnt == num_tests to cnt >= num_tests, and free the logitem before return NULL.
How can I miss this 😅

3-b. memory leak

Related to (2-a) error message.
It was overwritten, so no one freed the original errstr.

Chunk Size

The benchmark:

512 to 262144

(Note that the log file only has 5,600,000 lines)

My test command: for i in {1..5}; do for c in 1024 4096 16384 65536 262144 512 2048 8192 32768 131072; do for j in {1..4}; do time ./goaccess access.log.1gb -o /dev/null --jobs=$j --chunk-size=$c; sleep 1; done; done; done 2>&1 | tee -a benchmark-result

And for the extreme values:

1 to 1024

The overhead for low chunk size might be:

  • pthread_create() and pthread_join() without thread pool
  • function call for every chunk

16384 to 1048576

The overhead for extremely high chunk size:

  • The idle time for the first round (have no logitems to process)
  • The final process time (with no lines need to be parsed)

So I add a hard limit along with recommended range, and temporarily keep the default value untouched.
It's hard to decide the max/min hard/soft limit and the default value, it's welcome to change to any number you like.

I've created 3 EC2 server to run the benchmark, but the result sucks:

chart-c7g chart-c6a chart-t4g
c7g (arm64, 2.6 GHz, 16 GB) c6a (x86_64, 3.6 GHz, 16 GB) t4g (arm64, 2.5 GHz, 32 GB)

For 1 thread

For 1 thread, chunk-size has no significant difference. (Though higher chunk-size will have a slightly speedup.)

We have some options here:

  • only check for --jobs=4 --chunk-size=8, allow --chunk-size=8 --jobs=4 to work
  • also check the conf.chunk_size in the --jobs option handling
  • check conf.chunk_size in the main program (after all options are parsed)
  • still imply a meanless but harmless limit for --jobs=1
    And I chose the last one as my first thought, it's also up to you to decide.

@allinurl allinurl merged commit 1e116d0 into allinurl:master Dec 20, 2023
5 checks passed
@allinurl
Copy link
Owner

This is fantastic! Thank you for the effort put into this; it's a significant enhancement to the parsing process. Merged.

@allinurl
Copy link
Owner

allinurl commented Dec 20, 2023

whoops, I rushed the merge. Looks like something's broken when piping in data. e.g.,

tail -f access.log | goaccess - --log-format=VCOMBINED -o report.html --real-time-html

@0bi-w6n-K3nobi
Copy link
Contributor

0bi-w6n-K3nobi commented Mar 4, 2024

Hello @allinurl and @Sea-n

Wow! Great, great and great news!

I've been waiting for this for a long time.
And unfortunately, laziness didn't allow me to help us add this feature.
Sorry about that, @allinurl.

Well... I didn't come here to criticize anything, friends.
But, to collaborate even if than late.
I will only address the case of offline processing here, i.e. non realtime.
In my experience, even very old 2012 server hardware, can process 3 to
4 thousand requests per second.
This accounts for approximately 3 million requests per day.
I think it's more than enough.

So @Sea-n, are these EC2 instances single or multi-processor?
(What it seems to me here is that your instances have, at most, 2 threads)
Of course, thread performance also depends on this.
Note 1: I believe that the parameter value limitation, and default value,
for jobs can be calculated relative to the hardware that is running.

So 64K lines from LOGs, correspond to 1 to 2 Mbytes (200 to 400 bytes per line).
For values ​​like 32K lines and 8 threads We would be had 4 to 8 Mbytes.
Even on (what really here is) very fast EC2 storage, (Note 2:) We may have
run out SDD cache/FileSystem Cache.
But a more impartial analysis it would be require excluding this loading time.
Since We are talking about thread scalability and not I/O (bottleneck) problems.
This can be done (solved) in several ways:

  • Load data more than once, checking the time for each load;
  • Use the setvbuf function with the highest possible value;
  • Use dd as input cache (it is necessary to check the pipe bottleneck);
  • Copy the data LOGs to /dev/shm (in-memory file system) and run from there.

Well.. But all this may be in vain...
If we are talking about just 1 processor, the bottleneck in these graphics is certainly
processing power and not I/O.

So... I always thought about implementing via Thread Pool, being 1 to 1 for each LOG file.
So if you had 4 files, for ex., you would have 4 threads...
And when it had many files, it would process N active threads (jobs param) at a time.
This would make it simple, but it would lose meaning if it was just 1 LOG file or PIPE.

Anyway... Let's go ... to that really I can be help your, @allinurl and @Sea-n.
I had a lot of data for tests... as @allinurl already know about this.
I did think about make a battery of tests, similar to mentioned above.
However, I will be include more variations. It would be an matrix with parameters like:

  • 1 up to N processors;
  • 1 up to T threads;
  • 0 up to P panels;
  • 1 up to 2S for chunck-size;

This would certainly give us the real impact of Threads on the process, and which Panels
offer a bottleneck.

Sirs... Are you interested in this?
It should take a while to gather the analysis data.

@allinurl
Copy link
Owner

hey @0bi-w6n-K3nobi good to hear from you. Thanks so much for laying all that out! The analysis you provided on potential bottlenecks, caching, and threading is super insightful.

I'm definitely interested in seeing the results if you're able to run that comprehensive test matrix you proposed. Having real data on how different configs of processors, threads, panels, and chunk sizes actually perform would be really valuable for further optimizations.

If you can put together that testing and share the findings, that would be amazing. No rush at all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multiple parsing threads
3 participants