Add Multi-thread Support #2594

Sea-n · 2023-12-14T13:30:11Z

This PR achieve x2.88 speedup for 4-thread, and it's x3.65 speedup in compare to current version. benchmark result as follow:

Threads	Time	Speedup 1	Speedup 2
current	59.6 sec	-	baseline
1 thread	47.3 sec	baseline	x1.26
2 threads	35.9 sec	x1.32	x1.66
3 threads	21.2 sec	x2.23	x2.80
4 threads	16.4 sec	x2.88	x3.65

(1 GB log, 5.6 M lines, M2 Air, average of 5 runs)

Method

The simple control flow is as follows:

for each file:
  read file
  for each line:
    for every specifier in format ('%h %^[%d:%t %^] "%r" %s ...'):
      parse token into LogItem    <- parallelize here
    process LogItem to Metric     <- uplift this function

Since parse_specifier() takes much time here and doesn't have data races with others, we parallelize here first.

for each file:
  read file
  while not eof:
    for each line:
      save lines to Job1
    create threads:
      for every line in Job1:
        for every specifier in format ('%h %^[%d:%t %^] "%r" %s ...'):
          parse token into LogItem    <- parallelize here
          save LogItem to Job2
    for every LogItem in Job2:
      process LogItem to Metric     <- uplift this function

Major speedup causes

Parallelize `parse_log()`

Parsing logs takes about 70% of execution time, we parallelize this part.

And let process_log() run along with parse_log() to achieve x3.0 speedup for 5-thread machine.

Use `GLog->start_time` instead of `localtime_r()`

When solving the data race in init_log_item(), we introduced start_time in GLog.
And find out that parse_specifier() (which will be executed about 10 * times, depends on log format) also called this function every time.

Copying struct tm instead of convert Unix Timestamp to {year, month, day, hour, ...} saved a lot of time.

This contributed x1.15 speedup for serial version.

Change `GLogItem->status` from `char*` to `int`

When we measure the execute time of each modules, we find out that "Status Code" module takes 30% among them.

The reason is finding the {"404" => "404 - Not Found"} mapping with foreach() and string compare consume too many time. So we changed status_code to integer, and use status_codes[404] to lookup.

This will speedup process_log() by x1.38, and have x1.28 overall speedup if parse_log() in the bottleneck.

Details

Instead of --multi-thread-parsing we talked about before, using -j, --jobs (same as make) might be more reasonable for users. The default option conf.jobs=1 will not create new threads for parse/process log.

We split it into 20 commits:

Prepare
- Merge read_lines() for with/without GETLINE
- Rename pre_process_log() to parse_line()
- Uplift process_log() from parse_line()
- Uplift process_log() from read_line()
- Add --jobs and --chunk-size options
Parse in batch
- Replace logitems with GJob in read_lines()
- Replace lines with GJob in read_lines()
- Parse lines to GLogItem in batch
Solve data races
- Eliminate data races with __sync_bool_compare_and_swap()
- Use __sync_add_and_fetch()
- Add GLog->start_time to prevent using localtime_r() per GLogItem
- Use logitem instead of glog->item for init_log_item()
Parallelize
- Implement the parallel version for log parsing
- Process logs in batch
- Run proc_logs while read_lines_thread in progress
- Change GLogItem->status from char * to int
Misc
- Change char * to const char * for parser
- Add multi-thread to README
- Update .gitignore

With every commits:

exactly the same output as master branch
have no compiling warning
passed the Valgrind test
conform to coding style
have no big change in compare to previous one, hope it would be easy to review step by step

Future Work

Parallelize process_log() by separating modules into different threads
- I didn't think of a simple way to balance the thread count between parse_logs() and process_logs()
- And which thread parses which modules, the workload varies for modules
Decide CHUNK_SIZE more scientifically
- Not considered about low memory machines or highly efficient servers yet
Thread pool / Job queue
Check for rarely happened crashes
- During the development stage, we encountered many crashes too, but this version seems to be fine

Fixes #377

src/parser.c

Sea-n · 2023-12-18T05:30:52Z

Hi Gerardo,

I've revised this PR according to what we discussed in mail, ~~it's ready for review now! 😍~~

Wait, we found another bottlneck can be improved, please give me another day.

Sea-n · 2023-12-18T18:50:01Z

Hi Gerardo,

We find that processing status code is time consuming, and achieved x1.2 speedup.

Version	Overall Time	proc_log()	"Status Code" Module
Current	20.6 sec	17.4 sec	6.2 sec
Improved	16.2 sec	12.6 sec	1.4 sec

And as you advised, we extracted CHUNK_SIZE to --chunk-size option now, it's easier then we though!

This PR is ready for review now. 😊

src/parser.c

allinurl · 2023-12-19T02:45:01Z

This is fantastic news! I tested it on my end, and there's a significant improvement when using 2 threads with the default CHUNK_SIZE. We're making great progress. I also ran it with 4+ threads, and it didn't crash. Before any merging, there are a few things I'd like to address:

I think it's wise to set a limit when the user specifies the number of jobs, say between 1-12. I'm unsure of the appropriate maximum limit, but my assumption is that beyond a certain point, it might not be beneficial and could potentially crash the system. Same with chunk-size. If greater or lower, we should exit and display the reason, i.e., FATAL ('Max limit...').
I observed that, since we modified pre_process_log()/parse_line() to return a GLogItem *, the numtests/dryrun part in the code isn't correctly providing the reason for an invalid log/date/time format when it doesn't match the given string.
I suspect that part of the challenge stems from the adjustments in error handling. Let me take some time to consider the best approach for handling it on my end.
I noticed a couple of memory leaks when encountering an invalid log format, such as W3C on a COMBINED log when generating an HTML report. My initial thought is that it might be related to the error handling, but I'll investigate further to determine how we can address this issue. e.g.,

valgrind -s --tool=memcheck --leak-check=full --show-leak-kinds=all --track-origins=yes ./goaccess access.log -o a.html --log-format=W3C

Thanks again

And sort getopt short_options list

And move count_process() out of parse_line() to avoid data race.

- set conf.* flags in parse_specifier() - glog->lp.ts = logitem->dt in parse_line()

In get_ii32, get_si32, inc_ii32, inc_si32, count_process

Prevent data race for init_log_item() And reduced the workload of unnecessary localtime_r() function calls This improvement have x1.15 overall speedup

Avoid data race for GLog

In process_log(), use array lookup instead of string compare This will speedup process_log() by x1.38, and have x1.28 overall speedup

Sea-n · 2023-12-19T18:26:15Z

The code is updated! Here is the summary/reason.

1-a. jobs count

It now has --jobs=<1-6> and recommended 4-thread in the help text.
The hard limit is set to --jobs=12.

1-b. chunk-size

It's a long part, leave it to the end.

2-a. error message

Oh, I forgot to check if logitem->errstr exists and just replaced it.
Now it will display the original error message.

2-b. dry run

I've thinking about whether should we run strictly 10 items or one round (4 thread * chunk 1024 = 4096 items).

Compared to the added complexity to the codebase (and maybe slow down a little bit because of the if-else), I think 10 ms waste of time at the startup is acceptable.

In the extreme case, 6 threads * chunk 8,192 ~= 50,000 lines, it should be done in 0.x seconds.

3-a. memory leak

Changed from cnt == num_tests to cnt >= num_tests, and free the logitem before return NULL.
How can I miss this 😅

3-b. memory leak

Related to (2-a) error message.
It was overwritten, so no one freed the original errstr.

Chunk Size

The benchmark:

(Note that the log file only has 5,600,000 lines)

My test command: for i in {1..5}; do for c in 1024 4096 16384 65536 262144 512 2048 8192 32768 131072; do for j in {1..4}; do time ./goaccess access.log.1gb -o /dev/null --jobs=$j --chunk-size=$c; sleep 1; done; done; done 2>&1 | tee -a benchmark-result

And for the extreme values:

The overhead for low chunk size might be:

pthread_create() and pthread_join() without thread pool
function call for every chunk

The overhead for extremely high chunk size:

The idle time for the first round (have no logitems to process)
The final process time (with no lines need to be parsed)

So I add a hard limit along with recommended range, and temporarily keep the default value untouched.
It's hard to decide the max/min hard/soft limit and the default value, it's welcome to change to any number you like.

I've created 3 EC2 server to run the benchmark, but the result sucks:


c7g (arm64, 2.6 GHz, 16 GB)	c6a (x86_64, 3.6 GHz, 16 GB)	t4g (arm64, 2.5 GHz, 32 GB)

For 1 thread

For 1 thread, chunk-size has no significant difference. (Though higher chunk-size will have a slightly speedup.)

We have some options here:

only check for --jobs=4 --chunk-size=8, allow --chunk-size=8 --jobs=4 to work
also check the conf.chunk_size in the --jobs option handling
check conf.chunk_size in the main program (after all options are parsed)
still imply a meanless but harmless limit for --jobs=1
And I chose the last one as my first thought, it's also up to you to decide.

allinurl · 2023-12-20T23:09:34Z

This is fantastic! Thank you for the effort put into this; it's a significant enhancement to the parsing process. Merged.

allinurl · 2023-12-20T23:29:04Z

whoops, I rushed the merge. Looks like something's broken when piping in data. e.g.,

tail -f access.log | goaccess - --log-format=VCOMBINED -o report.html --real-time-html

0bi-w6n-K3nobi · 2024-03-04T17:12:05Z

Hello @allinurl and @Sea-n

Wow! Great, great and great news!

I've been waiting for this for a long time.
And unfortunately, laziness didn't allow me to help us add this feature.
Sorry about that, @allinurl.

Well... I didn't come here to criticize anything, friends.
But, to collaborate even if than late.
I will only address the case of offline processing here, i.e. non realtime.
In my experience, even very old 2012 server hardware, can process 3 to
4 thousand requests per second.
This accounts for approximately 3 million requests per day.
I think it's more than enough.

So @Sea-n, are these EC2 instances single or multi-processor?
(What it seems to me here is that your instances have, at most, 2 threads)
Of course, thread performance also depends on this.
Note 1: I believe that the parameter value limitation, and default value,
for jobs can be calculated relative to the hardware that is running.

So 64K lines from LOGs, correspond to 1 to 2 Mbytes (200 to 400 bytes per line).
For values like 32K lines and 8 threads We would be had 4 to 8 Mbytes.
Even on (what really here is) very fast EC2 storage, (Note 2:) We may have
run out SDD cache/FileSystem Cache.
But a more impartial analysis it would be require excluding this loading time.
Since We are talking about thread scalability and not I/O (bottleneck) problems.
This can be done (solved) in several ways:

Load data more than once, checking the time for each load;
Use the setvbuf function with the highest possible value;
Use dd as input cache (it is necessary to check the pipe bottleneck);
Copy the data LOGs to /dev/shm (in-memory file system) and run from there.

Well.. But all this may be in vain...
If we are talking about just 1 processor, the bottleneck in these graphics is certainly
processing power and not I/O.

So... I always thought about implementing via Thread Pool, being 1 to 1 for each LOG file.
So if you had 4 files, for ex., you would have 4 threads...
And when it had many files, it would process N active threads (jobs param) at a time.
This would make it simple, but it would lose meaning if it was just 1 LOG file or PIPE.

Anyway... Let's go ... to that really I can be help your, @allinurl and @Sea-n.
I had a lot of data for tests... as @allinurl already know about this.
I did think about make a battery of tests, similar to mentioned above.
However, I will be include more variations. It would be an matrix with parameters like:

1 up to N processors;
1 up to T threads;
0 up to P panels;
1 up to 2^S for chunck-size;

This would certainly give us the real impact of Threads on the process, and which Panels
offer a bottleneck.

Sirs... Are you interested in this?
It should take a while to gather the analysis data.

allinurl · 2024-03-14T22:55:13Z

hey @0bi-w6n-K3nobi good to hear from you. Thanks so much for laying all that out! The analysis you provided on potential bottlenecks, caching, and threading is super insightful.

I'm definitely interested in seeing the results if you're able to run that comprehensive test matrix you proposed. Having real data on how different configs of processors, threads, panels, and chunk sizes actually perform would be really valuable for further optimizations.

If you can put together that testing and share the findings, that would be amazing. No rush at all!

allinurl reviewed Dec 15, 2023

View reviewed changes

src/parser.c Outdated Show resolved Hide resolved

allinurl reviewed Dec 16, 2023

View reviewed changes

src/parser.c Outdated Show resolved Hide resolved

Sea-n force-pushed the multithread-part1 branch from 7e07e20 to e5012f4 Compare December 16, 2023 03:05

Sea-n added 2 commits December 18, 2023 13:17

Merge read_lines() for with/without GETLINE

df2df1e

Rename pre_process_log() to parse_line()

8164121

Sea-n force-pushed the multithread-part1 branch from e5012f4 to 0f2f8c0 Compare December 18, 2023 05:28

Sea-n force-pushed the multithread-part1 branch 2 times, most recently from 0a13956 to 35f6f49 Compare December 18, 2023 18:49

allinurl reviewed Dec 19, 2023

View reviewed changes

src/parser.c Outdated Show resolved Hide resolved

Sea-n added 17 commits December 19, 2023 20:09

Uplift process_log() from parse_line()

6815e30

Uplift process_log() from read_line()

9ee7266

Add --jobs and --chunk-size options

22e3f4b

And sort getopt short_options list

Replace logitems with GJob in read_lines()

18a0065

Replace lines with GJob in read_lines()

49f3308

Parse lines to GLogItem in batch

e8ba79e

And move count_process() out of parse_line() to avoid data race.

Eliminate data races with __sync_bool_compare_and_swap()

a8996cc

- set conf.* flags in parse_specifier() - glog->lp.ts = logitem->dt in parse_line()

Use __sync_add_and_fetch()

4b140aa

In get_ii32, get_si32, inc_ii32, inc_si32, count_process

Add GLog->start_time for parse_specifier() and init_log_item()

90be678

Prevent data race for init_log_item() And reduced the workload of unnecessary localtime_r() function calls This improvement have x1.15 overall speedup

Use logitem instead of glog->item for init_log_item()

2229709

Avoid data race for GLog

Change char * to const char * for parser

c612935

Implement the parallel version for log parsing

43ed0c1

Process logs in batch

c7c7df2

Run proc_logs while read_lines_thread in progress

4cd518f

Change GLogItem->status from char* to int

6d2cb7e

In process_log(), use array lookup instead of string compare This will speedup process_log() by x1.38, and have x1.28 overall speedup

Add multi-thread to README

1a07276

Update .gitignore

f4ca02d

Sea-n force-pushed the multithread-part1 branch from 35f6f49 to f4ca02d Compare December 19, 2023 18:26

allinurl merged commit 1e116d0 into allinurl:master Dec 20, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Multi-thread Support #2594

Add Multi-thread Support #2594

Sea-n commented Dec 14, 2023 •

edited

Loading

Sea-n commented Dec 18, 2023 •

edited

Loading

Sea-n commented Dec 18, 2023

allinurl commented Dec 19, 2023

Sea-n commented Dec 19, 2023 •

edited

Loading

allinurl commented Dec 20, 2023

allinurl commented Dec 20, 2023 •

edited

Loading

0bi-w6n-K3nobi commented Mar 4, 2024 •

edited

Loading

allinurl commented Mar 14, 2024

Add Multi-thread Support #2594

Add Multi-thread Support #2594

Conversation

Sea-n commented Dec 14, 2023 • edited Loading

Method

Major speedup causes

Parallelize parse_log()

Use GLog->start_time instead of localtime_r()

Change GLogItem->status from char* to int

Details

Future Work

Sea-n commented Dec 18, 2023 • edited Loading

Sea-n commented Dec 18, 2023

allinurl commented Dec 19, 2023

Sea-n commented Dec 19, 2023 • edited Loading

1-a. jobs count

1-b. chunk-size

2-a. error message

2-b. dry run

3-a. memory leak

3-b. memory leak

Chunk Size

For 1 thread

allinurl commented Dec 20, 2023

allinurl commented Dec 20, 2023 • edited Loading

0bi-w6n-K3nobi commented Mar 4, 2024 • edited Loading

allinurl commented Mar 14, 2024

Sea-n commented Dec 14, 2023 •

edited

Loading

Parallelize `parse_log()`

Use `GLog->start_time` instead of `localtime_r()`

Change `GLogItem->status` from `char*` to `int`

Sea-n commented Dec 18, 2023 •

edited

Loading

Sea-n commented Dec 19, 2023 •

edited

Loading

allinurl commented Dec 20, 2023 •

edited

Loading

0bi-w6n-K3nobi commented Mar 4, 2024 •

edited

Loading