Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1767: [C++] Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter #2010

Closed
wants to merge 13 commits into from

Conversation

taiyang-li
Copy link
Contributor

@taiyang-li taiyang-li commented Aug 22, 2024

What changes were proposed in this pull request?

Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter.
Performance was measured in ClickHouse#15

Why are the changes needed?

How was this patch tested?

original tests.

Was this patch authored or co-authored using generative AI tooling?

@wgtmac
Copy link
Member

wgtmac commented Aug 22, 2024

Thanks for opening a PR! Could you please create a JIRA for it?

@wgtmac
Copy link
Member

wgtmac commented Aug 22, 2024

cc @ffacs @luffy-zh

@taiyang-li taiyang-li changed the title Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter [C++] Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter Aug 22, 2024
@taiyang-li
Copy link
Contributor Author

@wgtmac could you help review this pr, thanks very much

@ffacs
Copy link
Contributor

ffacs commented Aug 22, 2024

@taiyang-li Hi, thank you for opening a PR!
.Do writing cost a lot on sorting? After all, the size of dictionary shouldn't be large when we try to use dictionary encoding.

c++/src/ColumnWriter.cc Outdated Show resolved Hide resolved
c++/src/ColumnWriter.cc Outdated Show resolved Hide resolved
@taiyang-li
Copy link
Contributor Author

taiyang-li commented Aug 22, 2024

@taiyang-li Hi, thank you for opening a PR! It seems that the time complexity not changed. Do writing cost a lot on sorting? After all, the size of dictionary shouldn't be large when we try to use dictionary encoding.

@ffacs I had measured about this. From flamegraph we could see that std::map::insert takes most of time. Notice that std::map::insert is executed even if key is duplicated. Anyway, I'll add a microbench for comparison.

@ffacs
Copy link
Contributor

ffacs commented Aug 22, 2024

@taiyang-li Hi, thank you for opening a PR! It seems that the time complexity not changed. Do writing cost a lot on sorting? After all, the size of dictionary shouldn't be large when we try to use dictionary encoding.

@ffacs I had measured about this. From flamegraph we could see that std::map::insert takes most of time. Notice that std::map::insert is executed even if key is duplicated.

Oh yes, std::map takes $$O(n \ \log(key))$$, but sort takes only $$O(n + key\ \log(Key))$$

c++/include/orc/Vector.hh Outdated Show resolved Hide resolved
@taiyang-li
Copy link
Contributor Author

@ffacs do you know where can I add google benchmark in orc ?

@ffacs
Copy link
Contributor

ffacs commented Aug 26, 2024

@ffacs do you know where can I add google benchmark in orc ?

There are no C++ benchmark currently. FYI. #1658

@taiyang-li
Copy link
Contributor Author

taiyang-li commented Aug 26, 2024

I added a benchmark src/Common/benchmarks/orc_string_dictionary.cpp in another related ClickHouse PR (https://github.com/ClickHouse/ClickHouse/pull/68591/files) to measure performance changes after my optimization. Here is the result. @ffacs @luffy-zh hope for your reviews.

$ ./build_gcc/src/Common/benchmarks/orc_string_dictionary 
2024-08-26T15:00:09+08:00
Running ./build_gcc/src/Common/benchmarks/orc_string_dictionary
Run on (32 X 2100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 11264 KiB (x2)
Load Average: 6.35, 7.05, 6.55
------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations
------------------------------------------------------------------------------------------------------
BM_writeStringDictionary<OldSortedStringDictionary, 10>       74094421 ns     74091713 ns            9
BM_writeStringDictionary<NewSortedStringDictionary, 10>       73200223 ns     73197770 ns            9
BM_writeStringDictionary<OldSortedStringDictionary, 100>     137097736 ns    137091909 ns            5
BM_writeStringDictionary<NewSortedStringDictionary, 100>      86304767 ns     86301101 ns            8
BM_writeStringDictionary<OldSortedStringDictionary, 1000>    237995798 ns    237986072 ns            3
BM_writeStringDictionary<NewSortedStringDictionary, 1000>    101562538 ns    101559063 ns            7
BM_writeStringDictionary<OldSortedStringDictionary, 10000>   397019465 ns    397002121 ns            2
BM_writeStringDictionary<NewSortedStringDictionary, 10000>   153278355 ns    153274828 ns            4
BM_writeStringDictionary<OldSortedStringDictionary, 100000>  787423737 ns    787397582 ns            1
BM_writeStringDictionary<NewSortedStringDictionary, 100000>  334958714 ns    334938666 ns            2 

OldSortedStringDictionary is the old implementation, NewSortedStringDictionary is the new implementation, and 10/100/1000/10000/100000 represents the cardinality number of strings among 10 million rows

c++/include/orc/Vector.hh Outdated Show resolved Hide resolved
c++/include/orc/Vector.hh Outdated Show resolved Hide resolved
c++/include/orc/Vector.hh Outdated Show resolved Hide resolved
c++/src/ColumnWriter.cc Outdated Show resolved Hide resolved
c++/src/ColumnWriter.cc Outdated Show resolved Hide resolved
c++/include/orc/Vector.hh Outdated Show resolved Hide resolved
@dongjoon-hyun dongjoon-hyun added this to the 2.1.0 milestone Aug 28, 2024
c++/include/orc/Vector.hh Outdated Show resolved Hide resolved
c++/include/orc/Vector.hh Outdated Show resolved Hide resolved
c++/src/ColumnWriter.cc Outdated Show resolved Hide resolved
@taiyang-li
Copy link
Contributor Author

Thanks for opening a PR! Could you please create a JIRA for it?

https://issues.apache.org/jira/browse/ORC-1767

@taiyang-li taiyang-li changed the title [C++] Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter ORC-1767: [C++] Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter Aug 29, 2024
c++/src/Vector.cc Outdated Show resolved Hide resolved
c++/src/Vector.cc Outdated Show resolved Hide resolved
@taiyang-li
Copy link
Contributor Author

@ffacs any new requested changes ?

Copy link
Contributor

@ffacs ffacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path almost LGTM other than the typo. Thank you @taiyang-li !

c++/include/orc/Vector.hh Outdated Show resolved Hide resolved
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@ffacs ffacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ffacs ffacs closed this in 97a3797 Sep 3, 2024
taiyang-li added a commit to taiyang-li/orc that referenced this pull request Sep 4, 2024
…and support EncodedStringVectorBatch for StringColumnWriter

Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter.
Performance was measured in ClickHouse#15

original tests.

Closes apache#2010 from taiyang-li/apache_improve_dict_write.

Lead-authored-by: taiyang-li <[email protected]>
Co-authored-by: 李扬 <[email protected]>
Signed-off-by: ffacs <[email protected]>
taiyang-li added a commit to bigo-sg/ClickHouse that referenced this pull request Sep 4, 2024
taiyang-li added a commit to taiyang-li/orc that referenced this pull request Sep 4, 2024
…and support EncodedStringVectorBatch for StringColumnWriter

Improve writing performance of encoded string column and support EncodedStringVectorBatch for StringColumnWriter.
Performance was measured in ClickHouse#15

original tests.

Closes apache#2010 from taiyang-li/apache_improve_dict_write.

Lead-authored-by: taiyang-li <[email protected]>
Co-authored-by: 李扬 <[email protected]>
Signed-off-by: ffacs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants