-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should not use seek() for skipping very small column chunks. better to read and ignore data. #3076
Comments
see also related issue: #3077 : AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads |
Anyway, I agree with you about read and discard is better for cloud stores, what I don't know is what is a good value here What do you think, at least in your deployments? The velox paper actually set the value as 500K
Maybe in hadoop we should
|
Yes we wanted to do more performance TPCDS tests using different min seek values and then decide a good default but those are expensive and time consuming to run and that's why we haven't done it yet. It would be good to know if you have any performance benchmark numbers. |
I think we could do some micro benchmarking here as well. What we wantis have a skip size such that it is faster to discard the data it is to wait for/acquire an http connection and download the data across multiple threads. Acquisition time is trouble given the http connection pool size and wait times which may be imposed by the actions of other threads. Same for that thread pool scheduling overhead. Download time is a function of bandwidth alone. We could ignore the https and thread delays and focus on the time from GET to "first byte" -which would be entirely that imposed by the cloud store itself. Then all we care about is that the time to download skip the data is less than the GET-to-first-byte latency. which is will be when
We could benchmark this: something to preheat the pool with a few head calls, then a set of single byte read() calls to different parts of a large file, with time to return being considered time to first byte. then work out download bandwidth and so how many bytes match that time-to-first-byte I will take a small PR to cloudstore for this; it'd be interesting to see what the local fs values are as well as the different cloud stores, local and remote. Even: do different VM types matter? @Arnaud-Nauwynck |
Describe the enhancement requested
When reading some column chunks but not all, parquet is building a list of "ConsecutivePartList", then trying to call the Hadoop api for vectorized reader of FSDataInputStream#readVectored(List ...)
Unfortunatly, many implementations of "FSDataInputStream" do not override the readVectored() method, which trigger many distinct calls to read.
For example on hadoop-azure, the Azure Datalake Storage is much slower at establishing a new Https connection (using infamous calls HttpURLConnection for jdk 1.0, then doing TLS hand-shake), that to get only few more megas of data on an existing socket !!
The case with small wholes to avoid reading is very frequent when having columns in parquet files that are not read, and are highly compressed because of RLE encoding. Typically, a very sparse column with only few values, or even always null within a page. Such a column could be encoded in only few hundred of bytes by parquet, so it is NOT a problem of reading 100 bytes more.
Parquet should at least honor the following method from hadoop class FileSystem, that says that a seek of less than 4096 bytes is NOT reasonable.
The logic for building this List for a list of column chunks is here:
org.apache.parquet.hadoop.ParquetFileReader#internalReadRowGroup
maybe a possible implementation could be to add fictive "ConsecutivePartList" that are to be ignored while receiving the data, but that would avoid having some wholes in the ranges to read.
Component(s)
No response
The text was updated successfully, but these errors were encountered: