-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1361: InvalidProtocolBufferException when reading large stripe statistics #1402
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
115 changes: 115 additions & 0 deletions
115
java/core/src/test/org/apache/orc/TestOrcWithLargeStripeStatistics.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to you under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.apache.orc; | ||
|
||
import org.apache.hadoop.conf.Configuration; | ||
import org.apache.hadoop.fs.Path; | ||
import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector; | ||
import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch; | ||
|
||
import org.junit.jupiter.api.Disabled; | ||
import org.junit.jupiter.params.ParameterizedTest; | ||
import org.junit.jupiter.params.provider.EnumSource; | ||
|
||
import java.io.IOException; | ||
import java.util.Arrays; | ||
import java.util.stream.Collectors; | ||
import java.util.stream.IntStream; | ||
|
||
import static org.junit.jupiter.api.Assertions.assertTrue; | ||
|
||
/** | ||
* Tests for operations on Orc file with very large stripe statistics. | ||
* <p> | ||
* The test is disabled by default cause it is rather slow (approx 14 minutes) and memory greedy | ||
* (it requires about 4g heap space when creating the files). If you want to run it remove the | ||
* {@code Disabled} annotation and ensure that max heap (Xmx) is at least 4g. | ||
* </p> | ||
*/ | ||
@Disabled("ORC-1361") | ||
public class TestOrcWithLargeStripeStatistics { | ||
|
||
@ParameterizedTest | ||
@EnumSource(value = OrcFile.Version.class, mode = EnumSource.Mode.EXCLUDE, names = "FUTURE") | ||
public void testGetStripeStatisticsNoProtocolBufferExceptions(OrcFile.Version version) | ||
throws Exception { | ||
// Use a size that exceeds the protobuf limit (e.g., 1GB) to trigger protobuf exception | ||
Path p = createOrcFile(1024L << 20, version); | ||
try (Reader reader = OrcFile.createReader(p, OrcFile.readerOptions(new Configuration()))) { | ||
assertTrue(reader.getStripeStatistics().isEmpty()); | ||
} | ||
} | ||
|
||
/** | ||
* Creates an Orc file with a metadata section of the specified size and return its path in the | ||
* filesystem. | ||
* | ||
* The file has a fixed schema (500 string columns) and content (every column contains 200 | ||
* characters, which is roughly 200 bytes). Each row is roughly 100KB uncompressed and each stripe | ||
* holds exactly one row thus stripe metadata (column statistics) per row is 200KB (100KB for min, | ||
* 100KB for max, few bytes for sum). | ||
* | ||
* @param metadataSize the desired size of the resulting metadata section in bytes | ||
* @param version the desired version to create the file | ||
* @return the path to filesystem where the file was created. | ||
* @throws IOException if an IO problem occurs while creating the file | ||
*/ | ||
private static Path createOrcFile(long metadataSize, OrcFile.Version version) throws IOException { | ||
// Calculate the number of rows/stripes to create based on the size of one row (200KB). | ||
final long ROW_STRIPE_NUM = metadataSize / 200_000L; | ||
Path p = new Path(System.getProperty("test.tmp.dir"), | ||
TestOrcWithLargeStripeStatistics.class.getSimpleName() | ||
+ "_" + ROW_STRIPE_NUM + "_" + version + ".orc"); | ||
// Modify defaults to force one row per stripe. | ||
Configuration conf = new Configuration(); | ||
conf.set(OrcConf.ROWS_BETWEEN_CHECKS.getAttribute(), "0"); | ||
TypeDescription schema = createTypeDescription(); | ||
OrcFile.WriterOptions writerOptions = | ||
OrcFile.writerOptions(conf) | ||
.setSchema(schema) | ||
.stripeSize(1) | ||
.encodingStrategy(OrcFile.EncodingStrategy.SPEED) | ||
.version(version); | ||
try (Writer writer = OrcFile.createWriter(p, writerOptions)) { | ||
VectorizedRowBatch batch = createSingleRowBatch(schema); | ||
for (long i = 0; i < ROW_STRIPE_NUM; i++) { | ||
writer.addRowBatch(batch); | ||
} | ||
} | ||
return p; | ||
} | ||
|
||
private static VectorizedRowBatch createSingleRowBatch(TypeDescription schema) { | ||
VectorizedRowBatch batch = schema.createRowBatch(); | ||
batch.size = 1; | ||
byte[] bigString = new byte[200]; | ||
Arrays.fill(bigString, (byte) 'A'); | ||
for (int i = 0; i < batch.numCols; i++) { | ||
BytesColumnVector col = (BytesColumnVector) batch.cols[i]; | ||
col.setVal(0, bigString); | ||
} | ||
return batch; | ||
} | ||
|
||
private static TypeDescription createTypeDescription() { | ||
String strCols = IntStream.range(0, 500) | ||
.mapToObj(i -> "col" + i + ":string") | ||
.collect(Collectors.joining(",")); | ||
return TypeDescription.fromString("struct<" + strCols + ">"); | ||
} | ||
|
||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message could be wrong because
InvalidProtocolBufferException
can happen in another cases, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
From protobuf, there are many other cases here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part of the message (
Failed to parse stripe statistics
) is accurate. The second part indeed can be misleading in some cases.Instead of pointing to ORC-1361 what do you think of creating a page/section in the ORC website and document there whatever is needed? Then the message can contain the URL to this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the exception message provide detail information? It yes, we may not need to state
check ORC-1361 for more details
explicitly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we just need to change the title to catch
InvalidProtocolBufferException
and let the program continue to run.Reading large bars of statistical information is one of the causes of the exception. But throw any InvalidProtocolBufferException and this pr will allow the program to continue to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep running, but need to let the user know that we skipped these statistics. This can be achieved by catching exceptions, or requiring the user to configure a parameter such as skipCheckStatistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The details in the exception are probably enough for the user to understand that the stats are big/corrupted. What may be more difficult to understand is why the stats are big or why they were corrupted and what can they do about it.
Providing some tips like check the strip size, number of columns, size of string stats, total size of the file, etc., should be useful and that's what I was thinking that could go in ORC-1361 or another resource.
I am fine with whatever you decide but I would strongly suggest to LOG at least the exception that was caught.
Let me know what you prefer and I will do the appropriate changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the content of the logging message is somewhat blocking this PR from being merged. I think this is a rather minor issue so how about keeping it dead simple (
Failed to parse stripe statistics
) and move the PR forward? WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 on this. Thanks @zabetak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified message in 55eebc6