Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xls addition #9

Merged
merged 1 commit into from
Feb 26, 2024
Merged

Xls addition #9

merged 1 commit into from
Feb 26, 2024

Conversation

psainics
Copy link
Collaborator

TODO: Add a description

public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {

CombineFileSplit split = (CombineFileSplit) genericSplit;
Configuration job = context.getConfiguration();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change variable name to jobConf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

*
* @return A hashmap of column names and their manually set schemas.
*/
public Map<String, Schema> getOverride() throws IllegalArgumentException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have this in all file plugins?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes !
image

case ERROR:
return null;
default:
throw new IllegalStateException("Unexpected celltype (" + cellType + ")");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use String.format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

}

@Override
protected void addFormatProperties(Map<String, String> properties) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this property also to keep split as 1.
properties.put(FileInputFormat.SPLIT_MINSIZE, Long.toString(Long.MAX_VALUE));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

context.getFailureCollector().addFailure("Sheet number must be a number.", null)
.withConfigProperty(XlsInputFormatConfig.NAME_SHEET_VALUE);
}
FailureCollector collector = context.getFailureCollector();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are evaluating the schema from the file, when this will be triggered.

* Utilities around XLS input format.
*/
public class XlsInputFormatUtils {
private static final Pattern NOT_VALID_PATTERN = Pattern.compile("[^A-Za-z0-9_]+");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the latest naming convention for BQ

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impl is taken from file plugin.
This is the invalid, column for CDAP.
We are not sure of the side effect this may cause, are you sure ?

Copy link
Collaborator

@anup-cloudsufi anup-cloudsufi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some unit tests before raising the external PR for review.

<version>2.12.0-SNAPSHOT</version>
</parent>
<artifactId>format-xls</artifactId>
<name>XLSX format plugins</name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets be consistent with format XLS.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed !

XLS format plugins

<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>5.2.4</version>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define the property for others as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

<poi.version>5.2.4</poi.version>

isRowNull = true;
for (int cellIndex = 0; cellIndex < row.getLastCellNum(); cellIndex++) {
if (cellIndex >= fields.size()) {
throw new IllegalArgumentException(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this check outside the loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can i move cellIndex logic outside loop ?
cellIndex is part of this loop 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean instead of checking it inside the loop you can do it before the loop, something like:

Suggested change
throw new IllegalArgumentException(
if (row.getLastCellNum() > fields.size()) {
throw new IllegalArgumentException(String.format("Schema contains less fields than the number of columns in the excel file. Schema fields: %s, Excel columns: %s", fields.size(), row.getLastCellNum()));
}

}
Cell cell = row.getCell(cellIndex, Row.MissingCellPolicy.RETURN_BLANK_AS_NULL);
if (cell == null) {
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comments regarding why the cell is not processed or can't we ignore this continue checks and add null in line#160

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

// Blank cells are skipped, builder will set null for the field, no processing needed.

We are also using the continue logic to track isRowNull so we can't move it inside line#160

"Can be either sheet name or sheet no; for example: 'Sheet1' or '0' in case user selects 'Sheet Name' or " +
"'Sheet Number' as 'sheet' input respectively. Sheet number starts with 0.";
public static final String DESC_TERMINATE_ROW = "Specify whether processing needs to be terminated in case an" +
" empty row is encountered while processing excel files. Options to select are true or false.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of Options to select are true or false., write The default value is false. or we can remove this also.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to Default value is false.

/**
* Formats the cell value of an Excel file.
*
* @param cell the cell to format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor the line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added @param type the schema type of the cell

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you update the comment as follows:
{@code Cell} the cell to format

case ERROR:
return null;
default:
throw new IllegalStateException(String.format("Unexpected celltype (%s)", cellType));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborate the message by specifying the list of Supported types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an impossible state , only possible if xls has a new cell type added in future, this is internal representation of cell by library we are using. This is an implementation details not relevant for end user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in impossible state we should be informative about the error messages. So, lets rewrite the error messages something as:
Failed to format (%s) due to unsupported cell type (%s).

for (int rowIndex = rowStart; rowIndex <= rowEnd; rowIndex++) {
Row row = workSheet.getRow(rowIndex);
if (row == null) {
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

come up with some logic to ignore the usage of continue and break down the header & data processing into separate methods for better readability.

return false;
}
// Get the next row.
Row row = workSheet.getRow(rowIndex);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it guaranteed that workSheet.getRow(rowIndex) never return null?

[s] Review Squashed
@psainics psainics merged commit 55ac600 into develop Feb 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants