[refactor](be) Remove io_helper helper indirection from BE cast paths#62179
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
26312d6 to
7b1a486
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR removes the legacy util/io_helper wrapper layer and routes parsing/formatting through the corresponding cast implementations so serialization/conversion code depends on the real owners of that behavior.
Changes:
- Deleted
be/src/util/io_helper.h/.cppand replaced helper call sites withCastTo*entry points. - Centralized numeric parsing helpers into
cast_to_basic_number_common.hand decimal text parsing intoCastToDecimal. - Updated tests and conversion/serde code paths to use
CastToString::from_int128andCastTo*::from_string*_mode(...).
Reviewed changes
Copilot reviewed 50 out of 50 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| be/test/exec/column_type_convert_test.cpp | Updates tests to use cast-owned parsing/formatting APIs. |
| be/test/core/field_test.cpp | Switches test dependency from io_helper to cast_to_string. |
| be/test/common/check_overflow.cpp | Replaces io_helper include with numeric cast parsing header. |
| be/src/util/io_helper.h | Removes legacy helper declarations (deleted file). |
| be/src/util/io_helper.cpp | Removes legacy helper implementations (deleted file). |
| be/src/util/counts.h | Drops io_helper include after refactor. |
| be/src/storage/olap_scan_common.h | Drops io_helper include after refactor. |
| be/src/format/column_type_convert.h | Replaces helper calls with CastTo* parsing/formatting implementations. |
| be/src/exprs/function/function_json.cpp | Drops io_helper include after refactor. |
| be/src/exprs/function/cast/cast_to_timestamptz.h | Drops io_helper include after refactor. |
| be/src/exprs/function/cast/cast_to_string.h | Adds from_int128 entry points and updates Int128 string conversion. |
| be/src/exprs/function/cast/cast_to_jsonb.h | Drops io_helper include after refactor. |
| be/src/exprs/function/cast/cast_to_decimal.h | Moves decimal text parsing helper into CastToDecimal. |
| be/src/exprs/function/cast/cast_to_boolean.h | Inlines boolean parsing via StringParser instead of io_helper. |
| be/src/exprs/function/cast/cast_to_basic_number_common.h | Moves numeric text parsing helpers from io_helper into cast common header. |
| be/src/exprs/function/cast/cast_base.h | Drops io_helper include after refactor. |
| be/src/exprs/aggregate/moments.h | Drops io_helper include after refactor. |
| be/src/exprs/aggregate/aggregate_function_uniq.h | Drops io_helper include and adjusts includes. |
| be/src/exprs/aggregate/aggregate_function_topn.h | Drops io_helper include and adjusts includes. |
| be/src/exprs/aggregate/aggregate_function_sequence_match.h | Replaces io_helper include with numeric cast parsing header. |
| be/src/exprs/aggregate/aggregate_function_rpc.h | Drops io_helper include after refactor. |
| be/src/exprs/aggregate/aggregate_function_product.h | Fixes logging for int128_t product by using cast-owned formatting. |
| be/src/exprs/aggregate/aggregate_function_min_max.h | Drops io_helper include and adjusts includes. |
| be/src/exprs/aggregate/aggregate_function_java_udaf.h | Drops io_helper include after refactor. |
| be/src/exprs/aggregate/aggregate_function_histogram.h | Drops io_helper include after refactor. |
| be/src/exprs/aggregate/aggregate_function_corr.h | Drops io_helper include and adds required standard/core includes. |
| be/src/exprs/aggregate/aggregate_function_collect.h | Drops io_helper include after refactor. |
| be/src/exprs/aggregate/aggregate_function_bit.h | Drops io_helper include after refactor. |
| be/src/exprs/aggregate/aggregate_function_avg_weighted.h | Drops io_helper include after refactor. |
| be/src/exprs/aggregate/aggregate_function_array_agg.h | Drops io_helper include after refactor. |
| be/src/exec/common/histogram_helpers.hpp | Drops io_helper include after refactor. |
| be/src/exec/common/hash_table/hash_table.h | Drops io_helper include after refactor. |
| be/src/core/field.cpp | Drops io_helper include after refactor. |
| be/src/core/data_type_serde/data_type_number_serde.cpp | Uses cast-owned parsing/formatting instead of helper functions. |
| be/src/core/data_type_serde/data_type_ipv6_serde.cpp | Uses CastToIPv6 parsing instead of io_helper helpers. |
| be/src/core/data_type_serde/data_type_ipv4_serde.cpp | Uses CastToIPv4 parsing instead of io_helper helpers. |
| be/src/core/data_type_serde/data_type_decimal_serde.cpp | Uses CastToDecimal::read_text instead of helper parsing. |
| be/src/core/data_type_serde/data_type_datev2_serde.cpp | Uses CastToDateV2 parsing instead of helper parsing. |
| be/src/core/data_type_serde/data_type_datetimev2_serde.cpp | Uses CastToDatetimeV2 parsing instead of helper parsing. |
| be/src/core/data_type_serde/data_type_date_or_datetime_serde.cpp | Uses CastToDateOrDatetime parsing directly and normalizes types post-parse. |
| be/src/core/data_type/data_type_quantilestate.cpp | Drops io_helper include after refactor. |
| be/src/core/data_type/data_type_number_base.cpp | Uses CastToString::from_int128 for 128-bit formatting. |
| be/src/core/data_type/data_type_ipv6.cpp | Drops io_helper include after refactor. |
| be/src/core/data_type/data_type_ipv4.cpp | Drops io_helper include after refactor. |
| be/src/core/data_type/data_type_hll.cpp | Drops io_helper include after refactor. |
| be/src/core/data_type/data_type_decimal.cpp | Drops io_helper include after refactor. |
| be/src/core/data_type/data_type_date_time.cpp | Drops io_helper include after refactor. |
| be/src/core/data_type/data_type_date_or_datetime_v2.cpp | Drops io_helper include after refactor. |
| be/src/core/data_type/data_type_date.cpp | Drops io_helper include after refactor. |
| be/src/core/data_type/data_type_bitmap.cpp | Drops io_helper include after refactor. |
Comments suppressed due to low confidence (3)
be/src/format/column_type_convert.h:1
CastParameters params;is default-initialized here. IfCastParameterscontains fundamental fields (e.g.bool is_strict) without default member initializers, this can leave them uninitialized and cause nondeterministic parsing behavior. Prefer explicit initialization (e.g.CastParameters params{};plus setting required fields, or use the same designated initialization pattern used elsewhere such as.status = Status::OK(), .is_strict = false).
be/src/exprs/function/cast/cast_to_basic_number_common.h:1- The
static_assertmessage refers toreadFloatTextImpl, which doesn’t match the new function name (try_read_float_text). Updating the message to the current API name will make future failures clearer and reduce confusion during refactors.
be/src/exprs/function/cast/cast_to_string.h:1 - This switches 128-bit formatting from
FMT_COMPILE(\"{}\")(previously used byint128_to_string) to a runtime format string. For hot formatting paths, consider usingFMT_COMPILEwithfmt::format_to(or an equivalent compiled-format approach) to avoid repeated runtime parsing of the format string.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
No issues found in this refactor after reviewing the full BE cast/serde call chain. Critical checkpoints:
Residual risk: compile-only fallout from transitive include changes is still possible, but I did not find orphaned references to the deleted helper in the repository. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Found 1 issue.
- Goal of the task: remove the legacy
io_helperwrapper and route parsing/formatting through the real cast owners. The refactor mostly does that, but it does not fully preserve behavior becauseTYPE_TIMEinput parsing is now broken in the generic number serde path. - Modification size/focus: mostly small and focused, but the new explicit
DataTypeNumberSerDe<TYPE_TIME>instantiation unintentionally expands the affected surface beyond a pure wrapper removal. - Concurrency: no concurrency-sensitive changes were involved.
- Lifecycle/static initialization: no special lifecycle or static initialization issues found.
- Configuration: no config changes.
- Compatibility/incompatible changes: no storage-format change was introduced, but there is a behavioral regression for deprecated
TYPE_TIMEtext input paths. - Parallel code paths: the wrapper removal was applied across most cast/serde/convert paths, but the parallel
TYPE_TIMEparse path intry_parse_impl()was left incomplete. - Special conditional checks: no problematic new conditional checks beyond the missing
TYPE_TIMEbranch. - Test coverage: the added serde/converter tests are useful, but there is still no coverage for
TYPE_TIMEgeneric-serde text parsing (from_string/from_fe_string/batch variants), which is why this regression slipped through. - Observability: no additional observability is needed for this refactor.
- Transaction/persistence: not applicable.
- Data write/atomicity: not applicable.
- FE-BE variable passing: not applicable.
- Performance: neutral overall; no clear performance regression found in the reviewed paths.
- Other issues: none beyond the regression below.
Critical checkpoints conclusion: the PR does not fully accomplish its goal yet because one reachable input path regresses.
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 53 out of 53 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (6)
be/test/exec/column_type_convert_test.cpp:1
reinterpret_cast<ColumnPtr&>(src_col)is undefined-behavior prone and can easily break if the smart-pointer types/layouts differ. Prefer creating a realColumnPtrvariable (e.g.,ColumnPtr src = src_col;) and passing that toconvert, or adjustconvertto accept the concrete pointer type used by the tests.
be/test/exec/column_type_convert_test.cpp:1reinterpret_cast<ColumnPtr&>(src_col)is undefined-behavior prone and can easily break if the smart-pointer types/layouts differ. Prefer creating a realColumnPtrvariable (e.g.,ColumnPtr src = src_col;) and passing that toconvert, or adjustconvertto accept the concrete pointer type used by the tests.
be/test/exec/column_type_convert_test.cpp:1reinterpret_cast<ColumnPtr&>(src_col)is undefined-behavior prone and can easily break if the smart-pointer types/layouts differ. Prefer creating a realColumnPtrvariable (e.g.,ColumnPtr src = src_col;) and passing that toconvert, or adjustconvertto accept the concrete pointer type used by the tests.
be/test/exec/column_type_convert_test.cpp:1reinterpret_cast<ColumnPtr&>(src_col)is undefined-behavior prone and can easily break if the smart-pointer types/layouts differ. Prefer creating a realColumnPtrvariable (e.g.,ColumnPtr src = src_col;) and passing that toconvert, or adjustconvertto accept the concrete pointer type used by the tests.
be/src/format/column_type_convert.h:1- This header now directly includes several
*_impl.hppfiles. Ifcolumn_type_convert.his widely included, this can significantly increase compile times and may also increase the risk of ODR/link issues if any of those impl headers contain non-inline definitions. If feasible, keepcolumn_type_convert.hdepending on stable interface headers only and move impl-heavy dependencies behind a.cppboundary (or introduce a smaller adapter header with forward declarations/inline-only glue).
be/src/exprs/function/cast/cast_to_string.h:1 - For a hot formatting path like int128→string, using a compile-time format string (e.g.,
FMT_COMPILE(\"{}\")) can reduce formatting overhead infmt. This is optional, but worth considering iffrom_int128is used frequently in tight loops.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I found 1 issue in the current head.
- Critical checkpoints
- Goal and correctness: Partially. The refactor does remove the
io_helperindirection, butTYPE_TIMEtext parsing is now regressed on the generic number serde path, so the current code is not behavior-preserving for all touched owners. - Scope/minimality: Yes. The change set stays focused on helper removal and owner API consolidation.
- Concurrency: Not applicable. I did not find new shared-state or lock interactions in the touched code.
- Lifecycle/static init: No issue found. No new static/global lifetime hazards were introduced.
- Config changes: Not applicable.
- Incompatible changes: No storage/protocol format change found, but there is a behavioral incompatibility on
TYPE_TIMEstring parsing. - Parallel paths: Mostly checked. The main remaining gap is that the generic serde/text path was updated without preserving the old
TYPE_TIMEhandling. - Special conditional checks: No problematic new conditionals beyond the bug below.
- Test coverage: Improved, but still missing coverage for
DataTypeNumberSerDe<TYPE_TIME>::from_string()/from_olap_string(); that is why this regression survived. - Observability: Not needed for this refactor.
- Transaction/persistence/data-write concerns: Not applicable.
- FE/BE variable passing: Not applicable.
- Performance: Acceptable overall; no blocking performance issue found.
- Other issues: None beyond the finding below.
- Finding
be/src/core/data_type_serde/data_type_number_serde.cpp:728:try_parse_impl()still only supports float/double, bool, and integer primitive types, but this file now explicitly instantiatesDataTypeNumberSerDe<TYPE_TIME>at the bottom. That makesfrom_string()andfrom_olap_string()for deprecatedTIMEcolumns go throughtry_parse_impl<TYPE_TIME, ...>(), which now throwsNOT_IMPLEMENTED_ERRORinstead of parsing the value. This is a real behavior regression for text-input paths that rely on the generic serde owner, including default values / schema-change style flows that call these APIs. Please either keepTYPE_TIMEout of the genericDataTypeNumberSerDetext-parsing expansion, or add aPT == TYPE_TIMEbranch that parses with the same float-like path used elsewhere in this file.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 58 out of 58 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| StringParser::ParseResult result; | ||
| duration = | ||
| StringParser::string_to_int<NativeType, false>(pos, end - pos, &result); | ||
| while (pos < end && *pos >= '0' && *pos <= '9') { | ||
| ++pos; | ||
| } |
There was a problem hiding this comment.
StringParser::string_to_int is called with end - pos (the rest of the pattern), but then pos is advanced only over the leading digits. This can cause parsing to fail when the number is immediately followed by non-digit characters (e.g. ")"), leaving duration as an unspecified/incorrect value while pos != prev_pos so no error is reported. Parse using the digit-only slice ([prev_pos, digit_end)) and check the parse result (or at least silence the unused result variable).
| StringParser::ParseResult result; | ||
| event_number = | ||
| StringParser::string_to_int<UInt64, false>(pos, end - pos, &result); | ||
| while (pos < end && *pos >= '0' && *pos <= '9') { | ||
| ++pos; | ||
| } |
There was a problem hiding this comment.
Same issue as the duration parsing above: string_to_int uses end - pos (including trailing pattern chars), but pos is later advanced only over digits. This can make event_number parse fail while still consuming digits, producing an incorrect event_number without triggering the pos == prev_pos check. Parse using the digit-only substring and check (or explicitly ignore) the parse result.
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Restore the original cast owner API surface after the io_helper refactor, keep the required decimal overflow and datetime conversion paths, and fix sequence_match pattern parsing so serialization/deserialization and focused cast paths behave correctly again.
### Release note
None
### Check List (For Author)
- Test: BE Unit Test
- ./run-be-ut.sh --run --filter='*FunctionCastToFloatTest.test_from_decimal*:*FunctionCastToFloatTest.test_decimalv2_keeps_scale_when_cast_to_float_and_double*:*FunctionCastToIntTest.test_from_decimal*:*FunctionCastToDecimalTest*:*ColumnTypeConvertTest*:*CastToStringTest*:*VSequenceMatchTest*' -j60
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Add focused sequence_match unit coverage for time-condition parsing and malformed numeric tokens, and make parser failures clear parser state so malformed patterns fail closed instead of leaving partial matches behind.
### Release note
None
### Check List (For Author)
- Test: BE Unit Test
- ./run-be-ut.sh --run --filter='*FunctionCastToFloatTest.test_from_decimal*:*FunctionCastToFloatTest.test_decimalv2_keeps_scale_when_cast_to_float_and_double*:*FunctionCastToIntTest.test_from_decimal*:*FunctionCastToDecimalTest*:*ColumnTypeConvertTest*:*CastToStringTest*:*VSequenceMatchTest*' -j60
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Restore CastToDecimal as the owner-facing decimal-to-decimal API so external callers do not encode decimal storage families, and route decimal-to-int/float cast loops back through the stable public wrappers instead of underscored helpers.
### Release note
None
### Check List (For Author)
- Test: BE Unit Test
- ./run-be-ut.sh --run --filter='*FunctionCastToFloatTest.test_from_decimal*:*FunctionCastToFloatTest.test_decimalv2_keeps_scale_when_cast_to_float_and_double*:*FunctionCastToIntTest.test_from_decimal*:*FunctionCastToDecimalTest*:*ColumnTypeConvertTest*:*CastToStringTest*:*VSequenceMatchTest*' -j60
- Behavior changed: No
- Does this need documentation: No
73d11be to
79669b5
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
Overall: Request changes. The refactor itself looks mostly behavior-preserving in the reviewed live paths, but the newly added Arrow string coverage is not checking the actual internal representation it claims to validate.
Critical checkpoints:
- Goal of task: Remove
io_helperindirection and route parsing/formatting to cast owners. Mostly achieved. The new coverage intended to prove Arrow STRING -> date-like parsing is currently incorrect, so the goal is not fully demonstrated by tests. - Small/clear/focused: Yes, mostly mechanical and focused.
- Concurrency: No concurrency-sensitive changes in the reviewed paths.
- Lifecycle/static init: No new lifecycle/SIOF issue found.
- Configuration: None.
- Compatibility/storage/protocol: No incompatible API/storage/protocol change found.
- Parallel code paths: Reviewed the replaced
io_helperwrappers; current live paths remain aligned. - Special conditionals/comments: No blocking issue found.
- Test coverage: Affected paths have some BE test coverage, but the new Arrow date-like test is not validating the real packed
DateV2/DateTimeV2values. - Modified test results: One new test is incorrect and should be fixed before merge.
- Observability: N/A for this refactor.
- Transaction/persistence/data writes: N/A.
- FE/BE variable passing: N/A.
- Performance: No blocking performance issue found.
- Other issues: None beyond the incorrect test.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 59 out of 59 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
PR approved by at least one committer and no changes requested. |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
wait #62865 to pick to 4.1 |
…#62179) Problem Summary: Remove the legacy io_helper wrapper layer and move its parsing and formatting entry points onto the existing cast implementations so serde and conversion code depends on the real owners of that behavior.
…#62179) Problem Summary: Remove the legacy io_helper wrapper layer and move its parsing and formatting entry points onto the existing cast implementations so serde and conversion code depends on the real owners of that behavior.
Problem Summary: Remove the legacy io_helper wrapper layer and move its parsing and formatting entry points onto the existing cast implementations so serde and conversion code depends on the real owners of that behavior.