Skip to content

Latest commit

 

History

History
327 lines (315 loc) · 95.7 KB

SupportProgress.md

File metadata and controls

327 lines (315 loc) · 95.7 KB
layout title nav_order
page
Supported Operators and Functions
5

The Operators and Functions Support Progress

Gluten is still in active development. Here is a list of supported operators and functions.

Velox backend

Since the same function may have different semantics between Presto and Spark, Velox implement the functions in Presto category, if we note a different sematics from Spark, then the function is implemented in Spark category. So Gluten firstly will use Velox's spark category, if a function isn't implemented there then refer to Presto category.

The total supported functions' number for Spark3.3 is 387 and for Velox is 204. Gluten supported frequently used 94, in which offloaded 62 is implemented in velox/spark and 32 in velox/presto, shown as below picture. support

Value Description
S (Supported) Gluten or Velox support fully.
(Not Applicable) Neither Gluten not Velox support this type in this situation.
PS (Partial Support) Apache Spark supports this type, but Velox only partially supports it.
NS (Not Supported) Apache Spark supports this type but Velox backend does not.
Value Description
Mismatched Some functions implimented by Velox, which return results mismatched with Apache Spark. So we marked then as "Mismatched"
Ansi OFF Velox's doesn't fully support ANSI mode). Once Ansi is enabled, Gluten will fallback to Vanilla Spark.

Operator Map

Gluten supports 14 operators (Draw to right to see all data types)

Executor Description Gluten Name Velox Name BOOLEAN BYTE SHORT INT LONG FLOAT DOUBLE STRING NULL BINARY ARRAY MAP STRUCT(ROW) DATE TIMESTAMP DECIMAL CALENDAR UDT
FileSourceScanExec Reading data from files, often from Hive tables FileSourceScanExecTransformer TableScanNode S S S S S S S S S S NS NS NS S NS NS NS NS
BatchScanExec The backend for most file input BatchScanExecTransformer TableScanNode S S S S S S S S S S NS NS NS S NS NS NS NS
FilterExec The backend for most filter statements FilterExecTransformer FilterNode S S S S S S S S S S NS NS NS S NS NS NS NS
ProjectExec The backend for most select, withColumn and dropColumn statements ProjectExecTransformer ProjectNode S S S S S S S S S S NS NS NS S NS NS NS NS
HashAggregateExec The backend for hash based aggregations HashAggregateBaseTransformer AggregationNode S S S S S S S S S S NS NS NS S NS NS NS NS
BroadcastHashJoinExec Implementation of join using broadcast data BroadcastHashJoinExecTransformer HashJoinNode S S S S S S S S S S NS NS NS S NS NS NS NS
ShuffledHashJoinExec Implementation of join using hashed shuffled data ShuffleHashJoinExecTransformer HashJoinNode S S S S S S S S S S NS NS NS S NS NS NS NS
SortExec The backend for the sort operator SortExecTransformer OrderByNode S S S S S S S S S S NS NS NS S NS NS NS NS
SortMergeJoinExec Sort merge join, replacing with shuffled hash join SortMergeJoinExecTransformer MergeJoinNode S S S S S S S S S S NS NS NS S NS NS NS NS
WindowExec Window-operator backend WindowExecTransformer WindowNode S S S S S S S S S S NS NS NS S NS NS NS NS
GlobalLimitExec Limiting of results across partitions LimitTransformer LimitNode S S S S S S S S S S NS NS NS S NS NS NS NS
LocalLimitExec Per-partition limiting of results LimitTransformer LimitNode S S S S S S S S S S NS NS NS S NS NS NS NS
ExpandExec The backend for the expand operator ExpandExecTransformer GroupIdNode S S S S S S S S S S NS NS NS S NS NS NS NS
UnionExec The backend for the union operator UnionExecTransformer N S S S S S S S S S S NS NS NS S NS NS NS NS
DataWritingCommandExec Writing data Y TableWriteNode S S S S S S S S S S S NS S S NS S NS NS
CartesianProductExec Implementation of join using brute force N CrossJoinNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
ShuffleExchangeExec The backend for most data being exchanged between processes N ExchangeNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
The unnest operation expands arrays and maps into separate columns. Arrays are expanded into a single column, and maps are expanded into two columns (key, value). Can be used to expand multiple columns. In this case produces as many rows as the highest cardinality array or map (the other columns are padded with nulls). Optionally can produce an ordinality column that specifies the row number starting with 1. N UnnestNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
The top-n operation reorders a dataset based on one or more identified sort fields as well as a sorting order. Rather than sort the entire dataset, the top-n will only maintain the total number of records required to ensure a limited output. A top-n is a combination of a logical sort and logical limit operations. N TopNNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
The partitioned output operation redistributes data based on zero or more distribution fields. N PartitionedOutputNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
The values operation returns specified data. N ValuesNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
A receiving operation that merges multiple ordered streams to maintain orderedness. Input streams are coming from remote exchange or shuffle. N MergeExchangeNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
An operation that merges multiple ordered streams to maintain orderedness. Input streams are coming from local exchange. N LocalMergeNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
A local exchange operation that partitions input data into multiple streams or combines data from multiple streams into a single stream. N LocalPartitionNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
The enforce single row operation checks that input contains at most one row and returns that row unmodified. If input is empty, returns a single row with all values set to null. If input contains more than one row raises an exception. N EnforceSingleRowNode NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
The assign unique id operation adds one column at the end of the input columns with unique value per row. This unique value marks each output row to be unique among all output rows of this operator. N AssignUniqueIdNode NS NS NS NS NS NS NS NS NS NS NS NS NS S S S S S
CollectLimitExec Reduce to single partition and apply limit N N
BroadcastExchangeExec The backend for broadcast exchange of data N N
ObjectHashAggregateExec The backend for hash based aggregations supporting TypedImperativeAggregate functions N N
SortAggregateExec The backend for sort based aggregations N N
CoalesceExec Reduce the partition numbers N N
GenerateExec The backend for operations that generate more output rows than input rows like explode N N
RangeExec The backend for range operator N N
SampleExec The backend for the sample operator N N
SubqueryBroadcastExec Plan to collect and transform the broadcast key values N N
TakeOrderedAndProjectExec Take the first limit elements as defined by the sortOrder, and do projection if needed N N
CustomShuffleReaderExec A wrapper of shuffle query stage N N
InMemoryTableScanExec Implementation of InMemoryTableScanExec to use GPU accelerated caching N N
BroadcastNestedLoopJoinExec Implementation of join using brute force. Full outer joins and joins where the broadcast side matches the join side (e.g.: LeftOuter with left broadcast) are not supported N N
AggregateInPandasExec The backend for an Aggregation Pandas UDF, this accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled. N N
ArrowEvalPythonExec The backend of the Scalar Pandas UDFs. Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled N N
FlatMapGroupsInPandasExec The backend for Flat Map Groups Pandas UDF, Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled. N N
MapInPandasExec The backend for Map Pandas Iterator UDF. Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled. N N
WindowInPandasExec The backend for Window Aggregation Pandas UDF, Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled. For now it only supports row based window frame. N N

Function support

Gluten supports 94 functions. (Draw to right to see all data types)

Spark Functions Velox/Presto Functions Velox/Spark functions Gluten Restrictions BOOLEAN BYTE SHORT INT LONG FLOAT DOUBLE DATE TIMESTAMP STRING DECIMAL NULL BINARS CALENDAR ARRAS MAP STRUCT UDT
! not S S S S S S S S S
&
AND S S S S S S S S S
\ bitwise_or S S S S
OR
^ bitwise_xor S S S S
~ bitwise_not S S S S
< lt lt S S S S S S S S S
<= lte lte S S S S S S S S S
> gt gt S S S S S S S S S
>= gte gte S S S S S S S S S
=
== eq eq S S S S S S S S S
<==>
% mod remainder S Ansi Off S S S S S
/ divide divide S Ansi Off S S S S S
* multiply multiply S Ansi Off S S S S S
+ plus add S Ansi Off S S S S S
- minus substract S Ansi Off S S S S S
isnull is_null isnull S S S S S S S S S
isnotnull isnotnull S S S S S S S S S
in in S S S S S S S S S
between between between S S S S S S S S S S
if
<> neq neq S S S S S S S S S
asc_nulls_first
asc_nulls_last
asc
ascii ascii S S
base64, unbase64
bin
bit_length
char_length, character_length length length S
chr, char chr chr S S
concat concat concat S S
concat_ws S
contains contains S S
endswith endsWith S S
first, first_value
initcap
instr instr S
lcase, lower lower lower S S
left
length length length S S
lower lower lower S S
locate strpos Mismatched S
lpad lpad S S
ltrim ltrim S S
position
printf
repeat
reverse reverse S S
replace replace replace S S
right
rpad rpad S S
rtrim rtrim S S
second
soundex
space
split split split Mismatched
split_part split_part Mismatched
substr, substring substr substring S S
substring_index
startswith startsWith S S
translate
trim trim S S
ucase, upper upper upper S S
like like like S S
rlike rlike rlike S Not support lookaround
regexp rlike rlike S Not support lookaround S
regexp_extract regexp_extract regexp_extract S Not support lookaround S
regexp_extract_all regexp_extract_all S Not support lookaround S
regexp_like rlike rlike S Not support lookaround S
regexp_replace regexp_replace Mismatched S
abs abs abs S Ansi Off S S S S S
acos acos S S S S S S
asin asin S S S S S S
atan, atan2 atan, atan2 S S S S S S
bitwiseNOT S
bround
cbrt cbrt
ceil ceil ceil S S S S S S
ceiling ceiling S S S S S S
cos cos S S S S S S
cosh cosh S S S S S S
degrees degrees S S S S S S
factorial
floor floor floor S S S S S S
e e S S S S S S
exp exp exp S S S S S S
greatest greatest greatest S S S S S S
hex, unhex
hypot
least least least S S S S S S
log ln S
log10 log10 S S S S S S
log1p
log2 log2 S S S S S S
ln ln S S S S S S
pi pi S S S S S S
pow pow S S S S S
power power power S S S S S S
pmod pmod S Ansi Off S S S S S
quarter
radians radians S S S S S S
rand rand S S S S S
random random S S S S S
rint
round round round S S S S S S
width_bucket width_bucket S S S S S
sequence S S S S S
sign sign S S S S S S
signum
sin, sinh sin S S S S S S
sqrt sqrt S S S S S S
tan, tanh tan, tanh S S S S S S
array_contains contains array_contains S
array_distinct array_distinct S
array_except array_except S
array_intersect array_intersect array_intersect S
array_join array_join S
array_max array_max S
array_min array_min S
array_position array_position S
array_remove
array_repeat
array_sort sort_array sort_array S
array_union
array
arrays_overlap
arrays_zip
element_at element_at element_at S
flatten
filter filter
map map map S
map_concat map_concat S
map_entries map_entries S
map_filter map_filter map_filter S
map_from_arrays map_from_arrays map_from_arrays S
map_from_entries
map_keys
map_values
negate
size size S
slice slice S
sort_array sort_array sort_array S
explode
add_months
current_date
current_timestamp
date
date_add date_add date_add S S S S S
date_format date_format S S
date_sub
date_trunc date_trunc S S
datediff date_diff S
day day S S S
dayofmonth day_of_month S S S
dayofweek day_of_week S S S
dayofyear day_of_year S S S
from_unixtime from_unixtime S
from_utc_timestamp
hour hour S S
last_day
month month S S S
minute minute S S
months_between
next_day
now
quarter quarter S S S
second second S S
timestamp
to_date
to_timestamp
to_unix_timestamp to_unixtime S
to_utc_timestamp
trunc
unix_timestamp
weekofyear
window
year year S S S
year_of_week year_of_week S
approx_count_distinct approx_distinct S S S S S S S S S S
avg avg avg S Ansi Off S S S S S
collect_list
collect_set
corr corr
count count count S S S S S S
countDistinct
covar_pop covar_pop S S S S S
covar_samp covar_samp S S S S S
first
grouping_id
kurtosis
last
max max max S S S S S S
mean
min min min S S S S S S
skewness
stddev_pop stddev_pop stddev_pop S S S S S S
stddev_samp stddev_samp stddev_samp S S S S S S
stddev stddev stddev S S S S S S
sum sum sum S Ansi Off S S S S S
sumDistinct
var_pop var_pop S S S S S
var_samp var_samp S S S S S
variance variance S S S S S
cume_dist cume_dist S
dense_rank dense_rank S
lag
lead
ntile
percent_rank percent_rank
rank rank S
row_number row_number S S S S
from_json
get_json_object json_extract_scalar get_json_object S S
input_file_name
json_array_length json_array_length S S
json_tuple
schema_of_json
sha1 sha1 S S
sha2 sha2 S S
to_json
callUDF
crc32 crc32 S S
decode
encode
hash hash hash S S S S S S S S
md5 md5 S S
monotonically_increasing_id
shiftLeft bitwise_left_shift S S S S
shiftRight bitwise_right_shift S S S S
shiftRightUnsigned
shuffle
spark_partition_id
udf
when
xxhash64 xxhash64

Clickhouse Backed

To be added