Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于group by实现方法的一点疑问 #140

Closed
echow97 opened this issue Aug 14, 2023 · 9 comments
Closed

关于group by实现方法的一点疑问 #140

echow97 opened this issue Aug 14, 2023 · 9 comments

Comments

@echow97
Copy link

echow97 commented Aug 14, 2023

我测试了两个sql语句,一个有group by,另一个没有,有group by的执行图似乎复杂很多。

执行sql语句“select avg(user_credit.income) from user_credit join user_stats on user_credit.id = user_stats.id”时,SCDB打印的execution plan为:

2023-08-14 15:43:32.8143 INFO app.go:171 [Translator] execution plan: 
digraph G {
0 [label="runsql:{in:[],out:[Out:{t_0,t_1,},],attr:[sql:select user_credit.id,user_credit.income from alice.user_credit,table_refs:[alice.user_credit],],url:[172.23.13.251:8087,]}"]
1 [label="runsql:{in:[],out:[Out:{t_2,},],attr:[sql:select user_stats.id from bob.user_stats,table_refs:[bob.user_stats],],url:[172.23.13.251:8088,]}"]
2 [label="join:{in:[Left:{t_0,},Right:{t_2,},],out:[LeftJoinIndex:{t_3,},RightJoinIndex:{t_4,},],attr:[input_party_codes:[alice bob],join_type:0,],url:[172.23.13.251:8087,172.23.13.251:8088,]}"]
3 [label="filter_by_index:{in:[Data:{t_0,t_1,},RowsIndexFilter:{t_3,},],out:[Out:{t_5,t_6,},],attr:[],url:[172.23.13.251:8087,]}"]
4 [label="filter_by_index:{in:[Data:{t_2,},RowsIndexFilter:{t_4,},],out:[Out:{t_7,},],attr:[],url:[172.23.13.251:8088,]}"]
5 [label="reduce_avg:{in:[In:{t_6,},],out:[Out:{t_8,},],attr:[],url:[172.23.13.251:8087,]}"]
6 [label="copy:{in:[In:{t_8,},],out:[Out:{t_9,},],attr:[input_party_codes:alice,output_party_codes:bob,],url:[172.23.13.251:8087,172.23.13.251:8088,]}"]
7 [label="publish:{in:[In:{t_9,},],out:[Out:{t_10,},],attr:[],url:[172.23.13.251:8088,]}"]

执行sql语句“select avg(user_credit.income) from user_credit join user_stats on user_credit.id = user_stats.id group by user_credit.credit_rank”时,SCDB打印的execution plan为:

2023-08-14 15:42:52.8143 INFO app.go:171 [Translator] execution plan: 
digraph G {
0 [label="runsql:{in:[],out:[Out:{t_0,t_1,t_2,},],attr:[sql:select user_credit.id,user_credit.credit_rank,user_credit.income from alice.user_credit,table_refs:[alice.user_credit],],url:[172.23.13.251:8087,]}"]
1 [label="runsql:{in:[],out:[Out:{t_3,},],attr:[sql:select user_stats.id from bob.user_stats,table_refs:[bob.user_stats],],url:[172.23.13.251:8088,]}"]
2 [label="join:{in:[Left:{t_0,},Right:{t_3,},],out:[LeftJoinIndex:{t_4,},RightJoinIndex:{t_5,},],attr:[input_party_codes:[alice bob],join_type:0,],url:[172.23.13.251:8087,172.23.13.251:8088,]}"]
3 [label="filter_by_index:{in:[Data:{t_0,t_1,t_2,},RowsIndexFilter:{t_4,},],out:[Out:{t_6,t_7,t_8,},],attr:[],url:[172.23.13.251:8087,]}"]
4 [label="filter_by_index:{in:[Data:{t_3,},RowsIndexFilter:{t_5,},],out:[Out:{t_9,},],attr:[],url:[172.23.13.251:8088,]}"]
5 [label="group:{in:[Key:{t_7,},],out:[GroupId:{t_10,},GroupNum:{t_11,},],attr:[],url:[172.23.13.251:8087,]}"]
6 [label="avg:{in:[GroupId:{t_10,},GroupNum:{t_11,},In:{t_8,},],out:[Out:{t_12,},],attr:[],url:[172.23.13.251:8087,]}"]
7 [label="count:{in:[GroupId:{t_10,},GroupNum:{t_11,},In:{t_10,},],out:[Out:{t_13,},],attr:[],url:[172.23.13.251:8087,]}"]
8 [label="make_constant:{in:[],out:[Out:{t_14,},],attr:[scalar:4,to_status:1,],url:[172.23.13.251:8087,172.23.13.251:8088,]}"]
9 [label="broadcast:{in:[In:{t_14,},ShapeRefTensor:{t_13,},],out:[Out:{t_15,},],attr:[],url:[172.23.13.251:8087,]}"]
10 [label="GreaterEqual:{in:[Left:{t_13,},Right:{t_15,},],out:[Out:{t_16,},],attr:[],url:[172.23.13.251:8087,]}"]
11 [label="apply_filter:{in:[Filter:{t_16,},In:{t_12,t_13,},],out:[Out:{t_17,t_18,},],attr:[],url:[172.23.13.251:8087,]}"]
12 [label="copy:{in:[In:{t_17,},],out:[Out:{t_19,},],attr:[input_party_codes:alice,output_party_codes:bob,],url:[172.23.13.251:8087,172.23.13.251:8088,]}"]
13 [label="publish:{in:[In:{t_19,},],out:[Out:{t_20,},],attr:[],url:[172.23.13.251:8088,]}"]

group by操作原理是什么?为什么语句中多了group by之后,增加了操作count、make_constant、broadcast、GreaterEqual、apply_filter?此外,make_constant貌似还有交互,其功能是什么,我阅读了Constant#,不太明白其含义。
ps:额外的一个问题,我注意到CCL中好像没提到+-*/的操作,此类操作应该归为哪一类CCL限制呢?

@jingshi-ant
Copy link
Contributor

jingshi-ant commented Aug 14, 2023

1.groupby 需要按照group独立计算聚合结果,因此多了group(分组),avg/count是和没有groupby的对应。 额外多出来的 make_constant、broadcast、GreaterEqual、apply_filter是为了过滤掉数量较小的分组(出于安全考虑,如果一个group里元素少于4,则会从结果滤除,这里的make constant就是产生常量4的)。
2.二元运算的结果的ccl是input的ccl的更严格的ccl并集。 e.g: a + b,如果a对alice可见,b对alice不可见,则a+b的结果对alice不可见。

@echow97 echow97 closed this as completed Aug 14, 2023
@echow97
Copy link
Author

echow97 commented Aug 14, 2023

明白了,感谢解答~

@echow97
Copy link
Author

echow97 commented Aug 30, 2023

有一个新的疑问。
我执行了如下语句:select base_info.company,sum(ta.ticket_count) from base_info join ta on base_info.id = ta.id group by company
SCDB日志:

digraph G {
0 [label="runsql:{in:[],out:[Out:{t_0,t_1,},],attr:[sql:select base_info.id,base_info.company from carol.base_info,table_refs:[carol.base_info],],url:[172.23.13.251:8089,]}"]
1 [label="runsql:{in:[],out:[Out:{t_2,t_3,},],attr:[sql:select ticket.id,ticket.ticket_count from alice.ticket,table_refs:[alice.ticket],],url:[172.23.13.251:8087,]}"]
2 [label="join:{in:[Left:{t_0,},Right:{t_2,},],out:[LeftJoinIndex:{t_4,},RightJoinIndex:{t_5,},],attr:[input_party_codes:[carol alice],join_type:0,],url:[172.23.13.251:8089,172.23.13.251:8087,]}"]
3 [label="filter_by_index:{in:[Data:{t_0,t_1,},RowsIndexFilter:{t_4,},],out:[Out:{t_6,t_7,},],attr:[],url:[172.23.13.251:8089,]}"]
4 [label="filter_by_index:{in:[Data:{t_2,t_3,},RowsIndexFilter:{t_5,},],out:[Out:{t_8,t_9,},],attr:[],url:[172.23.13.251:8087,]}"]
5 [label="make_share:{in:[In:{t_7,},],out:[Out:{t_10,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
6 [label="make_share:{in:[In:{t_6,},],out:[Out:{t_11,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
7 [label="make_share:{in:[In:{t_8,},],out:[Out:{t_12,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
8 [label="make_share:{in:[In:{t_9,},],out:[Out:{t_13,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
9 [label="sort:{in:[In:{t_10,t_11,t_10,t_12,t_13,},Key:{t_10,},],out:[Out:{t_14,t_15,t_16,t_17,t_18,},],attr:[reverse:false,],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
10 [label="group_mark:{in:[Key:{t_14,},],out:[Group:{t_19,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
11 [label="sum:{in:[Group:{t_19,},In:{t_18,},],out:[Out:{t_20,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
12 [label="count:{in:[Group:{t_19,},In:{t_19,},],out:[Out:{t_21,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
13 [label="shuffle:{in:[In:{t_20,t_16,t_21,t_19,},],out:[Out:{t_22,t_23,t_24,t_25,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
14 [label="make_public:{in:[In:{t_25,},],out:[Out:{t_26,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
15 [label="filter:{in:[Filter:{t_26,},In:{t_22,t_23,t_24,},],out:[Out:{t_27,t_28,t_29,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
16 [label="make_constant:{in:[],out:[Out:{t_30,},],attr:[scalar:4,to_status:1,],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
17 [label="broadcast:{in:[In:{t_30,},ShapeRefTensor:{t_29,},],out:[Out:{t_31,},],attr:[],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
18 [label="make_private:{in:[In:{t_29,},],out:[Out:{t_32,},],attr:[reveal_to:carol,],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
19 [label="GreaterEqual:{in:[Left:{t_32,},Right:{t_31,},],out:[Out:{t_33,},],attr:[],url:[172.23.13.251:8089,]}"]
20 [label="make_private:{in:[In:{t_28,},],out:[Out:{t_34,},],attr:[reveal_to:carol,],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
21 [label="make_private:{in:[In:{t_27,},],out:[Out:{t_35,},],attr:[reveal_to:carol,],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]
22 [label="apply_filter:{in:[Filter:{t_33,},In:{t_34,t_35,t_32,},],out:[Out:{t_36,t_37,t_38,},],attr:[],url:[172.23.13.251:8089,]}"]
23 [label="publish:{in:[In:{t_36,t_37,},],out:[Out:{t_39,t_40,},],attr:[],url:[172.23.13.251:8089,]}"]
0 -> 2 [label = "t_0:{id:PRIVATE:STRING}"]
0 -> 3 [label = "t_0:{id:PRIVATE:STRING}"]
0 -> 3 [label = "t_1:{company:PRIVATE:STRING}"]
1 -> 2 [label = "t_2:{id:PRIVATE:STRING}"]
1 -> 4 [label = "t_2:{id:PRIVATE:STRING}"]
1 -> 4 [label = "t_3:{ticket_count:PRIVATE:INT64}"]
10 -> 11 [label = "t_19:{group_mark:SECRET:BOOL}"]
10 -> 12 [label = "t_19:{group_mark:SECRET:BOOL}"]
10 -> 12 [label = "t_19:{group_mark:SECRET:BOOL}"]
10 -> 13 [label = "t_19:{group_mark:SECRET:BOOL}"]
11 -> 13 [label = "t_20:{ticket_count:SECRET:INT64}"]
12 -> 13 [label = "t_21:{group_mark:SECRET:INT64}"]
13 -> 14 [label = "t_25:{group_mark:SECRET:BOOL}"]
13 -> 15 [label = "t_22:{ticket_count:SECRET:INT64}"]
13 -> 15 [label = "t_23:{company:SECRET:STRING}"]
13 -> 15 [label = "t_24:{group_mark:SECRET:INT64}"]
14 -> 15 [label = "t_26:{group_mark:PUBLIC:BOOL}"]
15 -> 17 [label = "t_29:{group_mark:SECRET:INT64}"]
15 -> 18 [label = "t_29:{group_mark:SECRET:INT64}"]
15 -> 20 [label = "t_28:{company:SECRET:STRING}"]
15 -> 21 [label = "t_27:{ticket_count:SECRET:INT64}"]
16 -> 17 [label = "t_30:{constant_data:PUBLIC:INT64}"]
17 -> 19 [label = "t_31:{constant_data:PUBLIC:INT64}"]
18 -> 19 [label = "t_32:{group_mark:PRIVATE:INT64}"]
18 -> 22 [label = "t_32:{group_mark:PRIVATE:INT64}"]
19 -> 22 [label = "t_33:{GreaterEqual_out:PRIVATE:BOOL}"]
2 -> 3 [label = "t_4:{id:PRIVATE:INT64}"]
2 -> 4 [label = "t_5:{id:PRIVATE:INT64}"]
20 -> 22 [label = "t_34:{company:PRIVATE:STRING}"]
21 -> 22 [label = "t_35:{ticket_count:PRIVATE:INT64}"]
22 -> 23 [label = "t_36:{company:PRIVATE:STRING}"]
22 -> 23 [label = "t_37:{ticket_count:PRIVATE:INT64}"]
3 -> 5 [label = "t_7:{company:PRIVATE:STRING}"]
3 -> 6 [label = "t_6:{id:PRIVATE:STRING}"]
4 -> 7 [label = "t_8:{id:PRIVATE:STRING}"]
4 -> 8 [label = "t_9:{ticket_count:PRIVATE:INT64}"]
5 -> 9 [label = "t_10:{company:SECRET:STRING}"]
5 -> 9 [label = "t_10:{company:SECRET:STRING}"]
5 -> 9 [label = "t_10:{company:SECRET:STRING}"]
6 -> 9 [label = "t_11:{id:SECRET:STRING}"]
7 -> 9 [label = "t_12:{id:SECRET:STRING}"]
8 -> 9 [label = "t_13:{ticket_count:SECRET:INT64}"]
9 -> 10 [label = "t_14:{company:SECRET:STRING}"]
9 -> 11 [label = "t_18:{ticket_count:SECRET:INT64}"]
9 -> 13 [label = "t_16:{company:SECRET:STRING}"]
}

问题如下:
(1)其中的sort操作其实将key(也就是groupby的key)排序之后其他列跟着调换顺序,然后用group_mark标记下一个分组的节点对吧?
(2)然后其中的shuffle操作是干什么的?Shuffle#给出的例子好像是逆转,但是在这个语句中起了什么作用呢?
期待回复~

@echow97 echow97 reopened this Aug 30, 2023
@echow97
Copy link
Author

echow97 commented Aug 30, 2023

(3) "make_private:{in:[In:{t_28,},],out:[Out:{t_34,},],attr:[reveal_to:carol,],url:[172.23.13.251:8087,172.23.13.251:8089,]}"]是指将各方将秘密共享份额发送给carol对吧?Alice是否能知道恢复的结果?
(4) publish算子披露 DAG 结果。[label="publish:{in:[In:{t_19,},],out:[Out:{t_20,},],attr:[],url:[172.23.13.251:8088,]}"]是指8088这台机器把结果返回给SCDB,然后SCDB再返回客户端吗?这个最终的计算结果其他参与方能否获取?SCDB能否看到最终结果?

@echow97
Copy link
Author

echow97 commented Sep 1, 2023

@jingshi-ant 你好,可以帮忙解答一下吗?0.0

@jingshi-ant
Copy link
Contributor

(1)sort、group_mark的理解没有问题
(2)shuffle是为了打乱group_mark,避免每个分组有多少行的信息泄露,shuffle后就可以把group_mark 从密文转为明文了。
(3)make_private的时候只有reveal_to的参与方可以获得结果(大家把分片发给carol,由carol去恢复)
(4)publish的理解没问题,结果从Engine到SCDB再到客户端。SCDB可以看到最终结果,其它参与方不知道(SCDB布置在可信第三方)。

@echow97
Copy link
Author

echow97 commented Sep 4, 2023

感谢解答,(1)(3)(4)明白了
(2)根据ObliviousGroupMark#。Group由0和1组成,1代表进入下一个分组,进行shuffle操作,则g roup=0的行将失去分组特征,意味着后续步骤中,group非1行的行数据将不会使用到了,是这样吗?

@jingshi-ant
Copy link
Contributor

感谢解答,(1)(3)(4)明白了 (2)根据ObliviousGroupMark#。Group由0和1组成,1代表进入下一个分组,进行shuffle操作,则g roup=0的行将失去分组特征,意味着后续步骤中,group非1行的行数据将不会使用到了,是这样吗?

对的,只有值为1的行会留下来(包含最终结果),其它的会filter掉。

@echow97
Copy link
Author

echow97 commented Sep 4, 2023

感谢解答,(1)(3)(4)明白了 (2)根据ObliviousGroupMark#。Group由0和1组成,1代表进入下一个分组,进行shuffle操作,则g roup=0的行将失去分组特征,意味着后续步骤中,group非1行的行数据将不会使用到了,是这样吗?

对的,只有值为1的行会留下来(包含最终结果),其它的会filter掉。

明白了,感谢解答,收获很大!
祝社区越来越好。

@echow97 echow97 closed this as completed Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants