group by

第十道面试题

需求、数据、建表等

需求

编写Hive的HQL语句：

输出每个产品，在2018年期间，每个月的净利润，日均成本。
输出每个产品，在2018年3月中每一天与上一天相比，成本的变化。
输出2018年4月，有多少个产品总收入大于22000元，必须用一句SQL语句实现，且不允许使用关联表查询、子查询。
输出2018年4月，总收入最高的那个产品，每日的收入，成本，过程使用over()函数。

数据： ( 日期dt，产品id，当日收入income，当日成本cost )

2018-03-01,a,3000,2500
2018-03-01,b,4000,3200
2018-03-01,c,3200,2400
2018-03-01,d,3000,2500
2018-03-02,a,3000,2500
2018-03-02,b,1500,800
2018-03-02,c,2600,1800
2018-03-02,d,2400,1000
2018-03-03,a,3100,2400
2018-03-03,b,2500,2100
2018-03-03,c,4000,1200
2018-03-03,d,2500,1900
2018-03-04,a,2800,2400
2018-03-04,b,3200,2700
2018-03-04,c,2900,2200
2018-03-04,d,2700,2500
2018-03-05,a,2700,1000
2018-03-05,b,1800,200
2018-03-05,c,5600,2200
2018-03-05,d,1200,1000
2018-03-06,a,2900,2500
2018-03-06,b,4500,2500
2018-03-06,c,6700,2300
2018-03-06,d,7500,5000
2018-04-01,a,3000,2500
2018-04-01,b,4000,3200
2018-04-01,c,3200,2400
2018-04-01,d,3000,2500
2018-04-02,a,3000,2500
2018-04-02,b,1500,800
2018-04-02,c,4600,1800
2018-04-02,d,2400,1000
2018-04-03,a,6100,2400
2018-04-03,b,4500,2100
2018-04-03,c,6000,1200
2018-04-03,d,3500,1900
2018-04-04,a,2800,2400
2018-04-04,b,3200,2700
2018-04-04,c,2900,2200
2018-04-04,d,2700,2500
2018-04-05,a,4700,1000
2018-04-05,b,3800,200
2018-04-05,c,5600,2200
2018-04-05,d,5200,1000
2018-04-06,a,2900,2500
2018-04-06,b,4500,2500
2018-04-06,c,6700,2300
2018-04-06,d,7500,5000

建表、导入数据

-- 创建表并指定字段分隔符为逗号（，）
create table if not exists goods(dt string, id string, income int, cost int) row format delimited fields terminated by ','; 

-- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/goods_data.txt）

-- 加载数据到表
load data local inpath '/root/yber/data/goods_data.txt' into table goods;

思路与实现步骤

思路分析

聚合函数+group by
实现步骤

输出2018年期间，每个产品的每个月净利润、日均成本。

语句：

-- 首先字符串函数substring得到年、月、日并计算每日利润。
select 
substring(dt,1,4) as year,
substring(dt,6,2) as month,
dt as day,
id,
income,
cost,
(income-cost) as profit 
from 
goods;

-- 然后通过group by得到3月总利润和平均日成本。
-- where month = ‘03’既可以限制在内查询，也可以限制在内查询，一般来说，限制在内部会减少外部查询工作量。
select id,month,sum(profit) as month_sum_profit ,avg(cost) as daily_cost
from (
  select 
    substring(dt,1,4) as year,
    substring(dt,6,2) as month,
    dt as day,
    id,
    income,
    cost,
    (income-cost) as profit
  from 
    goods
) a
where month = '03'
group by id,month

结果：

id	month	month_sum_profit	daily_cost
a	3	4200	2216.666667
b	3	6000	1916.666667
c	3	12900	2016.666667
d	3	5400	2316.666667

输出每个产品，在2018年3月中每一天与上一天相比，成本的变化。

方法一：lag() over(partition by order by)

select id,day,cost,last_cost,(cost-last_cost) as cost_change
from (
  select 
    substring(dt,1,4) as year,
    substring(dt,6,2) as month,
    dt as day,
    id,
    cost,
    -- 获取上一条成本，第一条数据没有时赋默认值0；分区范围时id和月份；排序按照id和日期（day）
    lag(cost,1,0) over(partition by substring(dt,6,2),id order by id,dt) as last_cost
  from 
    goods
  where 
    -- 在内部过滤3月份，此时不能使用别名month。（原因与sql执行顺序有关！）
    substring(dt,6,2) ='03'
)a
;

方法二：自连接

语句：

select aa.aid as id,aa.bday as day,(aa.bcost-aa.acost) as difference
from
  (
  select a.id as aid,a.cost as acost,a.day as aday,b.cost as bcost,b.day as bday from 
    (
    select id,income,cost,substring(p_date,1,4) as year,substring(p_date,6,2) as month,substring(p_date,9,2) as day
    from product
    where substring(p_date,6,2)='03' and substring(p_date,1,4)='2018'
    order by id,month,day
    ) a
  left join
    (
    select id,income,cost,substring(p_date,1,4) as year,substring(p_date,6,2) as month,substring(p_date,9,2) as day
    from product
    where substring(p_date,6,2)='03' and substring(p_date,1,4)='2018'
    order by id,month,day
    ) b
  on a.id=b.id and a.month=b.month and a.day=b.day-1
  ) aa
where aa.bcost is not null;

结果（lag）：

id	day	cost	last_cost	cost_change
a	2018/3/1	2500	0	2500
a	2018/3/2	2500	2500	0
a	2018/3/3	2400	2500	-100
a	2018/3/4	2400	2400	0
a	2018/3/5	1000	2400	-1400
a	2018/3/6	2500	1000	1500
b	2018/3/1	3200	0	3200
b	2018/3/2	800	3200	-2400
b	2018/3/3	2100	800	1300
b	2018/3/4	2700	2100	600
b	2018/3/5	200	2700	-2500
b	2018/3/6	2500	200	2300
c	2018/3/1	2400	0	2400
c	2018/3/2	1800	2400	-600
c	2018/3/3	1200	1800	-600
c	2018/3/4	2200	1200	1000
c	2018/3/5	2200	2200	0
c	2018/3/6	2300	2200	100
d	2018/3/1	2500	0	2500
d	2018/3/2	1000	2500	-1500
d	2018/3/3	1900	1000	900
d	2018/3/4	2500	1900	600
d	2018/3/5	1000	2500	-1500
d	2018/3/6	5000	1000	4000

输出2018年4月，有多少个产品总收入大于22000元，必须用一句SQL语句实现，且不允许使用关联表查询、子查询。

语句：

select 
  substring(dt,1,4) as year,
  substring(dt,6,2) as month,
  id,
  sum(income) as month_income
from 
  goods
where 
  -- 不能用别名
  substring(dt,1,4) = '2018' and
  substring(dt,6,2) = '04'
group by
  -- 不能用别名
  substring(dt,1,4),
  substring(dt,6,2),
  id
having 
  -- having 可以使用别名
  month_income >= 22000;

结果：

year	month	id	month_income
2018	4	a	22500
2018	4	c	29000
2018	4	d	24300

输出2018年4月，总收入最高的那个产品，每日的收入，成本，过程使用over()函数。

语句：

-- 此查询结果如最下方图片所示！！！
select 
  *,
  -- 对内层当月总收入rank排序（月份为分区，总收入降序），rn=1则表示总收入最高的产品。
  rank() over(partition by month order by sum_income desc) as rn
from (
    select 
      substring(dt,1,4) as year,
      substring(dt,6,2) as month,
      dt as day,
      id,
      income,
      cost,
      -- 计算每个产品（id）当月总收入
      sum(income) over(partition by id order by id) as sum_income
    from 
      goods
    where substring(dt,6,2) = '04'  
) a




-- 最终结果！：对上述查询进行rn=1的过滤即可。
select * from (
select *,rank() over(partition by month order by sum_income desc) as rn
from (
select 
substring(dt,1,4) as year,
substring(dt,6,2) as month,
dt as day,
id,
income,
cost,
sum(income) over(partition by id order by id) as sum_income
from 
goods
where substring(dt,6,2) = '04'
) a
) b
where rn = 1

结果（最终结果！）：

a.year	a.month	a.day	a.id	a.income	a.cost	a.sum_income	rn
2018	4	2018/4/2	c	4600	1800	29000	1
2018	4	2018/4/3	c	6000	1200	29000	1
2018	4	2018/4/5	c	5600	2200	29000	1
2018	4	2018/4/1	c	3200	2400	29000	1
2018	4	2018/4/6	c	6700	2300	29000	1
2018	4	2018/4/4	c	2900	2200	29000	1

为什么第四问使用rank而不是row_numnber?

-- 详细请：回顾“求TopN” 章节他们的不同。
如上图，我们将产品每个月的总收入查询到了最后。
如果使用row_number，则上述111111将会表示为123456
就无法用rn=1过滤需要的结果。

而rank可以向相同的数排名同号。这样可以一次性过滤需要的数据。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!