Hive 及相关 SQL笔记

本篇文章记录了，自己在实际工作和学习中遇到的一些问题，算是SQL相关的总结

更多文章欢迎关注公众号：stackoverflow

下面是关于SQL在引擎内部执行的顺序的简易版/必记版：

from 某表，group by 某字段，开窗，聚合函数，having，distinct , order by , limit ，尤其注意当 group by 和开窗相遇时，一定是分组group by优先

hive的架构

如下图是Hive的架构图，即解析器-编译器-优化器-执行器，区别于MySQL的，连接器-分析器-优化器-执行器

Hive架构 VS Mysql架构

Metastrore是存储元数据的数据库，默认使用的是derby，可以更改为MySQL，元数据指的是将结构化数据映射成一张表的表名，表所属的数据库(默认为default)，表的拥有者，表的列，分区字段，表的类型(是否为外部表)表所在的目录等。Hive只是和RDB只是在SQL语句上有着类似之处

第一部分

collect_x

在使用这个函数时，需要设置set hive.map.aggr = false; 否则可能会发生IllegalArgumentException Size requested for unknown type: java.util.Collection的异常¹

select collect_set(col_a)  as          set_a
     , collect_list(col_a) as          list_a
     , sort_array(collect_list(col_a)) sort_list_a  -- sort_array 可对序列排序
from (
         select 'a' col_a
         union all
         select 'b' col_a
         union all
         select 'a' col_a
         union all
         select 'a' col_a
     ) t

collect_list函数和 collect_set函数的用法

¹. 设置map的聚合为false , 在map端聚合还可能会引发内存溢出的问题，详情可查看：http://dev.bizo.com/2013/02/map-side-aggregations-in-apache-hive.html ↩

日期

以下是 hive 语法

select date_format('2019-02-10','yyyy-MM');  
2019-02

select date_add('2019-02-10',-1),date_add('2019-02-10',1);
2019-02-09  2019-02-11
-- (1)取当前天的下一个周一
select next_day('2019-02-12','MO')
2019-02-18
-- 说明：星期一到星期日的英文(Monday，Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday)

-- (2)取当前周的周一   
select date_add(next_day('2019-02-12','MO'),-7);
2019-02-11

-- (3)取当前周的周日   
select date_add(next_day('2019-06-09','mo'),-1);
2019-06-09

-- (4)求当月最后一天日期
select last_day('2019-02-10');
2019-02-28


-- 求上个月
select substr(add_months(current_date(),-1),1,7) method_one
, substr(date_sub(from_unixtime(unix_timestamp()), dayofmonth(from_unixtime(unix_timestamp()))), 1, 7) as method_two

时区

格林威治时间（Greenwich Mean Time，简称GMT）是世界上最常用的时间标准之一。它以英国伦敦的格林威治皇家天文台为参考点，用于标定全球的时间。格林威治时间通常用作协调世界时（Coordinated Universal Time，简称UTC）的基准，因此这两个术语通常是互换使用的

UTC的时间戳，在全球任何一个地点，都是一个值，是一个13位的数字，小明在纽约，小红在上海，小蓝在格林威治天文台，他们三个在同一个时刻，得到的时间戳是一致的。只是在各自在不同的时区，换算当地的时区的时间表现形式不一致，如下的这个例子中

时间戳: 1694682379271
纽约时间: 2023-09-14 05:06:19
GMT&UTC时间: 2023-09-14 09:06:19
北京时间: 2023-09-14 17:06:19

以下的hive，impala语法

-- 用于将【指定时区的时间】转换为 【UTC（协调世界时）时间】。这个函数接受两个参数：【要转换的时间】和【源时区】

select to_utc_timestamp('2023-09-14 17:06:19', 'Asia/Shanghai')  as  `将输入时区对应的时间转为GMT时间`;
2023-09-14 09:06:19

-- 用于将【UTC时区的时间】转换为【目标时区的时间】。这个函数接受两个参数：【UTC时区的时间】和【目标时区】
SELECT from_utc_timestamp('2023-09-14 09:06:19', 'Asia/Shanghai')  as  `将UTC时区对应的时间转为目标时区`;
2023-09-14 17:06:19
SELECT from_unixtime(unix_timestamp()) as `UTC时间&GMT时间`
     , current_timestamp()             as `返回东八区`
     , unix_timestamp()                as `返回时间戳`

下图可以发现 hive 3 中 from_unixtime 函数并没有根据本地的时区进行时间的转化，而是直接使用UTC的时区，而 impala 对时区做了转换

impala和hive3时区对比

-- 将北京时间转为巴西时间
select from_utc_timestamp(to_utc_timestamp("2021-05-09 22:14:30",'GMT+8'),"GMT-3")
2021-05-09 11:14:30.0

select date_format(from_utc_timestamp(to_utc_timestamp("2021-05-09 22:14:30",'GMT+8'),"GMT-3"),'yyyy-MM-dd HH:mm:ss')
2021-05-09 11:14:30

字符串处理

substr，replace等就不赘述了

-- 1)
select regexp_extract('http://a.m.taobao.com/i41915173660.html', 'i([0-9]+)', 0)
     , regexp_extract('http://a.m.taobao.com/i41915173660.html', 'i([0-9]+)', 1)
-- i41915173660    ,   41915173660
-- 0是显示与之匹配的整个字符串； 1是显示第一个括号里面的

-- 2)
select  regexp_replace('a1b2c3d4', '[0-9]', '-');
-- a-b-c-d-

-- 3)
-- 某字符串是另外一个字符串的子串
instr(string string, string substring)
-- 返回查找字符串string中子字符串substring出现的位置，如果查找失败将返回0，如果任一参数为Null将返回null，位置为从1开始。

除数为0处理

这里我们将比较不同的引擎是如何处理除数为0的问题的，如下图：

不同引擎除0的结果比对

对于除数为0问题，优先使用nullif(a,0)函数来进行处理，但是该函数Hive2.2才有对应实现，nullif((a,b) 的语义在于，如果参数 a 等于参数 b 那么，该函数返回 null

第二部分

sum() + over()

不同引擎除0的结果比对

over() 全局求和

over(order by) 全局累积求和

over(partition by ) 分区内全局求和

over(partition by order by) 分区内累积求和

内网环境下，下面的脚本分别在 impala，hive，holo，pg , mysql ,执行以下，看下是什么情况

with tmp1 as (
    select 'a' as a,1 as b 
    union all
    select 'a' as a , 2 as b
    union all 
    select 'b' as b , 10 as b
)
select a , 
     avg(b) over(partition by a order by b) x ,
     sum(b) over(partition by a order by b) y
from tmp1 
;

关于聚合函数+窗口函数的描述，GreeksforGreeks上的那个博文是错的

侧写视图(lateral view)

🎈explode(split()) 只能用来解决一行转多列的单字段问题，侧写视图主要用来处理，通用行转列的问题

-- hive 语法
with tmp1 as(
select 'A;B;C' as name , 20 as age
)
select explode(split(name,';'))
         , age -- Only a single expression in the SELECT clause is supported with UDTF's 
         -- 如果是多列，使用测斜视图语法
from tmp1
;
-- 如下（hive语法）
with tmp1 as(
select 'A;B;C' as name , 20 as age
)
select name 
     , age 
     , col_x
from tmp1 lateral view explode(split(name,';')) x as col_x
;

name	age	col_x
A;B;C	20	A
A;B;C	20	B
A;B;C	20	C

该表格可左右移动

lag+over的使用

其中最为常用的是：按照时间正序排序，lag获取上个周期的值

lag + over 的使用

lag （落后）是获取上一个
lead （领导）是获取下一个

SQL（StructuredQueryLanguage）标准认为当前行的上一个行是后面，下一行是前面

一周内连续3天活跃

一周连续X天活跃的最佳解法

with tmp1 as(
    select 0 as mid_id,'2020-02-12' as dt union  all
    select 0 as mid_id,'2020-02-16' as dt  union  all
    select 0 as mid_id,'2020-02-17' as dt  union  all
    select 0 as mid_id,'2020-02-18' as dt  union  all
    select 1 as mid_id,'2020-02-11' as dt  union  all
    select 1 as mid_id,'2020-02-13' as dt  union  all
    select 1 as mid_id,'2020-02-14' as dt  union  all
    select 1 as mid_id,'2020-02-17' as dt
) ,
tmp2 as (
         select mid_id
            , dt
            , row_number() over (partition by mid_id order by dt)               rk
            , date_sub(dt, row_number() over (partition by mid_id order by dt)) diff
       from tmp1
       where dt between date_sub('2020-02-18', 7) and '2020-02-18'
)
select mid_id
from tmp2
group by mid_id,diff
having count(*) >=3 ;

对于类似的连续X天的问题，最优的解决方案是使用开窗函数，另外的一种解题思路是自关联，关联时，使用 t1.mid_id = t2.mid_id and t1.dt = date_sub(t2.dt,x)的方式

left semi join

关于left semi join 注意2点：

🅰left semi join 要严格区分于left outer join(left join)

🅱 t1 left semi join t2 选列时，不允许出现t2 的字段

select t1.id, t1.fieldA
from `table_A` t1
where t1.id in (
    select id
    from `table_B`
); -- A 和 B的 交集

-- 可改写为exists的方式
select t1.*
from `table_A` t1
where exists (
    select t2.id
    from `table_B` t2
    where t1.id = t2.id
)

-- 还可改写为
select t1.*
from `table_A` t1
left join `table_B` t2
on t1.id = t2.id
where t2.id is not null -- A 和 B 的交集
;

-- 改写为 ，这种方式更加高效
select t1.* -- 不允许出现t2 的字段
from `table_A` t1
left semi join `table_B` t2
on t1.id = t2.id;

同理对于not exist

select t1.*
from `table_A` t1
left join `table_B` t2
on t1.id = t2.id
where t2.id is  null -- A中有B中没有
;
-- 我们换成下面的写法
select t1.*
from `table_A` t1
where not exists (
    select t2.id
    from `table_B` t2
    where t1.id = t2.id
) -- A中有B中没有

-- 或者换成下面的写法
select t1.*
from `table_A` t1
where not in (
    select t2.id
    from `table_B` t2
    where t1.id = t2.id
) -- A中有B中没有

distinct

🅰 distinct 和 order by 的结合

先执行distinct ，后执行order by ，最后limit

先执行 distinct ，后执行 order by

🅱 distinct 多个字段

distinct` 多个字段对所有字段都起作用，并不是一个；如 `select distinct field_a,field_b from table;
a1,b1;
a1,b2;
a2,b2;
-- 只要有不同就会被选择出来

limit offset

1	limit x offset y` ,$y$是$x$的倍数出现，可以恰好将数据取完，`limit x offset y` 等效于 `limit y,x

limit 的用法

select *
from (
    select 1 a
    union all
    select 2 a
    union all
    select 3 a
    union all
    select 4 a
    union all
    select 5 a
) t
order by a desc
limit 3 offset 3;

最后一个截图的SQL语句，我在Hive2.1.1中的执行结果是：

说明在Hive中offset的排序是从1开始的（x取0等于x=1）

ntile+over

ntile(x)将数据划均分为x个桶，并且返回桶编号，如果有多的元素，优先进入第一个桶

使用nile将数据均分到x桶内

with tmp1 as (
    select 'a' as name , 'one' as claz, 1 as score union all
    select 'b' as name , 'two' as claz, 2 as score union all
    select 'c' as name , 'two' as claz, 3 as score union all
    select 'd' as name , 'one' as claz, 4 as score union all
    select 'e' as name , 'one' as claz, 5 as score union all
    select 'f' as name , 'two' as claz, 6 as score union all
    select 'g' as name , 'one' as claz, 7 as score union all
    select 'h' as name , 'one' as claz, 8 as score union all
    select 'i' as name , 'two' as claz, 9 as score union all
    select 'j' as name , 'two' as claz, 0 as score
)
select *
     , ntile(2) over(partition by claz order by score) rn
from tmp1;

模糊匹配多个字段

在hive或者是impala种，或者使用 rlike：其作用在于模糊匹配多个值

-- hive & impala
with tmp1 as(
    select 'abc大' as a union all
    select '中abc' as a union all
    select 'abc小' as a
)
select *
from tmp1
-- where a rlike '大|中' -- 写法1
where a regxp '大|中'    -- 写法2
-- 中abc
-- abc大

-- holo & pg
with tmp1 as(
    select 'abc大' as a union all
    select '中abc' as a union all
    select 'abc小' as a
)
select *
from tmp1
where a ~ '大|中'

窗口函数的范围选择

注意 range 和rows之间的使用区别： rows计算的是行，range 计算的是值， preceding 往上，following 往下

agg_func over(order by col_name rows between 1 proceding and 1 following) -- col_name的前后1行
agg_func over(order by col_name range between 1 proceding and 1 following) -- col_name值的(+/- 1) 的值

agg_func over(order by col_name rows between unbounded preceding and unbounded following) -- 全部行
agg_func over(order by col_name rows between unbounded preceding and current row) -- 开头到当前行

窗口函数的选择范围

第三部分

select 非 group by 字段

MySQL支持，Hive,Impala,PostgreSQL 不支持

对于下面这一段SQL

select dept
     , emp
     , max(sal) as max_sal
from (
    select 'A' as dept, 'a1' as emp, 10 as sal union all
    select 'A' as dept, 'a2' as emp, 20 as sal union all
    select 'B' as dept, 'b2' as emp, 100 as sal union all
    select 'B' as dept, 'b1' as emp, 200 as sal
) t
group by dept

1⃣MySQL 通过

mysql select 非 group by 的字段

MySQL 选择记录中的第一个记录(从实验结果来看，是记录的第一行)

2⃣ postgreSQL：

1	[42803] ERROR: column "t.emp" must appear in the GROUP BY clause or be used in an aggregate function

3⃣ Hive：

1	Error while compiling statement: FAILED: SemanticException [Error 10025]: line 2:7 Expression not in GROUP BY key 'emp'

4⃣ Impala：

1	AnalysisException: select list expression not produced by aggregation output (missing from GROUP BY clause?): emp

having 过滤是否支持别名

MySQL和Hive是支持的， impala和postgreSQL不支持，🎈：推荐无论何时都不使用别名进行分组后过滤

select a, count(*) as cnt
from (
    select 5 as a
    union all
    select 4 as a
    union all
    select 4 as a
    union all
    select 3 as a
    union all
    select 3 as a
) t
group by a
having cnt > 1;

上述的SQL在MySQL 和 hive中执行都是没问题的，在impala和postgreSQL报错 column "cnt" does not exist,需要下面的写法

select a, count(*) as cnt
from (
    select 5 as a
    union all
    select 4 as a
    union all
    select 4 as a
    union all
    select 3 as a
    union all
    select 3 as a
) t
group by a
having count(*) > 1;

order by 字符串

select a
from
(
    select  'a' as a union all  -- 97
    select '' as a union all    -- 66
    select  ' ' as a union all  -- 32
    select null as a            -- 0
) t
order  by a desc ;

对于以上查询和排序，Hive和MySQL认为NULL是最小；Impala和PostgresSQL认为NULL最大，如果使用explain命令查看SQL的执行计划的话，会明显看到编译器会给SQL添加1个null first / null last的明亮，这个取决于具体的引擎，感兴趣的读者可以自己test下，比如Hive会将null设为最小，impala会将null设为最大

null，空串，空格之间的排序关系

$24/5$的结果

DB/Program Language	value
Java / PostgreSQL	4
Hive / Impala / MySQL	4.8

窗口函数是否支持`distinct`

select  A, B , count( distinct A) over()
from (
select 1 as A ,'a' as B union all
select 2 as A ,'b' as B union all
select 1 as A ,'c' as B union all
select 3 as A ,'d' as B
) t

比如以上的SQL查询：Hive是支持的，Impala，MySQL，PostgreSQL暂时没有实现

窗口嵌套

窗口函数的嵌套，只Hive~2.1.1~中是支持的，PostgreSQL(window functions are not allowed in window definitions)，MySQL，Impala 中只能多嵌套一层

不同的引擎对于窗口嵌套的支持

字符串写入数值类型

create table if not exists  business (
    name strng,
    order_date string,
    cost float
);
insert into business values('xioaming','2021-08-22','');

Hive 会将字符串转为null写入；Impala，MySQL，PostgreSQL会进行类型检查异常

字段截取

-- MySql
select substring_index(substring_index('A/B/C/D/E', '/', 4), '/', -1) as dept_name0 -- D

-- hive，索引从0开始
select substring_index(substring_index('A/B/C/D/E', '/', 4), '/', -1) as dept_name0 -- D
     , split('A/B/C/D/E', '/')[3]                                     as dept_name1 -- D
     
    
-- pg/holo 索引从1开始
select split_part('A/B/C/D/E', '/',4)                                as dept_name1 -- D
-- impala 索引从1开始
select split_part('A/B/C/D/E', '/',4)                                as dept_name1 -- D、


-- 其中，holo存在以下函数
select string_to_array('xx~^~yy~^~zz', '~^~')           -- {xx,yy,zz}

`column_name not in (a)` 会过滤掉 `null` 的记录

column_name not in (a)：出了会过滤掉值为a的记录，还会过滤掉 column_name为 null 的记录

with tmp1 as (-- 
    select 'a' as a
    union all
    select 'b' as a
    union all
    select 'c' as a
    union all
    select null as a
)
select *
from tmp1
where a not in ('a')
-- 结果返回 a,c

`Impala` 的模糊关联（非等值关联）

-- t2.name 是 t1.name 的子串的时候即返回 true ，注意顺序
with tmp1 as ( --
    select '康恩贝/CONBA' as name, '2023-01-01' as live_date
)
   , tmp2 as ( --
    select 'CONBA' as name, '2023-01-01' as live_date
)
select *
from tmp1      t1
left join tmp2 t2
          on t1.live_date = t2.live_date
              and t1.name rlike t2.name

name	live_date	name	live_date
康恩贝/CONBA	2023-01-01	CONBA	2023-01-01

第四部分

`null`,`x` 关联

null 和其他值的关联

在任何 SQL(MySQL,PostgreSQL,Hive,Impala)引擎中，null和任意值都无法关联无法相互关联，包括其本身

PostgreSQL中有类型探测，执行以上关联会发生：Failed to find conversion function from unknown to text

返回1行&返回0行

使用聚合函数，返回的行数一定大于等于1

返回0行和返回1行

union all 的类型

任何引擎，union all的类型必须保持一致

组合主键非`null`

对于test01表，字段a和字段b在作为联合主键时，在字段a为null，字段b非null的时候

1⃣kudu将不会写入该记录，不会抛异常，导致写入数据和查询数据记录数不一致

2⃣MySql插入时抛出异常类似(primary key not null)

3⃣postgresql 插入时抛出异常类似(primary key not null)

时间戳

⚠时间是人可识别的，时间戳基本是机器识别的，比如2022-01-01 00:00:00（1640966400），前者是时间，后者是时间戳，几乎所有的引擎都实现了 unix_timestamp方法，支持传入一个时间，如果没有传入时间将使用当前的时刻

1⃣ 获取时间戳

--mysql
select unix_timestamp('2022-01-01 00:00:00');

-- hive 
select unix_timestamp('2022-01-01 00:00:00');

2⃣获取时间

-- mysql
select now();

-- hive
select from_unixtime( unix_timestamp());

-- impala
select now(),  utc_timestamp(),current_timestamp(),from_unixtime( unix_timestamp());

-- pg
select  now() ,  current_timestamp;

去掉文本中的换行符/回车符/制表符/空格，正则替换

-- \s 表示匹配一个或者多个空白字符（包括空格、制表符、换行符）， '+' 表示匹配前面的模式（即'\s'）一次或多次 
select regexp_replace(input_content,'\\s+','') as after_content
-- hive
select regexp_replace(video_desc,'\t|\n|\001|\r','') as video_desc

-- mysql5.7中不支持regexp_replace，8.0中支持，所以在5.7中使用
select replace(replace(video_desc, char(13), ''), char(10), '') as video_desc
select '1\r2\t\3\n4\0015'
    ,regexp_replace('1\r2\t\3\n4','\\s+','')
    ,regexp_replace('1     \r2\t\3\n4\0015','\\s+','')
    ,regexp_replace('1     \r2\t\3\n4\0015','\t|\n|\001|\r','')

由于特殊字符导致表错位串行的问题描述

impala upsert + Kudu

本质是insert + update 的结合

主键存在时，全字段更新
主键不存在时，插入

不使用`order by` 找到工资第二的员工

select
    e.emp_no,
    salary,
    last_name,
    first_name
from employees e
inner join salaries s
on e.emp_no = s.emp_no
where s.to_date = '9999-01-01' 
and s.salary = 
(
    select s1.salary
    from salaries s1
    inner join salaries s2
    on s1.salary <= s2.salary
    where s1.to_date = '9999-01-01' and s2.to_date = '9999-01-01' 
    group by s1.salary
    having count(distinct s2.salary) = 2
)

最大值只能小于等于最大值（出现1次)；次大值只能小于等于最大值和本身（出现2次）

求薪资次高的员工

from tmp1 , tmp2 的本质

from tmp1 , tmp2 的本质就是 inner join的行为

with tmp1 as (select 'a' as name, 10 as age
              union all
              select 'b' as name, 11 as age
              union all
              select 'c' as name, 12 as age
)
   , tmp2 as (select 'c' as name, 'female' as sex
              union all
              select 'd' as name, 'male' as sex
              union all
              select 'e' as name, 'female' as sex
)
select t1.name
     , t1.age
     , t2.sex
from tmp1 t1
   , tmp2 t2
where t1.name = t2.name

上述的SQL等同于下列SQL

with tmp1 as (select 'a' as name, 10 as age
              union all
              select 'b' as name, 11 as age
              union all
              select 'c' as name, 12 as age
)
   , tmp2 as (select 'c' as name, 'female' as sex
              union all
              select 'd' as name, 'male' as sex
              union all
              select 'e' as name, 'female' as sex
)
select t1.name, t1.age, t2.sex
from tmp1       t1
inner join tmp2 t2
           on t1.name = t2.name

Hive 中强制 `mapjoin`

select /*+ mapjoin(t2)*/  --强制指定关联方式
from t1
left join t2 
on t1.key= t2.key

第四部分

ORC存储格式和 RCFile存储格式的一个问题 - 发生 Unknow Reason问题

union all + distint 的时候需要设置 set hive.vectorized.execution.enabled=false;

spark.yarn.executor.memoryOverhead ，该参数官方地址，出现这个问题提升该参数的阈值是一方面，另一方面可以增加 executor的数量，set spark.dynamicAllocation.maxExecutors=14;

1
2
3

return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed because of out of memory.

ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.3 GB of 15.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead是Yarn分配给 executor 的堆外内存

Finished Stage-2_0: 1098(+1,-35) /1099
Finished Stage-2_0: 1099(-35)/1099

-- success/total
-- a(x,y)/b
hive,spark的运行日志，如果a最终等于b,就是task是成功的；

如何获取月初月末

select trunc(date_add(current_date,-1),'MM') as month_first_day    -- hive  月初
select date_format('2020-01-02 10:09:08','yyyy-MM-01') --  hive
select trunc(current_date,'MM')  -- spark-sql ,hive

select last_day(current_timestamp()) -- 月末

如何处理动态分区写入

开启非严格模式 set hive.exec.dynamic.partition.mode=nonstrict
动态分区写入注意partition（date_id）的字段和select 后的字段名称保持一致
动态分区字段放在select 的最后一行

Memory limit exceeded: Could not allocate memory while trying to increase reservation

最常用的方式是：重试

降低查询并发度：减少同时执行的查询数量，这样可以为每个查询分配更多的内存

配置准入控制：Impala 支持准入控制功能，可以限制同时执行的查询数量，避免资源竞争。 Impala Admission Control 文档

考虑优化查询以降低内存需求（sql 本身，limit 等）

json数组如何解析

如何解析JSON 数组？JSON数组大概长这个样子： [{},{}]，以下是解析demo

select explode(
               split(
                       regexp_replace(
                               regexp_replace(                                      '[{"skuNo":"KU413455571546619913","propertyValue":"雾霾蓝"},{"skuNo":"KU413455571546619912","propertyValue":"北极绿"},{"skuNo":"KU413455571546619911","propertyValue":"亚麻金"}]',
                                       '\\[|\\]', ''), --将json数组两边的中括号去掉
                               '\\}\\,\\{', '\\}\\;\\{'), --将json数组元素之间的逗号换成分号
                       '\\;') --以分号作为分隔符(split函数以分号作为分隔)
           )

如何求 `当前日期至前7/30天` 的聚合值？

with tmp1 as ( -- 求：同一个账号，同一个商品，在包含直播日期内的7天内，播过多少次？
    select '1424128656' as account_id, '259183220' as goods_id, '2022-06-28' live_date
    union all
    select '1424128656' as account_id, '259183220' as goods_id, '2022-06-22' live_date
    union all
    select '1424128656' as account_id, '259183220' as goods_id, '2022-06-17' live_date
    union all
    select '1424128656' as account_id, '259183220' as goods_id, '2022-06-15' live_date
    union all
    select '1424128656' as account_id, '262220152' as goods_id, '2021-12-09' live_date
)
select *
     , count(*)
             over (partition by account_id,goods_id
                 order by unix_timestamp(live_date, 'yyyy-MM-dd') desc
                 range between 1 following and 604800 following) + 1 as day_7 -- 过去包含今天在内的7天内出现了多少次

from tmp1

account_id	goods_id	live_date	day_
1424128656	259183220	2022-06-28	2
1424128656	259183220	2022-06-22	3
1424128656	259183220	2022-06-17	2
1424128656	259183220	2022-06-15	1
1424128656	262220152	2021-12-09	1

Hive，cast( as int) into bigint 时，数据都是0

Hive中

-- table_A 中 A字段的类型为 bigint

insert into table table_A
select cast(a as int) as A

-- 最后查询 A的数据为，均为0

-- solutions ： 2边类型保持一致即可

时间A在B时间开始后的第几个小时

确定A 时间（ on_top_time ）在 B时间（start_time）的第几个小时内，相当于求相对位置问题（5相对于1的位置是多少）

with tmp1 as ( -- 
    select '2023-01-01 12:00:00' as on_top_time, '2023-01-01 11:00:00' as start_time
    union all
    select '2023-01-01 12:02:00' as on_top_time, '2023-01-01 11:00:00' as start_time
    union all
    select '2023-01-01 12:59:00' as on_top_time, '2023-01-01 11:00:00' as start_time
    union all
    select '2023-01-01 16:00:00' as on_top_time, '2023-01-01 11:00:00' as start_time

)
select on_top_time                                          as on_top_time
     , start_time                                           as start_time
     , cast((unix_timestamp(on_top_time) -
             unix_timestamp(start_time)) / 3600 + 1 as int) as hours_id
from tmp1

is not in the vectorization context column map

在Hive vectorization union all problem 一文中，提出了一定的解决方案，但是没有写出原因

1 2	-- 在计算时不使用向量化计算 set hive.vectorized.execution.enabled=false

如何处理数据倾斜？

当你遇到下面倾斜的问题的时候：假设A表在 id = 1 有大量的数据，针对这种倾斜场景，有2种处理方式：
方式 🅰

-- 原有的逻辑是：
select A.id 
from A 
join B 
on A.id = B.id 

-- 改进后的逻辑，第一部分，该部分查询不会有任何倾斜
select A.id 
from A 
join B 
on A.id = B.id 
where A.id <> 1

-- 改进后的逻辑，第二部分,B.id=1的数据量很小，我们将其放入内存关联，在spark种叫广播关联，在hive中叫做 map-join
select /*+ mapjoin(B)*/ A.id 
from A 
join B 
on A.id = B.id 
where A.id = 1
and B.id = 1

方式 🅱

通过增加随机数列的方式，增加关联键,举个例子：对于A表

id
a 
a
a
a
-- 将变成
id  rand_column(0~2) 
a               0
a           1
a             2
a           2

对于B表
```sql
id
a
-- 将变成
id rand_num
a  0
a  1
a  2

假设之前的关联关系是 A.id = B.id，为了处理倾斜，我们间关联关系修改为：
`A.id = B.id and A.rand_num = B.rand_num`

如何为Hive表增加一列？

1	`alter table dw.live_thin_room_order_goods_df add columns (third_party_shop_id string comment '三方店铺id')`

生成日期维度列

-- 在hive中
select date_add("2023-08-01", a.pos) as range_date
from (
   select posexplode(split(repeat("@", datediff("2023-08-31", "2023-08-01")), "@")) -- 第一个日期终止日期，第二个日期起始日期，
) a

维表生成，使用 date_id = '2023-08-08' 的日期生成 date_id = '2023-08-08' 之前的日期

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table target_table partition (date_id)
select id   as id
     , name as name
     , t1.date_id
from (select date_add("2023-02-01", a.pos) as date_id
      from (select posexplode(split(repeat("@", datediff("2023-08-08", "2023-02-01")), "@")) -- 第一个日期终止日期，第二个日期起始日期，
      ) a
)            t1
cross join (select *
            from target_table
            where date_id = '2023-08-09'
           ) t2

如何将过程数据划分到小时

场景描述：现在有一个网页的页面，爬虫读取数据，每分钟读取一次，并且落库，需求是得到每个小时的增量数据

info about the picture

以上的代码如下

with tmp1 as ( -- l_c : launch_consume, up_at : updated_at, s_at: start_time ， s_diff 更改时间距离开始时间多少秒
    select 'o_01' as o_id, 10 as l_c, '2023-08-01 01:00:00' as up_at, '2023-08-01 00:30:00' as s_at
    union all
    select 'o_01' as o_id, 20 as l_c, '2023-08-01 01:30:00' as up_at, '2023-08-01 00:30:00' as s_at
    union all
    select 'o_01' as o_id, 30 as l_c, '2023-08-01 01:50:00' as up_at, '2023-08-01 00:30:00' as s_at
    union all
    select 'o_01' as o_id, 25 as l_c, '2023-08-01 02:00:00' as up_at, '2023-08-01 00:30:00' as s_at
    union all
    select 'o_01' as o_id, 60 as l_c, '2023-08-01 02:40:00' as up_at, '2023-08-01 00:30:00' as s_at
    union all
    select 'o_01' as o_id, 80 as l_c, '2023-08-01 03:00:00' as up_at, '2023-08-01 00:30:00' as s_at
    union all
    select 'o_01' as o_id, 180 as l_c, '2023-08-01 04:00:00' as up_at, '2023-08-01 00:30:00' as s_at
    union all
    select 'o_01' as o_id, 200 as l_c, '2023-08-01 04:05:00' as up_at, '2023-08-01 00:30:00' as s_at
    union all
    select 'o_01' as o_id, 400 as l_c, '2023-08-01 04:20:00' as up_at, '2023-08-01 00:30:00' as s_at
)
   , tmp2 as ( --
    select o_id                                                         as o_id
         , l_c                                                          as l_c
         , up_at                                                        as up_at
         , s_at                                                         as s_at
         , lag(l_c, 1, 0) over (partition by o_id order by up_at)       as l_l_c

         , l_c - lag(l_c, 1, 0) over (partition by o_id order by up_at) as l_c_diff
         , (unix_timestamp(up_at) - unix_timestamp(s_at))               as s_diff
         , (unix_timestamp(up_at) - unix_timestamp(s_at)) / 3600        as hour_diff
         , cast((unix_timestamp(up_at) -
                 unix_timestamp(s_at)) / 3600 + 1 as int)               as real_hours_id
    from tmp1
)
select *
from tmp2

动态参数

实现效果：当某一列没有传参时，将 where 当前列的过滤逻辑去除

1 2	where case when ${param} = '你预设的参数' then ture else platform_name = ${param}

多字段 full join

with tmp1 as (
    select 'o_01' as o_id, '2023-08-01' as data_id, 50 as money
    union all
    select 'o_03' as o_id, '2023-08-02' as data_id, 50 as money
    union all
    select 'o_04' as o_id, '2023-08-03' as data_id, 50 as money

)
   , tmp2 as ( --
    select 'o_01' as o_id, '2023-08-01' as data_id, 60 as money
    union all
    select 'o_02' as o_id, '2023-08-02' as data_id, 60 as money
    union all
    select 'o_05' as o_id, '2023-08-03' as data_id, 60 as money
)
select if(t1.data_id is not null, t1.o_id, t2.o_id)       as o_id
     , if(t1.data_id is not null, t1.data_id, t2.data_id) as date_id
     , coalesce(t1.money, t2.money)                       as money
from tmp1      t1
full join tmp2 t2
          on t1.o_id = t2.o_id
              and t1.data_id = t2.data_id

从 select * 到 select 判断逻辑

第 X 部分

这里继续收录一下问题或者解决方案，ToDo

hive的架构

第一部分

collect_x

日期

时区

字符串处理

除数为0处理

第二部分

sum() + over()

侧写视图(lateral view)

lag+over的使用

一周内连续3天活跃

left semi join

distinct

🅰 distinct 和 order by 的结合

🅱 distinct 多个字段

limit offset

ntile+over

模糊匹配多个字段

窗口函数的范围选择

第三部分

select 非 group by 字段

having 过滤是否支持别名

order by 字符串

$24/5$的结果

窗口函数是否支持distinct

窗口嵌套

字符串写入数值类型

字段截取

column_name not in (a) 会过滤掉 null 的记录

Impala 的模糊关联（非等值关联）

第四部分

null,x 关联

返回1行&返回0行

union all 的类型

组合主键非null

时间戳

去掉文本中的换行符/回车符/制表符/空格，正则替换

impala upsert + Kudu

不使用order by 找到工资第二的员工

from tmp1 , tmp2 的本质

Hive 中强制 mapjoin

第四部分

如何获取月初月末

如何处理动态分区写入

json数组如何解析

如何求 当前日期至前7/30天 的聚合值？

Hive，cast( as int) into bigint 时，数据都是0

时间A在B时间开始后的第几个小时

is not in the vectorization context column map

如何处理数据倾斜？

如何为Hive表增加一列？

生成日期维度列

如何将过程数据划分到小时

动态参数

多字段 full join

第 X 部分

窗口函数是否支持`distinct`

`column_name not in (a)` 会过滤掉 `null` 的记录

`Impala` 的模糊关联（非等值关联）

`null`,`x` 关联

组合主键非`null`

不使用`order by` 找到工资第二的员工

Hive 中强制 `mapjoin`

如何求 `当前日期至前7/30天` 的聚合值？