Hive 优化

对于如果join中有小表的话,可以开启map

Dynamic Partition Pruning for Hive Map Joins

You can enable dynamic partition pruning for map joins when you are running Hive on Spark (HoS), it is not available for Hive on MapReduce.
Dynamic partition pruning (DPP) is a database optimization that can significantly decrease the amount of data that a query scans, thereby executing your workloads faster.
DPP achieves this by dynamically determining and eliminating the number of partitions that a query must read from a partitioned table.

Map joins also optimize how Hive executes queries. They cause a small table to be scanned and loaded in memory as a hash table
so that a fast join can be performed entirely within a mapper without having to use another reduce step.
If you have queries that join small tables, map joins can make them execute much faster.
Map joins are enabled by default in CDH with the Enable MapJoin Optimization setting for HiveServer2 in Cloudera Manager.
Hive automatically uses map joins for join queries that involve a set of tables where:

  • There is one large table and there is no limit on the size of that large table.
  • All other tables involved in the join must have an aggregate size under the value set for Hive Auto Convert Join Noconditional Size for HiveServer2, which is set to 20MB by default in Cloudera Manager.

关于map-side join的配置:

1
2
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask.size=<number_in_megabytes>;

一次调优实战

最近在ETL过程中发现有条SQL执行时间非常长,其实数据量很小的,但为什么这么长呢。我带着极度好奇,抱着死缠烂打的精神,怎么也要把
问题给解决掉。SQL是这样的:

1
2
select count(1) from zhihu_answer
where ym in (select distinct(ym) from zhihu.zhihu_answer_increment);

先说说这两个表数据量吧:
zhihu_answer数据量大概是一亿,zhihu_answer_increment 也就是几十万条。
首先,我用explain extended查看下执行计划:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
explain extended
select count(1) from zhihu_answer
where ym in (select distinct(ym) from zhihu.zhihu_answer_increment);
Explain
STAGE DEPENDENCIES:
Stage-3 is a root stage
Stage-5 depends on stages: Stage-3 , consists of Stage-6, Stage-1
Stage-6 has a backup stage: Stage-1
Stage-4 depends on stages: Stage-6
Stage-2 depends on stages: Stage-1, Stage-4
Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: zhihu_answer_increment
filterExpr: ym is not null (type: boolean)
Statistics: Num rows: 347468 Data size: 73315748 Basic stats: COMPLETE Column stats: COMPLETE
GatherStats: false
Select Operator
expressions: ym (type: string)
outputColumnNames: ym
Statistics: Num rows: 347468 Data size: 73315748 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
keys: ym (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
null sort order: a
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
tag: -1
auto parallelism: false
Execution mode: vectorized
Path -> Alias:
nullscan://null/zhihu.zhihu_answer_increment/part_ym=201902_ [sq_1:zhihu_answer_increment]
Path -> Partition:
nullscan://null/zhihu.zhihu_answer_increment/part_ym=201902_
Partition
input format: org.apache.hadoop.hive.ql.io.OneNullRowInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
partition values:
ym 201902
properties:
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
bucket_count 256
bucket_field_name answer_id
columns admin_closed_comment,answer_content,answer_created,answer_id,answer_updated,author_headline,author_id,author_name,author_type,author_url_token,avatar_url,badge_num,can_comment,comment_count,gender,insert_time,is_advertiser,is_collapsed,is_copyable,is_org,question_created,question_id,question_title,question_type,reward_member_count,reward_total_money,voteup_count
columns.comments
columns.types boolean:string:string:string:string:string:string:string:string:string:string:smallint:boolean:int:string:string:boolean:boolean:boolean:boolean:string:string:string:string:int:int:int
file.inputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
location hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer_increment/ym=201902
name zhihu.zhihu_answer_increment
numFiles 256
numRows 347468
partition_columns ym
partition_columns.types string
rawDataSize 9381636
serialization.ddl struct zhihu_answer_increment { bool admin_closed_comment, string answer_content, string answer_created, string answer_id, string answer_updated, string author_headline, string author_id, string author_name, string author_type, string author_url_token, string avatar_url, i16 badge_num, bool can_comment, i32 comment_count, string gender, string insert_time, bool is_advertiser, bool is_collapsed, bool is_copyable, bool is_org, string question_created, string question_id, string question_title, string question_type, i32 reward_member_count, i32 reward_total_money, i32 voteup_count}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.NullStructSerDe
totalSize 433473813
transient_lastDdlTime 1571983508
serde: org.apache.hadoop.hive.serde2.NullStructSerDe
input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
properties:
bucket_count 256
bucket_field_name answer_id
columns admin_closed_comment,answer_content,answer_created,answer_id,answer_updated,author_headline,author_id,author_name,author_type,author_url_token,avatar_url,badge_num,can_comment,comment_count,gender,insert_time,is_advertiser,is_collapsed,is_copyable,is_org,question_created,question_id,question_title,question_type,reward_member_count,reward_total_money,voteup_count
columns.comments
columns.types boolean:string:string:string:string:string:string:string:string:string:string:smallint:boolean:int:string:string:boolean:boolean:boolean:boolean:string:string:string:string:int:int:int
file.inputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
location hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer_increment
name zhihu.zhihu_answer_increment
partition_columns ym
partition_columns.types string
serialization.ddl struct zhihu_answer_increment { bool admin_closed_comment, string answer_content, string answer_created, string answer_id, string answer_updated, string author_headline, string author_id, string author_name, string author_type, string author_url_token, string avatar_url, i16 badge_num, bool can_comment, i32 comment_count, string gender, string insert_time, bool is_advertiser, bool is_collapsed, bool is_copyable, bool is_org, string question_created, string question_id, string question_title, string question_type, i32 reward_member_count, i32 reward_total_money, i32 voteup_count}
serialization.format 1
serialization.lib org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
transient_lastDdlTime 1571983018
serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
name: zhihu.zhihu_answer_increment
name: zhihu.zhihu_answer_increment
Truncated Path -> Alias:
nullscan://null/zhihu.zhihu_answer_increment/part_ym=201902_ [sq_1:zhihu_answer_increment]
Needs Tagging: false
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
GlobalTableId: 0
directory: hdfs://device1:8020/tmp/hive/hive/7f0887a3-8c5a-44b6-b5ef-f0c7530a6b15/hive_2019-10-25_14-46-02_198_8962679143430564511-1/-mr-10004
NumFilesPerFileSink: 1
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
columns _col0
columns.types string
escape.delim \
serialization.lib org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Stage: Stage-5
Conditional Operator
Stage: Stage-6
Map Reduce Local Work
Alias -> Map Local Tables:
$INTNAME
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$INTNAME
TableScan
GatherStats: false
HashTable Sink Operator
keys:
0 ym (type: string)
1 _col0 (type: string)
Position of Big Table: 0
Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
alias: zhihu_answer
filterExpr: ym is not null (type: boolean)
Statistics: Num rows: 102075765 Data size: 21537986328 Basic stats: COMPLETE Column stats: COMPLETE
GatherStats: false
Map Join Operator
condition map:
Left Semi Join 0 to 1
keys:
0 ym (type: string)
1 _col0 (type: string)
Position of Big Table: 0
Statistics: Num rows: 4253156 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
GlobalTableId: 0
directory: hdfs://device1:8020/tmp/hive/hive/7f0887a3-8c5a-44b6-b5ef-f0c7530a6b15/hive_2019-10-25_14-46-02_198_8962679143430564511-1/-mr-10003
NumFilesPerFileSink: 1
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
columns _col0
columns.types bigint
escape.delim \
serialization.lib org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Local Work:
Map Reduce Local Work
Path -> Alias:
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201705 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201706 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201707 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201708 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201709 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201710 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201711 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201712 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201801 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201802 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201803 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201804 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201805 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201806 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201807 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201808 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201809 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201810 [zhihu_answer]
hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201811 [zhihu_answer]

一脸懵逼,不会看呀。。
然后我测试了下单独执行:select distinct(ym) from zhihu_answer_increment;,也就不到2分钟就出结果了。为什么组合在一起就要这么长时间呢??
这条SQL的执行结果就是"201902"。我把这个结果复制进去执行:

1
select count(1) from zhihu_answer where ym in (201902);

可能是因为我之前执行过的原因,这条语句的执行时间基本上是秒出呀。几秒内就出结果了。

我再一次执行了那句执行时间很长的SQL,看它的执行时候的log,我发现慢原因是在Stage-4 !!!回到上面那个explain的信息,我发现Hive在做全表扫描呀!Why?
为什么要做全表扫描呢? 因为Hive还是要join的 in (select ** ) 这种子查询中用的是semi join,所以要进行join,它就会进行全表扫描。我的解释不是很详细,
但隐隐约约我能理解为什么Hive在这要做全表扫描了,其实如果写死的话,比如where ym in (201902)它就不会做join,也就不用全表扫描了。所以解决方案还是要能
拿到 201902这个变量,这个value,再拼接到Hive SQL中。我查了下,Hive貌似目前还不支持以SQL查询结果作为新的SQL变量。所以,暂时还是以这种办法解决吧。

让我无比开心的是,改进后,SQL执行快了N倍,因为避免了全表扫描。从原来2个小时的执行,变为了几分钟!

总结

通过Explain打印看看执行计划有哪些;
通过执行的log看看到底是哪个Stage耗时比较长;

Referrence

https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hos_oview.html#concept_i22_l1h_1v