对于如果join中有小表的话,可以开启map
Dynamic Partition Pruning for Hive Map Joins
You can enable dynamic partition pruning for map joins when you are running Hive on Spark (HoS), it is not available for Hive on MapReduce.
Dynamic partition pruning (DPP) is a database optimization that can significantly decrease the amount of data that a query scans, thereby executing your workloads faster.
DPP achieves this by dynamically determining and eliminating the number of partitions that a query must read from a partitioned table.
Map joins also optimize how Hive executes queries. They cause a small table to be scanned and loaded in memory as a hash table
so that a fast join can be performed entirely within a mapper without having to use another reduce step.
If you have queries that join small tables, map joins can make them execute much faster.
Map joins are enabled by default in CDH with the Enable MapJoin Optimization setting for HiveServer2 in Cloudera Manager.
Hive automatically uses map joins for join queries that involve a set of tables where:
- There is one large table and there is no limit on the size of that large table.
- All other tables involved in the join must have an aggregate size under the value set for Hive Auto Convert Join Noconditional Size for HiveServer2, which is set to 20MB by default in Cloudera Manager.
关于map-side join的配置:
一次调优实战
最近在ETL过程中发现有条SQL执行时间非常长,其实数据量很小的,但为什么这么长呢。我带着极度好奇,抱着死缠烂打的精神,怎么也要把
问题给解决掉。SQL是这样的:
先说说这两个表数据量吧:
zhihu_answer数据量大概是一亿,zhihu_answer_increment 也就是几十万条。
首先,我用explain extended
查看下执行计划:
一脸懵逼,不会看呀。。
然后我测试了下单独执行:select distinct(ym) from zhihu_answer_increment;
,也就不到2分钟就出结果了。为什么组合在一起就要这么长时间呢??
这条SQL的执行结果就是"201902"
。我把这个结果复制进去执行:
可能是因为我之前执行过的原因,这条语句的执行时间基本上是秒出呀。几秒内就出结果了。
我再一次执行了那句执行时间很长的SQL,看它的执行时候的log,我发现慢原因是在Stage-4 !!!回到上面那个explain的信息,我发现Hive在做全表扫描呀!Why?
为什么要做全表扫描呢? 因为Hive还是要join的 in (select ** ) 这种子查询中用的是semi join,所以要进行join,它就会进行全表扫描。我的解释不是很详细,
但隐隐约约我能理解为什么Hive在这要做全表扫描了,其实如果写死的话,比如where ym in (201902)它就不会做join,也就不用全表扫描了。所以解决方案还是要能
拿到 201902
这个变量,这个value,再拼接到Hive SQL中。我查了下,Hive貌似目前还不支持以SQL查询结果作为新的SQL变量。所以,暂时还是以这种办法解决吧。
让我无比开心的是,改进后,SQL执行快了N倍,因为避免了全表扫描。从原来2个小时的执行,变为了几分钟!
总结
通过Explain打印看看执行计划有哪些;
通过执行的log看看到底是哪个Stage耗时比较长;