对于如果join中有小表的话，可以开启map

Dynamic Partition Pruning for Hive Map Joins

You can enable dynamic partition pruning for map joins when you are running Hive on Spark (HoS), it is not available for Hive on MapReduce.
Dynamic partition pruning (DPP) is a database optimization that can significantly decrease the amount of data that a query scans, thereby executing your workloads faster.
DPP achieves this by dynamically determining and eliminating the number of partitions that a query must read from a partitioned table.

Map joins also optimize how Hive executes queries. They cause a small table to be scanned and loaded in memory as a hash table
so that a fast join can be performed entirely within a mapper without having to use another reduce step.
If you have queries that join small tables, map joins can make them execute much faster.
Map joins are enabled by default in CDH with the Enable MapJoin Optimization setting for HiveServer2 in Cloudera Manager.
Hive automatically uses map joins for join queries that involve a set of tables where:

There is one large table and there is no limit on the size of that large table.
All other tables involved in the join must have an aggregate size under the value set for Hive Auto Convert Join Noconditional Size for HiveServer2, which is set to 20MB by default in Cloudera Manager.

关于map-side join的配置:

1 2	SET hive.auto.convert.join=true; SET hive.auto.convert.join.noconditionaltask.size=<number_in_megabytes>;

一次调优实战

最近在ETL过程中发现有条SQL执行时间非常长，其实数据量很小的，但为什么这么长呢。我带着极度好奇，抱着死缠烂打的精神，怎么也要把
问题给解决掉。SQL是这样的:

1 2	select count(1) from zhihu_answer where ym in (select distinct(ym) from zhihu.zhihu_answer_increment);

先说说这两个表数据量吧：
zhihu_answer数据量大概是一亿，zhihu_answer_increment 也就是几十万条。
首先，我用explain extended查看下执行计划:

explain extended
select count(1) from zhihu_answer 
where ym in (select distinct(ym) from zhihu.zhihu_answer_increment);
Explain	
STAGE DEPENDENCIES:	
  Stage-3 is a root stage	
  Stage-5 depends on stages: Stage-3 , consists of Stage-6, Stage-1	
  Stage-6 has a backup stage: Stage-1	
  Stage-4 depends on stages: Stage-6	
  Stage-2 depends on stages: Stage-1, Stage-4	
  Stage-1	
  Stage-0 depends on stages: Stage-2	
	
STAGE PLANS:	
  Stage: Stage-3	
    Map Reduce	
      Map Operator Tree:	
          TableScan	
            alias: zhihu_answer_increment	
            filterExpr: ym is not null (type: boolean)	
            Statistics: Num rows: 347468 Data size: 73315748 Basic stats: COMPLETE Column stats: COMPLETE	
            GatherStats: false	
            Select Operator	
              expressions: ym (type: string)	
              outputColumnNames: ym	
              Statistics: Num rows: 347468 Data size: 73315748 Basic stats: COMPLETE Column stats: COMPLETE	
              Group By Operator	
                keys: ym (type: string)	
                mode: hash	
                outputColumnNames: _col0	
                Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE	
                Reduce Output Operator	
                  key expressions: _col0 (type: string)	
                  null sort order: a	
                  sort order: +	
                  Map-reduce partition columns: _col0 (type: string)	
                  Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE	
                  tag: -1	
                  auto parallelism: false	
      Execution mode: vectorized	
      Path -> Alias:	
        nullscan://null/zhihu.zhihu_answer_increment/part_ym=201902_ [sq_1:zhihu_answer_increment]	
      Path -> Partition:	
        nullscan://null/zhihu.zhihu_answer_increment/part_ym=201902_ 	
          Partition	
            input format: org.apache.hadoop.hive.ql.io.OneNullRowInputFormat	
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	
            partition values:	
              ym 201902	
            properties:	
              COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}	
              bucket_count 256	
              bucket_field_name answer_id	
              columns admin_closed_comment,answer_content,answer_created,answer_id,answer_updated,author_headline,author_id,author_name,author_type,author_url_token,avatar_url,badge_num,can_comment,comment_count,gender,insert_time,is_advertiser,is_collapsed,is_copyable,is_org,question_created,question_id,question_title,question_type,reward_member_count,reward_total_money,voteup_count	
              columns.comments 	
              columns.types boolean:string:string:string:string:string:string:string:string:string:string:smallint:boolean:int:string:string:boolean:boolean:boolean:boolean:string:string:string:string:int:int:int	
              file.inputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat	
              file.outputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat	
              location hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer_increment/ym=201902	
              name zhihu.zhihu_answer_increment	
              numFiles 256	
              numRows 347468	
              partition_columns ym	
              partition_columns.types string	
              rawDataSize 9381636	
              serialization.ddl struct zhihu_answer_increment { bool admin_closed_comment, string answer_content, string answer_created, string answer_id, string answer_updated, string author_headline, string author_id, string author_name, string author_type, string author_url_token, string avatar_url, i16 badge_num, bool can_comment, i32 comment_count, string gender, string insert_time, bool is_advertiser, bool is_collapsed, bool is_copyable, bool is_org, string question_created, string question_id, string question_title, string question_type, i32 reward_member_count, i32 reward_total_money, i32 voteup_count}	
              serialization.format 1	
              serialization.lib org.apache.hadoop.hive.serde2.NullStructSerDe	
              totalSize 433473813	
              transient_lastDdlTime 1571983508	
            serde: org.apache.hadoop.hive.serde2.NullStructSerDe	
          	
              input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat	
              output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat	
              properties:	
                bucket_count 256	
                bucket_field_name answer_id	
                columns admin_closed_comment,answer_content,answer_created,answer_id,answer_updated,author_headline,author_id,author_name,author_type,author_url_token,avatar_url,badge_num,can_comment,comment_count,gender,insert_time,is_advertiser,is_collapsed,is_copyable,is_org,question_created,question_id,question_title,question_type,reward_member_count,reward_total_money,voteup_count	
                columns.comments 	
                columns.types boolean:string:string:string:string:string:string:string:string:string:string:smallint:boolean:int:string:string:boolean:boolean:boolean:boolean:string:string:string:string:int:int:int	
                file.inputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat	
                file.outputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat	
                location hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer_increment	
                name zhihu.zhihu_answer_increment	
                partition_columns ym	
                partition_columns.types string	
                serialization.ddl struct zhihu_answer_increment { bool admin_closed_comment, string answer_content, string answer_created, string answer_id, string answer_updated, string author_headline, string author_id, string author_name, string author_type, string author_url_token, string avatar_url, i16 badge_num, bool can_comment, i32 comment_count, string gender, string insert_time, bool is_advertiser, bool is_collapsed, bool is_copyable, bool is_org, string question_created, string question_id, string question_title, string question_type, i32 reward_member_count, i32 reward_total_money, i32 voteup_count}	
                serialization.format 1	
                serialization.lib org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe	
                transient_lastDdlTime 1571983018	
              serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe	
              name: zhihu.zhihu_answer_increment	
            name: zhihu.zhihu_answer_increment	
      Truncated Path -> Alias:	
        nullscan://null/zhihu.zhihu_answer_increment/part_ym=201902_ [sq_1:zhihu_answer_increment]	
      Needs Tagging: false	
      Reduce Operator Tree:	
        Group By Operator	
          keys: KEY._col0 (type: string)	
          mode: mergepartial	
          outputColumnNames: _col0	
          Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE	
          Group By Operator	
            keys: _col0 (type: string)	
            mode: hash	
            outputColumnNames: _col0	
            Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE	
            File Output Operator	
              compressed: false	
              GlobalTableId: 0	
              directory: hdfs://device1:8020/tmp/hive/hive/7f0887a3-8c5a-44b6-b5ef-f0c7530a6b15/hive_2019-10-25_14-46-02_198_8962679143430564511-1/-mr-10004	
              NumFilesPerFileSink: 1	
              table:	
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat	
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	
                  properties:	
                    columns _col0	
                    columns.types string	
                    escape.delim \	
                    serialization.lib org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe	
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe	
              TotalFiles: 1	
              GatherStats: false	
              MultiFileSpray: false	
	
  Stage: Stage-5	
    Conditional Operator	
	
  Stage: Stage-6	
    Map Reduce Local Work	
      Alias -> Map Local Tables:	
        $INTNAME 	
          Fetch Operator	
            limit: -1	
      Alias -> Map Local Operator Tree:	
        $INTNAME 	
          TableScan	
            GatherStats: false	
            HashTable Sink Operator	
              keys:	
                0 ym (type: string)	
                1 _col0 (type: string)	
              Position of Big Table: 0	
	
  Stage: Stage-4	
    Map Reduce	
      Map Operator Tree:	
          TableScan	
            alias: zhihu_answer	
            filterExpr: ym is not null (type: boolean)	
            Statistics: Num rows: 102075765 Data size: 21537986328 Basic stats: COMPLETE Column stats: COMPLETE	
            GatherStats: false	
            Map Join Operator	
              condition map:	
                   Left Semi Join 0 to 1	
              keys:	
                0 ym (type: string)	
                1 _col0 (type: string)	
              Position of Big Table: 0	
              Statistics: Num rows: 4253156 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE	
              Group By Operator	
                aggregations: count(1)	
                mode: hash	
                outputColumnNames: _col0	
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE	
                File Output Operator	
                  compressed: false	
                  GlobalTableId: 0	
                  directory: hdfs://device1:8020/tmp/hive/hive/7f0887a3-8c5a-44b6-b5ef-f0c7530a6b15/hive_2019-10-25_14-46-02_198_8962679143430564511-1/-mr-10003	
                  NumFilesPerFileSink: 1	
                  table:	
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat	
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	
                      properties:	
                        columns _col0	
                        columns.types bigint	
                        escape.delim \	
                        serialization.lib org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe	
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe	
                  TotalFiles: 1	
                  GatherStats: false	
                  MultiFileSpray: false	
      Local Work:	
        Map Reduce Local Work	
      Path -> Alias:	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201705 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201706 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201707 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201708 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201709 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201710 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201711 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201712 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201801 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201802 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201803 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201804 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201805 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201806 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201807 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201808 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201809 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201810 [zhihu_answer]	
        hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer/ym=201811 [zhihu_answer]

一脸懵逼，不会看呀。。
然后我测试了下单独执行：select distinct(ym) from zhihu_answer_increment;，也就不到2分钟就出结果了。为什么组合在一起就要这么长时间呢？？
这条SQL的执行结果就是"201902"。我把这个结果复制进去执行：

1	select count(1) from zhihu_answer where ym in (201902);

可能是因为我之前执行过的原因，这条语句的执行时间基本上是秒出呀。几秒内就出结果了。

我再一次执行了那句执行时间很长的SQL，看它的执行时候的log，我发现慢原因是在Stage-4 ！！！回到上面那个explain的信息，我发现Hive在做全表扫描呀！Why?
为什么要做全表扫描呢？因为Hive还是要join的 in （select ** ）这种子查询中用的是semi join，所以要进行join，它就会进行全表扫描。我的解释不是很详细，
但隐隐约约我能理解为什么Hive在这要做全表扫描了，其实如果写死的话，比如where ym in （201902）它就不会做join，也就不用全表扫描了。所以解决方案还是要能
拿到 201902这个变量，这个value，再拼接到Hive SQL中。我查了下，Hive貌似目前还不支持以SQL查询结果作为新的SQL变量。所以，暂时还是以这种办法解决吧。

让我无比开心的是，改进后，SQL执行快了N倍，因为避免了全表扫描。从原来2个小时的执行，变为了几分钟！

总结

通过Explain打印看看执行计划有哪些；
通过执行的log看看到底是哪个Stage耗时比较长；

Referrence

https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hos_oview.html#concept_i22_l1h_1v