Hive bucket和partition的区别

Hive partition和bucket的区别

翻译文

为了更好地阐述partition和bucket的区别，我们先看看数据是怎么保存在Hive上面的。比如，你有一个表：

CREATE TABLE mytable ( 
         name string,
         city string,
         employee_id int ) 
PARTITIONED BY (year STRING, month STRING, day STRING) 
CLUSTERED BY (employee_id) INTO 256 BUCKETS

然后，Hive会将数据保存为如下的层级：
/user/hive/warehouse/mytable/y=2015/m=12/d=02
所以，用partition的时候必须小心，因为如果你用employee_id来partition的话，如果有上百万个employee，那么你会看到有上百万个目录被创建在你的系统上。
“cardinality” 这个术语被用来表示不同字段值的数量。比如，你有country这个字段，世界上会有300个国家，所以这里的”cardinality”是300这样。
对于像’timestamp_ms’的字段，它的’cardinality’会有几十亿。
总的来说，当我们用partition的时候，不要用在cardinality很高的字段上。因为它会导致生成太多的目录。

说说bucket了，在指定了bucket数后，会使得文件的数量固定。Hive会做的是计算字段的hash值，然后分发一个记录给那个bucket。
但是比如说你用总共256个bucket在一个较低的cardinality的字段上会发生什么呢？（比如，美国的州，只有50个）
只有50个bucket有数据，其它206个bucket是没有数据的。

有人已经提到partition可以极大地减少查询数据的量。
那在我这个例子中，如果你想在某个日期上进一步查询，对 yearn/month/day的partition会大大地减少IO。
我觉得有人已经提到bucket可以加速和其它恰好在同一个bucket的表的join操作。那么在我的案例中，如果你正在通过employee_id来join两个表，
Hive能够在每个每个bucket地内部进行join（如果它们已经通过employee_id排序好的话，效果会更好！）

总结，
bucket在字段有很高的cardinality，和在数据在不同bucket中均匀地分布的时候，会表现出优越性。
partition在cardinality partition的字段不是很多的时候，会表现出优越性。

另外，你可以按顺序同时partiton多个字段，比如（yearn/month/day），但bucket只能取一个字段。

原文

There are a few details missing from the previous explanations. To better understand how partitioning and bucketing works,
 you should look at how data is stored in hive. Let's say you have a table
CREATE TABLE mytable ( 
         name string,
         city string,
         employee_id int ) 
PARTITIONED BY (year STRING, month STRING, day STRING) 
CLUSTERED BY (employee_id) INTO 256 BUCKETS
then hive will store data in a directory hierarchy like
/user/hive/warehouse/mytable/y=2015/m=12/d=02
So, you have to be careful when partitioning, because if you for instance partition by employee_id and you have millions of employees, 
you'll end up having millions of directories in your file system. The term 'cardinality' refers to the number of possible value a field can have. 
For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be ~300. 
For a field like 'timestamp_ms', which changes every millisecond, cardinality can be billions. 
In general, when choosing a field for partitioning, it should not have a high cardinality, because you'll end up with way too many directories in your file system.
Clustering aka bucketing on the other hand, will result with a fixed number of files, since you do specify the number of buckets. 
What hive will do is to take the field, calculate a hash and assign a record to that bucket. 
But what happens if you use let's say 256 buckets and the field you're bucketing on has a low cardinality (for instance, it's a US state, so can be only 50 different values) ? 
You'll have 50 buckets with data, 
and 206 buckets with no data.
Someone already mentioned how partitions can dramatically cut the amount of data you're querying. 
So in my example table, if you want to query only from a certain date forward, the partitioning by year/month/day is going to dramatically cut the amount of IO. 
I think that somebody also mentioned how bucketing can speed up joins with other tables that have exactly the same bucketing, 
so in my example, if you're joining two tables on the same employee_id, 
hive can do the join bucket by bucket (even better if they're already sorted by employee_id since it's going to mergesort parts that are already sorted, 
which works in linear time aka O(n) ).
So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. 
Partitioning works best when the cardinality of the partitioning field is not too high.
Also, you can partition on multiple fields, with an order (year/month/day is a good example), while you can bucket on only one field.

Reference

https://stackoverflow.com/questions/19128940/what-is-the-difference-between-partitioning-and-bucketing-a-table-in-hive