Hive、MapReduce、Spark分布式生成唯一數(shù)值型ID

作者：佚名 2017-04-12 09:29:02

大數(shù)據(jù) 分布式 Spark

Spark中生成這樣的非連續(xù)唯一數(shù)值型ID，非常簡(jiǎn)單，直接使用zipWithUniqueId()即可。

[[188237]]

在實(shí)際業(yè)務(wù)場(chǎng)景下，經(jīng)常會(huì)遇到在Hive、MapReduce、Spark中需要生成唯一的數(shù)值型ID。

一般常用的做法有：

MapReduce中使用1個(gè)Reduce來(lái)生成;

Hive中使用row_number分析函數(shù)來(lái)生成，其實(shí)也是1個(gè)Reduce;

借助HBase或Redis或Zookeeper等其它框架的計(jì)數(shù)器來(lái)生成;

數(shù)據(jù)量不大的情況下，可以直接使用1和2方法來(lái)生成，但如果數(shù)據(jù)量巨大，1個(gè)Reduce處理起來(lái)就非常慢。

在數(shù)據(jù)量非常大的情況下，如果你僅僅需要唯一的數(shù)值型ID，注意：不是需要”連續(xù)的唯一的數(shù)值型ID”，那么可以考慮采用本文中介紹的方法，否則，請(qǐng)使用第3種方法來(lái)完成。

Spark中生成這樣的非連續(xù)唯一數(shù)值型ID，非常簡(jiǎn)單，直接使用zipWithUniqueId()即可。

參考zipWithUniqueId()的方法，在MapReduce和Hive中，實(shí)現(xiàn)如下：

在Spark中，zipWithUniqueId是通過(guò)使用分區(qū)Index作為每個(gè)分區(qū)ID的開始值，在每個(gè)分區(qū)內(nèi)，ID增長(zhǎng)的步長(zhǎng)為該RDD的分區(qū)數(shù)，那么在MapReduce和Hive中，也可以照此思路實(shí)現(xiàn)，Spark中的分區(qū)數(shù)，即為MapReduce中的Map數(shù)，Spark分區(qū)的Index，即為Map Task的ID。Map數(shù)，可以通過(guò)JobConf的getNumMapTasks()，而Map Task ID，可以通過(guò)參數(shù)mapred.task.id獲取，格式如：attempt_1478926768563_0537_m_000004_0，截取m_000004_0中的4，再加1，作為該Map Task的ID起始值。注意：這兩個(gè)只均需要在Job運(yùn)行時(shí)才能獲取。另外，從圖中也可以看出，每個(gè)分區(qū)/Map Task中的數(shù)據(jù)量不是絕對(duì)一致的，因此，生成的ID不是連續(xù)的。

下面的UDF可以在Hive中直接使用：

package com.lxw1234.hive.udf; 
  
import org.apache.hadoop.hive.ql.exec.MapredContext; 
import org.apache.hadoop.hive.ql.exec.UDFArgumentException; 
import org.apache.hadoop.hive.ql.metadata.HiveException; 
import org.apache.hadoop.hive.ql.udf.UDFType; 
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; 
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; 
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; 
import org.apache.hadoop.io.LongWritable; 
  
@UDFType(deterministic = false, stateful = true) 
public class RowSeq2 extends GenericUDF { 
     
    private static LongWritable result = new LongWritable(); 
    private static final char SEPARATOR = '_'; 
    private static final String ATTEMPT = "attempt"; 
    private long initID = 0l; 
    private int increment = 0; 
     
     
    @Override 
    public void configure(MapredContext context) { 
        increment = context.getJobConf().getNumMapTasks(); 
        if(increment == 0) { 
            throw new IllegalArgumentException("mapred.map.tasks is zero"); 
        } 
         
        initID = getInitId(context.getJobConf().get("mapred.task.id"),increment); 
        if(initID == 0l) { 
            throw new IllegalArgumentException("mapred.task.id"); 
        } 
         
        System.out.println("initID : " + initID + "  increment : " + increment); 
    } 
     
    @Override 
    public ObjectInspector initialize(ObjectInspector[] arguments) 
            throws UDFArgumentException { 
        return PrimitiveObjectInspectorFactory.writableLongObjectInspector; 
    } 
  
    @Override 
    public Object evaluate(DeferredObject[] arguments) throws HiveException { 
        result.set(getValue()); 
        increment(increment); 
        return result; 
    } 
     
    @Override 
    public String getDisplayString(String[] children) { 
        return "RowSeq-func()"; 
    } 
     
    private synchronized void increment(int incr) { 
        initID += incr; 
    } 
     
    private synchronized long getValue() { 
        return initID; 
    } 
     
    //attempt_1478926768563_0537_m_000004_0 // return 0+1 
    private long getInitId (String taskAttemptIDstr,int numTasks) 
            throws IllegalArgumentException { 
        try { 
            String[] parts = taskAttemptIDstr.split(Character.toString(SEPARATOR)); 
            if(parts.length == 6) { 
                if(parts[0].equals(ATTEMPT)) { 
                    if(!parts[3].equals("m") && !parts[3].equals("r")) { 
                        throw new Exception(); 
                    } 
                    long result = Long.parseLong(parts[4]); 
                    if(result >= numTasks) { //if taskid >= numtasks 
                        throw new Exception("TaskAttemptId string : " + taskAttemptIDstr 
                                + "  parse ID [" + result + "] >= numTasks[" + numTasks + "] .."); 
                    } 
                    return result + 1; 
                } 
            } 
        } catch (Exception e) {} 
        throw new IllegalArgumentException("TaskAttemptId string : " + taskAttemptIDstr 
                + " is not properly formed"); 
    } 
     
}

有一張去重后的用戶id(字符串類型)表，需要位每個(gè)用戶id生成一個(gè)唯一的數(shù)值型seq:

ADD jar file:///tmp/udf.jar; 
CREATE temporary function seq2 as 'com.lxw1234.hive.udf.RowSeq2'; 
  
hive>> desc lxw_all_ids; 
OK 
id                      string                                       
Time taken: 0.074 seconds, Fetched: 1 row(s) 
hive> select * from lxw_all_ids limit 5; 
OK 
01779E7A06ABF5565A4982_cookie 
031E2D2408C29556420255_cookie 
03371ADA0B6E405806FFCD_cookie 
0517C4B701BC1256BFF6EC_cookie 
05F12ADE0E880455931C1A_cookie 
Time taken: 0.215 seconds, Fetched: 5 row(s) 
hive> select count(1) from lxw_all_ids; 
253402337 
  
hive> create table lxw_all_ids2 as select id,seq2() as seq from lxw_all_ids; 
… 
Hadoop job information for Stage-1: number of mappers: 27; number of reducers: 0 
…

該Job使用了27個(gè)Map Task，沒有使用Reduce，那么將會(huì)產(chǎn)生27個(gè)結(jié)果文件。

再看結(jié)果表中的數(shù)據(jù)：

hive> select * from lxw_all_ids2 limit 10; 
OK 
766CA2770527B257D332AA_cookie   1 
5A0492DB0000C557A81383_cookie   28 
8C06A5770F176E58301EEF_cookie   55 
6498F47B0BCAFE5842B83A_cookie   82 
6DA33CB709A23758428A44_cookie   109 
B766347B0D27925842AC2D_cookie   136 
5794357B050C99584251AC_cookie   163 
81D67A7B011BEA5842776C_cookie   190 
9D2F8EB40AEA525792347D_cookie   217 
BD21077B09F9E25844D2C1_cookie   244 
  
hive> select count(1),count(distinct seq) from lxw_all_ids2; 
253402337       253402337

limit 10只從第一個(gè)結(jié)果文件，即MapTaskId為0的結(jié)果文件中拿了10條，這個(gè)Map中，start=1，increment=27，因此生成的ID如上所示。

count(1),count(distinct seq)的值相同，說(shuō)明seq沒有重復(fù)值，你可以試試max(seq)，結(jié)果必然大于253402337，說(shuō)明seq是”非連續(xù)唯一數(shù)值型ID“.

責(zé)任編輯：武曉燕來(lái)源： lxw的大數(shù)據(jù)田地

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

Hive、MapReduce、Spark分布式生成唯一數(shù)值型ID