成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

<noscript id="8aysm"></noscript>

<noscript id="8aysm"><object id="8aysm"></object></noscript>

<option id="8aysm"><pre id="8aysm"></pre></option>

<noscript id="8aysm"><bdo id="8aysm"></bdo></noscript>

<menu id="8aysm"><th id="8aysm"></th></menu>

鴻蒙開發者社區

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考信創認證華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發者社區

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發者社區訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業版APP

鴻蒙開發者社區視頻號

51CTO軟考題庫

賬號設置退出

pyspark訪問hive數據實戰

作者：aibati2008 2017-03-13 09:48:26

大數據 Spark

直接進行spark開發需要去學習scala，為了降低數據分析師的學習成本，決定前期先試用sparkSQL，能夠讓計算引擎無縫從MR切換到spark，現在主要使用pyspark訪問hive數據。

數據分析都是直接使用hive腳本進行調用，隨著APP用戶行為和日志數據量的逐漸累積，跑每天的腳本運行需要花的時間越來越長，雖然進行了sql優化，但是上spark已經提上日程。

直接進行spark開發需要去學習scala，為了降低數據分析師的學習成本，決定前期先試用sparkSQL，能夠讓計算引擎無縫從MR切換到spark，現在主要使用pyspark訪問hive數據。

以下是安裝配置過程中的詳細步驟：

1.安裝spark

需要先安裝JDK和scala，這不必多說，由于現有hadoop集群版本是采用的2.6.3，所以spark版本是下載的穩定版本spark-1.4.0-bin-hadoop2.6.tgz

我是先在一臺機器上完成了Spark的部署，Master和Slave都在一臺機器上。注意要配置免秘鑰ssh登陸。

1.1 環境變量配置

export JAVA_HOME=/usr/jdk1.8.0_73 
export HADOOP_HOME=/usr/hadoop 
export HADOOP_CONF_DIR=/usr/hadoop/etc/hadoop 
export SCALA_HOME=/usr/local/scala-2.11.7 
export SPARK_HOME=/home/hadoop/spark_folder/spark-1.4.0-bin-hadoop2.6 
export SPARK_MASTER_IP=127.0.0.1 
export SPARK_MASTER_PORT=7077 
export SPARK_MASTER_WEBUI_PORT=8099 
  
export SPARK_WORKER_CORES=3     //每個Worker使用的CPU核數 
export SPARK_WORKER_INSTANCES=1   //每個Slave中啟動幾個Worker實例 
export SPARK_WORKER_MEMORY=10G    //每個Worker使用多大的內存 
export SPARK_WORKER_WEBUI_PORT=8081 //Worker的WebUI端口號 
export SPARK_EXECUTOR_CORES=1       //每個Executor使用使用的核數 
export SPARK_EXECUTOR_MEMORY=1G     //每個Executor使用的內存 
 
export HIVE_HOME=/home/hadoop/hive 
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar:$SPARK_CLASSPATH 
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$HADOOP_HOME/lib/native

1.2 配置slaves

cp slaves.template slaves 
vi slaves 添加以下內容：localhost

1.3 啟動master和slave

cd $SPARK_HOME/sbin/ 
./start-master.sh 
 
啟動日志位于 $SPARK_HOME/logs/目錄，訪問 http://localhost:8099，即可看到Spark的WebUI界面 
 
執行 ./bin/spark-shell,打開Scala到Spark的連接窗口

2.SparkSQL與Hive的整合

拷貝$HIVE_HOME/conf/hive-site.xml和hive-log4j.properties到 $SPARK_HOME/conf/  
在$SPARK_HOME/conf/目錄中，修改spark-env.sh，添加  
export HIVE_HOME=/home/hadoop/hive 
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar:$SPARK_CLASSPATH 
另外也可以設置一下Spark的log4j配置文件，使得屏幕中不打印額外的INFO信息(如果不想受干擾可設置為更高):  
log4j.rootCategory=WARN, console  
進入$SPARK_HOME/bin，執行 ./spark-sql –master spark://127.0.0.1:7077 進入spark-sql CLI: 
[hadoop@hadoop spark]$ bin/spark-sql --help   
Usage: ./bin/spark-sql [options] [cli option]   
CLI options:   
 -d,--define <keykey=value>          Variable subsitution to apply to hive   
                                  commands. e.g. -d A=B or --define A=B   
    --database <databasename>     Specify the database to use   
 -e <quoted-query-string>         SQL from command line   
 -f <filename>                    SQL from files   
 -h <hostname>                    connecting to Hive Server on remote host   
    --hiveconf <propertyproperty=value>   Use value for given property   
    --hivevar <keykey=value>         Variable subsitution to apply to hive   
                                  commands. e.g. --hivevar A=B   
 -i <filename>                    Initialization SQL file   
 -p <port>                        connecting to Hive Server on port number   
 -S,--silent                      Silent mode in interactive shell   
 -v,--verbose                     Verbose mode (echo executed SQL to the   
                                  console)

需要注意的是CLI不是使用JDBC連接，所以不能連接到ThriftServer;但可以配置conf/hive-site.xml連接到hive的metastore，然后對hive數據進行查詢。下面我們接著說如何在python中連接hive數據表查詢。

3.配置pyspark和示例代碼

3.1 配置pyspark

打開/etc/profile: 
        #PythonPath 將Spark中的pySpark模塊增加的Python環境中 
         export PYTHONPATH=/opt/spark-hadoop/python 
        source /etc/profile

執行./bin/pyspark ,打開Python到Spark的連接窗口，確認沒有報錯。

打開命令行窗口，輸入python,Python版本為2.7.6，如圖所示，注意Spark暫時不支持Python3。輸入import pyspark不報錯，證明開發前工作已經完成。

3.2 啟動ThriftServer

啟動ThriftServer，使之運行在spark集群中：

sbin/start-thriftserver.sh --master spark://localhost:7077 --executor-memory 5g

ThriftServer可以連接多個JDBC/ODBC客戶端，并相互之間可以共享數據。

3.3 請求示例

查看spark官方文檔說明，spark1.4和2.0對于sparksql調用hive數據的API變化并不大。都是用sparkContext 。

from pyspark import SparkConf, SparkContext 
from pyspark.sql import HiveContext 
 
conf = (SparkConf() 
         .setMaster("spark://127.0.0.1:7077") 
         .setAppName("My app") 
         .set("spark.executor.memory", "1g")) 
sc = SparkContext(conf = conf) 
sqlContext = HiveContext(sc) 
my_dataframe = sqlContext.sql("Select count(1) from logs.fmnews_dim_where") 
my_dataframe.show()

返回結果：

運行以后在webUI界面看到job運行詳情。

4.性能比較

截取了接近一個月的用戶行為數據，數據大小為2G，總共接近1600w條記錄。

為了測試不同sql需求情況下的結果，我們選取了日常運行的2類sql：

1.統計數據條數：

select count(1) from fmnews_user_log2;

2.統計用戶行為：

SELECT device_id, min_time FROM 
        (SELECT device_id,min(import_time) min_time FROM fmnews_user_log2 
            GROUP BY device_id)a 
        WHERE from_unixtime(int(substr(min_time,0,10)),'yyyy-MM-dd') = '2017-03-02';

3. 用戶行為分析：

select case when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '06:00' and '07:59' then 1 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '08:00' and '09:59' then 2 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '10:00' and '11:59' then 3 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '12:00' and '13:59' then 4 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '14:00' and '15:59' then 5 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '16:00' and '17:59' then 6 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '18:00' and '19:59' then 7 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '20:00' and '21:59' then 8 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '22:00' and '23:59' then 9 
            else 0 end fmnews_time_type, count(distinct device_id) device_count,count(1) click_count 
       from fmcm.fmnews_user_log2 
     where from_unixtime(int(substr(import_time,0,10)),'yyyy-MM-dd') = '2017-03-02' 
    group by case when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '06:00' and '07:59' then 1 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '08:00' and '09:59' then 2 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '10:00' and '11:59' then 3 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '12:00' and '13:59' then 4 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '14:00' and '15:59' then 5 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '16:00' and '17:59' then 6 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '18:00' and '19:59' then 7 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '20:00' and '21:59' then 8 
            when from_unixtime(int(substr(fmnews_time,0,10)),'HH:mm') between '22:00' and '23:59' then 9 
            else 0 end;

第一條sql的執行結果對比：hive 35.013 seconds

第一條sql的執行結果對比：sparksql 1.218 seconds

第二條sql的執行結果對比：hive 78.101 seconds

第二條sql的執行結果對比：sparksql 8.669 seconds

第三條sql的執行結果對比：hive 101.228 seconds

第三條sql的執行結果對比：sparksql 14.221 seconds

可以看到，雖然沒有官網吹破天的100倍性能提升，但是根據sql的復雜度來看10~30倍的效率還是可以達到的。

不過這里要注意到2個影響因子：

1. 我們數據集并沒有采取全量，在數據量達到TB級別兩者的差距應該會有所減小。同時sql也沒有針對hive做優化。

2. spark暫時是單機(內存足夠)并沒有搭建集群，hive使用的hadoop集群有4臺datanode。

責任編輯：武曉燕來源： oschina博客

pyspark hive 數據

51CTO技術棧公眾號

業務
速覽

媒體

51CTO CIOAge HC3i

社區

51CTO博客鴻蒙開發者社區 AI.x社區

教育

51CTO學堂精培企業培訓 CTO訓練營

主站蜘蛛池模板：国产91在线视频 | 成人妇女免费播放久久久 | 日韩国产一区二区三区 | 天堂一区在线 | 女女百合av大片一区二区三区九县 | 久久国产精品一区二区三区 | 亚洲视频免费观看 | 四虎网站在线观看 | 成人精品国产免费网站 | 久久一级免费视频 | 盗摄精品av一区二区三区 | 国产成人99久久亚洲综合精品 | 欧洲av一区 | 久久综合一区二区三区 | 91视频久久| 日本中文字幕在线视频 | 久久国产精品视频观看 | 欧美视频一区 | 欧美精品一区二区免费视频 | 国产精品国产成人国产三级 | 中文字幕黄色大片 | 懂色中文一区二区在线播放 | 四虎影视一区二区 | 99re视频 | 中文成人在线 | 久久精品国产免费 | 欧美天堂 | 国产九九精品视频 | 国产精品久久久久久久久免费相片 | 亚洲精品免费观看 | 视频一区国产精品 | 成人免费在线电影 | 欧美在线 | 欧美日韩中文字幕在线 | 婷婷色成人 | 亚洲一区二区久久 | 久久精品在线免费视频 | 欧美综合精品 | 午夜小视频在线观看 | 国产视频亚洲视频 | 综合亚洲视频 |

<ul id="k4kqo"><noframes id="k4kqo"></noframes></ul>

<option id="k4kqo"><small id="k4kqo"></small></option>

<pre id="k4kqo"></pre>