使用Azkaban调度执行spark-sql任务的小Demo
1 、准备数据
准备表dm_action_log 数据如下:
bdp_day | action | uv |
---|---|---|
20190101 | click | 11173 |
20190101 | exit | 11109 |
20190101 | install | 11139 |
20190101 | launch | 11083 |
20190101 | login | 11220 |
20190101 | page_enter_h5 | 11016 |
20190101 | page_enter_native | 11076 |
20190101 | page_exit | 11120 |
20190101 | register | 11064 |
20190102 | click | 11168 |
···················只取前十条·························
2 、编写azkaban调度job
dm_action_log.sh
#!/bin/bash
/usr/local/spark/bin/spark-sql --master yarn --name action_launch_job -S -f dm_action_log.hql
dm_action_log.job
type=command
command=sh dm_action_log.sh
dm_action_log.hql
use databasename;
create table if not exists databasename.dm_action_log_bak(
bdp_day string comment '时间',
action string comment '行为',
uv int comment 'uv'
)
stored as parquet;
insert overwrite table databasename.dm_action_log_bak
select bdp_day,action,count(userid) uv from databasename.dw_action_log
group by bdp_day,action;
这里注意这里需要是Unix环境(notepad++为例)
3 、配置spark-sql运行环境
3、1修改 yarn-site.xml,添加两行配置
<!--配置虚拟内存不生效,解决spark yarn 模式问题-->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!--配置虚拟内存大小,解决spark yarn 模式问题-->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
3、2 发送配置文件
- 把修改后的 yarn-site.xml 发送到集群其他节点
- 把$HADOOP_HOME/etc/hadoop中的 core-site.xml 和 hdfs-site.xml发送到$SPARK_HOME/conf/中(如果是spark集群,需要发送到每一个)
- 把hive-site.xml同样也发送到spark的每个节点($SPARK_HOME\conf\)
4 、启动hadoop集群和spark集群
5 、启动Azkaban并创建任务
6 、上传zip包执行任务
把 dm_action_log.sh、dm_action_log.job、dm_action_log.hql,打成zip包