我们在 amazon webservices 'aws' 上使用 emr 集群。我们使用默认的 'Amazon Linux AMI' 图像,没有进行自定义。在我看来,dhclient-script 正在从我们公司的 dhcp(动态主机配置协议)获取配置,尤其是 ntp(网络时间协议)
作为主节点上的一个例子,dhclient 脚本将我们公司的 ntp 服务器附加到文件中/etc/ntp.conf
。
[hadoop@ip-10-5-21-157 ~]$ grep ^server /etc/ntp.conf
server 0.amazon.pool.ntp.org iburst
server 1.amazon.pool.ntp.org iburst
server 2.amazon.pool.ntp.org iburst
server 3.amazon.pool.ntp.org iburst
server 10.2.78.21 # added by /sbin/dhclient-script
server 10.2.78.22 # added by /sbin/dhclient-script
server 10.2.78.23 # added by /sbin/dhclient-script
server 10.2.78.24 # added by /sbin/dhclient-script
IP 地址 10.2.78.21-24 解析为clockNN.ntp.mycompany.com
如何避免这种情况,以便我们只使用亚马逊的默认设置?
编辑我们在 emr 集群上运行 pig 聚合时遇到了问题。异常堆栈跟踪示例如下:
18/01/07 13:50:23 INFO tez.TezJob: DAG Status: status=FAILED, progress=TotalTasks: 4737 Succeeded: 3777 Running: 0 Failed: 1 Killed: 959 FailedTaskAttempts: 428 KilledTaskAttempts: 309, diagnostics=Vertex failed, vertexName=scope-421, vertexId=vertex_1515326570070_0001_1_04, diagnostics=[Task failed, taskId=task_1515326570070_0001_1_04_002846, diagnostics=[TaskAttempt 0 failed, info=[Container launch failed for container_1515326570070_0001_01_000599 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
This token is expired. current time is 1515332813920 found 1515330236564
Note: System times on machines may be out of sync. Check system time and time zones.
at sun.reflect.GeneratedConstructorAccessor51.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.tez.dag.app.launcher.TezContainerLauncherImpl$Container.launch(TezContainerLauncherImpl.java:160)
at org.apache.tez.dag.app.launcher.TezContainerLauncherImpl$EventProcessor.run(TezContainerLauncherImpl.java:353)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
导致(部分)emr 机器(vm、镜像、节点?)系统时间不准确的根本原因可能是我们公司的 DNS 服务器。(但这只是一个猜测。)消除这种可能性的一个方法是从 /etc/ntp.conf 文件中删除这些 ntp 服务器,然后重新初始化系统时间。
答案1
经过一些研究,我得出了以下结论:
modify_ntp_config.sh
在 S3 上创建文件:
#!/bin/bash
set -eEu
ntp_config_file="${1:-example_config}"
echo "Removing 'server 10.*' entries from \"$ntp_config_file\""
sudo sed -i -e '/server 10.*/d' $ntp_config_file
echo "Reinitialize ntp"
sudo service ntpd stop
sudo ntpdate -s time.nist.gov
sudo service ntpd start
将此文件复制到s3:
$ aws s3 cp /var/tmp/modify_ntp_config.sh \
s3://<s3-bucket-name>/data/scripts/modify_ntp_config.sh
然后使用aws-tools
:
aws emr create-cluster --name "..." [...cluster create options ...] \
--steps \
Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,\
Jar=s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=["s3://<s3-bucket-name>/data/scripts/modify_ntp_config.sh","/etc/ntp.conf"]
导致以下日志输出(从 s3 复制到 localdisk)
$ aws s3 cp --recursive s3://<s3-bucket-name>/log/<cluster-id>/steps/<step-id>/ /var/tmp/5HKO7
download: s3://[...]/stdout.gz to ../../var/tmp/5HKO7/stdout.gz
download: s3://[...]/stderr.gz to ../../var/tmp/5HKO7/stderr.gz
download: s3://[...]/controller.gz to ../../var/tmp/5HKO7/controller.gz
$ zcat /var/tmp/5HKO7/stdout.gz
Downloading 's3://<s3-bucket-name>/data/scripts/modify_ntp_config.sh' to '/mnt/var/lib/hadoop/steps/[...]/.'
Removing 'server 10.*' entries from "/etc/ntp.conf"
Reinitialize ntp
Shutting down ntpd: [ OK ]
Starting ntpd: [ OK ]
$ zcat /var/tmp/5HKO7/stderr.gz
Command exiting with ret '0'
笔记:另一种方法是在已经运行的 emr 集群上使用它aws emr add-steps
。
$ aws emr add-steps --cluster-id "j-<emr_cluster_id>"\
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,\
Jar=s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=["s3://<s3-bucket-name>/data/scripts/modify_ntp_config.sh","/etc/ntp.conf"]
参考: https://docs.aws.amazon.com/emr/latest/DeveloperGuide//emr-hadoop-script.html https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html https://askubuntu.com/questions/254826/how-to-force-a-clock-update-using-ntp https://unix.stackexchange.com/questions/158802/how-to-update-ntp-without-shutting-down-the-ntp-daemon