(我希望这个问题适合 ServerFault,如果不适合,请评论,我会删除它)
我正在尝试创建一个 docker 镜像,其中将安装并配置 Cassandra 和 Spark 以协同工作。
我从未使用过 Spark(也从未创建过 Dockerfile),只使用过 Cassandra,所以这是新领域。
我使用 Spark、Cassandra 和 Kafka 创建了一个 Dockerfile。现在如何在 Dockerfile 中配置它们以使它们协同工作?
Datastax 的 Cassandra-Spark 连接器...我不知道该怎么办。
以下是我的 Dockerfile:
FROM centos:centos7
RUN yum -y update;
RUN yum -y clean all;
# Install basic tools
RUN yum install -y wget dialog curl sudo lsof vim axel telnet nano openssh-server openssh-clients bzip2 passwd tar bc git unzip deltarpm
#Install Java
RUN yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel
# Install Python 3.6 - I used Anaconda2 instead to install it
#RUN yum install centos-release-scl -y
#RUN yum install rh-python36 -y
#RUN scl enable rh-python36 bash
#Create guest user. IMPORTANT: Change here UID 1000 to your host UID if you plan to share folders.
RUN useradd guest -u 1000
RUN echo guest | passwd guest --stdin
ENV HOME /home/guest
WORKDIR $HOME
USER guest
#Install Spark (Spark 2.4.0 - Nov 02, 2018, prebuilt for Hadoop 2.7 or higher)
RUN wget http://mirror.csclub.uwaterloo.ca/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
RUN tar xvf spark-2.4.0-bin-hadoop2.7.tgz
RUN mv spark-2.4.0-bin-hadoop2.7 spark
ENV SPARK_HOME $HOME/spark
#Install Kafka
RUN wget http://mirror.csclub.uwaterloo.ca/apache/kafka/2.1.0/kafka_2.12-2.1.0.tgz
RUN tar xvzf kafka_2.12-2.1.0.tgz
RUN mv kafka_2.12-2.1.0 kafka
ENV PATH $HOME/spark/bin:$HOME/spark/sbin:$HOME/kafka/bin:$PATH
#Install Anaconda Python distribution
RUN wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh
RUN bash Anaconda2-4.4.0-Linux-x86_64.sh -b
ENV PATH $HOME/anaconda2/bin:$PATH
RUN conda install -c anaconda python
# RUN pip install --upgrade pip
#Install Kafka Python module
RUN pip install kafka-python
USER root
#Install Cassandra
ADD cassandra.repo /etc/yum.repos.d/datastax.repo
RUN yum install -y cassandra
#Environment variables for Spark and Java
ADD setenv.sh /home/guest/setenv.sh
RUN chown guest:guest setenv.sh
RUN echo . ./setenv.sh >> .bashrc
#Startup (start SSH, Cassandra, Zookeeper, Kafka producer)
ADD startup_script.sh /usr/bin/startup_script.sh
RUN chmod +x /usr/bin/startup_script.sh
查看其余文件的 GitLab 仓库在这里:https://gitlab.com/HypeWolf/docker-cassandra-spark-kafka
最终目标是能够在一个容器内使用 Cassandra 和 Spark 所能提供的所有内容,并允许用户传递配置文件或环境值来修改某些设置。