- Airflow and Apache Spark are both open source tools. It seems that Apache Spark with 22.5K GitHub stars and 19.4K forks on GitHub has more adoption than Airflow with 12.9K GitHub stars and 4.71K GitHub forks.
- Airflow and Apache Spark are both open source tools. It seems that Apache Spark with 22.5K GitHub stars and 19.4K forks on GitHub has more adoption than Airflow with 12.9K GitHub stars and 4.71K GitHub forks.
- Spark Airflow Kubernetes
- Spark And Airflow Meter
- Spark And Airflow Light
- Air Flow Company
- Docker Spark Standalone
# for Airflow sparkjob.setupstream(src1s3) sparkjob.setupstream(src2hdfs) # alternatively using setdownstream src3s3.setdownstream(sparkjob) Adding our DAG to the Airflow scheduler. The easiest way to work with Airflow once you define our DAG is to use the web server. What Is Airflow? Apache Airflow is one realization of the DevOps philosophy of 'Configuration As Code.' Airflow allows users to launch multi-step pipelines using a simple Python object DAG (Directed Acyclic Graph). Airflow offers a wide range of integrations for services ranging from Spark and HBase, to services on various cloud providers.
As a Data Engineer, it is common to use in our daily routine the Apache Spark and Apache Airflow (if you do not yet use them, you should try) to overcome typical Data Engineering challenges like build pipelines to get data from someplace, do a lot of transformations and deliver it in another place.
In this article, I will share a guide on how to create a Data Engineering development environment containing a Spark Standalone Cluster, an Airflow server and a Jupyter Notebook instance.
In the Data Engineering context, Spark acts as the tool to process data (whatever you can think as data processing), Airflow as the orchestration tool to build pipelines and Jupyter Notebook to interactively develop Spark applications.
Think how amazing it would be if you could develop and test Spark applications integrated with Airflow pipelines using your machine, without the necessity to wait someone give you access to a development environment or having to share server resources with others using the same development environment or even to wait this environment be created if it does not exist yet.
Thinking about this, I started to search how I could create this environment without these dependencies but, unfortunately, I did not find a decent article explaining how to put these things to work together (or maybe I did not have lucky googling).
- Airflow configured with LocalExecutor meaning that all components (scheduler, webserver and executor) will be on the same machine.
- Postgres to store Airflow metadata. It was created a test database inside the Postgres if you want to run pipelines in Airflow writing to or reading from a Postgres database.
- Spark standalone cluster with 3 workers but you can configure more workers as explained further in this article.
- Jupyter notebook with Spark embedded to provide interactive Spark development.
Architecture components.
Below, a step by step process to get your environment running. The complete project on Git can be found here.
Prerequisites
Download images
- Spark image
$ docker pull bitnami/spark:latest
- Jupyter image
$ docker pull jupyter/pyspark-notebook:latest
- Postgres image
$ docker pull postgres:9.6
Download Git project
Spark Airflow Kubernetes
$ git clone https://github.com/cordon-thiago/airflow-spark
Build Airflow Image
$ cd airflow-spark/docker/docker-airflow
$ docker build –rm -t docker-airflow-spark:latest .
$ docker build –rm -t docker-airflow-spark:latest .
Check your images
$ docker images
Start containers
$ cd airflow-spark/docker
$ docker-compose up -d
$ docker-compose up -d
At this moment you will have an output like below and your stack will be running :).
Access applications
- Spark Master:http://localhost:8181
- Airflow: http://localhost:8282
- Postgres DB Airflow: Server: localhost, port: 5432, User: airflow, Password: airflow
- Postgres DB Test: Server: localhost, port: 5432, User: test, Password: postgres
- Jupyter notebook: you need to run the code below to get the URL + Token generated and paste in your browser to access the notebook UI.
$ docker logs -f docker_jupyter-spark_1
Running a Spark Job inside Jupyter notebook
Not it’s time to test if everything is working correctly. Let’s first run a Spark application interactively in Jupyter notebook.
Inside Jupyter, go to “work/notebooks” folder and start a new Python 3 notebook.
Paste the code below in the notebook and rename it to hello-world-notebook. This Spark code will count the lines with A and lines with B inside the airflow.cfg file.
# Set file
logFile = “/home/jovyan/work/data/airflow.cfg”# Read file
logData = sc.textFile(logFile).cache()# Get lines with A
numAs = logData.filter(lambda s: ‘a’ in s).count()# Get lines with B
numBs = logData.filter(lambda s: ‘b’ in s).count()# Print result
print(“Lines with a: {}, lines with b: {}”.format(numAs, numBs))
logFile = “/home/jovyan/work/data/airflow.cfg”# Read file
logData = sc.textFile(logFile).cache()# Get lines with A
numAs = logData.filter(lambda s: ‘a’ in s).count()# Get lines with B
numBs = logData.filter(lambda s: ‘b’ in s).count()# Print result
print(“Lines with a: {}, lines with b: {}”.format(numAs, numBs))
After running the code, you will have:
Note that:
- The path spark/resources/data in your project is mapped to /home/jovyan/work/data/ in Jupyter machine.
- The folder notebooks in your project is mapped to /home/jovyan/work/notebooks/ in Jupyter machine.
Triggering a Spark Job from Airflow
In Airflow UI, you can find a DAG called spark-test which resides in dags folder inside your project.
This is a simple DAG that triggers the same Spark application which we ran in Jupyter notebook with two little differences:
- We need to instantiate the Spark Context (in Jupyter it is already instantiated).
- The file to be processed will be an argument passed by Airflow when calling spark-submit.
DAG spark-test.pySpark app hello-world.py
Before running the DAG, change the spark_default connection inside Airflow UI to point to spark://spark (Spark Master) , port 7077:
Spark And Airflow Meter
spark_default connection inside Airflow.
Now, you can turn the DAG on and trigger it from Airflow UI.
![Spark and airflow system Spark and airflow system](/uploads/1/3/7/7/137767801/385188277.jpg)
Spark And Airflow Light
After running the DAG, you can see the result printed in the spark_job task log in Airflow UI:
And you can see the application in the Spark Master UI:
You can increase the number of Spark workers just adding new services based on bitnami/spark:latest image to the docker-compose.yml file like following:
Environment variables meaning can be found here.
![Apache airflow spark Apache airflow spark](/uploads/1/3/7/7/137767801/405944849.png)
Air Flow Company
When you no longer want to play with this stack, you can stop it to save local resources:
Docker Spark Standalone
$ cd airflow-spark/docker
$ docker-compose down
$ docker-compose down
Finally, I hope this guide can help you with your work or study and make it easier to provide a complete Data Engineer environment. Enjoy!