Spark And Airflow

Airflow and Apache Spark are both open source tools. It seems that Apache Spark with 22.5K GitHub stars and 19.4K forks on GitHub has more adoption than Airflow with 12.9K GitHub stars and 4.71K GitHub forks.
Airflow and Apache Spark are both open source tools. It seems that Apache Spark with 22.5K GitHub stars and 19.4K forks on GitHub has more adoption than Airflow with 12.9K GitHub stars and 4.71K GitHub forks.

Spark Airflow Kubernetes
Spark And Airflow Meter
Spark And Airflow Light
Air Flow Company
Docker Spark Standalone

# for Airflow sparkjob.setupstream(src1s3) sparkjob.setupstream(src2hdfs) # alternatively using setdownstream src3s3.setdownstream(sparkjob) Adding our DAG to the Airflow scheduler. The easiest way to work with Airflow once you define our DAG is to use the web server. What Is Airflow? Apache Airflow is one realization of the DevOps philosophy of 'Configuration As Code.' Airflow allows users to launch multi-step pipelines using a simple Python object DAG (Directed Acyclic Graph). Airflow offers a wide range of integrations for services ranging from Spark and HBase, to services on various cloud providers.

As a Data Engineer, it is common to use in our daily routine the Apache Spark and Apache Airflow (if you do not yet use them, you should try) to overcome typical Data Engineering challenges like build pipelines to get data from someplace, do a lot of transformations and deliver it in another place.

In this article, I will share a guide on how to create a Data Engineering development environment containing a Spark Standalone Cluster, an Airflow server and a Jupyter Notebook instance.

In the Data Engineering context, Spark acts as the tool to process data (whatever you can think as data processing), Airflow as the orchestration tool to build pipelines and Jupyter Notebook to interactively develop Spark applications.

Think how amazing it would be if you could develop and test Spark applications integrated with Airflow pipelines using your machine, without the necessity to wait someone give you access to a development environment or having to share server resources with others using the same development environment or even to wait this environment be created if it does not exist yet.

Thinking about this, I started to search how I could create this environment without these dependencies but, unfortunately, I did not find a decent article explaining how to put these things to work together (or maybe I did not have lucky googling).

Airflow configured with LocalExecutor meaning that all components (scheduler, webserver and executor) will be on the same machine.
Postgres to store Airflow metadata. It was created a test database inside the Postgres if you want to run pipelines in Airflow writing to or reading from a Postgres database.
Spark standalone cluster with 3 workers but you can configure more workers as explained further in this article.
Jupyter notebook with Spark embedded to provide interactive Spark development.

Architecture components.

Below, a step by step process to get your environment running. The complete project on Git can be found here.

Prerequisites

Download images

Spark image

$ docker pull bitnami/spark:latest

Jupyter image

$ docker pull jupyter/pyspark-notebook:latest

Postgres image

$ docker pull postgres:9.6

Download Git project

Spark Airflow Kubernetes

$ git clone https://github.com/cordon-thiago/airflow-spark

Build Airflow Image

$ cd airflow-spark/docker/docker-airflow
$ docker build –rm -t docker-airflow-spark:latest .

Check your images

$ docker images

Start containers

$ cd airflow-spark/docker
$ docker-compose up -d

At this moment you will have an output like below and your stack will be running :).

Access applications

Spark Master:http://localhost:8181
Airflow: http://localhost:8282
Postgres DB Airflow: Server: localhost, port: 5432, User: airflow, Password: airflow
Postgres DB Test: Server: localhost, port: 5432, User: test, Password: postgres
Jupyter notebook: you need to run the code below to get the URL + Token generated and paste in your browser to access the notebook UI.

$ docker logs -f docker_jupyter-spark_1

Running a Spark Job inside Jupyter notebook

Not it’s time to test if everything is working correctly. Let’s first run a Spark application interactively in Jupyter notebook.

Inside Jupyter, go to “work/notebooks” folder and start a new Python 3 notebook.

Paste the code below in the notebook and rename it to hello-world-notebook. This Spark code will count the lines with A and lines with B inside the airflow.cfg file.

# Set file
logFile = “/home/jovyan/work/data/airflow.cfg”# Read file
logData = sc.textFile(logFile).cache()# Get lines with A
numAs = logData.filter(lambda s: ‘a’ in s).count()# Get lines with B
numBs = logData.filter(lambda s: ‘b’ in s).count()# Print result
print(“Lines with a: {}, lines with b: {}”.format(numAs, numBs))

After running the code, you will have:

Note that:

The path spark/resources/data in your project is mapped to /home/jovyan/work/data/ in Jupyter machine.
The folder notebooks in your project is mapped to /home/jovyan/work/notebooks/ in Jupyter machine.

Triggering a Spark Job from Airflow

In Airflow UI, you can find a DAG called spark-test which resides in dags folder inside your project.

This is a simple DAG that triggers the same Spark application which we ran in Jupyter notebook with two little differences:

We need to instantiate the Spark Context (in Jupyter it is already instantiated).
The file to be processed will be an argument passed by Airflow when calling spark-submit.

DAG spark-test.pySpark app hello-world.py

Before running the DAG, change the spark_default connection inside Airflow UI to point to spark://spark (Spark Master) , port 7077:

Spark And Airflow Meter

spark_default connection inside Airflow.

Now, you can turn the DAG on and trigger it from Airflow UI.

Spark And Airflow Light

After running the DAG, you can see the result printed in the spark_job task log in Airflow UI:

And you can see the application in the Spark Master UI:

You can increase the number of Spark workers just adding new services based on bitnami/spark:latest image to the docker-compose.yml file like following:

Environment variables meaning can be found here.

Air Flow Company

When you no longer want to play with this stack, you can stop it to save local resources:

Docker Spark Standalone

$ cd airflow-spark/docker
$ docker-compose down

Finally, I hope this guide can help you with your work or study and make it easier to provide a complete Data Engineer environment. Enjoy!