Using Apache Airflow's Docker Operator with Amazon's Container Repository
Brian Campbell
Reading time: about 5 min
Topics:
Setup permissions and push to ECR
Once we had the image, we then needed to move that image into ECR. First, we needed to give the analysts access to ECR and have them push their container, so we gave the analyst access to ECR in IAM by adding a few policies. At the very least, someone pushing a container to ECR will need the permissions ecr:GetAuthorizationToken and ecr:PutImage. If you want to manage repositories yourself, that’s all you need. If you want someone to manage the repository they are pushing to as well, you’ll also need them to give them the ecr:CreateRepository permission. For more detailed information, AWS provides excellent tutorials: Creating a Repository and Pushing an Image. Next, we needed to give Airflow permissions to pull the image of the job from ECR. The permissions Airflow needed were ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetAuthorizationToken, and ecr:GetDownloadUrlForLayer. Our Airflow cluster runs on EC2 instances so we gave those specific permissions to the IAM roles associated with those instances. From there, we set up Airflow to be able to communicate with our account’s ECR.Connect Airflow to ECR
Airflow communicates with the Docker repository by looking for connections with the type “docker” in its list of connections. We wrote a small script that retrieved login credentials from ECR, parsed them, and put those into Docker’s connection list. Here is an example script similar to what we used to retrieve and store credentials:
#!/usr/bin/env python
import subprocess
import boto3
import base64
ecr = boto3.client('ecr', region_name='us-east-1')
response = ecr.get_authorization_token()
username, password = base64.b64decode(
response['authorizationData'][0]['authorizationToken']
).split(':')
registry_url = response['authorizationData'][0]['proxyEndpoint']
# Delete existing docker connection
airflow_del_cmd = 'airflow connections -d --conn_id docker_default'
process = subprocess.Popen(airflow_del_cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
# Add docker connection with updated credentials
airflow_add_cmd = 'airflow connections -a --conn_id docker_default --conn_type docker --conn_host {} --conn_login {} --conn_password {}'.format(registry_url, username, password)
process = subprocess.Popen(airflow_add_cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
The issue with the script above, though, is that the ECR credentials are only valid for an hour. To keep these credentials fresh, we set up a cron task on every host in our cluster that runs this script every half hour.
Use your Docker image on Airflow
Then, in order to run the container image as a task, we set up a dag with an operator like this:
DockerOperator(
task_id=’web_scraper’,
image='XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/web_scraper:latest',
command='python /home/ubuntu/web_scraper.py',
execution_timeout=timedelta(minutes=30),
dag=dag)
The DockerOperator pulls the image you pushed using the Docker connection we set up in the last script and runs that image with the provided command.
We had one last issue with working with Docker on Airflow. The DockerOperator does not clean up old images, which eventually led us to run out of disk space on our ECS cluster. To fix that, we added another task to the same DAG that does some cleanup:
BashOperator(
task_id='clean_up_docker',
bash_command='docker container prune',
dag=dag)
With that last operator in place, we had a system for running Docker images stored in ECR as tasks in Airflow. We can now take a task, put it in a portable Docker image, push that image to our private hosted repository in ECR, and then run on a schedule from our Airflow cluster.
Of course, this isn’t your only option for using Docker, or even ECR, with Airflow. Our site reliability team has started running some containerized tasks using the ECSOperator instead of the DockerOperator so they could run on an Elastic Container Service (ECS) cluster rather than directly on the Airflow worker. We’ve decided to use the DockerOperator since it made sense for our team, and I hope I’ve helped you get the most out of your Docker and Airflow infrastructure.About Lucid
Lucid Software is a pioneer and leader in visual collaboration dedicated to helping teams build the future. With its products—Lucidchart, Lucidspark, and Lucidscale—teams are supported from ideation to execution and are empowered to align around a shared vision, clarify complexity, and collaborate visually, no matter where they are. Lucid is proud to serve top businesses around the world, including customers such as Google, GE, and NBC Universal, and 99% of the Fortune 500. Lucid partners with industry leaders, including Google, Atlassian, and Microsoft. Since its founding, Lucid has received numerous awards for its products, business, and workplace culture. For more information, visit lucid.co.