Understanding Containerization and Docker for Data Engineer

Development on Virtual Machine (VM) Era

In the Virtual Machine (VM) era, creating consistent and compatible development environments was a major difficulty for the software and data teams. These struggles include Operating System (OS) specific problems, software version conflicts, and inconsistencies that led to ineffective testing procedures and wasted effort. Installing and configuring the necessary services on local computers was a step in the procedure.

You can run one or more operating systems inside of another operating system by using a Virtual Machine (VM), which is a computer that is replicated in software. For software developers, companies would package and run a single program on each Virtual Machine (VM) rather than running several apps on the same server. Personalizing environments by installing various tools and software versions, as well as deployment complexities, become challenges for data engineers.

Instead of running multiple applications, all on the same server, companies would package and run a single application on each VM. VM are pretty heavy beasts on their own since they all contain a full-blown operating system such as Linux or Windows Server, and all that for just a single application. Good solution to this problem was to provide something that is much more lightweight than VM.

Virtualization Concept

Virtual Machine (VM) is a software based environment used for virtualization. The term virtualization describes the process of creating a virtual resource that operates on a layer distinct from the actual hardware using software. Virtualization enables you to operate several Virtual Machine (VMs) on a single computer.

Every one of those virtual machine used Hypervisors run on a host operating system and enable multiple guest operating systems to run on top of it. Hypervisor (also known as a virtual machine monitor) is a software layer that enables several virtual machines (VMs) to live like a pools computing resources on a single physical machine. There are two main types of hypervisors exist:

Type 1 hypervisors (also known as bare-metal hypervisors) operate directly on hardware without the use of an operating system. This hypervisors run directly on the physical hardware without the need for a host operating system. Type 1 examples include Citrix Hypervisor (formerly known as Xen), Microsoft Hyper-V, and VMware ESXi.
Type 2 hypervisors are hosted hypervisors that use an existing operating system (OS) to function and run on top of a host. This hypervisors run on top of a host operating system. Type 2 examples include Parallels Desktop, VMware Workstation, and Oracle VirtualBox.

Comparison of Container and Virtual Machine Architecture | https://www.researchgate.net/publication/343764931/figure/fig1/AS:926595288145920@1597928940339/Comparison-of-Docker-Container-and-Virtual-Machine-Architecture-13.ppm

Development on Docker Era

When Docker emerged in 2013, containers exploded in popularity. These containers offer a lightweight and consistent environment, streamlining the process for developers to efficiently build, test, and deploy applications. Docker Inc is the company formed to develop Docker CE and Docker EE. The process for developers to efficiently build, test, and deploy applications. Docker is the company formed to develop Docker CE and Docker EE.

In software development, Docker containers encapsulate everything an application needs to run, including the operating system, libraries, dependencies, and application code. Docker eliminates the "But it works on my machine!" problem by packaging an application into a container. Using Docker, developers can build a container once and then deploy it to various stages of the development process (testing, staging, and production) without worrying about compatibility issues.

In data engineering, modern data engineers can creating, maintain and deploy data solutions that include manage workflows, oversee data warehouses, transform and visualize data, optimize batch processing and streaming process, establish a data lake. Spin up tools like Airflow, Postgres, Spark, and Dbt with minimal effort. Sharing data pipelines and simplifying collaboration between team members with a standard setup.

Containerization Concept

As Docker was still pushing containerization on your localhosts. Containerization is the process of encapsulating an application simple or complex and its dependencies into a container image. Your software will act consistently wherever it goes since it saves everything it needs in one location.

Containerization works by sharing the host OS kernel with other containers as a read-only resource. The lightweight and scalable nature of containers allows developers to set up several instances on a single server or virtual machine. Containerization supports the microservices architecture, which divides programs into discrete, autonomous services.

What is docker?

Docker has quietly invaded almost every aspect of current software development (including Data and AI/ML environment). Docker is an open platform for developing, shipping, and running applications. Similar to how an engine powers a ship, docker runtime environment offers the resources and infrastructure required to create, administrator, and share containerized apps.

How is Docker working behind the scenes?

Finding your way through the main functions and parts through of docker architecture is essential to understand docker inner workings. Docker architecture consists of several components that work together to enable containerization and application management. An outline of docker architecture is summarized here:

Docker client

Users can interact with Docker using feature call docker client. Docker client sends commands to the Docker Daemon and carries them out. Through the REST API, the Docker client and Docker daemon exchange information.
Docker Daemon (engine)

The Docker daemon (also known as Dockerd) is the actual process that runs the containers. The daemon is installed alongside the CLI as part of Docker Engine. The container runtime is the most important component as it offers an interface to the Linux kernel functions that make containerization possible.
Docker registries

A Docker registry is a central storage location for storing and sharing Docker images. There has two different kinds of registries:
- Public Registry: sometimes referred to the Docker Hub
- Private Registry: utilized for image sharing within the company org

Docker retrieves the necessary images from your configured registry when you use command docker pull or docker run. Docker pushes your image to your configured registry when you execute command docker push.

Docker architecture and its components | https://geekflare.com/wp-content/uploads/2019/09/docker-architecture-609x270.png

Docker Core Concept

The docker internal assist user in understanding the fundamental ideas of containers and how they connect to one another. Here’s a breakdown of the main docker concepts:

Dockerfile

A Dockerfile is a text document that contains build instructions for docker images. These instructions cover setting environment variables, adding files, configuring the container, and specifying the base image. Docker supports over many different dockerfile instructions include:

FROM: refers to an existing image that becomes the base for your build
RUN: executes shell commands during build time
CMD: provides default command to run at container start
ENTRYPOINT: defines a fixed command, often combined with CMD for args
COPY: adds files and folders to your image’s filesystem
ADD: similarly to COPY but additionally supports remote file URLs and automatic archive extraction
ENV: set environment variables that will be available within your containers
EXPOSE: documents which ports the container will listen on
WORKDIR: sets the working directory for following instructions
VOLUME: declares mount points for persistent or shared data
ARG: defines build-time variables, usable during RUN
USER: set user and group ID

Example: Creating and using a Dockerfile

A Dockerfile must begin with a FROM instruction. Here are the required packages in requirements.txt:

pyspark==3.5.2
psycopg2-binary==2.9.9
mysql-connector-python==9.0.0
pandas

Let’s take a look at the process of creating a Docker image for a simple java-based spark app for data engineering project in Dockerfile:

## choose the base image
FROM apache/airflow:2.10.0

## set the user as root, helps with the installation permissions :)
USER root

## set environment variable to avoid ui pop-ups during installations.
ENV DEBIAN_FRONTEND=noninteractive

## This multi-line instruction to install necessary packages in the image
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
    build-essential \
    libssl-dev \
    libffi-dev \
    apt-transport-https \
    gnupg2 \
    lsb-release \
    openjdk-17-jdk \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

## set up java home. Debian 12 bookworm comes with jdk-17 as default.
ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
ENV PATH="${JAVA_HOME}/bin:${PATH}"
## executed in their own isolated shell
RUN export JAVA_HOME

## for regular apache-ariflow installation.
USER airflow
## Copies the requirements.txt file from your local build context to the root directory (/)
COPY requirements.txt /
## Installs Python packages using pip, including apache-airflow package and apache-airflow-providers-apache-spark package
RUN pip install --no-cache-dir "apache-airflow==${AIRFLOW_VERSION}"  \
  apache-airflow-providers-apache-spark \
  -r /requirements.txt

Images

Docker image is a standardized package that includes all of the files, binaries, libraries, and configurations to run a container. Once you have an image, you can use it to spin up one container or a hundred. Docker images can be utilized in two primary ways:

Pulling pre-existing images from a registry

Most of your images will be created on top of a base image from the Docker Hub registry. This method involves downloading a pre-built Docker image from a container registry or a private registry. To download a particular image or set of images, use docker pull. This will pull the latest official PostgreSQL image from Docker Hub:
```
 docker pull postgres:16
```
After pull command, try verify the image by listing your local Docker images:
```
 docker images
```
Write docker run command to start a new Postgres instance or container:
```
 docker run --name test-postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres:16
```
Building custom images from a Dockerfile

If you need to completely control the contents of your images, you can create your own base image from a Linux distribution of your choosing. Images are built from a Dockerfile, which is a set of instructions for creating an image. Docker images can be stored and shared through Docker registries, such as Docker Hub or private registries. Let's create a simple docker container that runs a Python script in Dockerfile:
```
 FROM postgres:16

 # Install a sample extension (e.g., pg_cron)
 RUN apt-get update && \
     apt-get install -y --no-install-recommends \
         postgresql-16-pg-cron && \
     rm -rf /var/lib/apt/lists/*
```
Now you can use docker build to build an image from your Dockerfile.
```
 docker build -t demo-postgresql:16 .
```
After build image, try verify the image by listing your local docker images.
```
 docker images
```
With your image built, start a container to see your code execute.
```
 docker run -d --name my-postgresql-instance -p 5432:5432 demo-postgresql:16
```
If you want to stop the container, run this command.
```
 docker stop <container_id>
```

Container

Docker container is a running instance of a Docker image. Containers are made from Docker images and contain a writable layer on top of the image that lets them save their state and store runtime data. Characteristics of docker container:

Isolated and secure: container have their own filesystem, network, and process tree separate from the host and other containers
Ephemeral: by default, when a container stops, any changes to its filesystem are lost mimic production environments as best as possible
Based on an image: every container starts from an image

Docker Container Explain | https://cdn.shortpixel.ai/spai/q_lossy+ret_img+to_auto/linuxiac.com/wp-content/uploads/2021/06/what-is-docker-container-1024x354.png

Storage & Volume

Docker offers a versatile storage solution for data management inside container. Docker containers are designed to be ephemeral, which means they can be easily stopped, destroyed, rebuilt, and replaced with minimal setup. Creating a volume is necessary if you wish to persist data inside of Docker.

Volumes make it possible for data to survive container restarts. Even when the containers utilizing them are removed, volumes continue to store data. The host's filesystem contains the volume data, but you need to mount the volume to a container in order to access the data within it.

Docker provides two primary ways to use mount:

Bind mounts: ties a volume to a specific folder or file on the host machine
Tmpfs mounts: stores files directly in the host machine's memory, ensuring the data is not written to disk

Example: Created a Volume (Bind Mounts) for PostgreSQL Container

Create a named volume:

docker volume create pgdata

Check the volume that has been created.

docker volume ls

Inspect and managing existing volumes.

docker volume inspect pgdata

Write docker run command to start a new Postgres instance or container.

docker run --name test-postgres -e POSTGRES_PASSWORD=mysecretpassword -v pgdata:/var/lib/postgresql/data -d postgres

The -v pgdata:/var/lib/postgresql/data command ensures the PostgreSQL data is stored in our named volume.

Docker Storage Type | https://k21academy.com/wp-content/uploads/2020/10/Docker-storage-view-2.png

Networks

Docker networking provides a flexible and powerful way to connect those containers together, with external systems, via whichever network model you choose. By default, networking is enabled in containers, they are inaccessible to other containers and external systems. Docker uses network namespaces to implement networking; every container has its own network namespace, which includes network interfaces, firewall rules, and routing tables. Discover the many sorts of networks and how they distribute resources by using the Docker network type:

Bridge Network

The default network type for Docker containers on a single host. It creates an isolated network segment for your containers, allowing them to communicate with each other while maintaining separation from the host’s network. Each container connected to a bridge network receives an internal IP address, and the network resolves port conflicts.
Host Network

This network mode allows containers to share the host’s network stack directly. It’s use the same network interfaces and IP addresses as the host. It is useful when performance is critical, and you don’t want the overhead of a separate network namespace.
Overlay Network

Typically used in docker swarm and kubernetes environments, overlay networks enable containers on different hosts to communicate with each other. Containers on different docker hosts can communicate as if they are on the same local network. Overlay Network use VXLAN (Virtual Extensible LAN) encapsulation to encapsulate and route traffic between containers running on different hosts.
Macvlan Macvlan

Allows containers to appear as physical devices on the network. Docker networking provides a flexible and powerful way to connect those containers together, with external systems, via whichever network model you choose. It is useful for scenarios where containers need to be treated as physical devices by the network.
IPvlan Offers

Granular control and better performance than macvlan for environments with strict networking requirements. IPv6 addresses assigned to containers, as well as layer 2 and 3 VLAN tagging and routing. IPvLAN networks are assigned their own interfaces, offering performance advantages over bridge-based networking.

Example: Setting Up a Bridge Network

Create a custom bridge network:

docker network create --driver bridge my_custom_network

See all networks on your docker host.

docker network list

Inspect network details.

docker network inspect my_custom_network

Run a container on this network.

docker run -d --name web_server --network my_custom_network nginx

Get the IP address of current running container.

docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container_name>

Docker Host Network Driver | https://www.packetswitch.co.uk/content/images/2025/03/docker-01-12-.png

Docker Compose

Docker Compose is a tool that allows you to define, configure, and run multi-container docker applications through a single YAML configuration file. Dockerfile handles single containers, but docker compose enables you to stack them up into a single service. Docker compose works in all environments include production, staging, development, testing, and Continuous Integration (CI) workflows.

Instead of starting each container manually with the docker run command like example on docker image section, docker compose automates the process. Compose also supports docker-compose.yaml and docker-compose.yml for backwards compatibility of earlier versions. Let’s cover the key components and concepts of docker compose:

version: specifies the version of the Docker Compose file format being used
services: a service represents a container in docker compose
images: read-only templates used to create containers
ports: maps ports between the host machine and the container, allowing access to the service from outside the docker environment
depends_on: defines dependencies between services, ensuring that certain services start before others
volumes: volumes are persistent storage solutions that allow data to be between containers or saved outside the container’s lifecycle
networks: allows you to define how containers interact by specifying networks

Example: Configuring Airflow using Docker Compose

Build our custom image based off of our local Dockerfile using docker compose build spark-master command. Here are the required packages in requirements.txt:

Cython
pandas
requests
beautifulsoup4
pyodbc

Here are other required packages in .env:

# Meta-data DB
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow

# Airflow Core
AIRFLOW_UID=50000
AIRFLOW_GID=0

# Backend DB
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow

# API
AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth

# Airflow credentials
ADMIN_USERNAME=airflow
ADMIN_PASSWORD=airflow
ADMIN_FIRSTNAME=ricky
ADMIN_LASTNAME=test
ADMIN_MAIL=ricky_test@gmail.com

FERNET_KEY=zN5UbKI9xxeo2UzUcgZJ3tx-bf8v7KJtEy7Q8VT8xt4=
AIRFLOW_WWW_USER_USERNAME=admin-airflow
AIRFLOW_WWW_USER_PASSWORD=password-airflow
AIRFLOW_WWW_USER_FIRSTNAME=airflow
AIRFLOW_WWW_USER_LASTNAME=user
AIRFLOW_WWW_USER_EMAIL=test.ricky@gmail.com

Creating a Docker image of apache airflow to customize a pre-built image for data engineering project in Dockerfile:

## choose the base image and the python version
FROM apache/airflow:3.1.0-python3.12

# Install additional dependencies for root
USER root
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    unixodbc-dev \
    gnupg \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - && \
    curl https://packages.microsoft.com/config/debian/11/prod.list > /etc/apt/sources.list.d/mssql-release.list

RUN apt-get update && \
    ACCEPT_EULA=Y apt-get install -y --no-install-recommends \
    msodbcsql18 \
    mssql-tools \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# Install additional dependencies for airflow
USER airflow
RUN pip install --no-cache-dir pyodbc

Creating a Docker image for apache airflow for data engineering project in docker-compose.yaml:

#version: '3.8'
x-airflow-common:
  &airflow-common
  build: .
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:3.1.0}
  environment:
    &airflow-common-env
    AIRFLOW_HOME: /opt/airflow
    AIRFLOW__CORE__DAGS_FOLDER: /opt/airflow/dags
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: ${AIRFLOW__DATABASE__SQL_ALCHEMY_CONN}
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ${FERNET_KEY}
    AIRFLOW__CORE__WEBSERVER_PORT: '8080'
    AIRFLOW__WEBSERVER__AUTHENTICATE: 'True'
    AIRFLOW__WEBSERVER__AUTH_BACKEND: 'airflow.www.security.backends.database_auth'
    AIRFLOW__API__AUTH_BACKENDS: ${AIRFLOW__API__AUTH_BACKENDS}
    _AIRFLOW_DB_UPGRADE: 'true'
    _AIRFLOW_WWW_USER_CREATE: 'true'
    _AIRFLOW_WWW_USER_USERNAME: ${AIRFLOW_WWW_USER_USERNAME}
    _AIRFLOW_WWW_USER_PASSWORD: ${AIRFLOW_WWW_USER_PASSWORD}
    _AIRFLOW_WWW_USER_FIRSTNAME: ${AIRFLOW_WWW_USER_FIRSTNAME}
    _AIRFLOW_WWW_USER_LASTNAME: ${AIRFLOW_WWW_USER_LASTNAME}
    _AIRFLOW_WWW_USER_EMAIL: ${AIRFLOW_WWW_USER_EMAIL}
    _AIRFLOW_WEBSERVER_HOST: '0.0.0.0'
    _AIRFLOW_UID: ${AIRFLOW_UID}
    _AIRFLOW_GID: ${AIRFLOW_GID:-0}
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    - ./data:/opt/airflow/data
  depends_on:
    postgres:
      condition: service_healthy
    redis:
      condition: service_healthy
  networks:
    - airflow

services:
  postgres:
    image: postgres:13
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - airflow
    restart: always

  redis:
    image: redis:latest
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - airflow
    restart: always

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    restart: always

  airflow-worker:
    <<: *airflow-common
    command: celery worker
    restart: always

  airflow-init:
    <<: *airflow-common
    command: > 
      bash -c "exec airflow db upgrade && 
               exec airflow users create 
                      --username ${ADMIN_USERNAME} 
                      --password ${ADMIN_PASSWORD} 
                      --firstname ${ADMIN_FIRSTNAME} 
                      --lastname ${ADMIN_LASTNAME} 
                      --role Admin 
                      --email ${ADMIN_MAIL}"
    environment:
      <<: *airflow-common-env
      _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
    user: "${AIRFLOW_UID}"
    restart: "no"

networks:
  airflow:
    name: airflow

Starting our containers using this command.

docker-compose up -d

To stop the containers without removing them using this command.

docker-compose down

Docker Hub

Docker Hub is a public registry service that allows you to store and share docker images. It provides a central location to discover pre-built images and tools designed to streamline your container workflows. Docker Hub has an abuse rate limit to protect the application and infrastructure.

Docker Hub usage and limits | https://docs.docker.com/docker-hub/usage/

Summary

In the VM era, creating compatible and consistent development environments was challenging due to OS-specific issues, software version conflicts, and deployment complexities. A VM is a software-replicated computer containing a full-blown operating system, making it a heavy and resource-intensive solution for running often just a single application. The underlying concept of virtualization allows multiple VMs to run on a single machine, with a software layer called a Hypervisor managing the resources.

The shift occurred with the emergence of Docker and containerization, which gained popularity for offering a lightweight and consistent environment to efficiently build, test, and deploy applications. The process of containerization achieves its lightweight nature by sharing the host OS kernel among containers as a read-only resource. Key components of this architecture include the Dockerfile, which is a text document containing the image build instructions, and the resulting docker image, which is a standardized, distributable package used to create a running docker container. I hope you enjoyed reading this.

Understanding Containerization and Docker for Data Engineer

Development on Virtual Machine (VM) Era

Virtualization Concept

Development on Docker Era

Containerization Concept

What is docker?

How is Docker working behind the scenes?

Docker Core Concept

Dockerfile

Images

Container

Storage & Volume

Networks

Docker Compose

Docker Hub

Summary

Comments

More from this blog

Demystifying Data Warehouses

Introduction to Data Processing and Data Transformation

Data Modeling Fundamentals part-2: Data Modeling Approach and Techniques

Data Modeling Fundamentals part-1: Introduction to Data Model and Data Modeling Types

Overview of Storage System: Transactional, Analytical, and Hybrid Database

Command Palette

Development on Virtual Machine (VM) Era

Virtualization Concept

Development on Docker Era

Containerization Concept

What is docker?

How is Docker working behind the scenes?

Docker Core Concept

Dockerfile

Images

Container

Storage & Volume

Networks

Docker Compose

Docker Hub

Summary

Comments

More from this blog