Skip to main content

Command Palette

Search for a command to run...

Mastering the Linux Shell and Cron: Essential Skills for Data Engineers

Updated
14 min read
Mastering the Linux Shell and Cron: Essential Skills for Data Engineers
R

Data Engineer based in Jakarta, Indonesia. When I first started out in my career, I was all about becoming a Software Engineer or Backend Engineer. But then, I realized that I was actually more interested in being a Data Practitioner. Currently focused on data engineering and cloud infrastructure. In my free time, I jog and running as a hobby, listening to Jpop music, and trying to learn the Japanese language.

The next stage is to switch from manual execution to automation after learning the basics of Linux commands for file system navigation in the previous article. This is where shell scripting and Bash are useful. The goal is to simplify our lives by automating the boring manual tasks. Automation has significantly elevated the quality of products while simultaneously eliminating human errors.

Shell scripting is the robust mechanism that transforms these commands into repeatable processes, saving time and ensuring consistency for tasks like data transfers and daily report generation. Additionally, bash scripting is the engine that automates repetitive, complex operations that take too long to do line by line. Various shells are part of the shell language used in shell scripting including bash, sh, csh, and ksh.

We'll go over the basics of shell scripting with practical examples.

Introduction to shell scripting 📃

Shell scripts can be used to manage deployments, configure servers, and automate repetitive activities, among other things. It's fairly easy to write a shell script, we just need to create a new file with .sh extension. At the first line of the script, the she bang #! line tells the system to use the Bash interpreter.

  1. Create new script name “run_shell.sh”
#!/bin/bash

echo "Hello, welcome to Ubuntu on WSL2! \\nThe current date and time is: $(date)"
  1. Make the script executable, assign execution rights to your user
$ chmod 777 run_shell.sh

Explanation:

  • chmod modifies the ownership of a file for the current user :u.

  • 777 is an octal representation of the permissions granted. This means that the user who is the owner can now run the script.

  • run_shell.sh is the file target

  1. Run the script
$ ./run_shell.sh

Variable

Variables are used to store data, such as numbers, text strings, or the output of commands. Within the script, these variables can then be accessed and changed. Command echo and replacement with $(variable_name) are used to assign a command's output to a variable.

In shell scripting there are three main types of variables include:

  • System variable

Global variable is predefined variables that are set by the operating system or shell environment.

$ message="Hello there"
$ echo $message
  • Local variable

The local variable is declared by using the keyword before the variable name.

#!/bin/sh

FullName="John Doe"
age=20
echo "Your $FullName and $age information"
  • Global variable

Global variable is a variable defined anywhere in the script, outside of a function, is accessible throughout the entire script.

#!/bin/sh
FullName="John Doe"

function greeting() {
   FullName="Doja Cat"
   age=20
   echo "Your $FullName and $age information"
}

echo $FullName

Conditional Statements

Linux shell scripts with conditional statements can execute distinct code blocks according to whether a given condition evaluates to true or false. The if statement is the most common conditional statement, and it can be expanded using else and elif clauses. This essentially allows user to choose a response to the result which our conditional expression evaluates to.

#!/bin/bash

echo "Please enter a number: "
read num_1
echo "Please enter a number again:"
read num_2
echo "Sum result: $((num_1 + num_2))"

echo "-------- Check the input number1 ---------"
if [ $num_1 -gt 0 ]; then
  echo "$num_1 is positive"
elif [ $num_1 -lt 0 ]; then
  echo "$num_1 is negative"
else
  echo "$num_1 is zero"
fi

echo "-------- Check the input number2 ----------"
if [ $num_2 -gt 0 ]; then
  echo "$num_2 is positive"
elif [ $num_2 -lt 0 ]; then
  echo "$num_2 is negative"
else
  echo "$num_2 is zero"
fi

Explanation:

  • read command used to save the user input into a specified variable by providing an argument

  • if [ $num_1 -gt 0 ]: This condition checks if the value of num_1 is "greater than" (-gt) zero.

  • elif [ $num -lt 0 ]: This is an "else if" condition. It checks if the value of num is "less than" (-lt) zero.

  • fi is a keyword used to terminate an if statement block

Looping

Loops are crucial to shell programming because they allow you to process files, handle user input, and automate repeated activities. When you need to do the same thing again, like processing a list of files or iterating over an array of values, they are especially helpful. Loops come in a variety of forms, including the "for," "while," and "until" loops.

Here's an explanation of how to use each of these loops in a shell script:

  1. While

The while loop repeatedly executes a block of commands as long as a specified condition is true.

#!/bin/bash

i=1
while [[ $i -le 10 ]]; do
   echo "$i"
  (( i += 1 ))
done

Explanation: -le: This is a comparison operator that means "less than or equal to".

  1. For

The for loop is used to iterate over a sequence of values and below is the syntax.

#!/bin/bash

for i in {1..10}
do
    echo $i
done
  1. Until

The until loop continues executing until a specific condition becomes true.

#!/bin/bash

counter=1
until [ $counter -gt 10 ]
do
    echo $counter
    counter=$((counter + 1))
done

Function

Shell script functions let you organize reusable code blocks, which makes scripts easier to maintain and more modular. They encourage modularity, reusability, and code organization in your scripts. In shell scripting, a function is comparable to a function or subroutine in other programming languages.

There are two common ways to define functions in a shell script:

  1. Using the function keyword
#!/bin/bash

function demo {
  echo "This is my function"
}
demo

echo "The exit status of the demo function is: $?"
  1. Without the function keyword
#!/bin/bash

echo "------- Do You Know Odd Number? -------"
echo -n "Enter the Number: "
read x
is_odd(){
  if [ $((x%2)) == 0 ]; then
     echo "Invalid Input, ${x} not odd number!"
     exit 1
  else
     echo "Yes! Number ${x} is Odd."
  fi
}
is_odd

Error Handling

Writing solid and dependable shell scripts requires the ability to handle errors. Without proper error handling, a script may keep running even after an error has occurred, which could have unanticipated consequences or even cause the system to crash. We must consciously choose actions that will enable us to deal with such situations.

Below are various ways error handling techniques in different scenarios:

  1. if statement and exit status

The exit command terminates the script. To find out the result of a command, check this variable right away. Success is indicated by an exit status of 0, and failure is indicated by any value that is not zero.

#!/bin/bash

mkdir /tmp/mydir
if [ $? -eq 0 ]; then
  echo "Directory /tmp/mydir created successfully."
else
  echo "Failed to create directory /tmp/mydir."
  exit 1
fi

Explanation:

  • mkdir command attempts to create a directory

  • $? to check if the command was successful

  • -eq 0 is a test that checks if the value of $? is equal to 0

  • Exit Status 1 confirming that mkdir failed, terminates the script with a non-zero exit status

  1. Function and trap

The trap command captures any errors (signaled by the ERR keyword) and runs the specified error handling code. You can define a command or function to be carried out in response to a certain event or signal by using the trap command. It helps you create more strong and dependable scripts and programs and is especially helpful for graceful termination, signal handling, error handling, and temporary file management.

#! /bin/bash

handle_error() {
    echo "Error on line $1"
    exit 1
}
trap 'handle_error $LINENO' ERR
mkdir /root/test_dir
echo "Directory created successfully."

Explanation:

  • if mkdir /root/test_dir fails, the handle_error() function is executed, printing an error message and exiting the script.

  • $LINENO is a special bash variable that contains the line number of the script where the error occurred.

  • ERR is the signal to be trapped.

  1. Redirecting error messages to .log file

Redirection operators are used in Linux shell scripts to route error messages to a log file. Errors are written to the terminal by default, but you can reroute them to a file to facilitate debugging. Custom error messages can assist users understand what went wrong and how to repair it by providing additional context.

#!/bin/bash

ls -l /home/ricky-suh101/shell-log-dump 2> script_errors.log
mkdir /home/ricky-suh101/shell-log-dump 
echo "Error: Directory /home/ricky-suh101/shell-log-dump could not be created. Please check if you have the necessary permissions or if the directory already exists." >> error_dump.log

Explanation:

  • Use the 2> operator to send standard error(stderr) to a specified file.

  • If mkdir fails, the error message is appended (>>) to error_dump.log

Cronjob for Task Scheduling ⏰

Cron is a time-based job scheduler that executes commands or scripts without manual intervention. Cron functionality is enabled via the crond daemon, which also submits the necessary jobs and runs in the background at predetermined periods. The cron syntax, which consists of minute, hour, day of the month, month, and day of the week parameters, allows users to define the timing and frequency of these operations.

Setup time schedule for cronjob | https://images.archbee.com/jvwmQL6VASLd-norgNd8V/w2SCPNLCbpuIl7EXjo4-E_untitled-1.png?format=webp

Using a job scheduler (cron) to carry out tasks is made possible by Crontab (also known as "cron tables"). Cron jobs are any tasks that you schedule using Crons. Cron jobs work by giving users the ability to plan and automate actions on an operating system similar to Unix.

# To create or edit a crontab
$ crontab -e
# To view a list of active crontab
$ crontab -l
# To delete all tasks and start all over
$ crontab -i
# To view cron job history, you can show logs by
$ grep CRON /var/log/syslog
# To find cron daemon is running or not by
$ systemctl status cron # quit press q

Below are some examples of scheduling cron jobs:

Image create by author

Project Example: ETL weather data pipeline using cronjob and python

Let’s look at a practical example. We want to to analyze information for the weather. It extracts hourly weather data from Weather API based on a city name, the latitude, longitude, and other. Upserts daily stats in a PostgreSQL database and allow us to query it with SQL.

Create a Virtual Environment

Navigate to where you want to create your project and then create a directory for it.

#  /home/username/
$ mkdir scripts
$ cd scripts

Check your python version by running the following command.

$ python3 --version

Install the venv module, which is used to create virtual environments.

$ sudo apt-get install python3-venv

Inside your project directory, create the virtual environment.

$ python3 -m venv my_venv_name # Replace my_venv_name with your desired name

Activate the virtual environment.

$ source env_name/bin/activate

Now, any Python packages you install using pip will be installed within this isolated environment.

$ pip install requests psycopg2-binary

When you are finished working in the virtual environment, you can deactivate it.

$ deactivate

Extracting Weather Data

First, we need to define a function to fetches a weather data from http://api.weatherapi.com/v1.

import os
import sys
import requests
import psycopg2
import statistics
from datetime import datetime

API_BASE = "<http://api.weatherapi.com/v1>"

def get_weather_data(city: str) -> dict:
    """
    Extract current weather data from WeatherAPI.
    """
    url = f"<http://api.weatherapi.com/v1/forecast.json?key={API_KEY}&q={city}&days=1&aqi=no&alerts=no>"
    resp = requests.get(url, timeout=10)
    resp.raise_for_status()
    return resp.json()

Transforming and Storing Data

Now that we have the raw data from the API, we need to transform it into a structured format that can be easily stored in a database.

import os
import sys
import requests
import psycopg2
from datetime import datetime
import statistics

def transform_weather_data(raw: dict) -> dict:
    """
    Transform raw data to compute statistics.
    """
    hours = raw["forecast"]["forecastday"][0]["hour"]
    temps = [h["temp_c"] for h in hours]

    stats = {
        "date": raw["forecast"]["forecastday"][0]["date"],
        "city": raw["location"]["name"],
        "avg_temp": round(statistics.mean(temps), 2),
        "max_temp": max(temps),
        "min_temp": min(temps),
        "data_points": len(temps),
        "created_at": datetime.utcnow(),
    }
    return stats

Load to the target database

Load the transformed data into our PostgreSQL database, which allows us to persist the data and query it later.

import os
import sys
import requests
import psycopg2
from datetime import datetime
import statistics

def load_to_postgres(data: dict):
    """
    Load statistics into PostgreSQL table.
    """
    conn = psycopg2.connect(**POSTGRES_CONN)
    cur = conn.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS weather_statistics (
            id SERIAL PRIMARY KEY,
            date DATE,
            city TEXT,
            avg_temp FLOAT,
            max_temp FLOAT,
            min_temp FLOAT,
            data_points INT,
            created_at TIMESTAMP
        );
    """)
    conn.commit()

    cur.execute("""
        INSERT INTO weather_statistics
        (date, city, avg_temp, max_temp, min_temp, data_points, created_at)
        VALUES (%s, %s, %s, %s, %s, %s, %s);
    """, (
        data["date"], data["city"], data["avg_temp"],
        data["max_temp"], data["min_temp"], data["data_points"],
        data["created_at"]
    ))
    conn.commit()
    cur.close()
    conn.close()

Here's a complete Python script that connects to the REST API endpoint, fetches object data, and saves it to a database.

# /home/username/scripts/fetch_weather_api_data.py

import os
import sys
import requests
import psycopg2
from datetime import datetime
import statistics

# API and Database Config
API_KEY = os.getenv("WEATHER_API_KEY")  # set in environment
CITY = "Jakarta"
POSTGRES_CONN = {
    "host": "localhost",
    "database": "weatherdb",
    "user": "weather_user",
    "password": "weather_pass",
    "port": 5432,
}

# Fetch data from the API
def get_weather_data(city: str) -> dict:
    """
    Extract current weather data from WeatherAPI.
    """
    url = f"<http://api.weatherapi.com/v1/forecast.json?key={API_KEY}&q={city}&days=1&aqi=no&alerts=no>"
    resp = requests.get(url, timeout=10)
    resp.raise_for_status()
    return resp.json()

# Transfrom API data 
def transform_weather_data(raw: dict) -> dict:
    """
    Transform raw data to compute statistics.
    """
    hours = raw["forecast"]["forecastday"][0]["hour"]
    temps = [h["temp_c"] for h in hours]

    stats = {
        "date": raw["forecast"]["forecastday"][0]["date"],
        "city": raw["location"]["name"],
        "avg_temp": round(statistics.mean(temps), 2),
        "max_temp": max(temps),
        "min_temp": min(temps),
        "data_points": len(temps),
        "created_at": datetime.utcnow(),
    }
    return stats

# Save transform result into database
def load_to_postgres(data: dict):
    """
    Load statistics into PostgreSQL table.
    """
    conn = psycopg2.connect(**POSTGRES_CONN)
    cur = conn.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS weather_statistics (
            id SERIAL PRIMARY KEY,
            date DATE,
            city TEXT,
            avg_temp FLOAT,
            max_temp FLOAT,
            min_temp FLOAT,
            data_points INT,
            created_at TIMESTAMP
        );
    """)
    conn.commit()

    cur.execute("""
        INSERT INTO weather_statistics
        (date, city, avg_temp, max_temp, min_temp, data_points, created_at)
        VALUES (%s, %s, %s, %s, %s, %s, %s);
    """, (
        data["date"], data["city"], data["avg_temp"],
        data["max_temp"], data["min_temp"], data["data_points"],
        data["created_at"]
    ))
    conn.commit()
    cur.close()
    conn.close()

def main():
    try:
        raw = get_weather_data(CITY)
        stats = transform_weather_data(raw)
        load_to_postgres(stats)
        print(f"[{datetime.now()}] Success: Data loaded for {CITY}")
    except Exception as e:
        print(f"[{datetime.now()}] ERROR: {e}", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    main()

A common approach is to create a shell script that contains all your ETL steps, then schedule that script with cron.

#!/bin/bash
# /home/username/scripts/etl_pipeline.sh
set -Eeuo pipefail

LOG_FILE="/home/username/scripts/logs/temperature.log"
SCRIPT_DIR="/home/username/scripts"
PY_SCRIPT="${SCRIPT_DIR}/fetch_weather_api_data.py"
VENV_DIR="/path/to/your/venv"
PYTHON_BIN="${VENV_DIR}/bin/python"
MAX_LOG_SIZE=$((5 * 1024 * 1024)) # 5 MB rotate threshold
ROTATED_LOG="${LOG_FILE}.$(date +%Y%m%d_%H%M%S)"

# --- Ensure log dir exists
mkdir -p "$(dirname "$LOG_FILE")" || true
touch "$LOG_FILE" || {
  echo "Cannot write to $LOG_FILE. Check permissions." >&2
  exit 1
}

# --- Redirect all stdout/stderr to log
exec >>"$LOG_FILE" 2>&1

# --- Defaults config
: "${LOCATIONS:=Jakarta}"  # e.g., "Jakarta,Singapore"
TARGET_DATE="${TARGET_DATE:-$(date -d 'yesterday' +%F)}"

timestamp() { date +"%Y-%m-%d %H:%M:%S%z"; }
log() { echo "[$(timestamp)] $*"; }

rotate_log_if_needed() {
  if [ -f "$LOG_FILE" ]; then
    local size
    size=$(stat -c%s "$LOG_FILE" 2>/dev/null || echo 0)
    if [ "$size" -ge "$MAX_LOG_SIZE" ]; then
      mv "$LOG_FILE" "$ROTATED_LOG" || log "WARN: Failed to rotate log"
      touch "$LOG_FILE" || true
      log "INFO: Rotated log to $ROTATED_LOG"
    fi
  fi
}

on_error() {
  local line=$1
  local cmd=$2
  local code=$3
  log "ERROR: Exit code $code at line $line while running: $cmd"
  # Lock is auto-released by file descriptor closing on exit
  exit "$code"
}
trap 'on_error $LINENO "$BASH_COMMAND" $?' ERR

echo "==== $(date '+%F %T') : Starting ETL ====" >> "$LOG_FILE"
echo "Locations: ${LOCATIONS}"
echo "Target date: ${TARGET_DATE}"
rotate_log_if_needed

# ============ Pre-flight checks ==============
# --- Python available check
command -v "$PYTHON_BIN" >/dev/null 2>&1 || {
  log "ERROR: Python binary not found at $PYTHON_BIN"
  exit 1
}
[ -r "$PY_SCRIPT" ] || { log "ERROR: Script not readable at $PY_SCRIPT"; exit 1; }

if [ -z "${WEATHER_API_KEY:-}" ]; then
  log "ERROR: WEATHER_API_KEY is not set in environment."
  exit 1
fi

# ---- Activate virtualenv ----
if [ -f "${VENV_DIR}/bin/activate" ]; then
  log "INFO: Activating Python virtual environment at $VENV_DIR"
  source "${VENV_DIR}/bin/activate"
else
  log "ERROR: Virtual environment not found at $VENV_DIR"
  exit 1
fi

# ============ Run weather_statistic.py ============
log "INFO: Running Python ETL..."
"$PYTHON_BIN" "$PY_SCRIPT"
run_code=$?

if [ $run_code -eq 0 ]; then
  log "INFO: ETL finished successfully."
else
  log "ERROR: ETL failed with exit code $run_code."
  exit $run_code
fi

echo "==== $(date '+%F %T') : ETL Finished ====" >> "$LOG_FILE"
exit 0

Explanation:

  • set -Eeuo pipefail + trap to catch and report any failure with line + command

  • rotate_log_if_needed: log rotation when the file exceeds 5 MB

  • Pre-flight checks: confirms Python is installed, script is readable, API key is set

  • Unified timestamped logging with full stdout/stderr capture

  • exit 0: reports success or failure and releases the lock

Use the chmod command to make the script executable.

chmod 777 fetch_weather_api_data.py etl_pipeline.sh

Automating with Cron

For this, we’ll use cron, a time-based job scheduler in Unix-like operating systems.

Here’s how you can set up a cron job to run your Python script: “At 06:05 on every day-of-week from Monday through Friday”:

  1. Open your terminal and type the following command to edit your cron jobs.
crontab -e
  1. Schedule it with cron to run.
# Timezone note: ensure the system timezone or CRON_TZ is set appropriately.
CRON_TZ=Asia/Jakarta
5 6 * * 1-5 /bin/bash -lc '/home/username/scripts/etl_pipeline.sh'

Explanation: The ETL runs every weekday at 06:05 AM, logging output to /home/username/scripts/logs/temperature.log

  1. To see all your currently scheduled cron jobs.
crontab -l
  1. Check your output logs regularly.
tail -f /home/username/scripts/logs/temperature.log

That’s it! The pipeline will append a daily stats row to PostgreSQL database and log the run to temperature.log.

Summary

Shell scripting and Bash are powerful tools for automating repetitive manual tasks in Linux, allowing you to create repeatable processes and improve the quality of your work. Conditional statements like the if/elif/else blocks in shell scripting allow a script to execute different code based on a condition. The concept of functions is also explained as a way to group reusable blocks of code for better modularity. Finally, the Cron, a time-based job scheduler that can be used to automate the execution of these scripts without manual intervention. I hope you enjoyed reading this.

More from this blog

Data Engineering Blog

13 posts