Mastering the Linux Shell and Cron: Essential Skills for Data Engineers

Data Engineer based in Jakarta, Indonesia. When I first started out in my career, I was all about becoming a Software Engineer or Backend Engineer. But then, I realized that I was actually more interested in being a Data Practitioner. Currently focused on data engineering and cloud infrastructure. In my free time, I jog and running as a hobby, listening to Jpop music, and trying to learn the Japanese language.
The next stage is to switch from manual execution to automation after learning the basics of Linux commands for file system navigation in the previous article. This is where shell scripting and Bash are useful. The goal is to simplify our lives by automating the boring manual tasks. Automation has significantly elevated the quality of products while simultaneously eliminating human errors.
Shell scripting is the robust mechanism that transforms these commands into repeatable processes, saving time and ensuring consistency for tasks like data transfers and daily report generation. Additionally, bash scripting is the engine that automates repetitive, complex operations that take too long to do line by line. Various shells are part of the shell language used in shell scripting including bash, sh, csh, and ksh.
We'll go over the basics of shell scripting with practical examples.
Introduction to shell scripting 📃
Shell scripts can be used to manage deployments, configure servers, and automate repetitive activities, among other things. It's fairly easy to write a shell script, we just need to create a new file with .sh extension. At the first line of the script, the she bang #! line tells the system to use the Bash interpreter.
- Create new script name “run_shell.sh”
#!/bin/bash
echo "Hello, welcome to Ubuntu on WSL2! \\nThe current date and time is: $(date)"
- Make the script executable, assign execution rights to your user
$ chmod 777 run_shell.sh
Explanation:
chmodmodifies the ownership of a file for the current user :u.777is an octal representation of the permissions granted. This means that the user who is the owner can now run the script.run_shell.shis the file target
- Run the script
$ ./run_shell.sh
Variable
Variables are used to store data, such as numbers, text strings, or the output of commands. Within the script, these variables can then be accessed and changed. Command echo and replacement with $(variable_name) are used to assign a command's output to a variable.
In shell scripting there are three main types of variables include:
- System variable
Global variable is predefined variables that are set by the operating system or shell environment.
$ message="Hello there"
$ echo $message
- Local variable
The local variable is declared by using the keyword before the variable name.
#!/bin/sh
FullName="John Doe"
age=20
echo "Your $FullName and $age information"
- Global variable
Global variable is a variable defined anywhere in the script, outside of a function, is accessible throughout the entire script.
#!/bin/sh
FullName="John Doe"
function greeting() {
FullName="Doja Cat"
age=20
echo "Your $FullName and $age information"
}
echo $FullName
Conditional Statements
Linux shell scripts with conditional statements can execute distinct code blocks according to whether a given condition evaluates to true or false. The if statement is the most common conditional statement, and it can be expanded using else and elif clauses. This essentially allows user to choose a response to the result which our conditional expression evaluates to.
#!/bin/bash
echo "Please enter a number: "
read num_1
echo "Please enter a number again:"
read num_2
echo "Sum result: $((num_1 + num_2))"
echo "-------- Check the input number1 ---------"
if [ $num_1 -gt 0 ]; then
echo "$num_1 is positive"
elif [ $num_1 -lt 0 ]; then
echo "$num_1 is negative"
else
echo "$num_1 is zero"
fi
echo "-------- Check the input number2 ----------"
if [ $num_2 -gt 0 ]; then
echo "$num_2 is positive"
elif [ $num_2 -lt 0 ]; then
echo "$num_2 is negative"
else
echo "$num_2 is zero"
fi
Explanation:
readcommand used to save the user input into a specified variable by providing an argumentif [ $num_1 -gt 0 ]: This condition checks if the value ofnum_1is "greater than" (-gt) zero.elif [ $num -lt 0 ]: This is an "else if" condition. It checks if the value ofnumis "less than" (-lt) zero.fiis a keyword used to terminate anifstatement block
Looping
Loops are crucial to shell programming because they allow you to process files, handle user input, and automate repeated activities. When you need to do the same thing again, like processing a list of files or iterating over an array of values, they are especially helpful. Loops come in a variety of forms, including the "for," "while," and "until" loops.
Here's an explanation of how to use each of these loops in a shell script:
- While
The while loop repeatedly executes a block of commands as long as a specified condition is true.
#!/bin/bash
i=1
while [[ $i -le 10 ]]; do
echo "$i"
(( i += 1 ))
done
Explanation: -le: This is a comparison operator that means "less than or equal to".
- For
The for loop is used to iterate over a sequence of values and below is the syntax.
#!/bin/bash
for i in {1..10}
do
echo $i
done
- Until
The until loop continues executing until a specific condition becomes true.
#!/bin/bash
counter=1
until [ $counter -gt 10 ]
do
echo $counter
counter=$((counter + 1))
done
Function
Shell script functions let you organize reusable code blocks, which makes scripts easier to maintain and more modular. They encourage modularity, reusability, and code organization in your scripts. In shell scripting, a function is comparable to a function or subroutine in other programming languages.
There are two common ways to define functions in a shell script:
- Using the function keyword
#!/bin/bash
function demo {
echo "This is my function"
}
demo
echo "The exit status of the demo function is: $?"
- Without the function keyword
#!/bin/bash
echo "------- Do You Know Odd Number? -------"
echo -n "Enter the Number: "
read x
is_odd(){
if [ $((x%2)) == 0 ]; then
echo "Invalid Input, ${x} not odd number!"
exit 1
else
echo "Yes! Number ${x} is Odd."
fi
}
is_odd
Error Handling
Writing solid and dependable shell scripts requires the ability to handle errors. Without proper error handling, a script may keep running even after an error has occurred, which could have unanticipated consequences or even cause the system to crash. We must consciously choose actions that will enable us to deal with such situations.
Below are various ways error handling techniques in different scenarios:
- if statement and exit status
The exit command terminates the script. To find out the result of a command, check this variable right away. Success is indicated by an exit status of 0, and failure is indicated by any value that is not zero.
#!/bin/bash
mkdir /tmp/mydir
if [ $? -eq 0 ]; then
echo "Directory /tmp/mydir created successfully."
else
echo "Failed to create directory /tmp/mydir."
exit 1
fi
Explanation:
mkdircommand attempts to create a directory$?to check if the command was successful-eq 0is a test that checks if the value of$?is equal to0Exit Status 1confirming thatmkdirfailed, terminates the script with a non-zero exit status
- Function and trap
The trap command captures any errors (signaled by the ERR keyword) and runs the specified error handling code. You can define a command or function to be carried out in response to a certain event or signal by using the trap command. It helps you create more strong and dependable scripts and programs and is especially helpful for graceful termination, signal handling, error handling, and temporary file management.
#! /bin/bash
handle_error() {
echo "Error on line $1"
exit 1
}
trap 'handle_error $LINENO' ERR
mkdir /root/test_dir
echo "Directory created successfully."
Explanation:
if mkdir
/root/test_dirfails, thehandle_error()function is executed, printing an error message and exiting the script.$LINENOis a special bash variable that contains the line number of the script where the error occurred.ERRis the signal to be trapped.
- Redirecting error messages to .log file
Redirection operators are used in Linux shell scripts to route error messages to a log file. Errors are written to the terminal by default, but you can reroute them to a file to facilitate debugging. Custom error messages can assist users understand what went wrong and how to repair it by providing additional context.
#!/bin/bash
ls -l /home/ricky-suh101/shell-log-dump 2> script_errors.log
mkdir /home/ricky-suh101/shell-log-dump
echo "Error: Directory /home/ricky-suh101/shell-log-dump could not be created. Please check if you have the necessary permissions or if the directory already exists." >> error_dump.log
Explanation:
Use the
2>operator to send standard error(stderr) to a specified file.If
mkdirfails, the error message is appended (>>) to error_dump.log
Cronjob for Task Scheduling ⏰
Cron is a time-based job scheduler that executes commands or scripts without manual intervention. Cron functionality is enabled via the crond daemon, which also submits the necessary jobs and runs in the background at predetermined periods. The cron syntax, which consists of minute, hour, day of the month, month, and day of the week parameters, allows users to define the timing and frequency of these operations.

Setup time schedule for cronjob | https://images.archbee.com/jvwmQL6VASLd-norgNd8V/w2SCPNLCbpuIl7EXjo4-E_untitled-1.png?format=webp
Using a job scheduler (cron) to carry out tasks is made possible by Crontab (also known as "cron tables"). Cron jobs are any tasks that you schedule using Crons. Cron jobs work by giving users the ability to plan and automate actions on an operating system similar to Unix.
# To create or edit a crontab
$ crontab -e
# To view a list of active crontab
$ crontab -l
# To delete all tasks and start all over
$ crontab -i
# To view cron job history, you can show logs by
$ grep CRON /var/log/syslog
# To find cron daemon is running or not by
$ systemctl status cron # quit press q
Below are some examples of scheduling cron jobs:

Image create by author
Project Example: ETL weather data pipeline using cronjob and python
Let’s look at a practical example. We want to to analyze information for the weather. It extracts hourly weather data from Weather API based on a city name, the latitude, longitude, and other. Upserts daily stats in a PostgreSQL database and allow us to query it with SQL.
Create a Virtual Environment
Navigate to where you want to create your project and then create a directory for it.
# /home/username/
$ mkdir scripts
$ cd scripts
Check your python version by running the following command.
$ python3 --version
Install the venv module, which is used to create virtual environments.
$ sudo apt-get install python3-venv
Inside your project directory, create the virtual environment.
$ python3 -m venv my_venv_name # Replace my_venv_name with your desired name
Activate the virtual environment.
$ source env_name/bin/activate
Now, any Python packages you install using pip will be installed within this isolated environment.
$ pip install requests psycopg2-binary
When you are finished working in the virtual environment, you can deactivate it.
$ deactivate
Extracting Weather Data
First, we need to define a function to fetches a weather data from http://api.weatherapi.com/v1.
import os
import sys
import requests
import psycopg2
import statistics
from datetime import datetime
API_BASE = "<http://api.weatherapi.com/v1>"
def get_weather_data(city: str) -> dict:
"""
Extract current weather data from WeatherAPI.
"""
url = f"<http://api.weatherapi.com/v1/forecast.json?key={API_KEY}&q={city}&days=1&aqi=no&alerts=no>"
resp = requests.get(url, timeout=10)
resp.raise_for_status()
return resp.json()
Transforming and Storing Data
Now that we have the raw data from the API, we need to transform it into a structured format that can be easily stored in a database.
import os
import sys
import requests
import psycopg2
from datetime import datetime
import statistics
def transform_weather_data(raw: dict) -> dict:
"""
Transform raw data to compute statistics.
"""
hours = raw["forecast"]["forecastday"][0]["hour"]
temps = [h["temp_c"] for h in hours]
stats = {
"date": raw["forecast"]["forecastday"][0]["date"],
"city": raw["location"]["name"],
"avg_temp": round(statistics.mean(temps), 2),
"max_temp": max(temps),
"min_temp": min(temps),
"data_points": len(temps),
"created_at": datetime.utcnow(),
}
return stats
Load to the target database
Load the transformed data into our PostgreSQL database, which allows us to persist the data and query it later.
import os
import sys
import requests
import psycopg2
from datetime import datetime
import statistics
def load_to_postgres(data: dict):
"""
Load statistics into PostgreSQL table.
"""
conn = psycopg2.connect(**POSTGRES_CONN)
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS weather_statistics (
id SERIAL PRIMARY KEY,
date DATE,
city TEXT,
avg_temp FLOAT,
max_temp FLOAT,
min_temp FLOAT,
data_points INT,
created_at TIMESTAMP
);
""")
conn.commit()
cur.execute("""
INSERT INTO weather_statistics
(date, city, avg_temp, max_temp, min_temp, data_points, created_at)
VALUES (%s, %s, %s, %s, %s, %s, %s);
""", (
data["date"], data["city"], data["avg_temp"],
data["max_temp"], data["min_temp"], data["data_points"],
data["created_at"]
))
conn.commit()
cur.close()
conn.close()
Here's a complete Python script that connects to the REST API endpoint, fetches object data, and saves it to a database.
# /home/username/scripts/fetch_weather_api_data.py
import os
import sys
import requests
import psycopg2
from datetime import datetime
import statistics
# API and Database Config
API_KEY = os.getenv("WEATHER_API_KEY") # set in environment
CITY = "Jakarta"
POSTGRES_CONN = {
"host": "localhost",
"database": "weatherdb",
"user": "weather_user",
"password": "weather_pass",
"port": 5432,
}
# Fetch data from the API
def get_weather_data(city: str) -> dict:
"""
Extract current weather data from WeatherAPI.
"""
url = f"<http://api.weatherapi.com/v1/forecast.json?key={API_KEY}&q={city}&days=1&aqi=no&alerts=no>"
resp = requests.get(url, timeout=10)
resp.raise_for_status()
return resp.json()
# Transfrom API data
def transform_weather_data(raw: dict) -> dict:
"""
Transform raw data to compute statistics.
"""
hours = raw["forecast"]["forecastday"][0]["hour"]
temps = [h["temp_c"] for h in hours]
stats = {
"date": raw["forecast"]["forecastday"][0]["date"],
"city": raw["location"]["name"],
"avg_temp": round(statistics.mean(temps), 2),
"max_temp": max(temps),
"min_temp": min(temps),
"data_points": len(temps),
"created_at": datetime.utcnow(),
}
return stats
# Save transform result into database
def load_to_postgres(data: dict):
"""
Load statistics into PostgreSQL table.
"""
conn = psycopg2.connect(**POSTGRES_CONN)
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS weather_statistics (
id SERIAL PRIMARY KEY,
date DATE,
city TEXT,
avg_temp FLOAT,
max_temp FLOAT,
min_temp FLOAT,
data_points INT,
created_at TIMESTAMP
);
""")
conn.commit()
cur.execute("""
INSERT INTO weather_statistics
(date, city, avg_temp, max_temp, min_temp, data_points, created_at)
VALUES (%s, %s, %s, %s, %s, %s, %s);
""", (
data["date"], data["city"], data["avg_temp"],
data["max_temp"], data["min_temp"], data["data_points"],
data["created_at"]
))
conn.commit()
cur.close()
conn.close()
def main():
try:
raw = get_weather_data(CITY)
stats = transform_weather_data(raw)
load_to_postgres(stats)
print(f"[{datetime.now()}] Success: Data loaded for {CITY}")
except Exception as e:
print(f"[{datetime.now()}] ERROR: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()
A common approach is to create a shell script that contains all your ETL steps, then schedule that script with cron.
#!/bin/bash
# /home/username/scripts/etl_pipeline.sh
set -Eeuo pipefail
LOG_FILE="/home/username/scripts/logs/temperature.log"
SCRIPT_DIR="/home/username/scripts"
PY_SCRIPT="${SCRIPT_DIR}/fetch_weather_api_data.py"
VENV_DIR="/path/to/your/venv"
PYTHON_BIN="${VENV_DIR}/bin/python"
MAX_LOG_SIZE=$((5 * 1024 * 1024)) # 5 MB rotate threshold
ROTATED_LOG="${LOG_FILE}.$(date +%Y%m%d_%H%M%S)"
# --- Ensure log dir exists
mkdir -p "$(dirname "$LOG_FILE")" || true
touch "$LOG_FILE" || {
echo "Cannot write to $LOG_FILE. Check permissions." >&2
exit 1
}
# --- Redirect all stdout/stderr to log
exec >>"$LOG_FILE" 2>&1
# --- Defaults config
: "${LOCATIONS:=Jakarta}" # e.g., "Jakarta,Singapore"
TARGET_DATE="${TARGET_DATE:-$(date -d 'yesterday' +%F)}"
timestamp() { date +"%Y-%m-%d %H:%M:%S%z"; }
log() { echo "[$(timestamp)] $*"; }
rotate_log_if_needed() {
if [ -f "$LOG_FILE" ]; then
local size
size=$(stat -c%s "$LOG_FILE" 2>/dev/null || echo 0)
if [ "$size" -ge "$MAX_LOG_SIZE" ]; then
mv "$LOG_FILE" "$ROTATED_LOG" || log "WARN: Failed to rotate log"
touch "$LOG_FILE" || true
log "INFO: Rotated log to $ROTATED_LOG"
fi
fi
}
on_error() {
local line=$1
local cmd=$2
local code=$3
log "ERROR: Exit code $code at line $line while running: $cmd"
# Lock is auto-released by file descriptor closing on exit
exit "$code"
}
trap 'on_error $LINENO "$BASH_COMMAND" $?' ERR
echo "==== $(date '+%F %T') : Starting ETL ====" >> "$LOG_FILE"
echo "Locations: ${LOCATIONS}"
echo "Target date: ${TARGET_DATE}"
rotate_log_if_needed
# ============ Pre-flight checks ==============
# --- Python available check
command -v "$PYTHON_BIN" >/dev/null 2>&1 || {
log "ERROR: Python binary not found at $PYTHON_BIN"
exit 1
}
[ -r "$PY_SCRIPT" ] || { log "ERROR: Script not readable at $PY_SCRIPT"; exit 1; }
if [ -z "${WEATHER_API_KEY:-}" ]; then
log "ERROR: WEATHER_API_KEY is not set in environment."
exit 1
fi
# ---- Activate virtualenv ----
if [ -f "${VENV_DIR}/bin/activate" ]; then
log "INFO: Activating Python virtual environment at $VENV_DIR"
source "${VENV_DIR}/bin/activate"
else
log "ERROR: Virtual environment not found at $VENV_DIR"
exit 1
fi
# ============ Run weather_statistic.py ============
log "INFO: Running Python ETL..."
"$PYTHON_BIN" "$PY_SCRIPT"
run_code=$?
if [ $run_code -eq 0 ]; then
log "INFO: ETL finished successfully."
else
log "ERROR: ETL failed with exit code $run_code."
exit $run_code
fi
echo "==== $(date '+%F %T') : ETL Finished ====" >> "$LOG_FILE"
exit 0
Explanation:
set -Eeuo pipefail+trapto catch and report any failure with line + commandrotate_log_if_needed: log rotation when the file exceeds 5 MBPre-flight checks: confirms Python is installed, script is readable, API key is set
Unified timestamped logging with full stdout/stderr capture
exit 0: reports success or failure and releases the lock
Use the chmod command to make the script executable.
chmod 777 fetch_weather_api_data.py etl_pipeline.sh
Automating with Cron
For this, we’ll use cron, a time-based job scheduler in Unix-like operating systems.
Here’s how you can set up a cron job to run your Python script: “At 06:05 on every day-of-week from Monday through Friday”:
- Open your terminal and type the following command to edit your cron jobs.
crontab -e
- Schedule it with cron to run.
# Timezone note: ensure the system timezone or CRON_TZ is set appropriately.
CRON_TZ=Asia/Jakarta
5 6 * * 1-5 /bin/bash -lc '/home/username/scripts/etl_pipeline.sh'
Explanation: The ETL runs every weekday at 06:05 AM, logging output to /home/username/scripts/logs/temperature.log
- To see all your currently scheduled cron jobs.
crontab -l
- Check your output logs regularly.
tail -f /home/username/scripts/logs/temperature.log
That’s it! The pipeline will append a daily stats row to PostgreSQL database and log the run to temperature.log.
Summary
Shell scripting and Bash are powerful tools for automating repetitive manual tasks in Linux, allowing you to create repeatable processes and improve the quality of your work. Conditional statements like the if/elif/else blocks in shell scripting allow a script to execute different code based on a condition. The concept of functions is also explained as a way to group reusable blocks of code for better modularity. Finally, the Cron, a time-based job scheduler that can be used to automate the execution of these scripts without manual intervention. I hope you enjoyed reading this.




