Essential Linux Command Line for Data Engineer

In early year of 2025, I started to notice the importance of being familiar with some linux commands when realized that my career path in data area and core skill in data engineer. As a Windows user, I'll using Windows Subsystem for Linux (WSL) with Ubuntu and docker. WSL Ubuntu seamlessly integrates with Docker, allowing you to run containers with ease. We will discuss about containerization and docker later on.

Why Do Data Engineer Need to Learn Linux?

Managing Cloud Instances

Data engineers frequently work with remote servers and cloud virtual machines. Connecting to these instances, navigating file systems, managing processes, keeping an eye on system performance, and troubleshooting issues are done using linux command.

Data Processing Tools and Frameworks

The majority of Hadoop or Spark clusters operate on Linux nodes therefore you end up finds yourself running more linux commands during the daily responsibilities. Mastery skill of Linux commands will be facilitates data engineer in the effective deployment, configuration, and management.

Automation and Scripting

The implementation of automation may be good investment in the context of repetitive tasks that time consuming to execute line by line. Shell scripting is a powerful tool for automating repetitive tasks, serves as a robust mechanism for the automation of repetitive tasks such as database backups, data transfer, report generation, and ETL workflow.

Troubleshooting and Debugging

Data engineers are occasionally in charge of making sure that the cloud's infrastructure and databases function properly. Linux commands are essential for reviewing logs, identifying problems, and carrying out required fixes when disruptions occur in data pipelines or cloud infrastructure.

Basic Linux Command for Data Engineer

Here’s a summary of useful linux command with example:

Basic File and Directory Management

pwd: print the current working directory
ls: list directory content

# The command use for shows file or directory permissions, owner, size, modified date, time and name
ls -l
# The command use for shows list the contents of the directory with current directory folder subdirectories
ls *
# The command use for shows list files or directories including hidden files or directories
ls -a

cd: change to a different directory
mkdir: create a new directory
rm: remove file or directory

rm test3.txt
# Remove a directory and it’s content
rm -r test-dir
# Remove a empty directory
rmdir run-dir

cp: copy files or directory

# Copying a file to the same directory with a new name
cp text_file.txt new_text_file.txt
# Copying a file to a different directory
cp test0.txt /home/user/documents/
# Copying a file to a different directory and renaming it
cp test0.txt /home/user/documents/test1.txt

mv: move file or multiple file and rename file

# Moving a file to another directory
mv file1.txt test-dir
# Moving multiple file to another directory
mv file1.txt file2.txt file3.txt test-dir
# Rename a file
mv -v file1.txt text_file.txt

touch: creates a file only if the file doesn't already exist
cat: view file contents, combine files, and create new files

man: view the reference manuals of a command or utility used in the terminal
head: shows the first 10 lines

tail: prints data from the end of a specified file or files to standard output

less: displays the contents of a file or a command output, one page at a time

User Group and File Permissions

Every file and folder in linux system has three sets of permissions. Each of the three category refers to a different operation you can perform on the file:

User: the owner of a file belongs to this class
Group: the members of the file’s group belong to this class
Other: these permissions apply universally to all users on the system

File and folder permissions are organized into three type which user can read (r), write (w), or execute (x).

whoami: display the effective username of the current user
sudo: allows users to run programs using the security privileges of another user, generally root (superuser)

# Example
sudo chmod 750 testfile

chown: allows users to change the ownership of a file, directory, or link
chmod**:** manage permissions for files and directories

Symbolic notation represents permissions using symbols to read, write, execute permission.

Change permissions with the chmod command using symbol:
- plus (+): add permissions on top of the pre-existing permissions
- minus (-): remove permissions from the pre-existing permissions
- sign (=): overwrite the old permissions and set to something else instead

Octal notation represents permissions using a three-digit number, where each digit corresponds to the sum of the permissions. Change permissions with the chmod command using octal:

number 4 for read
number 2 for write
number 1 for execute
number 0 for no permission

Change permissions that are result of sum digit using octal:

total 7: for read, write, and execute permission (rwx)
total 6: for read and write privileges (rw)
total 5: for read and execute privileges (re)
total 4: for read privileges (r)

Using the number 755 is a shorthand way to set permissions, here the description:

7 (owner): Read (r) | (4) + Write (w) | (2) + Execute (x) | (1) = 7 (rwx)
5 (group): Read (4) + Execute (1) = 5 (rx)
0 (others)

useradd: creates a new user with a group and Home directory for that user

All users added are assigned a name, unique User Identification (UID), group, and Group Identification (GID). When a user is initially created, a new UID and matching GID are assigned. UID and matching GID numbers are assigned based on the type of user:
- Administrator (root): UID and GID = 0
- System user (computer-generated): UID and GID assigned from 1 to 999
- Normal users (real people): UID and GID = 1000 or greater, incremented with every new user
userdel: delete a user with a group
usermod: modify user account information

The option -a in -aG command used to add the the user without removing from any other group.

groupadd: create a new group
groupdel: delete a group
groupmod: modify group account information
chage: manage and view user password expiration and account aging information

whoami 
# Display where users are logged in from and what they are currently doing 
w
# Create new user
sudo adduser datarunner01
# Create password for new user
sudo passwd datarunner01
# Check for password change requirements
sudo chage -l datarunner01
# Searches and displays list of users
sudo getent passwd
# Create a new group
sudo groupadd data
# Searches and displays list of group 
sudo getent group data
# Displays file contents in /etc/group and sorts the output alphabetically
cat /etc/group | sort
# Add a user into a group
sudo useradd -G data datarunner01
# or
sudo usermod -aG data datarunner01
# Verify that new user is now a member of the group
groups datarunner01
# Change current system to other user
su datarunner01
# Create a new file and change its ownership
touch testfile
ls -l testfile
# Assign a new owner and owning group of a file at the same time
sudo chown datarunner01:datarunner01 testfile
ls -l testfile
# use a group ID (GID) instead of a group name to change the group of a file
sudo chmod 750 testfile
ls -l testfile
# Remove user from list of a group
sudo gpasswd -d datarunner01 data
# Remove the user
sudo userdel -r datarunner01
# Switch to an interactive session as a root user
sudo -i

File compression & archiving

gunzip: a tool for decompressing files that were originally compressed with the gzip command
gzip (“GNU zip”): a tool for compressing or and decompressing individual files

gzip test-1.txt
# Compressed file and shows statistics about the given compressed/uncompressed files
gzip -d -l test-2.txt
# Forces the file decompression
gunzip -f test-1.txt
# Verify file is valid without decompressing it
gunzip -t test-2.txt

tar (“Tape Archiver”): create, extract, and list archive individual files

# Compressed all files inside folder and subdirectory
tar -zcvf archive.tar.gz directory/ 
# Decompress and unpack the archive into the current directory
tar -zxvf archive.tar.gz
# Compresses and creates archive files less than the size of the gzip
tar cvfj test-1.tar.tbz test-1.txt

Key Difference Between gzip, gunzip, unzip, and tar command in linux | Image create by author

Standard File Streams

In Linux, every running process is automatically provided with three standard input/output (I/O) communication channels, known as standard streams. Linux provides three standard file streams. These are standard input (stdin), standard output (stdout), and standard error (stderr).

Three standard file stream in linux | Image create by author

Standard input (stdin)

The standard input stream is where programs and instructions read their input by default. What you type on the keyboard onto the computer, it becomes the input for a program. Linux assigns a unique value of 0 to standard input.

Standard output (stdout)

Standard output is the typical, anticipated outcome of a command's execution. Any output generated by a program or command is automatically sent to stdout. The specific number that Linux assigns to standard output is 1.

Standard error (stderr)

A software can alert you to errors or warnings using standard error (stderr). Stderr allows errors to be separated from regular output. By default, it shows that the output file descriptor is 2.

Linux - Standard Stream | https://geek-university.com/wp-content/images/linux/linux_streams.png

Text Processing

grep (“global regular expression print”): used to searching patterns within files or input streams

# Limit grep output to a fixed number 
grep -m3 'ransom' test-3.txt
# Search for multiple patterns
grep -e 'one' -e 'file' -e 'malware' test-3.txt
# Use grep with pipes
cat test-3.txt | grep 'category'

sed: used to searches a file for particular configuration of characters, and prints matched lines

# Create a new file
echo -e "Hello, world\nThis is a test\nHello, labex\nWorld of Linux" > sed_test.txt
# Replace text with sed command
sed 's/Hello/Hi/g' sed_test.txt
# To insert text before a line, use the i command
sed '5i\Hi, Windows Subsystem for Linux 2' sed_test.txt
# To delete a specific line, use the d command
sed '4d' sed_test.txt
# To replace text using a regular expression, 
# use [Ww] a regular expression that matches either uppercase "W" or lowercase
sed 's/[Ww]orld/Universe/g' sed_test.txt

In the sed command, delimiters are important for the search pattern

Forward slashes (/) as delimiters Used to separate the different parts of the substitute command.

Syntax format: s/search/replace/flags
Non whitespace are specific characters, such as commas, semicolons, or custom symbols
Backslashes (\) for escaping Used to escape special characters or to indicate literal interpretation.

Used with commands like i\ (insert) and a\ (append).

awk: used to scans all lines and filters data according to the space by default

# Show the first field of each line
awk '{print $1}' sed_test.txt
# Prints the first and second fields of those records whose third field is greater than ten and the fourth field is less than 20.
awk '$3 > 10 && $4 < 20 {print $1, $2}' sed_test.txt
# Printing the third field and the fifth field for each line
awk '{print $1, $3}' sed_test.txt | head -n 3

You can combine that variable with multiple conditions using logical operators:

&& (AND): both conditions must be true
|| (OR): at least one condition must be true
! (NOT): negates a condition

cut: used to cut parts of a line by byte position, character and field

# Show only the first character of each line
cut -c1 sed_test.txt
# Use ps to list processes, then filter it with cut to display specific columns, such as the process ID (PID) and the command name
ps -e | cut -c1-5,25-

You can reference every column by variables associated with their column number. For example, the The first column is $1, the second is $2, and you can reference the entire line with $0.

sort: used for sorting file contents and printing the result in standard output
uniq: used to filters out repeated lines in a file or from command input

# Remove duplicate lines from a file
sort test-2.txt -o test-2.txt_sorted.txt
uniq test-2.txt_sorted.txt
# Show only unique lines
sort test-2.txt -o test-2.txt_sorted.txt
uniq -u test-2.txt_sorted.txt
# Use a pipeline to sort the file and count duplicates
cat test-2.txt
sort test-2.txt | uniq -C

wc (“word count”): used to counts the number of lines, words, and characters in a single file

# Display the file's line, word, and character count
wc test-1.txt
# Display number of word count only
wc -w test-1.txt
# Get the size of a file in bytes
wc -c test-1.txt

tee: used to reads standard input (stdin) and writes it to both standard output (stdout) output and one or more files

# append a line of text to a file
echo "This text will be added" | tee -a example.txt
# stores the output of the ls command to example.txt, passes the content of that file to the grep command, and  and displays all instances of the word "example2"
ls | tee example2.txt | grep "example2"

I/O Redirection

I/O Redirect is defined by a special character within the command. They allow you to direct the input or output (stream) of your command.

The input operator

The input operator does the same as the input operator only it opens the file in read/write mode. The right angle bracket ( < ) represents this operator.

The output operator

The output operator controls the command output stream. The left angle bracket ( > ) represents this operator.

The output append operator

The output append operator does the same as the output operator only it appends data to a file.

Two left angle brackets ( >> ) represent this operator.

The non-standard operator

Non-standard operators direct both standard output (stdout) and standard error (stderr).

One overwrites ( &> ) and one appends ( &>> ) data to the output file.

Pipe operator

The pipe operator is used to pass the output of a command to the input of another command. The vertical bar ( | ) represents this operator.

Here are some examples of using each operator:

echo: display text or variables in the terminal

echo "Learning linux command echo."
# Display the values of variables or concatenate strings
name="RickySuh"
echo "Hello, $name!"
# Formatting output and enhancing readability
echo -e "Hello, world!\nThis is a test.\nHello from Virtual Machine."

The echo command also used to print the output to a file instead of displaying it in the terminal. Use the symbol left angle bracket ( > ) combined with echo command.

echo "The sort utility sorts text and binary files by lines.\nA line is a record by a newline (default) or NULL.\nA record can contain any printable or unprintable characters." > test-1.txt

find: search all files and directories within a specified path based on different

# Use --name flag To find a file by name
find ~ -name "test1.txt"
# Use find command combined with wildcards (*) to match a given pattern
find ~ -name "test*.txt"
# To find the file test-1-dup.txt.gz within the current directory
sudo find /home -name "test-1-dup.txt.gz"

EOF (“end-of-file”): give signals to the system when no more data can be read from a file or input stream

touch demo.sh
nano demo.sh
----------------
#!/bin/bash
cat <<EOF
Hello world!
This is command run from sh file.
End of text.
EOF
---------------
bash demo.sh

Process and Service

ps (“process status”): display list of currently running processes with their PIDs

ps
# Display complete picture of current running processes
px aux

top: a tool for displaying dynamic view of the processes running on current system

top
# Display a summary of system resource usage to refresh the output 7 times using options -n
top -n 7

There is a lot of information in the default top output, which could be confusing. Use a few keystrokes to adjust the contents, locate the data you require, or remove specific parts of the summary.

Press number 1: see statistics between single and combined CPU view
Press key t: see statistics between single and combined CPU view that supports elementary ASCII graphs
Press key m: display options of memory and swap memory lines
Press key z: see statistics between single and combined CPU view and add color to the display

htop: an interactive real-time process alternative to top command for monitoring application on current system

kill: send signals to processes, typically to terminate them

# Kill a process by ID
sudo kill <SIGNAL> <PID>
# To see a list of the signals available
kill -l

killall: kills processes by their name

sudo killall chrome
# If the process does not terminate gracefully with the default signal (SIGTERM), you can  using the -9 option
sudo killall -9 chrome

Summary

The importance of Linux command for data engineers, it use for managing cloud instances, working with data processing frameworks like Hadoop and Spark, and enabling automation through shell scripting. It provides a summary of essential commands for file and directory management, including ls, cp, mv, and rm, along with a breakdown of file compression tools such as gzip, gunzip, and tar. Additionally, the text covers commands for user and file permissions (chown, chmod), text processing (grep, sed, awk), and monitoring system processes (ps, top). Overall, fundamental Linux skills that are crucial for a career in data engineering. I hope you enjoyed reading this.

Essential Linux Command Line for Data Engineer

Why Do Data Engineer Need to Learn Linux?

Basic Linux Command for Data Engineer

Basic File and Directory Management

User Group and File Permissions

File compression & archiving

Standard File Streams

Text Processing

I/O Redirection

Process and Service

Summary

Comments

More from this blog

Demystifying Data Warehouses

Introduction to Data Processing and Data Transformation

Data Modeling Fundamentals part-2: Data Modeling Approach and Techniques

Data Modeling Fundamentals part-1: Introduction to Data Model and Data Modeling Types

Overview of Storage System: Transactional, Analytical, and Hybrid Database

Command Palette

Why Do Data Engineer Need to Learn Linux?

Basic Linux Command for Data Engineer

Basic File and Directory Management

User Group and File Permissions

File compression & archiving

Standard File Streams

Text Processing

I/O Redirection

Process and Service

Summary

Comments

More from this blog