Skip to main content

Command Palette

Search for a command to run...

Data Modeling Fundamentals part-1: Introduction to Data Model and Data Modeling Types

Published
9 min read
Data Modeling Fundamentals part-1: Introduction to Data Model and Data Modeling Types
R

Data Engineer based in Jakarta, Indonesia. When I first started out in my career, I was all about becoming a Software Engineer or Backend Engineer. But then, I realized that I was actually more interested in being a Data Practitioner. Currently focused on data engineering and cloud infrastructure. In my free time, I jog and running as a hobby, listening to Jpop music, and trying to learn the Japanese language.

Stakeholders are people, organizations, or other entities that have a stake in a business and either influence or are impacted by its performance. It originated from external stakeholders who are indirectly impacted by the company's decisions and activities, or from internal stakeholders who are actively involved in the business's operations. Others, such as DBAs, software engineers, data engineers, or other IT teams, will be regarded as supporters.

In this article, I’m going to focus on supporters side for implement good data model to optimize pipelines, speed up queries, and make your engineering work smoother and more maintainable. Before moving on to data pipelines and dashboards, we will go over the basics of data modeling. Now let's get started.

What is Data Models?

Data modeling is the process of building a data model which acts as a blueprint for a database design. It involves defining the structure, relationships, and constraints of data elements to guarantee correct and effective data storage and retrieval. Starting with the gathering of business requirement details from stakeholders, then converted into data structures that create a concrete database architecture.

The components of a data model are as follows:

  • Entities

The real items or ideas that an organization wishes to have a deeper understanding of are represented by entities. Entities generally have attributes that describe their current state. Each instance of an entity represents a row in the table.

  • Attributes

Attributes contain information used to characterize and differentiate instances of an entity. It offers certain data that is necessary for efficiently identifying, classifying, and managing entities. They represent the information we want to save about an entity.

  • Relationships

Relationships in a database contribute to the creation of useful information. Relationships are the links that exist between entities in a data model. Relationships specify how tables are connected by shared data in relational databases.

There are three kinds of cardinality relationships:

(a) One-to-One Relationship: this type of relationship is less common but helpful in situations where two tables have a unique mapping of data

(b) One-to-Many Relationship: The most common form. One record in one table is related to several entries in another table.

(c) Many-to-Many Relationship: this type is used when a direct one-to-many relationship is not feasible. When several objects or records are connected to a single corresponding record.

  • Data Types

Data types in data modeling describe interactions or correlations. The data type also determines the kind of operations that may be performed on the column, as well as how much memory is required to hold the values. Selecting the right data types enhances query performance, preserves data integrity, and optimizes storage.

  • Keys

In data modeling, keys are used to identify records, link tables, and give alternate unique identifiers for integrity and relationships. Several can be used with a data model, such as:

(a) Primary key: a column, or set of columns, in a database that serves as a unique identifier for every row

note: Rider_id is the primary key (PK); Rider_id attribue has uniquely identifies each row in this table

(b) Foreign key: a column in one table that establishes a connection with the Primary Key in another table

note: Trip_id is the primary key; Rider_id and driver_id are the foreign key (FK); This two attribute links to the Rider table and the Driver table

(c) Surrogate key: an artificially generated key that serves as a table's primary key

note: Ride_fact, Rider_sk, Drive_sk, and Vehicle_sk are the surrogate key

(d) Composite key: a primary key with two or more columns

note: the primary key is the combination of the first three columns [driver_id (PK, FK); vehicle_id (PK, FK); shift_date (PK)]

Why Data Models Matters?

Data models defines what the business wants to achieve in term of communicate with stakeholders. Data modeling is a single source of truth that makes business reports simpler and more understandable while preventing confusion over numbers and other metrics across departments. Organizations may be able to browse and manage their data assets more effectively with the use of an effective data model.

Data modeling converts the design into a real-world schema that may be used to create databases, tables, indexes, and other objects on a particular database platform. In order to assure efficiency, quality, and scalability for data pipelines, data modeling is also essential. It does this by logically organizing data, enforcing rules, cutting redundancy, and maximizing performance.

Here’s key impacts for good data modeling:

(a) Enhanced Performance & Efficiency

When handling huge volumes of data, well-structured data minimizes query complexity, resulting in faster data retrieval.

(b) Ensures Data Integrity

Standardized definitions, naming conventions, and default values across the organization reduce errors and inconsistencies in your data.

(c) Better Scalability & Flexibility

A good data model makes it easy to accommodate new data sources without disrupting present systems and maintains the current data format.

(d) Lower Storage Costs

A proper data model can avoid big, wide tables, which are costly in storage.

Types of Data Modeling

There are numerous varieties of data models, each with a variety of layout options. Each type of data model serves a distinct purpose and provides a different level of abstraction, meeting the various demands of businesses. Below, let's examine the three main categories of data models:

  • Conceptual data modeling (CDM)

A conceptual data model (CDM) is a high-level abstraction of the core entities in your business and how they relate to one another. The conceptual model gives stakeholders, especially business users, an understanding of what data is crucial and how it fits into the larger business process. It's intended to enable both technical and non-technical teams agree on what data is relevant, rather than delving into structure or storage details.

Example: Online Ride-Hailing Platform(Taxi & Bike Transportation App)

→ Question:

(a) What primary business entities must be tracked to complete a ride-hailing service?

(b) What information about the ride itself is essential for both service execution?

(c) Who needs to access the data, and for what purpose?

→ Key entities:

Here’s a simplified view of what a conceptual model might include in a Online Ride-Hailing Platform(Taxi & Bike Transportation App):

EntityDefinitionRelationship
RiderThe customer requesting and taking the tripRider initiates Trip
DriverThe service provider who executes the tripDriver executes Trip
TripThe core transaction: a journey from pickup to drop-offTrip generates Payment
VehicleThe specific asset used by the driver for the tripDriver uses Vehicle
PaymentThe financial record for the completed tripPayment is associated with Rider and Trip
  • Logical data modeling (LDM)

A logical data model (LDM) is a structured representation of data that outlines entities, attributes, and relationships without referencing any specific database or technology. It builds upon the conceptual model by adding more structure and detail, like primary keys and data types. The logical model explains the rules regulating the data, how the data entities will be divided into characteristics, and how those entities connect to another.

→ Question:

(a) What are the minimum required attributes for the rider and driver entities to uniquely identify and contact them, while also tracking their service status?

(b) What specific location and time attributes must be captured within the trip entity to accurately calculate both the fare and the route taken?

(c) What cardinality constraints must a trip have exactly one payment, and can a driver execute multiple trips over time?

→Tables:

  • Physical data modeling (PDM)

A physical data model (PDM) is a technical blueprint that maps your logical data model to a specific database system. It describes every aspect of data storage, indexing, and access, including restrictions, datatype, and column length. It converts the design into a practical schema that can be utilized to build databases on a particular database platform.

Final Model to Database

→ Database: RDBMS - PostgreSQL

→ Reasoning:

(a) Transactional Integrity (ACID)

The core service was booking a ride and processing a payment, requires strong ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure no payment is lost and no booking is double-booked. PostgreSQL is a robust RDBMS that guarantees this.

(b) Advanced Geo-spatial Support

Ride-hailing is fundamentally location-based. PostgreSQL's PostGIS extension is the industry standard for efficiently storing, indexing, and querying geographical data (like driver location and pickup/drop-off points) which is crucial for dynamic driver-rider matching and routing.

(c) Schema Flexibility

While relational, PostgreSQL's JSONB data type allows for storing semi-structured data (like trip logs or driver device details) within a structured table, providing an excellent blend of relational power and NoSQL flexibility.

→ ERD:

Key Relationship:

  1. RIDER.Rider_id (PK) - (one) to TRIP.Rider_id (FK) - (many)

  2. DRIVER.Driver_id (PK) - (one) to TRIP.Driver_id (FK) - (many)

  3. DRIVER.Vehicle_id (PK) - (one) to VEHICLE.Driver_vehicle_id (FK) - (one)

  4. TRIP.Trip_id (PK) - (one) to PAYMENT.Trip_id (FK) - (one)

How to Avoid Mistakes in Data Modeling?

Not planning for data model evolution

Things are constantly evolving. Once you include new data assets into your system, they are difficult to remove. Data model evolution traces the shift from early hierarchical or network structures to the foundational relational model. You are unlikely to be finished with your data modeling.

Incorrect levels of data with different granularity

Granularity in analytics refers to the level of detail that may be observed. Stakeholders are interested in knowing how much of our service is used each day, hour, or minute. Getting the proper amount of granularity in a data model is critical since too much granularity might result in a lot of useless data. However, if you have too little granularity, you might not have enough information to identify significant trends or details.

Inconsistent or nonexistent naming patterns

You're better off using standard methods with data models rather than creating your own naming convention. For instance, the data model becomes quite challenging to comprehend if tables don't have a consistent logic in their names. Using unusual column names and not documenting it in a data dictionary will probably be a lot of problems later.

Choosing the wrong modeling approach

Numerous modeling approaches exist, including data vault, snowflake schema, and star schema. There is no perfect data modeling approach, everything is a trade-off. However, bear in mind that a lot of data warehouse modeling techniques were created when storage was costly.

Summary

Data modeling serves as a comprehensive blueprint for database design by defining the structures, relationships, and constraints necessary for effective data storage and retrieval. This process establishes a single source of truth that enhances system performance and data integrity while significantly lowering long-term storage costs. Professionals implement these designs through conceptual, logical, and physical stages to translate high-level business requirements into technical schemas for specific platforms. To ensure long-term success, engineers must plan for model evolution and maintain consistent naming conventions to manage data granularity and system scalability effectively. I hope you enjoyed reading this.