1. Introduction
A data warehouse refers to a centralized system used to store large amounts of data from different sources. Most of the time, data warehouses store structured data for analytical purposes.
Data warehouses help businesses make data-driven decisions by ensuring that data is readily available and easily accessible.
2. Components of a Data Warehouse
There are four main components of a data warehouse:
- Source – This is where the data originates, such as transactional databases, APIs, logs, or external data sources.
-
Staging – This is the area where data is processed before being loaded into the warehouse.
- Data is usually moved through either ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) pipelines, depending on the use case.
- Storage – This is where the processed data is stored, typically in a structured format optimized for analytical queries.
- Presentation – This is where the data reaches the end user, such as a data analyst using BI (Business Intelligence) tools to analyze and visualize the data.
3. Data Warehouse vs. Database
The main difference between a data warehouse and a database is the amount and nature of the data they store:
- A database is optimized for transactional processing (OLTP - Online Transaction Processing) and handles real-time operations, such as inserting, updating, and deleting records.
- A data warehouse is optimized for analytical processing (OLAP - Online Analytical Processing) and stores large volumes of historical data for reporting and decision-making.
When should you use a data warehouse?
You should use a data warehouse instead of a database when you need to store historical data and perform complex queries on large datasets that grow exponentially.
4. Data Warehouse Architecture
There are three main architecture models:
- Top-down approach (Inmon) – In this approach, the data warehouse is designed to meet business requirements first, ensuring a well-structured, integrated system.
- Bottom-up approach (Kimball) – This approach prioritizes quick reporting by building data marts first, which can later be integrated into a larger data warehouse. This is the most commonly used approach.
- Data Vault – A more flexible and scalable approach designed for handling changes in data structures over time.
Which model to use depends on your use case.
5. Data Modeling
Data modeling refers to the visual representation of how data is organized within a system. There are three main categories:
- Conceptual Data Modeling – A high-level overview that focuses on business concepts without technical details.
-
Logical Data Modeling – Adds structure, attributes, and relationships to the data.
- Entity-Relationship Diagram (ERD) is used for OLTP systems.
- Dimensional Data Model is used for OLAP systems.
- Physical Data Modeling – Specifies how data is actually stored in the database, defining table structures, indexes, and relationships.
6. Star Schema vs. Snowflake Schema
Both of these are types of Dimensional Data Models used in data warehouses.
-
Star Schema – A data model where a central fact table is directly connected to dimension tables, forming a star-like structure.
- Pros: Simplifies queries and improves performance.
- Cons: Can lead to data redundancy.
-
Snowflake Schema – A data model where dimension tables are normalized, breaking them into smaller related tables.
- Pros: Reduces data redundancy.
- Cons: Increases query complexity.
7. OLAP vs. OLTP
- OLAP (Online Analytical Processing) – Used for historical data analysis, enabling businesses to derive insights from large datasets.
- OLTP (Online Transaction Processing) – Used for real-time transactions, such as banking systems, e-commerce platforms, and booking systems.
Example:
- OLAP: Analyzing customer purchasing patterns over the past 5 years.
- OLTP: Processing an online purchase in a retail store.
8. Types of Data Warehouses
- On-Premise Data Warehouse – A company develops and maintains its own data warehouse infrastructure.
-
Cloud Data Warehouse – A company outsources its data warehouse to cloud providers like:
- Amazon Redshift (AWS)
- Google BigQuery
- Azure Synapse Analytics (Microsoft)
- Hybrid Data Warehouse – A combination of on-premise and cloud storage, leveraging the advantages of both.
9. Conclusion
In this article we briefly went over what a data warehouse and what it entails.