

In past posts, I’ve been talking about Data Warehouses, their basic architecture, and some basic principles that can help you to build one. Today, I want to show you an implementation of Data Warehouse on AWS based on a case study performed a couple of months ago.

This implementation uses AWS S3 as the Data Lake (DL). AWS Glue as the Data Catalog. And AWS Redshift and Redshift Spectrum as the Data Warehouse (DW).

Note: This post can be confusing if you are not familiar with some of the terminology and concepts I’m using here. For further information about this terms and concepts I recommend you to take a look to other posts where these topics are addressed.

建筑 (Architecture)

The architecture followed in this implementation is based on ELT processes. First, the data is extracted from sources, then is loaded into the Data Lake, and finally is transformed in the Data Warehouse.

An abstraction of a Data Warehouse Architecture — Illustration made by the author
The implementation addressed in this post is based on a case study performed a couple of months ago — for more information check this post. The architecture looks like this:

Data Warehouse architecture in AWS — Illustration made by the author

It uses AWS S3 as the DL. AWS Glue as the Data Catalog. And AWS Redshift and Redshift Spectrum as the DW.

Also, it uses Apache Spark for data extraction, Airflow as the orchestrator, and Metabase as a BI tool. But, particularly for this post, the scope is limited to the implementation of the DL and DW.

数据湖 (Data Lake)

The first part of this case study is the Data Lake.


A Data Lake is a repository where data from multiple sources is stored. It allows for working with structured and unstructured data.

In this case study, the Data Lake is used as a staging area allowing for centralizing all different data sources.

The data coming from these sources is stored in its original format. There are no transformation processes involved before loading the data into the Data Lake. So, it can be considered as an immutable staging area.

数据湖架构 (Data Lake Architecture)

Data Lake architecture may vary according to your needs. In this case study, a simple architecture is used. It comprises two zones: the raw zone and the sandbox zone.

If you are working with data that has a complex format, e.g., some complex-nested JSON who a creative developer decided to write, you might need to process the data before load it into the Data Warehouse. So, you might need to implement another Data Lake zone. This is what Ben Sharma calls in his book Architecting Data Lakes a refined zone.

But, for now, I’ll keep it simple.


  • The raw zone is where the data is stored in its original format. Then, it is loaded into the Data Warehouse by running some transformation processes on DBT

  • The sandbox zone is where Data Analysts, Data Engineers, or Data Scientists can do some crazy experiments


In the raw zone, the data is partitioned according to the source where it comes from and to the day when it is loaded, e.g., files from a source named “source” loaded on September 15th, 2020 are stored in /source/2020/09/15.

在原始区域中,将根据数据来源和加载日期对数据进行分区,例如,将2020年9月15日加载的名为“ source”的源文件存储在/ source / 2020 / 09/15。

It is important to mention that the way you structure your partitions in the Data Lake should be according to your particular needs. See this post and this post for information about S3 partitioning.

You can find in the next image a graphical representation of the proposed architecture for the Data Lake.

Data Lake Architecture — Illustration made by the author

And, this is how it looks like its implementation in AWS S3.

而且,这就是在AWS S3中实现的样子。

Data Lake — Level 0 — Illustration made by the author
Data Lake — Level 1 — Illustration made by the author
Data Lake — Level 2 — Illustration made by the author
Data Lake — Level 3 — Illustration made by the author

Finally, on the last level, the data is stored using parquet files. I used an open-source dataset for exemplifying this case study.

最后,在最后一级,使用镶木地板文件存储数据。 我使用开源数据集来举例说明此案例研究。

Data Lake — Level 4 — Illustration made by the author
资料目录(Data Catalog)

Now we have the data stored in the Data Lake, we need to be able to query the data using SQL. So, we can perform some data transformations using DBT.

We are going to use AWS Glue Data Catalog Databases and Crawlers for allowing us to run SQL queries on top of the DL.

The first step is creating a database in AWS Glue.

Database in AWS Glue Data Catalog — Illustration made by the author
Then, the database can be populated using AWS Glue Crawlers. See this post for more information.

然后,可以使用AWS Glue Crawlers填充数据库。 有关更多信息,请参见此帖子

The final result after creating the crawler and running it may look like this.


A table in AWS Glue Catalog — Part I — Illustration made by the author
A table in AWS Glue Catalog — Part II — Illustration made by the author
Now, we are good to go with the DW. With the tables mapped in the data catalog, now we can access them from the DW using AWS Redshift Spectrum. So, we can finally materialize the data in the DW.

数据仓库 (Data Warehouse)

As mentioned earlier, the DW is built using AWS Redshift, Redshift Spectrum, and DBT.

AWS Redshift is a data warehousing service provided by AWS. Redshift Spectrum is a service that can be used inside a Redshift cluster to query data directly from files on Amazon S3. And, DBT is a tool allowing you to perform transformation inside a data warehouse using SQL.

One of the key components of the DW is Redshift Spectrum since it allows you to connect the Glue Data Catalog with Redshift. So, you can query the DL data inside the Redshift cluster using DBT. This is important because DBT does not move data. It just transforms the data in the data warehouse.

The way you connect Redshift Spectrum with the data previously mapped in the AWS Glue Catalog is by creating external tables in an external schema. See this for more information about it.

After doing so, the external schema should look like this:



And, if you are using the same dataset and I’m using for this case study, your metadata should look like this:


Source table visualized in DBeaver — Illustration made by the author

After mapping the data with Redshift Spectrum, we are good to proceed with the data materialization in AWS Redshift.

This materialization allows you to put the data in Redshift tables. One of the key differences is that when you query the data in Redshift tables you are not charged for the query because you are already paying the Redshift cluster. While if you query the data in the external tables, i.e., Redshift Spectrum, you are charged by the data processed in the query. See this for more information.

Another key difference is that AWS Redshift outperforms Redshift Spectrum when processing a huge amount of data. See this for more information.

So, if we want to make some transformations to the table presented before, we could write a DBT model and materialize the data in Redshift.


This is a simple DBT model for this case study:


DBT model — Illustration made by the author

And this is how it looks like the model materialized in a Redshift table.


Redshift Tables — Illustration made by the author


In this post, we implemented a simple architecture for a Data Lake and a Data Warehouse in AWS.

Also, we addressed some of the key steps necessary to do so. Some takeaways are:

  • The architecture follows an ELT approach. So, no transformation processes are involved before loading the data into the Data Lake

  • The Data Lake is used as an immutable staging area for loading the data directly into de Data Warehouse

  • The Data Lake comprises two zones: the raw zone and the sandbox zone. If you need to process the data before loading into the Data Warehouse you should put in place another zone for the Data Lake: the refined zone

  • Partitions in the Data Lake should be defined according to you use cases

  • AWS Redshift Spectrum is a service that can be used inside a Redshift cluster to query data directly from files on Amazon S3. It is an extra service to AWS Redshift

  • AWS Redshift Spectrum allows you to connect the Glue Data Catalog with Redshift

  • Transformation logic is using DBT models

  • DBT does not move data. It just transforms the data in the data warehouse

I hope you find useful this information.


Thanks for reading until the end.


See you in the next post!


