aws数据仓库_AWS上的数据仓库实现

aws数据仓库

In past posts, I’ve been talking about Data Warehouses, their basic architecture, and some basic principles that can help you to build one. Today, I want to show you an implementation of Data Warehouse on AWS based on a case study performed a couple of months ago.

在过去的文章中,我一直在谈论数据仓库,其基本体系结构以及一些可以帮助您构建数据仓库的基本原理。 今天,我想根据几个月前进行的案例研究,向您展示AWSData Warehouse的实现

This implementation uses AWS S3 as the Data Lake (DL). AWS Glue as the Data Catalog. And AWS Redshift and Redshift Spectrum as the Data Warehouse (DW).

此实施使用AWS S3作为数据湖(DL)。 AWS Glue作为数据目录。 并将AWS Redshift和Redshift Spectrum用作数据仓库(DW)。

Note: This post can be confusing if you are not familiar with some of the terminology and concepts I’m using here. For further information about this terms and concepts I recommend you to take a look to other posts where these topics are addressed.

注意:如果您不熟悉我在这里使用的一些术语和概念,那么这篇文章可能会造成混淆。 有关此术语和概念的更多信息,建议您查看解决这些主题的其他文章。

建筑 (Architecture)

The architecture followed in this implementation is based on ELT processes. First, the data is extracted from sources, then is loaded into the Data Lake, and finally is transformed in the Data Warehouse.

此实现中遵循的体系结构基于ELT流程。 首先,从源中提取数据,然后将其加载到Data Lake中,最后在Data Warehouse中进行转换。

aws数据仓库_AWS上的数据仓库实现
An abstraction of a Data Warehouse Architecture — Illustration made by the author
数据仓库体系结构的抽象— Vector by作者所作

The implementation addressed in this post is based on a case study performed a couple of months ago — for more information check this post. The architecture looks like this:

这篇文章中介绍的实现基于几个月前进行的案例研究-有关更多信息,请查看这篇文章。 该架构如下所示:

aws数据仓库_AWS上的数据仓库实现
Data Warehouse architecture in AWS — Illustration made by the author
AWS中的数据仓库架构—作者所作的插图

It uses AWS S3 as the DL. AWS Glue as the Data Catalog. And AWS Redshift and Redshift Spectrum as the DW.

它使用AWS S3作为DL。 AWS Glue作为数据目录。 而AWS Redshift和Redshift Spectrum作为DW。

Also, it uses Apache Spark for data extraction, Airflow as the orchestrator, and Metabase as a BI tool. But, particularly for this post, the scope is limited to the implementation of the DL and DW.

此外,它使用Apache Spark进行数据提取,使用Airflow作为协调器,使用Metabase作为BI工具。 但是,特别是对于此职位,范围仅限于DL和DW的实施

数据湖 (Data Lake)

The first part of this case study is the Data Lake.

本案例研究的第一部分是数据湖。

A Data Lake is a repository where data from multiple sources is stored. It allows for working with structured and unstructured data.

Data Lake是一个存储多个来源数据的存储库。 它允许处理结构化和非结构化数据。

In this case study, the Data Lake is used as a staging area allowing for centralizing all different data sources.

在本案例研究中,Data Lake用作暂存区域,可以集中所有不同的数据源。

The data coming from these sources is stored in its original format. There are no transformation processes involved before loading the data into the Data Lake. So, it can be considered as an immutable staging area.

来自这些来源的数据以其原始格式存储。 将数据加载到Data Lake中之前,无需进行任何转换过程。 因此,可以将其视为不变的登台区域

数据湖架构 (Data Lake Architecture)

Data Lake architecture may vary according to your needs. In this case study, a simple architecture is used. It comprises two zones: the raw zone and the sandbox zone.

Data Lake体系结构可能会根据您的需求而有所不同。 在本案例研究中,使用了一个简单的体系结构。 它包括两个区域:原始区域沙箱区域。

If you are working with data that has a complex format, e.g., some complex-nested JSON who a creative developer decided to write, you might need to process the data before load it into the Data Warehouse. So, you might need to implement another Data Lake zone. This is what Ben Sharma calls in his book Architecting Data Lakes a refined zone.

如果您使用的是格式复杂的数据(例如,某个创意开发人员决定编写的复杂嵌套的JSON),则可能需要先处理数据,然后再将其加载到数据仓库中。 因此,您可能需要实现另一个Data Lake区域。 这就是本·夏尔马(Ben Sharma)在他的《架构数据湖》( Architecting Data Lakes)一书中所说的精致区域。

But, for now, I’ll keep it simple.

但是,到目前为止,我将使其保持简单。

  • The raw zone is where the data is stored in its original format. Then, it is loaded into the Data Warehouse by running some transformation processes on DBT

    原始区域是数据以其原始格式存储的位置。 然后,通过在DBT上运行一些转换过程将其加载到数据仓库中

  • The sandbox zone is where Data Analysts, Data Engineers, or Data Scientists can do some crazy experiments

    沙盒区域是数据分析师,数据工程师或数据科学家可以进行一些疯狂实验的地方

In the raw zone, the data is partitioned according to the source where it comes from and to the day when it is loaded, e.g., files from a source named “source” loaded on September 15th, 2020 are stored in /source/2020/09/15.

在原始区域中,将根据数据来源和加载日期对数据进行分区,例如,将2020年9月15日加载的名为“ source”的源文件存储在/ source / 2020 / 09/15。

It is important to mention that the way you structure your partitions in the Data Lake should be according to your particular needs. See this post and this post for information about S3 partitioning.

重要的是要提到在Data Lake中构造分区的方式应根据您的特定需求。 有关S3分区的信息,请参阅此职位和该职位

You can find in the next image a graphical representation of the proposed architecture for the Data Lake.

您可以在下一张图片中找到Data Lake所建议的体系结构的图形表示。

aws数据仓库_AWS上的数据仓库实现
Data Lake Architecture — Illustration made by the author
数据湖体系结构—作者所作的插图

And, this is how it looks like its implementation in AWS S3.

而且,这就是在AWS S3中实现的样子。

aws数据仓库_AWS上的数据仓库实现
Data Lake — Level 0 — Illustration made by the author
数据湖— 0级—作者的插图
aws数据仓库_AWS上的数据仓库实现
Data Lake — Level 1 — Illustration made by the author
数据湖— 1级—作者的插图
aws数据仓库_AWS上的数据仓库实现
Data Lake — Level 2 — Illustration made by the author
数据湖— 2级—作者的插图
aws数据仓库_AWS上的数据仓库实现
Data Lake — Level 3 — Illustration made by the author
数据湖—第3级—由作者所作的插图

Finally, on the last level, the data is stored using parquet files. I used an open-source dataset for exemplifying this case study.

最后,在最后一级,使用镶木地板文件存储数据。 我使用开源数据集来举例说明此案例研究。

aws数据仓库_AWS上的数据仓库实现
Data Lake — Level 4 — Illustration made by the author
数据湖— 4级—由作者所作的插图

资料目录(Data Catalog)

Now we have the data stored in the Data Lake, we need to be able to query the data using SQL. So, we can perform some data transformations using DBT.

现在我们已经将数据存储在Data Lake中,我们需要能够使用SQL查询数据。 因此,我们可以使用DBT执行一些数据转换。

We are going to use AWS Glue Data Catalog Databases and Crawlers for allowing us to run SQL queries on top of the DL.

我们将使用AWS Glue数据目录数据库和抓取工具,使我们能够在DL之上运行SQL查询。

The first step is creating a database in AWS Glue.

第一步是在AWS Glue中创建数据库

aws数据仓库_AWS上的数据仓库实现
Database in AWS Glue Data Catalog — Illustration made by the author
AWS Glue数据目录中的数据库—作者制图

Then, the database can be populated using AWS Glue Crawlers. See this post for more information.

然后,可以使用AWS Glue Crawlers填充数据库。 有关更多信息,请参见此帖子

The final result after creating the crawler and running it may look like this.

创建搜寻器并运行后的最终结果可能如下所示。

aws数据仓库_AWS上的数据仓库实现
A table in AWS Glue Catalog — Part I — Illustration made by the author
AWS Glue目录中的表—第一部分—作者制作的插图
aws数据仓库_AWS上的数据仓库实现
A table in AWS Glue Catalog — Part II — Illustration made by the author
AWS Glue目录中的表格—第II部分—作者制作的插图

Now, we are good to go with the DW. With the tables mapped in the data catalog, now we can access them from the DW using AWS Redshift Spectrum. So, we can finally materialize the data in the DW.

现在,我们很高兴与DW一起使用。 使用数据目录中映射的表,现在我们可以使用AWS Redshift Spectrum从DW访问它们。 因此,我们最终可以在DW中实现数据。

数据仓库 (Data Warehouse)

As mentioned earlier, the DW is built using AWS Redshift, Redshift Spectrum, and DBT.

如前所述,DW是使用AWS Redshift,Redshift Spectrum和DBT构建的。

AWS Redshift is a data warehousing service provided by AWS. Redshift Spectrum is a service that can be used inside a Redshift cluster to query data directly from files on Amazon S3. And, DBT is a tool allowing you to perform transformation inside a data warehouse using SQL.

AWS Redshift是AWS提供的数据仓库服务。 红移频谱 是一项服务,可在Redshift集群内部使用该服务直接从Amazon S3上的文件查询数据。 而且, DBT是允许您使用SQL在数据仓库内执行转换的工具。

One of the key components of the DW is Redshift Spectrum since it allows you to connect the Glue Data Catalog with Redshift. So, you can query the DL data inside the Redshift cluster using DBT. This is important because DBT does not move data. It just transforms the data in the data warehouse.

DW的关键组件之一是Redshift Spectrum,因为它使您可以将Glue数据目录与Redshift连接起来。 因此,您可以使用DBT查询Redshift集群中的DL数据。 这很重要,因为DBT不会移动数据。 它只是转换数据仓库中的数据。

The way you connect Redshift Spectrum with the data previously mapped in the AWS Glue Catalog is by creating external tables in an external schema. See this for more information about it.

将Redshift Spectrum与先前在AWS Glue目录中映射的数据连接的方式是通过在外部架构中创建外部表。 请参见关于它的更多信息。

After doing so, the external schema should look like this:

之后,外部架构应如下所示:

aws数据仓库_AWS上的数据仓库实现

And, if you are using the same dataset and I’m using for this case study, your metadata should look like this:

而且,如果您使用的是相同的数据集,而我正在本案例研究中使用,则您的元数据应如下所示:

aws数据仓库_AWS上的数据仓库实现
Source table visualized in DBeaver — Illustration made by the author
在DBeaver中可视化的源表—由作者所作的插图

数据实现(Data materialization)

After mapping the data with Redshift Spectrum, we are good to proceed with the data materialization in AWS Redshift.

在使用Redshift Spectrum映射数据之后,我们很高兴在AWS Redshift中进行数据实现。

This materialization allows you to put the data in Redshift tables. One of the key differences is that when you query the data in Redshift tables you are not charged for the query because you are already paying the Redshift cluster. While if you query the data in the external tables, i.e., Redshift Spectrum, you are charged by the data processed in the query. See this for more information.

通过实现,您可以将数据放入Redshift表中。 关键区别之一是,当您查询Redshift表中的数据时,无需为查询付费,因为您已经在支付Redshift集群了。 而如果您查询外部表(即Redshift Spectrum)中的数据,则将由查询中处理的数据收费。 请参阅以获取更多信息。

Another key difference is that AWS Redshift outperforms Redshift Spectrum when processing a huge amount of data. See this for more information.

另一个主要区别是,在处理大量数据时AWS Redshift优于Redshift Spectrum 。 请参阅以获取更多信息。

So, if we want to make some transformations to the table presented before, we could write a DBT model and materialize the data in Redshift.

因此,如果我们想对前面介绍的表进行一些转换,则可以编写一个DBT模型并在Redshift中实现数据。

This is a simple DBT model for this case study:

这是此案例研究的简单DBT模型:

aws数据仓库_AWS上的数据仓库实现
DBT model — Illustration made by the author
DBT模型—作者的插图

And this is how it looks like the model materialized in a Redshift table.

这就是模型在Redshift表中实现的样子。

aws数据仓库_AWS上的数据仓库实现
Redshift Tables — Illustration made by the author
红移表—作者的插图

结论(Conclusions)

In this post, we implemented a simple architecture for a Data Lake and a Data Warehouse in AWS.

在本文中,我们为AWS中的Data Lake和Data Warehouse实现了一个简单的架构。

Also, we addressed some of the key steps necessary to do so. Some takeaways are:

此外,我们介绍了一些必要的关键步骤。 一些要点是:

  • The architecture follows an ELT approach. So, no transformation processes are involved before loading the data into the Data Lake

    该体系结构遵循ELT方法。 因此,在将数据加载到数据湖之前,无需进行任何转换过程
  • The Data Lake is used as an immutable staging area for loading the data directly into de Data Warehouse

    Data Lake用作不变的暂存区域,用于将数据直接加载到de Data Warehouse中
  • The Data Lake comprises two zones: the raw zone and the sandbox zone. If you need to process the data before loading into the Data Warehouse you should put in place another zone for the Data Lake: the refined zone

    数据湖包括两个区域:原始区域和沙箱区域。 如果需要在将数据加载到数据仓库中之前处理数据,则应为数据湖放置另一个区域:精炼区域

  • Partitions in the Data Lake should be defined according to you use cases

    数据湖中的分区应根据您的用例进行定义
  • AWS Redshift Spectrum is a service that can be used inside a Redshift cluster to query data directly from files on Amazon S3. It is an extra service to AWS Redshift

    AWS Redshift Spectrum是一项服务,可在Redshift集群内部使用该服务直接从Amazon S3上的文件查询数据。 这是AWS Redshift的一项额外服务
  • AWS Redshift Spectrum allows you to connect the Glue Data Catalog with Redshift

    AWS Redshift Spectrum允许您将Glue数据目录与Redshift连接
  • Transformation logic is using DBT models

    转换逻辑正在使用DBT模型
  • DBT does not move data. It just transforms the data in the data warehouse

    DBT不会移动数据。 它只是转换数据仓库中的数据

I hope you find useful this information.

希望您能从中找到有用的信息。

Thanks for reading until the end.

感谢您的阅读直到最后。

See you in the next post!

下篇再见!

翻译自: https://towardsdatascience.com/a-data-warehouse-implementation-on-aws-a96d0e251abd

aws数据仓库