如何对AWS RDS SQL Server数据库进行分类

This article gives you an overview of cataloging AWS RDS SQL Server database objects like tables and views, using AWS Glue service.

本文概述了如何使用AWS Glue服务对AWS RDS SQL Server数据库对象（例如表和视图）进行分类。

介绍 (Introduction)

An enterprise IT landscape often consists of a large number of databases and even more database objects. Data is often exchanged between multiple sources and target databases. The standard practice is to maintain a central inventory of data repositories and database objects. This repository is also called the metadata catalog, which holds just the schema definition of database objects. ETL tools can refer to this centralized catalog while pointing to the source and target database objects. AWS RDS for SQL Server is one of the databases supported by AWS RDS service and enterprises host large production workloads on Amazon RDS SQL Server database instances. And with it comes the need to catalog the database. AWS offers AWS Glue service that supports crawling data repositories to create a metadata catalog. In this article, we will learn how to catalog Amazon RDS SQL Server database objects using AWS Glue.

企业IT环境通常由大量数据库甚至更多数据库对象组成。数据经常在多个源和目标数据库之间交换。标准做法是维护数据存储库和数据库对象的中央清单。该存储库也称为元数据目录，它仅包含数据库对象的架构定义。 ETL工具可以在指向源和目标数据库对象时引用此集中式目录。适用于SQL Server的AWS RDS是AWS RDS服务支持的数据库之一，企业在Amazon RDS SQL Server数据库实例上托管大量生产工作负载。随之而来的是对数据库进行分类的需求。 AWS提供了AWS Glue服务，该服务支持对数据存储库进行爬网以创建元数据目录。在本文中，我们将学习如何使用AWS Glue对Amazon RDS SQL Server数据库对象进行分类。

AWS RDS SQL Server实例 (AWS RDS SQL Server Instance)

It’s assumed that an operational instance of Amazon RDS SQL Server is already in place. If you are new to Amazon RDS for SQL Server, you can read this article, Getting started with AWS RDS SQL Server, to create a new instance. Once your instance is ready, it would look as shown below. For this exercise, any SQL Server instance using any edition of SQL Server will work. It is assumed that you have the required privileges to administer the AWS RDS SQL Server database instance.

假设已经存在Amazon RDS SQL Server的操作实例。如果您不熟悉 Amazon RDS for SQL Server，则可以阅读本文“ AWS RDS SQL Server入门”以创建新实例。实例准备就绪后，将如下所示。对于本练习，使用任何版本SQL Server的任何SQL Server实例都将起作用。假定您具有管理AWS RDS SQL Server数据库实例的必需特权。

使用AWS Glue爬行AWS RDS SQL Server (Crawling AWS RDS SQL Server with AWS Glue)

Next, you would need an active connection to the SQL Server instance. You can refer to my last article, How to connect AWS RDS SQL Server with AWS Glue, that explains how to configure Amazon RDS SQL Server to create a connection with AWS Glue. This step is a pre-requisite to proceed with the rest of the exercise. You can navigate to connections under the data catalog section in the left pane and create a new connection to the SQL Server instance. Once the connection is in place, click on the Crawlers in the left pane and you would see an interface as shown below.

接下来，您需要与SQL Server实例的活动连接。您可以参考我的上一篇文章“ 如何将AWS RDS SQL Server与AWS Glue连接” ，其中介绍了如何配置Amazon RDS SQL Server以创建与AWS Glue的连接。此步骤是进行其余练习的前提。您可以导航到左窗格中“数据目录”部分下的连接，并创建与SQL Server实例的新连接。连接到位后，单击左侧窗格中的“ 爬网程序 ”，您将看到如下所示的界面。

Click on the Add crawler button to start creating a new crawler. The first step of defining a crawler is shown below. Provide a relevant name for the crawler. You can add tags to the crawler add more information about the crawler like purpose, owner, environment etc. Security configuration allows us to configure encryption. In our case it’s not required, so we can use the default values for the rest of the details. Once the values are filled up, click Next.

单击添加搜寻器按钮以开始创建新的搜寻器。定义搜寻器的第一步如下所示。提供搜寻器的相关名称。您可以在搜寻器中添加标签，以添加有关搜寻器的更多信息，例如目的，所有者，环境等。安全配置允许我们配置加密。在我们的情况下，这不是必需的，因此我们可以将默认值用于其余详细信息。填满值后，单击“ 下一步” 。

In this step, we need to specify the source data repository for the crawlers. We can select either data repositories or already cataloged objects. In our case, as we are going to point the crawler to AWS RDS SQL Server instance, we will choose the Data stores option. After selecting this option, click Next.

在此步骤中，我们需要为搜寻器指定源数据存储库。我们可以选择数据存储库或已经分类的对象。在本例中，由于我们要将爬网程序指向AWS RDS SQL Server实例，因此我们将选择“ 数据存储”选项。选择此选项后，单击“ 下一步” 。

AWS Glue supports three types of data stores or repositories based on the mode of access. Data files hosted on S3 files, databases that support JDBC connectivity, and AWS DynamoDB database are the three types of data repositories that are supported. We would be using JDBC connectivity to connect to the AWS RDS SQL Server database instance, so select the JDBC option.

AWS Glue根据访问模式支持三种类型的数据存储或存储库。 S3文件上托管的数据文件，支持JDBC连接的数据库以及AWS DynamoDB数据库是受支持的三种类型的数据存储库。我们将使用JDBC连接来连接到AWS RDS SQL Server数据库实例，因此请选择JDBC选项。

As mentioned earlier, we need to have a connection to our AWS RDS SQL Server instance to select from the connection list. If you do not have any instance available, click on Add connection and create a new connection to the SQL Server instance. The “Include path” section expects the path to database objects. It supports wild-card patterns as well as exact object paths. The format that is generally followed is mydatabase/myschema/myobjects. It’s interesting to understand the effect of providing an incorrect path. So, for now, let’s just mention the master database, and our connection also points to the master database. Alternatively, if you intend to catalog many objects, and ignore only a few objects like system objects, then you can use exclude patterns and just mention objects that should not be crawled and cataloged. After updating the details, click Next.

如前所述，我们需要与AWS RDS SQL Server实例建立连接以从连接列表中进行选择。如果没有可用的实例，请单击“ 添加连接”，然后创建与SQL Server实例的新连接。 “包含路径”部分需要数据库对象的路径。它支持通配符模式以及精确的对象路径。通常遵循的格式是mydatabase / myschema / myobjects。了解提供错误路径的效果很有趣。因此，就目前而言，我们仅提及master数据库，而我们的连接也指向master数据库。另外，如果您打算对许多对象进行分类，而只忽略一些对象（如系统对象），则可以使用排除模式，仅提及不应进行爬网和分类的对象。更新详细信息后，单击下一步 。

In this step, if we intend to add data from multiple stores using the same crawler, you can select Yes to add another data store details. Else we can select the default value No. In this case, we do not want to add any other store, so we will use the default value and click Next.

在此步骤中，如果我们打算使用同一搜寻器添加来自多个存储的数据，则可以选择“是”以添加另一个数据存储详细信息。否则，我们可以选择默认值No。在这种情况下，我们不想添加任何其他存储，因此我们将使用默认值并单击Next 。

In this step, we need to select the IAM Role that would have access to the AWS RDS SQL Server instance. You can use the built-in AWS Glue Service Role that can access AWS RDS or you can create a custom role that has the required privileges. Once the role is selected, click Next.

在此步骤中，我们需要选择有权访问AWS RDS SQL Server实例的IAM角色 。您可以使用可访问AWS RDS的内置AWS Glue服务角色，也可以创建具有所需特权的自定义角色。选择角色后，单击“ 下一步” 。

In this step, we need to specify how frequently we intend to execute the crawler. In our case, we can select Run on-demand, so that the crawler will get executed only when we explicitly execute it. If you intend to schedule it, you can select other options as required.

在此步骤中，我们需要指定打算执行搜寻器的频率。在本例中，我们可以选择“按需运行”，以便仅在明确执行搜寻器时才执行搜寻器。如果要计划它，则可以根据需要选择其他选项。

In this step, we need to specify that when the database objects are crawled, where the crawler should create the metadata definition of the crawled objects. The database mentioned here is the AWS Glue database which will hold the metadata definition of the crawled objects. You can use the default database or create a new one by clicking on the Add database. In our case, we will use the default value. The rest of the options can be configured to adding prefixes to cataloged objects, configuring grouping options when AWS S3 based data, and cataloging behavior when changed to already cataloged objects are detected.

在此步骤中，我们需要指定在对数据库对象进行爬网时，爬网程序应在其中创建爬网对象的元数据定义。此处提到的数据库是AWS Glue数据库，它将保存已爬网对象的元数据定义。您可以使用默认数据库，也可以通过单击添加数据库来创建一个新数据库 。在本例中，我们将使用默认值。其余选项可以配置为在编目对象中添加前缀，在基于AWS S3的数据时配置分组选项，并在检测到更改为已编目对象时配置编目行为。

The default behavior of the crawler is to update the object definition if a change of definition is detected in the source for already cataloged objects and marked the table as deprecated if an already cataloged object is not found in the source. In our case, we can use the default values and click on Next. This will take us on the Review page, where we can review the details and click on Finish to create the crawler.

爬网程序的默认行为是，如果在源中已为已分类的对象检测到定义更改，则更新对象定义，如果在源中未找到已分类的对象，则将该表标记为已弃用。在我们的例子中，我们可以使用默认值，然后单击Next 。这将带我们进入“审阅”页面，我们可以在其中审阅详细信息，然后单击“完成”以创建搜寻器。

使用AWS Glue执行搜寻器 (Executing crawlers with AWS Glue)

Once the crawler is created, it would appear in the list as shown below. It will ask whether we want to execute the crawler. You can execute the crawler by clicking on Run it now link or selecting the crawler and clicking on the Run crawler button.

创建搜寻器后，它将显示在列表中，如下所示。它将询问我们是否要执行搜寻器。您可以通过单击立即运行链接或选择搜寻器并单击运行搜寻器按钮来执行搜寻器。

Once the crawler is executed, it will initiate a connection to AWS RDS SQL Server instance, read the definition of the objects that we mentioned in the include path, and create metadata objects in the catalog. Once the crawler completes execution you would find the details whether any new tables were created or updated. In our case, you would find that no tables got updated or created, though the crawler completed successful execution. The reason is the incorrect format of the path we mentioned in the include path setting.

执行搜寻器后，它将启动与AWS RDS SQL Server实例的连接，读取我们在包含路径中提到的对象的定义，并在目录中创建元数据对象。搜寻器完成执行后，您将找到创建或更新任何新表的详细信息。在本例中，尽管搜寻器已成功完成执行，但您将发现没有表被更新或创建。原因是我们在包含路径设置中提到的路径格式不正确。

For the purpose of the demonstration, we can create a new database and a table in the AWS RDS SQL Server instance as shown below. Alternatively, if you already have a database and a table, you can use the same too. Here we have the database name as “rahul” and a table named “test” in this database. And let’s say that we intend to crawl and catalog the table test.

为了进行演示，我们可以在AWS RDS SQL Server实例中创建一个新数据库和一个表，如下所示。或者，如果您已经有一个数据库和一个表，则也可以使用它们。这里，我们的数据库名称为“ rahul”，该数据库中的表名为“ test”。假设我们打算对表测试进行爬网和分类。

Select the crawler and click on the edit crawler to edit the crawler. Move to the step where we can configure the include path setting as shown below. Mention rahul/dbo/test as the include path, where rahul is the database name, dbo is the schema, and the test is the table. Ensure that the connection being used by the crawler also points to the same database in the AWS RDS SQL Server instance.

选择搜寻器，然后单击编辑搜寻器以编辑该搜寻器。移至我们可以配置包含路径设置的步骤，如下所示。提及rahul / dbo / test作为包含路径，其中rahul是数据库名称，dbo是架构，而test是表。确保搜寻器正在使用的连接也指向AWS RDS SQL Server实例中的同一数据库。

Save the crawler configuration and execute the crawler again. This time you should be able to see the value of 1 in the Tables added field, which means a new table was created in the metadata catalog.

保存搜寻器配置，然后再次执行搜寻器。这次您应该能够在“添加的表”字段中看到值1，这意味着在元数据目录中创建了一个新表。

AWS Glue元数据目录表 (AWS Glue Metadata Catalog Tables)

Navigate to the Tables option under databases on the left-hand pane, there you would find the table listed with the name rahul_dbo_test. Open the table and you would find the details as shown below. It would mention the name of the crawler that created the crawler, classification of the object as sqlserver, and deprecated status as No. At the bottom, you would find the field definitions as well.

导航到左侧窗格上“数据库”下的“ 表”选项，您将在其中找到名为rahul_dbo_test的表。打开表，您将找到详细信息，如下所示。它会提到创建爬网程序的爬网程序的名称，对象分类为sqlserver以及不赞成使用的状态为No。在底部，您还将找到字段定义。

Once this table is available in the metadata catalog, you can start using it as the source or target of AWS Glue ETL jobs as shown below.

一旦该表在元数据目录中可用，您就可以开始将其用作AWS Glue ETL作业的源或目标，如下所示。

In this way, we can crawl an Amazon RDS SQL Server database instance and catalog objects in the AWS Glue metadata catalog.

通过这种方式，我们可以在AWS Glue元数据目录中对Amazon RDS SQL Server数据库实例和目录对象进行爬网。

结论 (Conclusion)

In this article, we learned how to use AWS Glue crawlers to catalog database objects from AWS RDS SQL Server instances. We learned how to configure the crawler behavior as well as expressions to specify what objects should be crawled and cataloged. Finally, after the objects were cataloged, we explored the metadata definition of the cataloged object from the AWS Glue metadata catalog.

在本文中，我们学习了如何使用AWS Glue搜寻器对来自AWS RDS SQL Server实例的数据库对象进行分类。我们学习了如何配置搜寻器行为以及用于指定应搜寻和分类哪些对象的表达式。最后，在对对象进行分类之后，我们从AWS Glue元数据目录中探索了已分类对象的元数据定义。

翻译自: https://www.sqlshack.com/how-to-catalog-aws-rds-sql-server-database/