如何使用数据质量服务和SQL Server集成服务清除数据

介绍 (Introduction)

A year or so ago, I worked for an online web grocery software house located in the northern United States. At that time I had my ‘baptismal’ exposure to ‘genuinely dirty data’. Granted most of the data entry was done manually and many times from offshore. The point being that I could not fathom just how many ways there were to spell the brand name of a major cereal manufacturer. Why is this such an issue? The answer is fairly straight forward. Imagine the scenario that you are trying to ascertain the dollar value of breakfast cereals sold in the country from the local supermarket standpoint all the way up to national sales. Imagine this utilizing a SQL Server Multi-dimensional cube. The ‘eagle – eyed’ reader will recognize that the results will not aggregate correctly should our aggregation attributes have a plethora of different ways of being spelt.

大约一年前,我在美国北部的一家在线网络杂货店工作。 那时,我对“真正肮脏的数据”进行了“洗礼”。 当然,大多数数据输入是手动完成的,并且离岸很多次。 关键是我无法理解主要谷物制造商的品牌名称拼写有多少种方法。 为什么会这样呢? 答案很简单。 想象一下您要确定从本地超级市场的​​角度一直到全国销售的早餐谷物的美元价值的情况。 想象一下,利用SQL Server多维多维数据集。 “老鹰眼”的读者会认识到,如果我们的汇总属性具有多种不同的拼写方式,则结果将无法正确汇总。

Yes, we (as other firms) manually fixed the data anomalies, however these anomalies tendered to rear their ugly heads with each new data load.

是的,我们(和其他公司一样)手动修复了数据异常,但是随着每一次新的数据加载,这些异常都会抬起他们的丑陋头。

Enter, Data Quality Services and SQL Server Integration Services and THIS is what we are going to discuss.

输入,数据质量服务和SQL Server集成服务,这就是我们将要讨论的内容。

A final note, in the preparation of this article I felt it necessary to give the reader unfamiliar with the Data Quality Services product, a high level understanding of the processes involved in creating a workable ‘model’. Should you be familiar with Data Quality Services, feel free to skip ahead to the SQL Server Integration Services section below.

最后,在准备本文时,我觉得有必要使读者不熟悉数据质量服务产品,对创建可行的“模型”所涉及的过程有较高的了解。 如果您熟悉数据质量服务,请随时跳到下面的“ SQL Server集成服务”部分。

入门 (Getting Started)

For the sake of simplicity, in our little example, we shall be working only with the manufacturers of certain products. Naturally data errors crop up in a plethora of data fields, however I believe in the adage of keeping things as simple as possible, in a paper of this sort.

为了简单起见,在我们的小示例中,我们将仅与某些产品的制造商合作。 自然地,数据错误会出现在大量数据字段中,但是,我相信在此类论文中将事情保持尽可能简单的说法。

Our first task will be to get the current data cleaned up AND THEN make it possible for Data Quality Services to ‘use its magic’ to clean up new data (on its own) going forward. This requires the construction of a ‘Knowledge Base’ and a ‘Data Quality Services Project’.

我们的首要任务是清理当前数据,然后使数据质量服务“自行使用”清理未来的新数据。 这需要构建“知识库”和“数据质量服务项目”。

We shall then create a SQL Server Integration Services package which will be run daily to place correct data into our database and send bad data to our Business Analysts and Data Stewards to be fixed/corrected for the next day’s run.

然后,我们将创建一个SQL Server Integration Services软件包,该软件包将每天运行,以将正确的数据放入我们的数据库中,并将不良数据发送给我们的业务分析师和数据管理员,以便在第二天的运行中进行修复/更正。

安装数据质量服务 (Installing Data Quality Services)

Data Quality Services is available to the Business Intelligence and Enterprise versions of SQL Server 2012 and SQL Server 2014. Should you wish to experiment with the product, it is also available via the Developer Edition. The important point being to let the SQL Server installation process know that you wish to install Data Quality Services (DQS) when you install your SQL Server instance.

数据质量服务可用于SQL Server 2012和SQL Server 2014的商业版和企业版。如果您想试用该产品,也可以通过开发人员版使用。 重要的一点是让SQL Server安装过程知道您希望在安装SQL Server实例时安装数据质量服务(DQS)。

You thought that you were finished, right? Think again! We must now physically install the server portion on our instance. Simply select Programs, SQL Server 2012, Data Quality Services, and Data Quality Server Installer. The process executes in a command window and once complete you are ready to go. (See the screen dump below).

你以为你完蛋了吧? 再想一想! 现在,我们必须在实例上实际安装服务器部分。 只需选择程序,SQL Server 2012,数据质量服务和数据质量服务器安装程序。 该过程在命令窗口中执行,完成后就可以开始了。 (请参见下面的屏幕转储)。

如何使用数据质量服务和SQL Server集成服务清除数据

We now call upon the Data Quality Service client and begin our journey.

现在,我们呼吁数据质量服务客户开始我们的旅程。

如何使用数据质量服务和SQL Server集成服务清除数据

After having selected the ‘Data Quality Client’, the work screen that we shall be using for this portion of the paper, will appear (See below).

选择“数据质量客户端”后,将出现我们将在本文的此部分使用的工作屏幕(请参见下文)。

如何使用数据质量服务和SQL Server集成服务清除数据

构建我们的第一个知识库(创建知识库的三个步骤中的第一个) (Constructing our first Knowledge Base (the first of three steps to create our knowledge base))

In order for Data Quality Services to understand a bit about our data AND to use that knowledge about our data on future loads, we must build/construct a ‘Knowledge Base’. Please note that once complete, a Knowledge Base is similar to a .NET object and therefore can be ‘inherited’ in any subsequent new knowledge base.

为了使Data Quality Services能够对我们的数据有所了解并在将来的负载中使用有关我们的数据的知识,我们必须构建/构建“知识库”。 请注意,知识库一旦完成,便类似于.NET对象,因此可以在任何后续的新知识库中“继承”。

N.B. The results of the activities that we shall be performing below will be ‘stored’ in a special SQL Server Database called DQS_MAIN, which is created by the Data Quality Services server installation.

注意:我们将在下面执行的活动的结果将被“存储”在一个称为DQS_MAIN的特殊SQL Server数据库中,该数据库是由Data Quality Services服务器安装创建的。

Let us get going!

让我们开始吧!

I first click the ‘New Knowledge Base’ Option from the left hand menu. (See above)

我首先单击左侧菜单中的“新知识库”选项。 (往上看)

The following screen appears:

出现以下屏幕:

如何使用数据质量服务和SQL Server集成服务清除数据

I have taken the liberty of naming the Knowledge Base and simply click on ‘Next’.

我可以随意命名知识库,只需单击“下一步”即可。

We first wish to create a Domain. This domain will contain all of our Manufacturer related data.

我们首先希望创建一个域。 该域将包含我们所有与制造商相关的数据。

如何使用数据质量服务和SQL Server集成服务清除数据

I now click OK to accept and the following screen is brought into view.

我现在单击“确定”接受,并显示以下屏幕。

如何使用数据质量服务和SQL Server集成服务清除数据

“培训”我们的知识库,或者让乐趣开始!! (‘Training’ our knowledge base OR let the Fun Begin!!)

The astute reader will note that there are 5 main tabs in the screen dump above. The ‘Domain Properties’ tab is shown.

精明的读者会注意到上面的屏幕转储中有5个主要选项卡。 显示“域属性”选项卡。

We shall not be discussing the ‘Reference Data’ tab, however it is used to link to the cloud to obtain reference data such as telephone numbers, street addresses etc. from third party vendors.

我们不会在讨论“参考数据”选项卡,但是它用于链接到云以从第三方供应商那里获取参考数据,例如电话号码,街道地址等。

As with any process we must ensure that we have a clean set manufacturer data as our ‘Master’ manufacturer list. Often this takes a few days to construct.

与任何过程一样,我们必须确保拥有完整的制造商数据作为“主”制造商列表。 通常,这需要几天的时间来构建。

Although we shall be looking at manufacturers, in reality one would really want to include ‘products’, financial data etc.

尽管我们将关注制造商,但实际上,人们确实希望包括“产品”,财务数据等。

载入我们的主要制造商数据 (Loading our master Manufacturer data)

To load our master list we select the ‘Domain Values’ tab and select ‘Import Values’ (See below).

要加载主列表,我们选择“域值”标签,然后选择“导入值”(见下文)。

如何使用数据质量服务和SQL Server集成服务清除数据

The plot now thickens!!! The name of the product is ‘SQL Server’ however, guess what!!! The data for our master list must be in a spreadsheet as YOU CANNOT load our master data from a SQL Server table. Go figure!! As a BTW, this point has been raised with Microsoft.

现在情节变厚了!!! 该产品的名称是“ SQL Server”,但是,请猜!!! 我们的主列表中的数据必须是在电子表格中,你不能从SQL Server表加载我们的主数据。 去搞清楚!! 作为BTW,Microsoft已提出了这一点。

如何使用数据质量服务和SQL Server集成服务清除数据

I point to my master list and import the values. (See the screen dumps below)

我指向我的主列表并导入值。 (请参见下面的屏幕转储)

如何使用数据质量服务和SQL Server集成服务清除数据

I then indicate that the first row of the spreadsheet does contains the header.

然后,我指出电子表格的第一行确实包含标题。

如何使用数据质量服务和SQL Server集成服务清除数据

..and click OK.

..然后单击确定。

Here is our master list of data (which is not all that clean as we shall now see).

这是我们的数据主列表(这并不像我们现在看到的那么干净)。

如何使用数据质量服务和SQL Server集成服务清除数据

我们五千万米的数据 (Our data from fifty million meters)

Note how many ways the word ‘Kellogg’s’ has been spelt (See above). We need to ‘tell’ Data Quality Services (hence forward referred to as DQS) the correct spelling for ‘Kellogg’s’. This may be different from country to country. In the current example, we shall accept the correct spelling of ‘Kellogg’s’ to be Kellogg USA Inc. We click on the Kellogg’s USA Inc (fifth from the top) in the ‘corrected to’ column and with the control key depressed, highlight the 5th row Kellogg row and then highlight the remaining Kellogg entries.

请注意“凯洛格”一词的拼写方式(请参见上文)。 我们需要“告诉”数据质量服务(此后称为DQS)“凯洛格”的正确拼写。 各国之间可能会有所不同。 在当前示例中,我们将接受正确的拼写“ Kellogg's”为Kellogg USAInc 。 我们在“更正为”列中单击“ Kellogg's USA Inc”(从顶部起第五个),并按下控制键,突出显示“ Kellogg”的第5行,然后突出显示其余的“ Kellogg”条目。

We then right click and from the context menu we select ‘Set as synonyms’ and our work here is done. The important point being that the myriad ways that Kellogg has been spelt WILL BE CORRECTED and LEARNT by the system, going forward (See the screen dump below).

然后,我们单击鼠标右键,然后从上下文菜单中选择“设置为同义词”,然后完成此处的工作。 最重要的一点在于无数的方式是凯洛格已经拼写将被纠正,并通过系统了解到 ,前进(请参见下面的屏幕转储)。

如何使用数据质量服务和SQL Server集成服务清除数据

The cleansing of additional manufacturers would then ensue.

然后将进行其他制造商的清洗。

基于术语的规则 (Term-Based Rules)

The fifth of the tabs on our screen above, is our ‘Term-Based Rules’ and it is meant to change record fields ‘globally’ from one value to another. This is often based upon special and / or enterprise based rules.

上方屏幕上的第五个选项卡是“基于术语的规则”,它旨在将记录字段“全局”从一个值更改为另一个值。 这通常基于特殊和/或基于企业的规则。

Personally, I do not like the use of ‘co.’ for the abbreviation ‘company’. I prefer the usage of the British abbreviation ‘coy.’. This is the place to ensure that ALL “co’s” are changed to ‘coy’s.

我个人不喜欢使用“公司”。 缩写为“公司”。 我更喜欢使用英国缩写“ coy”。 这是确保将所有“ co's”更改为“ coy's”的地方。

I add a relationship. (See below)

我添加一个关系。 (见下文)

如何使用数据质量服务和SQL Server集成服务清除数据

I add my ‘co.’ -> to ‘coy.’

我加我的“公司”。 ->设为“ co”。

In your situation, you will probably have more rules that YOU wish to apply.

在您的情况下,您可能会有更多希望应用的规则。

如何使用数据质量服务和SQL Server集成服务清除数据

I then apply the changes. We shall see how this plays a part in the big scheme of things in a few minutes.

然后,我应用更改。 我们将在几分钟后看到它如何在大事情中发挥作用。

域规则确认正确性或指示数据错误 (Domain Rules either affirm correctness or indicate data errors)

Domain rules (the third tab) are slightly different and more down to the ‘grain of the wood’. As an example, ‘odd ball’ characters, white characters (from copying and pasting), slashes etc. that you wish to be invalidate and have fixed by an Analyst.

域规则(第三个选项卡)略有不同,更多的是“木纹”。 例如,您希望无效的“奇数球”字符,白色字符(来自复制和粘贴),斜杠等已经被分析师固定。

Data rows containing any of these characters need to be flagged and sent to the business analyst to be checked and rectified. In the case of the ‘Term-Based Rule’ we KNEW what should replace any instance of a value. In the case of ‘Domain Rules’ we do not what to expect.

包含任何这些字符的数据行都需要标记,并发送给业务分析人员进行检查和纠正。 在“基于术语的规则”的情况下,我们知道应该替换值的任何实例的内容。 对于“域规则”,我们没有什么期望。

Let us now look at a rule that I created for our project. I add a new rule (see below)

现在让我们看看我为我们的项目创建的规则。 我添加了一条新规则(见下文)

如何使用数据质量服务和SQL Server集成服务清除数据

In this case, I wish to check that the name of the manufacturer is greater than three characters. If it is less than three characters, there is surely an error.

在这种情况下,我希望检查制造商的名称是否大于三个字符。 如果少于三个字符,则肯定存在错误。

如何使用数据质量服务和SQL Server集成服务清除数据

I now wish to add one further clause that states that if a slash is found within the manufacturer name, that this record be marked as ‘Invalid’.

现在,我想添加另一个子句,该子句指出,如果在制造商名称中找到斜杠,则该记录将标记为“无效”。

如何使用数据质量服务和SQL Server集成服务清除数据

I now click ‘apply all rules’, and then finish.

我现在单击“应用所有规则”,然后完成。

I am then asked if I wish to ‘publish the knowledge base’, to which I answer, ‘Publish’.

然后询问我是否要“发布知识库”,对此我回答为“发布”。

如何使用数据质量服务和SQL Server集成服务清除数据

You will then be returned to the Main work screen. You will note that the SQL Shack Knowledge base may be seen on the left. See below.

然后,您将返回到主工作屏幕。 您会注意到,SQL Shack知识库可能在左侧显示。 见下文。

如何使用数据质量服务和SQL Server集成服务清除数据

知识发现(创建知识库的三个步骤中的第二个) (Knowledge Discovery (the second of three steps to create our knowledge base))

Having entered our master list and having rectified a few anomalies, we are in a position to look at additional data as it comes in. We now wish to look at the accuracy statistics that our model generates against NEW DATA.

进入主列表并更正了一些异常之后,我们可以查看传入的其他数据。现在,我们希望查看我们的模型针对NEW DATA生成的准确性统计信息。

This is where ‘Knowledge Discovery’ comes into play.

这就是“知识发现”发挥作用的地方。

I click on the arrow next to our SQL Shack knowledge base and select ‘Knowledge Discovery’.

我单击我们SQL Shack知识库旁边的箭头,然后选择“知识发现”。

如何使用数据质量服务和SQL Server集成服务清除数据

Glory be!! We are now able to select our new data from a SQL Server Table.

荣耀归来! 现在,我们可以从SQL Server表中选择新数据。

如何使用数据质量服务和SQL Server集成服务清除数据

As a BTW, in our wonderful world new Manufacturer data is loaded daily into a special SQL Server table. It is this table that we shall be using.

作为BTW,在我们美好的世界中,每天都会将新的制造商数据加载到特殊SQL Server表中。 我们将使用此表。

Note that the manufacturer in the left orange highlighted box comes from the our SQL Server Manufacturer table (Source Column) AND THAT the Domain value of Manufacturer comes for the process that we just completed above and is displayed in the box to the right.

请注意,橙色突出显示框中的制造商来自我们SQL Server制造商表(源列),并且制造商的域值来自于我们刚刚在上面完成的过程,并显示在右侧框中。

Click next.

点击下一步。

对所选数据源执行数据发现分析 (Performing data discovery analysis on the selected data source)

Having clicked next, the following screen appears.

单击下一步后,出现以下屏幕。

如何使用数据质量服务和SQL Server集成服务清除数据

We are now in the position to start analyzing our data.

我们现在可以开始分析我们的数据了。

We click ‘Start’.

我们点击“开始”。

如何使用数据质量服务和SQL Server集成服务清除数据

Note that the process statistics are very informative. DQS has found 317 records that are NEW. This naturally implies that our ‘master list’ is missing a few records. We shall see how this comes into play within a few minutes.

请注意,过程统计信息非常有用。 DQS已找到317条新记录。 这自然意味着我们的“主列表”缺少一些记录。 我们将在几分钟内看到它如何发挥作用。

Click next.

点击下一步。

你有力量!!! (You have the power!!!!!)

The following screen is brought up.

出现以下屏幕。

如何使用数据质量服务和SQL Server集成服务清除数据

Note that DQS has found a few problems in the NEW incoming data (e.g. Chas Freihofer) RELATIVE to our master list. DQS has marked them as ‘incorrect’ HOWEVER DQS was INTELLIGENT enough to provide an alternate value (originating from our master list). In many ways it is similar to a ‘Spell check’ HOWEVER a spell check CANNOT learn whereas DQS actually learns with time.

请注意,DQS在与我们的主列表相关的新传入数据(例如Chas Freihofer)中发现了一些问题。 DQS已将其标记为“不正确”,但是DQS足够智能以提供备用值(源自我们的主列表)。 在许多方面,它类似于“拼写检查”,但是拼写检查无法学习,而DQS实际上是随着时间而学习的。

As always, you have the option to accept the spelling on the left OR to let DQS fix the value for you as recommended in the ‘Correct to’ column.

与往常一样,您可以选择接受左侧的拼写,也可以让DQS根据“更正为”列中的建议为您修复该值。

如何使用数据质量服务和SQL Server集成服务清除数据

Leaving the type as ‘X’ permits DQS to use its own suggestion. Clicking the ‘check mark’ tells DQS, “no it is correct as is” and invalid ‘!’ is a half way between the two and is virtually saying ‘ I do not know’, or it has found a slash in the name OR that the name is less than three characters long.

将类型保留为“ X”允许DQS使用其自己的建议。 单击“复选标记”会告诉DQS:“按原样不正确”和“!”无效。 是两者之间的一半,实际上是在说“我不知道”,或者在名称中发现了斜线,或者名称少于三个字符。

One also has the option of doing a manual correction.

还可以选择手动校正。

At this point the astute reader is saying to themselves, why must I still have to correct values? I thought that this product was the panacea of data cleansing.

在这一点上,精明的读者对自己说,为什么我还必须更正值? 我认为该产品是数据清理的灵丹妙药。

The truth be said, DQS learn from each interaction, and manual data correction should be required less and less with time. After all, at this point, DQS has found 75 correct records, 5 Errors and 12 Invalid records ALL in one pass (see the statistics in the top right portion of the Manufacturer box).

说实话,DQS会从每次交互中学习,并且随着时间的流逝,对手动数据校正的需求将越来越少。 毕竟,在这一点上,DQS一次就发现了75条正确的记录,5条错误和12条无效的记录(请参阅“制造商”框右上方的统计信息)。

We now click ‘Finish’.

现在,我们单击“完成”。

如何使用数据质量服务和SQL Server集成服务清除数据

Once again we publish our results.

我们再次发布结果。

匹配政策(创建知识库的三个步骤中的第三步) (Matching Policy (the third of three steps to create our knowledge base))

Matching policy is a critical part of the data cleansing process as it defines the percentage ‘certainty’ required to either declare that a manufacturer is correctly defined or that is incorrectly defined. In short this process looks at the DQS statistics for our data and tells us what it (DQS) considers valid.

匹配策略是数据清理过程的关键部分,因为它定义了声明制造商定义正确或定义错误的制造商所需的“确定性”百分比。 简而言之,此过程会查看数据的DQS统计信息,并告诉我们该数据(DQS)认为有效的内容。

We select the ‘Matching Policy’ tab as shown below:

我们选择“匹配政策”标签,如下所示:

如何使用数据质量服务和SQL Server集成服务清除数据

The following screen appears and using the same table, we define our relationships (see below).

出现以下屏幕,并使用同一表定义关系(请参见下文)。

如何使用数据质量服务和SQL Server集成服务清除数据

We then click ‘Next’ to create a matching rule. We select the ‘Create a matching rule’ option.

然后,我们单击“下一步”以创建匹配规则。 我们选择“创建匹配规则”选项。

如何使用数据质量服务和SQL Server集成服务清除数据

We need to add a ‘Domain Element’ (see below)

我们需要添加一个“域元素”(见下文)

如何使用数据质量服务和SQL Server集成服务清除数据

This element will be used as a guideline to ascertain ‘manufacturer’ validity and correctness.

该元素将用作确定“制造商”有效性和正确性的指南。

Under normal circumstances, we would not be using only ‘Manufacturer’ exclusively but rather ‘Manufacturer’, ‘Product’ etc.

在正常情况下,我们不会仅使用“制造商”,而是使用“制造商”,“产品”等。

It should be noted that one can weight correctness on any of these attributes to any percentage, however the total must amount to 100%.

应当指出,可以将这些属性中的任何一个的正确性加权为任何百分比,但是总和必须达到100%。

As we are looking solely at ‘Manufacturer’, the results must be 100% correct for a match, otherwise the record will be deemed questionable.

当我们只看“制造商”时,匹配结果必须是100%正确,否则记录将被视为有问题。

如何使用数据质量服务和SQL Server集成服务清除数据

We now click next.

现在,我们单击下一步。

Let us now look at the matching results to see how our NEW data compares with our master list. DQS found ‘matches’ for the following manufacturers.

现在让我们查看匹配结果,以了解我们的新数据与主列表的比较。 DQS为以下制造商找到了“匹配”。

如何使用数据质量服务和SQL Server集成服务清除数据

Further, it could not find matches for the following items.

此外,它找不到以下项目的匹配项。

如何使用数据质量服务和SQL Server集成服务清除数据

One possible reason for a non-match was mentioned above in that there were numerous items in the SQL Server table that were NOT in the master list.

上面提到了不匹配的一个可能原因,因为SQL Server表中有许多项不在主列表中。

We now click ‘Finish’ publish.

现在,我们单击“完成”发布。

谈话便宜,但是钱买午餐! (Talk is cheap, but money buys the lunch!)

Having quickly run through the preparations to evaluate our data and to train our knowledge base, it is now time to utilize our knowledge base against our production data or for that matter, new incoming data.

在快速进行了准备工作以评估我们的数据并训练我们的知识库之后,现在是时候根据我们的生产数据或就此而言,使用新的传入数据来利用我们的知识库。

To do this we must create a new ‘Data Quality Project’.

为此,我们必须创建一个新的“数据质量项目”

如何使用数据质量服务和SQL Server集成服务清除数据

We select the ‘New Data Quality Project’ option from the ‘Data Quality Projects’ menu.

我们从“数据质量项目”菜单中选择“新数据质量项目”选项。

如何使用数据质量服务和SQL Server集成服务清除数据

Note that the necessary criteria has been completed for the reader, utilizing the knowledge base that we just created.

请注意,利用我们刚刚创建的知识库,已经为读者完成了必要的条件。

We select the ‘Cleansing’ option and click ‘Next’

我们选择“清洁”选项,然后单击“下一步”

如何使用数据质量服务和SQL Server集成服务清除数据

Our familiar database and domain data entry screen appears. Once again I have taken the liberty of completing the required fields as shown above and we then click next.

出现我们熟悉的数据库和域数据输入屏幕。 我再次*地完成了如上所示的必填字段,然后单击下一步。

对所选数据执行清理 (Performing cleansing on the selected data)

Upon arriving at the next screen, we start processing the data and DQS will effectively BEGIN CLEANSING your live data. My results are shown below:

到达下一个屏幕后,我们开始处理数据,DQS将有效地开始清洁您的实时数据。 我的结果如下所示:

如何使用数据质量服务和SQL Server集成服务清除数据

Note that DQS found that 96 % of the records were either correct or were corrected (by DQS). 4% of the records were invalid.

请注意,DQS发现96%的记录是正确的或已被DQS纠正。 4%的记录无效。

We NOW have the opportunity to look at how DQS processed our data and we have the opportunity to fix any errors, even at this late stage.

现在,我们有机会查看DQS如何处理我们的数据,即使在此后期,我们也有机会纠正任何错误。

如何使用数据质量服务和SQL Server集成服务清除数据

In the screen shot above, we see how DQS corrected our data PLUS the reasons why it made this decision. Looking at the ‘Invalid’ tab, we can see those records that were invalid according to the rules that we set up creating our knowledge base (See below).

在上面的屏幕截图中,我们看到了DQS如何纠正我们的数据以及做出此决定的原因。 查看“无效”标签,根据建立知识库的规则(请参阅下文),我们可以看到那些无效的记录。

如何使用数据质量服务和SQL Server集成服务清除数据

Note the slashes that invalidate the rows.

请注意使行无效的斜杠。

As always, even at this late stage in the process you have the option to approve or disapprove of any DQS corrections or invalidating actions. NOTE that from here on, the final rules are ‘set in stone’ and future running of this project will respect all the rules that have been set up.

与往常一样,即使在流程的后期,您也可以选择批准或不批准任何DQS更正或无效操作。 请注意,从这里开始,最终规则将“定下来”,并且该项目的未来运行将遵守所有已设置的规则。

We then click next and arrive at our final screen.

然后,我们单击“下一步”到达最终屏幕。

如何使用数据质量服务和SQL Server集成服务清除数据

We now may export our results to a SQL Server table for either production usage OR to be scrutinized by the Business Analyst or Data Steward.

现在,我们可以将结果导出到SQL Server表中,以用于生产用途或由业务分析师或数据管家进行检查。

I have once again, taken the liberty of completing the necessary fields. The important point to note is that we are exporting the data PLUS the ‘cleansing info’. This ‘cleansing info’ will be CRITICAL to our SQL Server Integration Services nightly process.

我再次*完成了必填字段。 需要注意的重要一点是,我们正在导出数据以及“清理信息”。 对于我们SQL Server Integration Services每晚处理而言,此“清理信息”至关重要。

Click finish.

单击完成。

如何使用数据质量服务和SQL Server集成服务清除数据

结果证明了布丁的质量 (The proof of the quality of the pudding, is in the results)

Opening SQL Server Management Studio and going to the DQS_STAGING_DATA database, one will note that the data which was just extracted is present in the table ‘SQLShack’. The correct records are show in the screen dump below.

打开SQL Server Management Studio并转到DQS_STAGING_DATA数据库,您会注意到,刚提取的数据存在于“ SQLShack”表中。 正确的记录显示在下面的屏幕转储中。

如何使用数据质量服务和SQL Server集成服务清除数据

The following are those records that were ‘corrected’ by DQS.

以下是DQS已“更正”的那些记录。

如何使用数据质量服务和SQL Server集成服务清除数据

and these… the invalid records.

这些……无效的记录。

如何使用数据质量服务和SQL Server集成服务清除数据

At the end of the day, the correct records would be added to the production data. Dicey records such as ‘corrected’ and ‘invalid’ records may be sent to the business folks for their feedback.

最终,正确的记录将添加到生产数据中。 诸如“更正”和“无效”记录之类的Dicey记录可能会发送给业务人员以征求他们的反馈。

利用SQL Server Integration Services自动化我们的日常加载过程 (Automating our daily load processes utilizing SQL Server Integration Services)

For those of you rejoining me, welcome back!

对于那些重新加入我的人,欢迎回来!

Now that we have created our Data Quality Project, it is time to utilize it within our daily load process. We start off by creating a new SQL Server Integration Services Project.

现在我们已经创建了数据质量项目,是时候在日常加载过程中利用它了。 我们首先创建一个新SQL Server Integration Services项目。

如何使用数据质量服务和SQL Server集成服务清除数据

We now create two connection managers, one to the DQS data staging database where we shall store our cleansed data and one to the INCOMING NEW source data within my ‘PASSNordicRally’ database. Further, I add an ‘Execute SQL Task’ and a ‘Data Flow Task’ to the Control Flow. (See below)

现在,我们创建两个连接管理器,一个连接到DQS数据分级数据库,我们将在其中存储清理后的数据,另一个连接到我的“ PASSNordicRally”数据库中的INCOMING NEW源数据。 此外,我在控制流中添加了“执行SQL任务”和“数据流任务”。 (见下文)

如何使用数据质量服务和SQL Server集成服务清除数据

The ‘Execute SQL Task’ is used to truncate the existing data from the staging tables.

“执行SQL任务”用于截断登台表中的现有数据。

At this point the astute reader will ask ‘why do we want to truncate the ’bad data’ tables especially if the Business Analysts and Data Stewards have not yet finished the data vamping?’ This is a valid point, therefore the truncation of these tables will OBVIOUSLY be dependent on your setup and the business policies.

此时,精明的读者会问:“为什么我们要截断“不良数据”表,特别是如果业务分析师和数据管理者尚未完成数据抽取? 这是正确的一点,因此这些表的截断显然取决于您的设置和业务策略。

创建数据流 (Creating the Data Flow)

Having access to both the data source and to the destinations, we create the following data source which points to our incoming data table. This table may have been refreshed in a plethora of ways (which are not relevant to our present discussion). Suffice it to say that the table of new data is ready to be run through our data cleansing model.

既可以访问数据源也可以访问目标,我们创建以下数据源,该数据源指向传入的数据表。 该表可能已经以多种方式刷新(与我们当前的讨论无关)。 可以说新数据表已准备好通过我们的数据清理模型运行。

如何使用数据质量服务和SQL Server集成服务清除数据

The columns within this SOURCE table may be seen below:

此SOURCE表中的列如下所示:

如何使用数据质量服务和SQL Server集成服务清除数据

添加数据清理组件 (Adding the Data Cleaning Component)

We now add a ‘Data Cleansing Component’ which was added to your SSIS tool kit when you installed Data Quality Services.

现在,我们添加了一个“数据清理组件”,当您安装数据质量服务时,该组件已添加到您的SSIS工具套件中。

如何使用数据质量服务和SQL Server集成服务清除数据

Configuring the Data Cleansing component is fairly fast and easy. We start off by creating a DQS Cleansing Connection Manager.

配置数据清理组件非常快速简便。 我们首先创建一个DQS清理连接管理器。

如何使用数据质量服务和SQL Server集成服务清除数据

We now must choose our Knowledge Base.

现在,我们必须选择我们的知识库。

如何使用数据质量服务和SQL Server集成服务清除数据

Notice that our Manufacturer Domain immediately appears after having accepted our Knowledge Base.

请注意,我们的制造商域在接受我们的知识库后立即出现。

映射选项卡 (The Mapping Tab)

Selecting the mapping tab permits us to tell the system which field(s) are to be processed by our cleansing model. In our case (described above) we have been looking at the ‘Manufacturer’ field exclusively thus I have only selected the ‘Manufacturer’. The remaining fields will follow through with the load process.

选择映射选项卡使我们可以告诉系统清理模型将处理哪些字段。 就我们的情况(如上所述)而言,我们一直在专门研究“制造商”字段,因此我只选择了“制造商”。 其余字段将继续加载过程。

如何使用数据质量服务和SQL Server集成服务清除数据

Note that our source field ‘Manufacturer’ and our Domain ‘Manufacturer’ have a one to one mapping.

请注意,我们的源字段“ Manufacturer”和我们的域“ Manufacturer”具有一对一的映射。

At this point, we are complete with the Data Cleansing Transform and click OK.

至此,我们完成了数据清理转换,然后单击确定。

分割提要 (Splitting the feed)

The reader should note that feed exiting the DQS Cleaning Transformation (as seen above) contains an additional field. In our case this field is called Manufacturer_Status and it is the results from this field that we shall be utilizing to split our feed, sending the correct records through to production AND all the invalid, new and corrected records through to the Business Analysts and Data Stewards.

读者应注意,退出DQS清洗转换的饲料(如上所示)包含一个附加字段。 在我们的例子中,该字段称为Manufacturer_Status,它是我们将利用该字段的结果来拆分Feed,将正确的记录发送到生产以及将所有无效,新的和更正的记录发送到业务分析师和数据管理员的信息。

This said, we add a ‘Conditional Split’ to our project (See below).

也就是说,我们在项目中添加了“条件拆分”(请参见下文)。

如何使用数据质量服务和SQL Server集成服务清除数据

完成我们的项目 (Finishing our project)

We have now reached the point where we need to set our data destination. All data coming out of the process will be staged in our DQS_STAGING_DATA database. Four staging table exist (See below).

现在,我们已经到了需要设置数据目标的地步。 该过程中产生的所有数据都将在我们的DQS_STAGING_DATA数据库中暂存。 存在四个登台表(请参见下文)。

如何使用数据质量服务和SQL Server集成服务清除数据

We now proceed to connect the outputs of our conditional split to the OLE DB destinations (See below).

现在,我们将条件拆分的输出连接到OLE DB目标(请参见下文)。

如何使用数据质量服务和SQL Server集成服务清除数据

We then complete the mappings (See below).

然后,我们完成映射(请参见下文)。

如何使用数据质量服务和SQL Server集成服务清除数据

Having then completed the configuring of all four OLE DB destinations, you project should look as follows:

然后,完成所有四个OLE DB目标的配置,您的项目应如下所示:

如何使用数据质量服务和SQL Server集成服务清除数据

运行我们的日常流程 (Running our daily process)

Test driving our new SQL Server Integration Services project with BRAND NEW data, we now find the following:

使用BRAND NEW数据测试驱动我们的新SQL Server Integration Services项目,现在发现以下内容:

如何使用数据质量服务和SQL Server集成服务清除数据

结论 (Conclusions)

It is blatantly obvious that data quality is an ongoing battle. Our nightly processes must be trusted to maintain our data and to ensure that our data is as ‘correct’ as possible. Intelligent algorithms such as those found within Data Quality Services can and will help ensure that we are not ‘reinventing’ the wheel with each new data load.

显而易见,数据质量是一场持续不断的战斗。 我们的夜间流程必须得到信任,以维护我们的数据并确保我们的数据尽可能“正确”。 诸如Data Quality Services内的智能算法可以并且将帮助确保我们不会在每次新数据加载时“重塑”*。

Our ‘invalid’, ‘new’ and ‘corrected’ data is sent to the Business Analysts / Data Stewards for further validation whilst the ‘correct’ data may be loaded into our production tables.

我们的“无效”,“新”和“已更正”数据将发送到业务分析师/数据管理员以进行进一步验证,而“正确”数据可能会加载到我们的生产表中。

In our case our Business Analysts and Data Stewards process the faulty data via SQL Server Master Data Services, however this is a topic for another day.

在我们的案例中,我们的业务分析师和数据管理员通过SQL Server主数据服务处理错误数据,但这是另一天的话题。

Happy programming.

编程愉快。

翻译自: https://www.sqlshack.com/clean-data-using-data-quality-services-ssis/