machine learning yearning 第五章

Your development and test sets

开发与测试集

让我们会到喵喵图片的案例:你运营者一个手机app,用户向你的app上上传各种图片,但是你要让机器自动找到喵咪的图片。

Lets return to our earlier cat pictures example: You run a mobile app, and users are uploading pictures of many different things to your app. You want to automatically find the cat pictures. 

你的团队从不同网页中下载了大量的图片作为训练集,包括有喵咪的图片(阳性样本)和没有喵咪的图片(阴性样本),并且其中的70%用于当训练集和30%用来测试。使用这些数据,你的队友建立了一个在测试集上表现良好,算得上经得住”测试“的分类器。但当你将这个分类器部署到你的手机app上,你吓了一跳,因为它在实际工作中的表现实在太差了!

Your team gets a large training set by downloading pictures of cats (positive examples) and non-cats (negative examples) off different websites. They split the dataset 70%/30% into training and test sets. Using this data, they build a cat detector that works well on the training and test sets. 

But when you deploy this classifier into the mobile app, you find that the performance is really poor! 

                                                                     !machine learning yearning 第五章

那么...发生了什么呢?

What happened? 

你发现用户上传的图片跟你在网上下载的并作为训练集的图片有所差别:用户上传的图片通常是用手机拍摄的,所以分辨率较低、相对模糊并且缺少理想照明。因为你的训练集大多是容网上搜来的,所以你的算法并不能很好的解决你实际关心的问题,即识别手机图片。

You figure out that the pictures users are uploading have a different look than the website images that make up your training set: Users are uploading pictures taken with mobile phones, which tend to be lower resolution, blurrier, and have less ideal lighting. Since your training/test sets were made of website images, your algorithm did not generalize well to the actual distribution you care about of smartphone pictures. 

在大数据之前,机器学习随机地按70%和30%的比例分配训练集和测试集是业内的一个通用法则。虽然从网上收集的这些图片能发挥效用,但是在越来越多的应用中,这样的做法,即选取训练集与实际情景有偏颇的做法(例如上例中的网页图片),最终可能导致无法“命中”目标(上例中的手机图片)。因此是不可取的。

Before the modern era of big data, it was a common rule in machine learning to use a random 70%/30% split to form your training and test sets. This practice can work, but is a  bad idea in more and more applications where the training distribution (website images in our example above) is different from the distribution you ultimately care about (mobile phone images).

我们通常定义[1]:

  1. 训练集——用来训练你的学习算法
  2. 开发集——用来对训练集训练出来的模型进行测试,通过测试结果来不断地优化模型。也叫交叉训练集
  3. 测试集——在训练结束后,对训练出的模型进行一次最终的评估所用的数据集。

We usually define:

  • Training set — Which you run your learning algorithm on.
  • Dev (development) set — Which you use to tune parameters, select features, and make other decisions regarding the learning algorithm. Sometimes also called the holdout cross validation set
  • Test set — which you use to evaluate the performance of the algorithm, but not to make any decisions about regarding what learning algorithm or parameters to use. 

假设你确定开发集和测试集,你的团队有许多点子,例如学习算法的参数该怎么设置。为了测试那种设置结果最优,你可以使用你的开发集和测试集来考察,并且能够让你的团队知道你当前的学习算法怎么样。

One you define a dev set (development set) and test set, your team will try a lot of ideas, such as different learning algorithm parameters, to see what works best. The dev and test sets allow your team to quickly see how well your algorithm is doing. 

换句话说,开发集和测试集能够指引你们的团队,在机器学习系统上,做出重大决策。

In other words, the purpose of the dev and test sets are to direct your team toward the most important changes to make to the machine learning system

所以,现在你应该:

                     选择一个适合的开发集和数据集,以回馈你未来所期望得到的数据和你想做得更好的方面的数据。

So, you should do the following: 

Choose dev and test sets to reflect data you expect to get in the future and want to do well on. 

换而言之,你的测试集不应该仅仅是30%的可用数据,开发/测试集应该与训练集来自不同的一份,特别是当你期望的未来的数据(手机app图片)与训练集中的数据(网络图片)具有本质差别的时候。

In order words, your test set should not simply be 30% of the available data, especially if you expect your future data (mobile app images) to be different in nature from your training set (website images). 

 

假如你还没发布你的手机软件,你还没用户呢,因此没办法得到能够真实反映你模型效率的数据。但是你仍旧可以尝试模拟它,比如你可以叫你的朋友用手机拍一些图片然后发送给你,你再用这些数据测试你的app也未尝不可。一旦你的app发布之后,你就可以用你的用户数据更新你的开发集或测试集。

If you have not yet launched your mobile app, you might not have any users yet, and thus might not be able to get data that accurately reflects what you have to do well on in the future. But you might still try to approximate this. For example, ask your friends to take mobile phone pictures and send them to you. Once your app is launched, you can update your dev/test sets using actual user data. 

如果你真的毫无办法模拟实际数据,那或许你可以使用网络图片,但是你应该注意,这可能会导致系统功能不高的后果。

If you really don’t have any way of getting data that approximates what you expect to get in the future, perhaps you can start by using website images. But you should be aware of the risk of this leading to a system that doesn’t generalize well. 

花多少时间在开发开发集和测试集上,需要你根据实际情况做出判断。但千万不要让你的训练集和测试集是在同一份数据上挑出来的。试着去找一些,能够真正模拟在实际应用中你想要处理好数据的测试样本,而不是随便在用来训练的数据上上挑几个马虎了事。

It requires judgment to decide how much to invest in developing great dev and test sets. But don’t assume your training distribution is the same as your test distribution. Try to pick test examples that reflect what you ultimately want to perform well on, rather than whatever data you happen to have for training. 


1.一般需要将样本分成独立的三部分训练集(train set),验证集(validation set)和测试集(test set)。其中训练集用来估计模型,验证集用来确定网络结构或者控制模型复杂程度的参数,而测试集则检验最终选择最优的模型的性能如何。

人能力有限,如有错误欢迎改正,希望不吝赐教。

 

                                                                                                  ——译者:wexin_42141390 邮箱:[email protected]