Machine learning system design - Prioritizing what to work on: Spam classification example

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十二章《机器学习系统设计》中第93课时《确定执行的优先级》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————
In the next few videos, I'd like to talk about machine learning system design. These videos will touch on the main issues that you may face when designing a complex machine learning system. And I'd like to try to advice on how to strategize putting together a complex machine learning system. In case, this next set of videos seems a little disjointed, that's because these videos will touch on a range of the different issues that you may come across when designing complex machine learning systems. And even though the next set of videos may seem somewhat less mathematical, I think that this material may turn out to be very useful, and potentially huge time savers when you're building big machine learning systems. Concretely, I'd like to begin with the issue of prioritizing how to spend your time on what to work on, and I'll begin with an example on spam classification.

Machine learning system design - Prioritizing what to work on: Spam classification example

Let's say you want to build a spam classifier. Here're a couple of examples of obvious spam and non-spam email with the one on the left tried to sell things. And notice how spammers will sometimes deliberately misspell words like medicine with a 1 there, and mortagages. And on the right is an obvious example of non-spam email actually an email from my younger brother. Let's say we have a labeled training set of some number of spam emails and some non-spam emails denoted with labels Machine learning system design - Prioritizing what to work on: Spam classification example or Machine learning system design - Prioritizing what to work on: Spam classification example. How do we build a classifier using supervised learning to distinguish between spam and non-spam?

Machine learning system design - Prioritizing what to work on: Spam classification example

In order to apply supervised learning, the first decision we must make is how do we want to represent Machine learning system design - Prioritizing what to work on: Spam classification example, that is the features of the email. Given the features Machine learning system design - Prioritizing what to work on: Spam classification example, and the labels Machine learning system design - Prioritizing what to work on: Spam classification example in our training set, we can then train a classifier, for example, using logistic regression. Here's one way to choose a set of features for our email. We could come up with, say, a list of maybe a hundred words that we think are indicative of whether email is spam or non-spam. For example, if a piece of email contains the word "deal", maybe it's more likely to be spam. If it contains a word "buy", maybe more likely to be spam. A word like "discount" is more likely to be spam. Whereas a piece of email that contains my name "Andrew", maybe that means the person actually knows who I am and that might means it's less likely to be spam. And maybe for some reason I think the word "now" may be indicative of non-spam because I get a lot of urgent emails. And so on, and maybe we choose a hundred words or so. Given a piece of email, we can then take this piece of email and encode it into a feature vector as follows. I'm going to take my list of a hundred words, and sort them in alphabetical order. It doesn't have to be sorted. Now, given a piece of email that shown on the right, I'm going to check and see whether or not each of these words appears in the email. And then I'm going to define a feature vector Machine learning system design - Prioritizing what to work on: Spam classification example where, in this piece of email on the right, my name doesn't appear, so I'm gonna put a zero there. The word "buy" does appear, so I'm gonna put one there. And I'm just gonna put ones or zeros. I'm gonna put one even though the word buy occurs twice. I'm not gonna recount how many times the word occurs. The word "deal" appears, I put one there. The word "discount" doesn't appear, at least not in this short email, and so on. The word "now" does appear and so on. So I put ones and zeros in this feature vector depending on whether or not a particular word appears. And in this example, my feature vector would have to mention one hundred if I chose one hundred words to use for this representation. And each of my features Machine learning system design - Prioritizing what to work on: Spam classification example will basically be 1 if you have a particular word that we'll call this word Machine learning system design - Prioritizing what to work on: Spam classification example appears in the email. And Machine learning system design - Prioritizing what to work on: Spam classification example would be zero otherwise. So that gives me a feature representation of a piece of email. By the way, even though I've described this process as manually picking a hundred words, in practice what's most commonly done is to look through a training set, and in the training set to pick the most frequently occurring Machine learning system design - Prioritizing what to work on: Spam classification example words, where Machine learning system design - Prioritizing what to work on: Spam classification example is usually between Machine learning system design - Prioritizing what to work on: Spam classification example and Machine learning system design - Prioritizing what to work on: Spam classification example and use thos as your features.

Machine learning system design - Prioritizing what to work on: Spam classification example

Now, if you're building a spam classifier, one question you may face is what's the best use of your time in order to make your spame classifier have higher accuracy and have lower error. One natural inclination is going to collect a lot of data. And in fact there's this tendency to think that the more data we have, the better the algorithm will do. And in fact, in the email spam domain, there are actually pretty serious projects called "Honey pot" project which create fake email addresses. And try to get these fake email addresses into the hands of the spammers. And use that to try to collect tons of spam emails, and therefore get a lot of spam data to train learning algorithms. But we've already seen in the previous sets of videos, that getting lots of data will often help, but not all the time. But for most machine learning problems, there are a lot of things you could do to improve performance. For spam, one thing you might think of is to develop more sophisticated features on the email. Maybe based on the email routing information. This is the information contained in the email header. So, when spammers send email, very often they will try to obscure the origins of the email, and maybe use fake email headers, or send email through very unusual sets of computer service through very unusual routes in order to get the spam to you. And some of this information will be reflected in the email header. So, one you can imagine, looking at the email headers and trying to develop more sophisticated features to capture this sort of email routing information to identify if something is spam. Something else you might consider doing is to look at the email message body, that is the email text and try to develop more sophisticated features. For example, should the word "discount" and the word "discounts" be treated as the same words. Or should we have treat the words "deal" and "dealer" as the same word? Maybe even though one is lower case and one is capitalized in this example. Or do we want more complex features about punctuation because maybe spam is using exclamation marks a lot more, I don't know. And along the same lines, maybe we also want to develop more sophisticated algorithms to detect and maybe to correct the deliberate misspellings, like mortgage (m0rtgate), medicine(med1cine) and watches(w4tches). Because spammers actually do this because if you have watches with a 4 in there, then with the simple technique that we talked about just now, the spam classifier might not equate this as the same thing as the word "watches". And it may have a harder time realizing something is spam with these deliberate misspesllings. And this is why spammers do it. While working on a machine learning problem, very often you can brainstorm list of different things to try like these. And by the way, I've actually worked on the spam problem myself for a while. And even though I cannot understand the spam problem, I actually know a bit about it. I would actually have a very hard time to tell you of these 4 options which is the best use of your time. So frankly what happens far too often is that a research group or a product group will randomly fixate on one of these options. And sometimes that turns out not to be the fruitful way to spend your time depending on which of these options, someone ends up randomly fixate on.

By the way, if you even get to the stage where you brainstorm a list of different options to try, you're probably already ahead of the curve. Sadly, when most of people do is instead of trying to list out the options of things you might try, what for too many people do is wake up one morning and for some reason just have a weird gut feeling that, Oh, let's have a huge honeypot project to go and collect tons of more data. And for whatever strange reason just wake up one morning, and randomly fixate on one thing, and just work on that for six months. But I think we can do better. And particular, what I'd like to do in the next video is tell you about the concept of error analysis. And talk about the way where you can try to have a more systematic way to choose amongst the options of the many different things you might work on. And therefore be more likely to select what is actually a good way to spend your time for the next few weeks, or next few days or the next few months.

<end>