Failure Diagnosis Using Decision Trees

paper link: https://ieeexplore.ieee.org/abstract/document/1301345

Main idea

Train decision trees on the request traces from time periods in which user-visible failures are present. Paths through the tree are ranked according to their degree of correlation with failure, and nodes are merged according to the observed partial order of system components.

Goals

High diagnosis rate:localize a large percentage of the errors in order to garner conﬁdence from the system recovery operator.
Few false positives:False positives, which are correct behavior of the system that are mistaken for an error, are costly in terms of time wasted on recovery and re-diagnosis.
Robust to noise: An ideal algorithm should be able to ﬁlter out noise and prioritize faults that are impacting the availability of a system.
Near real-time turn-around. The algorithm needs to be fast enough in order to derive beneﬁts over manual approaches.

Decision Tree Learning Approach

In this paper, we treat the problem as one of ﬁnding system components that are correlated with failure.
Failure Diagnosis Using Decision Trees
Learning a decision tree as usual , we can use C4.5 or MinEntropy which had served in eBay.

Failure Diagnosis from Decision Tree Output

We use following heuristics approach to deal output of decision to get the diagnosis result we want:

Ignore the leaf nodes corresponding to successful requests. Most of them do not contain any failed requests, and are thus useless in diagnosing failures.(we can delete the leaf of “Success(20/10)”)
Noise Filtering: Ignore leaves containing less than c% of the total number of failures. This corresponds to making the reasonable assumptions that there are only a few independent sources of error, and each of them accounts for a large fraction of the total number of failures.
Node Merging: Before reporting the ﬁnal diagnosis,we merge nodes on a path by eliminating ancestor nodes that are logically “subsumed” by successor nodes. For example, if machine x only runs software version y, then the path (Version=y and Machine=x) is equivalent to(Machine=x). This kind of “subsumption” between system components deﬁnes a partial order on our features, where feature1 <= feature2 iff all requests containing feature1 also contain feature2.
Ranking: Sort the predicted causes by failure counts to prioritize their importance.

eg: see the decision tree picture above:

In step1,applying these heuristics to the example decision tree in Figure 1, the right-most leaf node would be eliminated because it does not contain any failures.
In step 2, if we have a noise ﬁltering threshold of 10%, then the remaining leaf nodes, with 40% and 60% of the total failures, will both be retained as candidates.
In step 3, we produce two predicted sources of error: (Machine=x) and (Machine!=x and RequestType=y). If none of the failed requests with request type y is executed on machine x,(that is when machine =x, there is only one fault path: machinex-> requestType!=y) then the machine feature is subsumed by the request type and the two nodes merge into one. The diagnosis is now (Machine=x) and (RequestType=y).
Finally,in Step 4, the two predicted causes are ranked by failure count; (Machine=x) causes 15 failures, which places it before (RequestType =y)with 10 failures.

Failure Diagnosis Using Decision Trees

Main idea

Goals

Decision Tree Learning Approach

Failure Diagnosis from Decision Tree Output

相关推荐