Failure Diagnosis Using Decision Trees
paper link: https://ieeexplore.ieee.org/abstract/document/1301345
Main idea
Train decision trees on the request traces from time periods in which user-visible failures are present. Paths through the tree are ranked according to their degree of correlation with failure, and nodes are merged according to the observed partial order of system components.
Goals
- High diagnosis rate:localize a large percentage of the errors in order to garner confidence from the system recovery operator.
- Few false positives:False positives, which are correct behavior of the system that are mistaken for an error, are costly in terms of time wasted on recovery and re-diagnosis.
- Robust to noise: An ideal algorithm should be able to filter out noise and prioritize faults that are impacting the availability of a system.
- Near real-time turn-around. The algorithm needs to be fast enough in order to derive benefits over manual approaches.
Decision Tree Learning Approach
In this paper, we treat the problem as one of finding system components that are correlated with failure.
Learning a decision tree as usual , we can use C4.5 or MinEntropy which had served in eBay.
Failure Diagnosis from Decision Tree Output
We use following heuristics approach to deal output of decision to get the diagnosis result we want:
- Ignore the leaf nodes corresponding to successful requests. Most of them do not contain any failed requests, and are thus useless in diagnosing failures.(we can delete the leaf of “Success(20/10)”)
- Noise Filtering: Ignore leaves containing less than c% of the total number of failures. This corresponds to making the reasonable assumptions that there are only a few independent sources of error, and each of them accounts for a large fraction of the total number of failures.
- Node Merging: Before reporting the final diagnosis,we merge nodes on a path by eliminating ancestor nodes that are logically “subsumed” by successor nodes. For example, if machine x only runs software version y, then the path (Version=y and Machine=x) is equivalent to(Machine=x). This kind of “subsumption” between system components defines a partial order on our features, where feature1 <= feature2 iff all requests containing feature1 also contain feature2.
- Ranking: Sort the predicted causes by failure counts to prioritize their importance.
eg: see the decision tree picture above:
- In step1,applying these heuristics to the example decision tree in Figure 1, the right-most leaf node would be eliminated because it does not contain any failures.
- In step 2, if we have a noise filtering threshold of 10%, then the remaining leaf nodes, with 40% and 60% of the total failures, will both be retained as candidates.
- In step 3, we produce two predicted sources of error: (Machine=x) and (Machine!=x and RequestType=y). If none of the failed requests with request type y is executed on machine x,(that is when machine =x, there is only one fault path: machinex-> requestType!=y) then the machine feature is subsumed by the request type and the two nodes merge into one. The diagnosis is now (Machine=x) and (RequestType=y).
- Finally,in Step 4, the two predicted causes are ranked by failure count; (Machine=x) causes 15 failures, which places it before (RequestType =y)with 10 failures.