On the use of machine learning to predict the time and resources consumed by applications

为什么80%的码农都做不了架构师？>>> On the use of machine learning to predict the time and resources consumed by applications

现代机器学习技术可以使用大量的特征，考虑application- and system-specific attributes，例如CPU 微架构, size and speed of memory and storage, input data characteristics and input parameters。本文扩展了一种已有的分类树算法Predicting Query Runtime (PQR预测查询运行时间)，选择最好的回归算法。PQR2取得了最好的平均误差百分比（对于 predicting execution time, memory and disk consumption for two bioinformatics applications, BLAST and RAxML）

使用非线性函数处理系统和应用的特征，配置两种机器学习算法：Support Vector Machine and k-nearest neighbors。

机器学习相关的工作

预测（CPU, memory, disk and network）资源消耗是一个有监督学习问题，

假定有n个之前运行的任务作为历史数据，特征集合包含m个特征。

几个参数学习 (e.g.,linear regression, polynomial regression) and 非参数学习
(e.g., k-nn, locally weighted linear regression, decision trees)监督学习算法

参数学习方法定义了假设空间和损失函数。

训练数据用于提取模型参数以最小化损失函数（老生常谈）。

KNN：一个挑战就是找到理想的近邻数k，依赖于训练数据。

High value for k can reduce the influence of noise

可以用遗传算法确定k

Locally weighted polynomial regression (LWPR) is similar to k-nn algorithm，见文献[6]:W. Smith, “Prediction Services for Distributed Computing,” in Proc. 21st Int. Parallel Distributed Processing Symp., 2007.

线性回归是简单的模型，应用于工作流时间预测和LWPR，SVM,参见：R. Albers, E. Suijs, and P. H. N. de With, “Triple-C: Resource-usage prediction for semi-automatic parallelization of groups of dynamic image-processing tasks,” in Proc. 23rd Int. Parallel Distributed Processing Symp., 2009.

Decision table/tree (DT) algorithm 决策表决策树：给予分治策略

使用树形结构按照他们的特征分离数据

C4.5 classification tree was applied in I. Rodero, F. Guim, J. Corbalan et al., "The Grid Backfilling: a Multi-Site Scheduling Architecture with Data Mining Prediction Techniques," Grid Middleware and Services, pp.137-152, 2008.

使用模板而不是树形图的论文很多（[1][4][5][6][12]），从所有任务特征中选择一个子集合。选择统计函数，如平均，平均加1.96*标准差，平均加1.5标准差，根据单个特征的线性回归，根据单个特征的逆回归和对数回归。

人工神经网络：Radial Basis Function network (RBFn) is a feedforward ANN

径向基函数网络 Typically with three layers: input, hidden and output

RBFn is very similar to k-nn, except for the fact that RBFn is a parametric method.神经网络是参数学习

SVM：kernel method for solving classification [19] and regression [20]
problems, especially for scenarios with non-linear learning pattern. 非线性场景

时间序列方法：Network Weather Service (NWS) [22] or a probabilistic approach as a Markov-chain.马尔科夫时间序列

Triple-C ：Resource-usage prediction for semi-automatic parallelization of groups of dynamic image-processing tasks

PQR回归

Generates a binary tree that can combine a variety of classifiers

树的每个节点可以表示成二分类器（从分类算法池中依据精度选取的）

算法将数值特征离散化为二分类，m个训练特征值（ai,i=1,...,m）

找到第k大的gaps gaps_i=(a_i+1-ai)/ai，i=1,..,m-1成为潜在的分割点

The best combination of classifier and split location determines the two attribute
ranges.

Instead of outputting classes (a broad range) or a static value (e.g., range median), PQR2 selects the best regression model for the availab le data (LR and
SVM in the case of the leaves shown in the figure).

生物信息学应用-序列比对

Basic Local Alignment Search Tool (BLAST) [1] and Randomized Axelerated
Maximu m Likelihood (RAxML)

BLAST----the non-redundant (NR) protein sequence database from NCBI split into 1 frag ment (total of 3.5 GB of data)

树生成 by PQR and PQR2 algorithms

On the use of machine learning to predict the time and resources consumed by applications

The nodes of the tree (circles) are common to both methods, while leaves (方形叶子) of PQR2 (加粗) yield lower errors than PQR (正常字体).

The improvement comes from the ability of selecting the best regression method
from a pool, whereas leaf range median is used in PQR.

The number between square brackets represent the range of values of the attribute to be predicted, which is followed by the number of historical data points in each node/leaf.

The percentage value indicates the accuracy of each classifier (nodes 分类) or the percentage error of each regressior (leaves 回归).

The last value indicates the name of each cassification (PQR and PQR2) or regression (PQR2) algorithm selected.

BLAST运行时间如下图，运行时间与输入序列的长度成线性关系：

On the use of machine learning to predict the time and resources consumed by applications

需要尝试的机器学习预测算法：

On the use of machine learning to predict the time and resources consumed by applications

问题

Question 1: Which ML algorithm offers the best accuracy?

Question 2: Which attributes should be included in the training dataset?

Question 3: Which ML algorith m provides better accuracy when dealing with training datasets with low coverage?

Regression Error Characteristic Curves

Comparing accuracy of PQR and PQR2 for predicting BLAST output and execution time (2 graphs on the left) as well as RAxML memory consumption and execution time (graphs on the right). On the use of machine learning to predict the time and resources consumed by applications

On the use of machine learning to predict the time and resources consumed by applications

Question 4: Does PQR2 offer better accuracy than PQR?

总结与讨论

To better adapt to scenarios with different characteristics (linear and non-linear relationships, high and low density of training data points) by choosing different models for its nodes and leaves.

更一般的，using the largest dataset (BLAST), PQR2 required a few minutes to create the model and a few milliseconds to produce a single prediction, indicat ing practicality of PQR2 for production deployments.

PQR2 是最佳的方案 for BLAST and RAxML and should 作为备选的方案 for other applications.

2. Attributes can have high impact on the performance of the learning algorith ms. 特征对性能的影响

The use of system performance attributes showed to be relevant for execution time prediction whereas application specific attributes were pertinent for all scenarios.

This work makes the case for including as many attributes as available, while letting the algorithms analyze the relevance of the attributes when necessary. For cloud and grid computing scenarios, where resources are outsourced, the provision of this informat ion to its users (or services acting on behalf of the users) through the use of benchmarks and runtime mon itoring,especially of shared resources, can bring several benefits.

* Amazon CloudWatch is one such example limited to a virtual mach ine instance. AWS云端资源监控
改进预测能更好的利用系统资源，避免系统应用中断，量入为出的使用资源

转载于:https://my.oschina.net/lfxu/blog/1507313

On the use of machine learning to predict the time and resources consumed by applications

相关推荐