【计算机科学】【2011.05】【含源码】微阵列数据的SVM分类与边缘距离分析

本文为美国阿克伦大学(作者:Ameer Basha Shaik Abdul)的硕士论文,共84页。

【计算机科学】【2011.05】【含源码】微阵列数据的SVM分类与边缘距离分析

支持向量机是一种统计分类算法,它借助于泛函超平面将两类数据分开进行分类。SVM在噪声和高维数据(如微阵列)的应用上具有良好的性能。(注:微阵列(DNA Microarray)也叫寡核苷酸阵列(Oligonucleitide array),是人类基因组计划(Human Geneome Project,HGP)的逐步实施和分子生物学的迅猛发展及运用的产物,它是生物学家受到计算机芯片制造和广为应用的启迪,融微电子学、生命科学、计算机科学和光电化学为一体,在原来核酸杂交(Northern、Southern)的基础上发展起来的一项新技术,它是第三次革命(基因组革命)中的主要技术之一,是生物芯片中的一种。该技术的原理是在固体表面上集成已知序列的基因探针,被测生物细胞或组织中大量标记的核酸序列与上述探针阵列进行杂交,通过检测相应位置杂交探针,实现基因信息的快速检测。)

泛函超平面的边缘区域称为危险区域,它定义为两个平行超平面之间的区域,平行超平面由两类数据支持向量与泛函超平面之间的平均距离确定。本研究的主要目的是确定边缘距离、危险区宽度对分类器精度的影响,并分析边缘距离在特征选择中的作用。本文的研究使用了三组微阵列数据集。对于每个数据集,推导了两类数据的泛函超平面方程,并获得了相应的支持向量。研究了危险区宽度与分类精度之间的关系,还研究了用于构建支持向量机的特征数量相对于边缘距离的变化率。

研究结果表明,虽然边缘距离与分类精度的相关性不是很强,但利用分类精度相对于边缘距离的变化率,可以确定构造高性能支持向量机的最优特征数。

Support vector machine is statisticalclassification algorithm that classifies data by separating two classes withthe help of a functional hyper plane. SVM is known for good performance onnoisy and high dimensional data such as microarray. A marginal region offunctional hyper plane named „danger zone‟is defined to be the regionbetween two parallel hyper planes that are determinedby the average distances of the support vectors from the two classes tofunctional hyper plane. The main aim of this study was to determine the effectof margin distance, the width of the danger zone, on the accuracy of theclassifier and to analyze the role of margin distance in feature selection. Thestudy was carried out using three microarray datasets. For each dataset,equation of functional hyper plane separating the two classes of data wasderived. The corresponding support vectors were obtained. The average distancesbetween support vectors from the two classes to functional hyper plane werecalculated. The relations between the width of the danger zone and theclassification accuracy were investigated. The rate of change of the margindistance with respect to the number of features used for constructing thesupport vector machine was also examined. The results indicate that althoughcorrelation between margin and accuracy is not very strong, but the rate ofchange of classification accuracy with respect to margin distance can beemployed to determine the optimal number of features for constructing highperformance support vector machine for classifying microarray samples.

1 引言

2 相关文献回顾

3 研究数据与方法

4 研究结果与讨论

5 结论

附录 MATLAB源码

附录A 随机产生训练与测试数据

附录B 训练与测试数据集定标

附录C 对定标训练数据进行T检验

附录D 计算SVM分类器的边缘距离

下载英文原文地址:

http://page2.dfpan.com/fs/3lcj02214291a659985/

更多精彩文章请关注微信号:【计算机科学】【2011.05】【含源码】微阵列数据的SVM分类与边缘距离分析