Spark(day09) -- MLlib(1)

一.Overview

https://en.wikipedia.org/wiki/Machine_learning


二.Arithmetic 

(1)Linear Regression

Spark(day09) -- MLlib(1)

Spark(day09) -- MLlib(1)


(2)K-Nearest Neighbor

In essence, KNN algorithm USES distance to measure the similarity between samples.

The algorithm involves three main factors:

1) training data set.

2) calculation of distance or similarity.

3) the size of the k

Spark(day09) -- MLlib(1)


Algorithm description:

1) Two kinds of "prior" data are known, namely, the blue square block and the red triangle, which are distributed in a two-dimensional space.

2) There is an unknown data (green point) which needs to be judged whether it belongs to "blue square" or "red triangle".

3) Examine the categories of the three (or k) data points closest to the green point, and the majority of categories are the green point classification.


The calculation steps are as follows:

1) Calculate distance: given the test object, calculate the distance from each object in the training set.

2) Find a neighbor: close the nearest k training object, as a neighbor of the test object.

3) Classification: classification of test objects according to the main categories of k neighbors.


The measure of similarity.

1. The closer the distance should be, the greater the probability that the two points belong to a category.

However, distance is not everything, and some data is not suitable for distance measurement.

2. Measures of similarity measure: including Euclidean distance, Angle cosine and so on.


Categorical decision

1. Simple voting method: the minority is subordinate to the majority, and the most points in the nearest neighbor are classified into this class.

2. Weighted voting method: based on the distance, the voting on the nearest neighbor is weighted, and the closer the distance is, the greater the weight is (the weight is the reciprocal of the distance squared).


Algorithm deficiency.

1. Sample imbalance is easy to lead to incorrect results.

If the sample size of a class is large, and the other sample size is very small, it may lead to a large sample size of as the sample's K neighbors when a new sample is entered.

Method of improvement: this can be improved by using weights (and a smaller neighbor weight).



2. Larger computation

Because each text that is to be classified will calculate its distance to the entire known sample, so that its K nearest neighbor points can be obtained.

Improvement method: to edit the known sample points in advance, and remove the samples with little effect in advance.

This method is suitable for the classification of large sample size, while those with smaller sample size are more prone to miss-classification.


(3)K-Means

K-means algorithm is a simple and classical clustering algorithm based on distance.

The distance is used as the evaluation index of similarity, that is, the closer the two objects are, the greater the similarity is.

The algorithm considers that the class cluster is composed of objects that are close to each other, so that a compact and independent cluster is obtained as the ultimate goal.

Spark(day09) -- MLlib(1)

Spark(day09) -- MLlib(1)


The core idea

By iterating over a partition scheme of k class clusters, the mean value of the k class cluster is the minimum of the total error of the corresponding samples.

K clustering has the following characteristics: the clustering itself is as compact as possible, and the clustering is as separate as possible.

The k-means algorithm is based on the minimum error sum of squares and criteria,

All kinds of clusters in the sample, the more similar, with the mean square error is smaller, for all classes of error square summation, validation can be divided into k classes, each cluster is optimal.

The overhead cost function can't be minimized by parsing, and there can only be iterative methods.


The step diagram of Algorithm 

Spark(day09) -- MLlib(1)

1. Select K center of mass,

2. Calculate the distance between each particle and the center of mass into k classes,

3. Take the mean of all the particles, and take this mean as a new center of mass,

4. Calculate the distance from each particle to the center of mass.

5.Take an average of all the particles.

6.Take this mean as a new center of mass.

7. Calculate the distance from each particle to the center of mass.

8. Until the mean is unchanged.


Kmeans algorithm defect.

The k-means algorithm is relatively simple, but there are several major disadvantages:

1. the choice of k value is specified by the user, different k results have very big difference, as shown in the figure below, the left side is the result of the k = 3, this is too thin, the cluster is blue can be divided into two clusters.

On the right, the result of k=5, you can see that the two clusters of red diamond and blue diamond should be combined into one cluster:

Spark(day09) -- MLlib(1)

2. sensitive to the choice of k initial center of mass, it is easy to fall into local minimum.

For example, when the above algorithm runs, it is possible to get different results, such as the following two scenarios.


K-means also converges, but converges to local minimum:

Spark(day09) -- MLlib(1)

3. there are limitations, such as the following non-spherical data distribution:

Spark(day09) -- MLlib(1)

4. when the data set is large, convergence will be slow.


(4)Bayes theorem

In probability theory and statistics, Bayes’ theorem (alternatively Bayes’ law or Bayes' rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if cancer is related to age, then, using Bayes’ theorem, a person’s age can be used to more accurately assess the probability that they have cancer, compared to the assessment of the probability of cancer made without knowledge of the person's age. -- wiki

Spark(day09) -- MLlib(1)

Spark(day09) -- MLlib(1)

Spark(day09) -- MLlib(1)


Laplace's equation

Laplace's equation and Poisson's equation are the simplest examples of elliptic partial differential equations. The general theory of solutions to Laplace's equation is known as potential theory. The solutions of Laplace's equation are theharmonic functions, which are important in many fields of science, notably the fields of electromagnetism, astronomy, and fluid dynamics, because they can be used to accurately describe the behavior of electric, gravitational, and fluidpotentials. In the study of heat conduction, the Laplace equation is the steady-state heat equation. -- wiki