【全英文警告!】如何比较两个向量的相似度

Jaccard Similarity

Basic Concept

A statistic used for measuring the simularity and diversity sample sets [ 2 ] ^{[2]} [2]

The Jaccard coefficient measures the similarity of between the finite sample sets, and is defined as the intersection divided by the size of union of the sample sets.

J ( A , B ) = ∣ A ∩ B ∣ ∣ A ∪ B ∣ = ∣ ∣ A ∩ B ∣ ∣ A ∣ + ∣ B ∣ − ∣ A ∩ B ∣ ∣ J(A,B)=\frac{|A\cap B|}{|A\cup B|}=|\frac{|A\cap B|}{|A|+|B|-|A\cap B|}| J(A,B)=ABAB=A+BABAB

tips:

  1. If A and B are both empty, define J ( A , B ) = 1 J(A,B)=1 J(A,B)=1

  2. 0 ≤ J ( A , B ) ≤ 1 0\le J(A,B) \le1 0J(A,B)1

The Jaccard distance
The scale that measure dissimiarity between sample sets, is complementary(互补的) to the Jaccard coefficient

d J ( A , B ) = 1 − J ( A , B ) = ∣ A ∪ B ∣ − ∣ A ∩ B ∣ ∣ A ∪ B ∣ d_{J}(A,B)=1-J(A,B)=\frac{|A\cup B|-|A\cap B|}{|A\cup B|} dJ(A,B)=1J(A,B)=ABABAB

Alternative interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference

【全英文警告!】如何比较两个向量的相似度

A Δ B = ( A ∪ B ) − ( A ∩ B ) A\Delta B=(A\cup B)-(A\cap B) AΔB=(AB)(AB)

This interpretation commonly used as the calculation of an n × n n\times n n×n matrix for clustering and multidimentional scaling(MDS)

If μ \mu μ is a measure on a measurable space X X X,then we define the Jaccard coefficient by

J μ ( A , B ) = μ ( A ∩ B ) μ ( A ∪ B ) J_{\mu}(A,B)=\frac{\mu(A\cap B)}{\mu (A\cup B)} Jμ(A,B)=μ(AB)μ(AB)

and

d μ ( A , B ) = 1 − J μ ( A , B ) = μ ( A Δ B ) μ ( A ∩ B ) d_{\mu}(A,B)=1-J_{\mu}(A,B)=\frac{\mu(A\Delta B)}{\mu(A\cap B)} dμ(A,B)=1Jμ(A,B)=μ(AB)μ(AΔB)


Cosine Similarity [ 3 ] ^{[3]} [3]

A measure of similarity between two non-zero vectors of an inner product space [ 3 ] ^{[3]} [3]

Definition

s i m i l a r i t y = c o s ( θ ) = A ⋅ B ∣ ∣ A ∣ ∣ B ∣ ∣ = ∑ i = 1 n A i B i ∑ i = 1 n A i 2 ∑ i = 1 n B i 2 similarity =cos(\theta)=\frac{A\cdot B}{||A||B||}=\frac{\sum_{i=1}^{n}A_{i}B_{i}}{\sqrt{\sum_{i=1}^{n}A_{i}^{2}}\sqrt{\sum_{i=1}^{n}B_{i}^{2}}} similarity=cos(θ)=ABAB=i=1nAi2 i=1nBi2 i=1nAiBi

Normalisation issues

i f   A = [ A 1 , A 2 ] T if\ A=[A_{1},A_{2}]^{T} if A=[A1,A2]T

t h e n   A ˉ = [ ( A 1 + A 2 ) 2 , ( A 1 + A 2 ) 2 ] T then\ \bar{A}=[\frac{(A_{1}+A_{2})}{2},\frac{(A_{1}+A_{2})}{2}]^{T} then Aˉ=[2(A1+A2),2(A1+A2)]T

s o   A − A ˉ = [ ( A 1 − A 2 ) 2 , ( − A 1 + A 2 ) 2 ] T so\ A-\bar{A}= [\frac{(A_{1}-A{2})}{2},\frac{(-A_{1}+A_{2})}{2}]^T so AAˉ=[2(A1A2),2(A1+A2)]T

Angular distance and similarity

For the vector elements may be positive or negative

a n g u l a r   d i s t a n c e = c o s − 1 ( c o s i n e   s i m i l a r i t y ) π angular\ distance =\frac{cos^{-1}(cosine\ similarity)}{\pi} angular distance=πcos1(cosine similarity)

a n g u l a r   s i m i l a r i t y = 1 − a n g u l a r   d i s t a n c e angular\ similarity=1-angular \ distance angular similarity=1angular distance

For those always positive

a n g u l a r   d i s t a n c e = 2 ⋅ c o s − 1 ( c o s i n e   s i m i l a r i t y ) π angular\ distance =\frac{2\cdot cos^{-1}(cosine\ similarity)}{\pi} angular distance=π2cos1(cosine similarity)

a n g u l a r   s i m i l a r i t y = 1 − a n g u l a r   d i s t a n c e angular\ similarity=1-angular \ distance angular similarity=1angular distance

Otsuka-Ochiai coefficient

Useful in set similarity researches and here A and B are both sets.

K = ∣ A ∩ B ∣ ∣ A ∣ × ∣ B ∣ K=\frac{|A\cap B|}{\sqrt{|A|\times |B|}} K=A×B AB

Reference

[1] Symmetric difference
[2] Jaccard index
[3] Cosine_similarity