【全英文警告!】如何比较两个向量的相似度
Jaccard Similarity
Basic Concept
A statistic used for measuring the simularity and diversity sample sets [ 2 ] ^{[2]} [2]
The Jaccard coefficient measures the similarity of between the finite sample sets, and is defined as the intersection divided by the size of union of the sample sets.
J ( A , B ) = ∣ A ∩ B ∣ ∣ A ∪ B ∣ = ∣ ∣ A ∩ B ∣ ∣ A ∣ + ∣ B ∣ − ∣ A ∩ B ∣ ∣ J(A,B)=\frac{|A\cap B|}{|A\cup B|}=|\frac{|A\cap B|}{|A|+|B|-|A\cap B|}| J(A,B)=∣A∪B∣∣A∩B∣=∣∣A∣+∣B∣−∣A∩B∣∣A∩B∣∣
tips:
-
If A and B are both empty, define J ( A , B ) = 1 J(A,B)=1 J(A,B)=1
-
0 ≤ J ( A , B ) ≤ 1 0\le J(A,B) \le1 0≤J(A,B)≤1
The Jaccard distance
The scale that measure dissimiarity between sample sets, is complementary(互补的) to the Jaccard coefficient
d J ( A , B ) = 1 − J ( A , B ) = ∣ A ∪ B ∣ − ∣ A ∩ B ∣ ∣ A ∪ B ∣ d_{J}(A,B)=1-J(A,B)=\frac{|A\cup B|-|A\cap B|}{|A\cup B|} dJ(A,B)=1−J(A,B)=∣A∪B∣∣A∪B∣−∣A∩B∣
Alternative interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference
A Δ B = ( A ∪ B ) − ( A ∩ B ) A\Delta B=(A\cup B)-(A\cap B) AΔB=(A∪B)−(A∩B)
This interpretation commonly used as the calculation of an n × n n\times n n×n matrix for clustering and multidimentional scaling(MDS)
If μ \mu μ is a measure on a measurable space X X X,then we define the Jaccard coefficient by
J μ ( A , B ) = μ ( A ∩ B ) μ ( A ∪ B ) J_{\mu}(A,B)=\frac{\mu(A\cap B)}{\mu (A\cup B)} Jμ(A,B)=μ(A∪B)μ(A∩B)
and
d μ ( A , B ) = 1 − J μ ( A , B ) = μ ( A Δ B ) μ ( A ∩ B ) d_{\mu}(A,B)=1-J_{\mu}(A,B)=\frac{\mu(A\Delta B)}{\mu(A\cap B)} dμ(A,B)=1−Jμ(A,B)=μ(A∩B)μ(AΔB)
Cosine Similarity [ 3 ] ^{[3]} [3]
A measure of similarity between two non-zero vectors of an inner product space [ 3 ] ^{[3]} [3]
Definition
s
i
m
i
l
a
r
i
t
y
=
c
o
s
(
θ
)
=
A
⋅
B
∣
∣
A
∣
∣
B
∣
∣
=
∑
i
=
1
n
A
i
B
i
∑
i
=
1
n
A
i
2
∑
i
=
1
n
B
i
2
similarity =cos(\theta)=\frac{A\cdot B}{||A||B||}=\frac{\sum_{i=1}^{n}A_{i}B_{i}}{\sqrt{\sum_{i=1}^{n}A_{i}^{2}}\sqrt{\sum_{i=1}^{n}B_{i}^{2}}}
similarity=cos(θ)=∣∣A∣∣B∣∣A⋅B=∑i=1nAi2
∑i=1nBi2
∑i=1nAiBi
Normalisation issues
i f A = [ A 1 , A 2 ] T if\ A=[A_{1},A_{2}]^{T} if A=[A1,A2]T
t h e n A ˉ = [ ( A 1 + A 2 ) 2 , ( A 1 + A 2 ) 2 ] T then\ \bar{A}=[\frac{(A_{1}+A_{2})}{2},\frac{(A_{1}+A_{2})}{2}]^{T} then Aˉ=[2(A1+A2),2(A1+A2)]T
s o A − A ˉ = [ ( A 1 − A 2 ) 2 , ( − A 1 + A 2 ) 2 ] T so\ A-\bar{A}= [\frac{(A_{1}-A{2})}{2},\frac{(-A_{1}+A_{2})}{2}]^T so A−Aˉ=[2(A1−A2),2(−A1+A2)]T
Angular distance and similarity
For the vector elements may be positive or negative
a n g u l a r d i s t a n c e = c o s − 1 ( c o s i n e s i m i l a r i t y ) π angular\ distance =\frac{cos^{-1}(cosine\ similarity)}{\pi} angular distance=πcos−1(cosine similarity)
a n g u l a r s i m i l a r i t y = 1 − a n g u l a r d i s t a n c e angular\ similarity=1-angular \ distance angular similarity=1−angular distance
For those always positive
a n g u l a r d i s t a n c e = 2 ⋅ c o s − 1 ( c o s i n e s i m i l a r i t y ) π angular\ distance =\frac{2\cdot cos^{-1}(cosine\ similarity)}{\pi} angular distance=π2⋅cos−1(cosine similarity)
a n g u l a r s i m i l a r i t y = 1 − a n g u l a r d i s t a n c e angular\ similarity=1-angular \ distance angular similarity=1−angular distance
Otsuka-Ochiai coefficient
Useful in set similarity researches and here A and B are both sets.
K = ∣ A ∩ B ∣ ∣ A ∣ × ∣ B ∣ K=\frac{|A\cap B|}{\sqrt{|A|\times |B|}} K=∣A∣×∣B∣ ∣A∩B∣
Reference
[1] Symmetric difference
[2] Jaccard index
[3] Cosine_similarity