Hashing哈希函数(Introduction to Algorithms, 算法导论,CLRS)学习笔记

Hashing

  • Hashing fundamentals
  • Application: Unordered sets
  • Hash table with chaining
  • Practical universal hash functions
  • Application: Coordinated sampling

1. Hashing fundamentals

  • Notation:

    • For n ∈ N n\in N nN:

      [ n ] = { 0 , . . . , n − 1 } [n]=\{0,...,n-1\} [n]={0,...,n1}

      [ n ] + = { 1 , . . . , n − 1 } [n]_+=\{1,...,n-1\} [n]+={1,...,n1}

    • Expectation of indicator variable:

      E [ X ] = P r [ X = 1 ] E[X]=Pr[X=1] E[X]=Pr[X=1]

      E [ ∑ i X i ] = ∑ i E [ X i ] E[\sum_iX_i]=\sum_iE[X_i] E[iXi]=iE[Xi]

1.1 Hash function

Given a universe U U U of keys, and a positive integer m m m.

Definition

  • A (random) hash function h : U → [ m ] h:U\rightarrow[m] h:U[m] is a randomly chosen function from U → [ m ] U\rightarrow [m] U[m]. Equivalently, it is a function h h h such that for each x ∈ U , h ( x ) ∈ [ m ] x\in U,h(x)\in[m] xU,h(x)[m] is a random variable.

    (We randomly project the original U U U to a set of [ m ] [m] [m])

  • Three things we care about:

    • Space need to represent h h h;
    • Time need to calculate h ( x ) h(x) h(x) given x ∈ U x\in U xU;
    • Properties of the random variable.
  • Truly random hash function is impractical, and we introduce universal hashing.

1.2Hash function types

  • c-approximately universal: for all x ≠ y ∈ U : P r [ h ( x ) = h ( y ) ] ≤ c m x≠y\in U:Pr[h(x)=h(y)]\le\frac{c}{m} x=yU:Pr[h(x)=h(y)]mc(collision probability)
  • c-approximately strongly universal: (a.k.a. 2-independent)
    • if, for all x ≠ y ∈ U x≠y\in U x=yU, and q , r ∈ [ m ] :    P r [ h ( x ) = q ∧ h ( y ) = r ] ≤ c m 2 q,r\in[m]:\;Pr[h(x)=q\land h(y)=r]\le\frac{c}{m^2} q,r[m]:Pr[h(x)=qh(y)=r]m2c
    • Each key is hashed uniformly into [m]; any two distinct keys hash independently.

2. Application: Unordered sets

A set S S S of at most n n n keys from some unordered universe U U U, some functions:

  • INSERT(x,S) Insert key x x x into S S S;

  • DELETE(x,S) Delete key x from S.

  • MEMBER(x,S) Return x ∈ S.

We want these operations to run in expected constant time.

3. Hash table with chaining

Hashing哈希函数(Introduction to Algorithms, 算法导论,CLRS)学习笔记

Theorem: For x ∉ S , E [ ∣ L [ h ( x ) ] ∣ ≤ 1 x\notin S,E[|L[h(x)]|\le1 x/S,E[L[h(x)]1

Proof:

E [ ∣ L [ h ( x ) ] ∣ ] = E [ ∣ { y ∈ S ∣ h ( y ) = h ( x ) } ∣ ] ( a l l    e l e m e n t s    h a v e    t h e    s a m e    h a s h    v a l u e ) = E [ ∑ y ∈ S [ h ( y ) = h ( x ) ] ] = ∑ y ∈ S E [ [ h ( y ) = h ( x ) ] ] = ∑ y ∈ S P r [ h ( y ) = h ( x ) ] ≤ ∣ S ∣ 1 m = n m ≤ 1 \begin{aligned}E[|L[h(x)]|]&=E[|\{y\in S|h(y)=h(x)\}|]\quad(all\;elements\;have\;the\;same\;hash\;value)\\&=E[\sum_{y\in S}[h(y)=h(x)]]\\ &=\sum_{y\in S}E[[h(y)=h(x)]]\\&=\sum_{y\in S}Pr[h(y)=h(x)]\\ &\le|S|\frac{1}{m}=\frac{n}{m}\le1 \end{aligned} E[L[h(x)]]=E[{ySh(y)=h(x)}](allelementshavethesamehashvalue)=E[yS[h(y)=h(x)]]=ySE[[h(y)=h(x)]]=ySPr[h(y)=h(x)]Sm1=mn1

4. Practical universal hash functions

Hashing哈希函数(Introduction to Algorithms, 算法导论,CLRS)学习笔记Hashing哈希函数(Introduction to Algorithms, 算法导论,CLRS)学习笔记Hashing哈希函数(Introduction to Algorithms, 算法导论,CLRS)学习笔记
  • The original hashing functions are not random hash function, but we can add some conditions, like choose the parameters a , b a,b a,b independent and uniformly at random, and the hash functions would be universal.

5. Application: Coordinated sampling

Hashing哈希函数(Introduction to Algorithms, 算法导论,CLRS)学习笔记Hashing哈希函数(Introduction to Algorithms, 算法导论,CLRS)学习笔记
  • There are i i i agents, and S i S_i Si is the subset of the whole events A i A_i Ai observed by agent i i i.
  • The “coordinated” means that samples can be combined.
  • According to the definition our hash function, we only sample the events A i A_i Ai whose hash value h ( x ) < t h(x)<t h(x)<t.
  • Since the total number of hash value is m m m, and the number of h ( x ) < t h(x)<t h(x)<t is t. Besides, as our it’s a strong universal hash function, which means h ( x ) h(x) h(x) uniform in [ m ] [m] [m] so we can know the probability P r [ h ( x ) < t ] = t m Pr[h(x)<t]=\frac{t}{m} Pr[h(x)<t]=mt.
  • Then for any events set A A A, the expected size of its subset would be ∣ A ∣ × t m |A|\times \frac{t}{m} A×mt, and we can have an unbiased estimate the size of A: ∣ A ∣ ≈ m t ∣ S h , t ( A ) ∣ |A|≈\frac{m}{t}|S_{h,t}(A)| AtmSh,t(A) .

Analysis of our estimate

By Chebyshev’s inequality: P r [ ∣ X − μ ∣ ≥ q μ ] ≤ 1 q 2 Pr[|X-\mu|\ge q\sqrt\mu]\le\frac{1}{q^2} Pr[Xμqμ ]q21

Let X = ∣ S h , t ( A ) ∣ X=|S_{h,t}(A)| X=Sh,t(A) and for any a , b ∈ A ,    X a a,b\in A,\;X_a a,bA,Xa and X b X_b Xb are independent. Let X a = [ h ( a ) < t ] X_a=[h(a)<t] Xa=[h(a)<t]. X = ∑ a ∈ A X a X=\sum_{a\in A}X_a X=aAXa. Also, let μ = E [ X ] = t m ∣ A ∣ \mu=E[X]=\frac{t}{m}|A| μ=E[X]=mtA.

  • h h h must be uniform to get unbiased estimate, and pairwise independent for the lemma.

6. Additional application

Hashing哈希函数(Introduction to Algorithms, 算法导论,CLRS)学习笔记