Hashing

Hashing fundamentals
Application: Unordered sets
Hash table with chaining
Practical universal hash functions
Application: Coordinated sampling

1. Hashing fundamentals

Notation:
- For n ∈ N n\in N n∈N:
  
  [ n ] = { 0 , . . . , n − 1 } [n]=\{0,...,n-1\} [n]={0,...,n−1}
  
  [ n ] + = { 1 , . . . , n − 1 } [n]_+=\{1,...,n-1\} [n]+={1,...,n−1}
- Expectation of indicator variable:
  
  E [ X ] = P r [ X = 1 ] E[X]=Pr[X=1] E[X]=Pr[X=1]
  
  E [ ∑ i X i ] = ∑ i E [ X i ] E[\sum_iX_i]=\sum_iE[X_i] E[∑iXi]=∑iE[Xi]

1.1 Hash function

Given a universe U U U of keys, and a positive integer m m m.

Definition

A (random) hash function h : U → [ m ] h:U\rightarrow[m] h:U→[m] is a randomly chosen function from U → [ m ] U\rightarrow [m] U→[m]. Equivalently, it is a function h h h such that for each x ∈ U , h ( x ) ∈ [ m ] x\in U,h(x)\in[m] x∈U,h(x)∈[m] is a random variable.

(We randomly project the original U U U to a set of [ m ] [m] [m])
Three things we care about:
- Space need to represent h h h;
- Time need to calculate h ( x ) h(x) h(x) given x ∈ U x\in U x∈U;
- Properties of the random variable.
Truly random hash function is impractical, and we introduce universal hashing.

1.2Hash function types

c-approximately universal: for all x ≠ y ∈ U : P r [ h ( x ) = h ( y ) ] ≤ c m x≠y\in U:Pr[h(x)=h(y)]\le\frac{c}{m} x=y∈U:Pr[h(x)=h(y)]≤mc(collision probability)
c-approximately strongly universal: (a.k.a. 2-independent)
- if, for all x ≠ y ∈ U x≠y\in U x=y∈U, and q , r ∈ [ m ] : P r [ h ( x ) = q ∧ h ( y ) = r ] ≤ c m 2 q,r\in[m]:\;Pr[h(x)=q\land h(y)=r]\le\frac{c}{m^2} q,r∈[m]:Pr[h(x)=q∧h(y)=r]≤m2c
- Each key is hashed uniformly into [m]; any two distinct keys hash independently.

2. Application: Unordered sets

A set S S S of at most n n n keys from some unordered universe U U U, some functions:

INSERT(x,S) Insert key x x x into S S S;
DELETE(x,S) Delete key x from S.
MEMBER(x,S) Return x ∈ S.

We want these operations to run in expected constant time.

3. Hash table with chaining

Hashing哈希函数(Introduction to Algorithms, 算法导论，CLRS)学习笔记

Theorem: For x ∉ S , E [ ∣ L [ h ( x ) ] ∣ ≤ 1 x\notin S,E[|L[h(x)]|\le1 x∈/S,E[∣L[h(x)]∣≤1

Proof:

E [ ∣ L [ h ( x ) ] ∣ ] = E [ ∣ { y ∈ S ∣ h ( y ) = h ( x ) } ∣ ] ( a l l e l e m e n t s h a v e t h e s a m e h a s h v a l u e ) = E [ ∑ y ∈ S [ h ( y ) = h ( x ) ] ] = ∑ y ∈ S E [ [ h ( y ) = h ( x ) ] ] = ∑ y ∈ S P r [ h ( y ) = h ( x ) ] ≤ ∣ S ∣ 1 m = n m ≤ 1 \begin{aligned}E[|L[h(x)]|]&=E[|\{y\in S|h(y)=h(x)\}|]\quad(all\;elements\;have\;the\;same\;hash\;value)\\&=E[\sum_{y\in S}[h(y)=h(x)]]\\ &=\sum_{y\in S}E[[h(y)=h(x)]]\\&=\sum_{y\in S}Pr[h(y)=h(x)]\\ &\le|S|\frac{1}{m}=\frac{n}{m}\le1 \end{aligned} E[∣L[h(x)]∣]=E[∣{y∈S∣h(y)=h(x)}∣](allelementshavethesamehashvalue)=E[y∈S∑[h(y)=h(x)]]=y∈S∑E[[h(y)=h(x)]]=y∈S∑Pr[h(y)=h(x)]≤∣S∣m1=mn≤1

4. Practical universal hash functions

The original hashing functions are not random hash function, but we can add some conditions, like choose the parameters a , b a,b a,b independent and uniformly at random, and the hash functions would be universal.

5. Application: Coordinated sampling

There are i i i agents, and S i S_i Si is the subset of the whole events A i A_i Ai observed by agent i i i.
The “coordinated” means that samples can be combined.
According to the definition our hash function, we only sample the events A i A_i Ai whose hash value h ( x ) < t h(x)<t h(x)<t.
Since the total number of hash value is m m m, and the number of h ( x ) < t h(x)<t h(x)<t is t. Besides, as our it’s a strong universal hash function, which means h ( x ) h(x) h(x) uniform in [ m ] [m] [m] so we can know the probability P r [ h ( x ) < t ] = t m Pr[h(x)<t]=\frac{t}{m} Pr[h(x)<t]=mt.
Then for any events set A A A, the expected size of its subset would be ∣ A ∣ × t m |A|\times \frac{t}{m} ∣A∣×mt, and we can have an unbiased estimate the size of A: ∣ A ∣ ≈ m t ∣ S h , t ( A ) ∣ |A|≈\frac{m}{t}|S_{h,t}(A)| ∣A∣≈tm∣Sh,t(A)∣ .

Analysis of our estimate

By Chebyshev’s inequality: P r [ ∣ X − μ ∣ ≥ q μ ] ≤ 1 q 2 Pr[|X-\mu|\ge q\sqrt\mu]\le\frac{1}{q^2} Pr[∣X−μ∣≥qμ ]≤q21

Let X = ∣ S h , t ( A ) ∣ X=|S_{h,t}(A)| X=∣Sh,t(A)∣ and for any a , b ∈ A , X a a,b\in A,\;X_a a,b∈A,Xa and X b X_b Xb are independent. Let X a = [ h ( a ) < t ] X_a=[h(a)<t] Xa=[h(a)<t]. X = ∑ a ∈ A X a X=\sum_{a\in A}X_a X=∑a∈AXa. Also, let μ = E [ X ] = t m ∣ A ∣ \mu=E[X]=\frac{t}{m}|A| μ=E[X]=mt∣A∣.

h h h must be uniform to get unbiased estimate, and pairwise independent for the lemma.

Hashing哈希函数(Introduction to Algorithms, 算法导论，CLRS)学习笔记