Huffman Codes and Data Compression

哈夫曼编码让不同字符之间代码长度不同,保证经常出现的字符的代码要短。

Prefix Code

浅层定义:没有字符代码是别的字符代码的前缀。

定义:for a set SS of letters, a prefix code is a function γ\gamma that maps each letter xSx \in S to some sequence of zeros and ones, in such way that for distinct x,ySx, y \in S, the sequence γ(x)\gamma(x) is not a prefix of the sequence γ(y)\gamma(y).

Optimal Prefix Codes

假设对于任意一个字母 xSx \in S, fxf_x 代表字母 xx 出现的频率。

  • 假设总共有 nn 个字母,其中 nfxn f_x 个字母是 xx.
  • xSfx=1\sum_{x\in S} f_x = 1

γ(x)|\gamma(x)| 表示 γ(x)\gamma(x) 的长度,那么
encoding length=xSnfxγ(x)=nxSfxγ(x)\text{encoding length}=\sum_{x \in S} n f_x |\gamma(x)| = n \sum_{x \in S} f_x |\gamma(x)|

定义: the average number of bits required per letter ABL(γ)=xSfxγ(x)ABL(\gamma)=\sum_{x \in S} f_x |\gamma(x)|

目标:minimize ABL(γ)ABL(\gamma)

算法设计

A binary tree TT with the number of leaves equal to the size of the alphabet SS, and we label each leaf with a distinct letter in SS.

定理 (4.27): The encoding of SS constructed from TT is a prefix code.
证明:(反证法)
γ(x)\gamma(x)γ(y)\gamma(y) 的 prefix,the path from the root to xx would have to be a prefix of the path from
the root to yy
但是,这等同于说 xx would lie on the path from root to yy
xx 是 leaf 矛盾

ABL(T)=xSfxdepthT(x)ABL(T) = \sum_{x \in S} f_x \cdot \text{depth}_T(x)

定理 (4.28): The binary tree corresponding to the optimal prefix code is full.
证明:(exchange argument)
Let TT denote the binary tree corresponding to the optimal prefix code, and suppose it contains
a node uu with exactly one child vv.
\quad
Convert TT into a tree TT' by replacing node uu with vv.
分两类情况讨论:
1.  u\; u 是 root: 删除 uu, 并将 vv 变成 root。
2.  u\; u 不是 root:
\quadwwuu 的 parent in TT.
\quad 删除 uu,并让 vv 成为 ww 的 child.
\quad 这个改变 decrease the number of bits needed to encode any leaf in the subtree rooted at node uu,
\quad and it does not affect other leaves
  ABL(T)<ABL(T)\quad \; \therefore ABL(T')<ABL(T)
\quad 与 optimality of TT 矛盾

定理 (4.29): 假设已知 optimal binary tree TT^*。Suppose that uu and vv are lives of TT^*, such that depth(u)<depth(v)depth(u) < depth(v). Further, suppose that in a labeling of TT^* corresponding to an optimal prefix code, leaf uu is labeled with ySy\in S and leaf vv is labeled with zSz \in S. Then, fyfzf_y \geq f_z.
证明:用 exchange argument 证明,交换 u,vu, v 的 label。

定理 (4.30): Consider a leaf vv in TT^* with largest depth. Leaf vv has a parent uu, and by (4.28) TT^* is a full binary tree, so uu has another child ww. Then, ww is a leaf of TT^*. 反证法

定理 (4.31): There is an optimal prefix code, with corresponding tree TT^*, in which the two lowest-frequency letters are assigned to leaves that are siblings in TT^*.

算法伪代码

To construct a prefix code for an alphabet SS, with given frequencies:

  1. If SS has two letters then
  2. \quadEncode one letter using 0 and the other letter using 1
  3. Else
  4. \quadLet yy^* and zz^* be the two lowest-frequency letters
  5. \quadForm a new alphabet SS' by deleting yy* and zz^* and replacing them with a new letter ww of frequency fy+fzf_{y^*}+f_{z^*}
  6. \quadRecursively construct a prefix code γ\gamma' for SS', with tree TT'
  7. \quadDefine a prefix code for SS as follows:
  8. \quad \quadStart with TT'
  9. \quad \quadTake the leaf labeled ww and add two children below it, labeled yy^* and zz^*
  10. EndIf
    Huffman Codes and Data Compression