Mã nén
1Introduction to Data CompressionData compression seeks to reduce the number of bits used to store or transmit information. 2Lecture 1Source Coding and Statistical ModelingAlexander KolesnikovEntropyExampleHow to get probailitiesShannon-Fano codeHuffman codePrefix codesContext modeling 3EntropySet of symbols (alphabet) S={s1, s2, …, sN},N is number of symbols in the alphabet. Probability distribution of the symbols: P={p1, p2, …, pN}According to Shannon, the entropy H of an informationsource S is defined as follows:∑=⋅−=NiiippH12)(log 4EntropyThe amount of information in symbol si, in other words, the number of bits to code or code lengthfor the symbol si:)(log)(2 iipsH −=∑=⋅−=NiiippH12)(logThe average number of bits for the source S: 5Entropy for binary source: N=2))1(log)1(log(22ppppH−⋅−+⋅−=S={0,1}p0=pp1=1-p0 1p1-pH=1 bit for p0=p1=0.5 6Entropy for uniform distribution: pi=1/NUniform distribution of probabilities: pi=1/N:)(log)/1(log)/1(212NNNHNi∑==⋅−=Examples: N= 2: pi=0.5; H=log2(2) = 1 bitN=256: pi=1/256; H=log2(256)= 8 bitsPi=1/Ns1s2sN 7How to get the probability distribution?1) Static modeling: a) The same code table is applied to all input data. b) One-pass method (encoding) c) No side information (không cần thông tin phụ)2) Semi-adaptive modeling: a) Two-pass method: (1) analysis and (2) encoding. b) Side information needed (model, code table)3) Adaptive (dynamic) modeling: a) One-pass method: analysis and encoding b) Updating the model during encoding/decoding c) No side information 8Static vs. Dynamic: Example S = {a,b,c}; Data: a,a,b,a,a,c,a,a,b,a.1) Static model: pi=1/10 H = -log2(1/10)=1.58 bits2) Semi-adaptive method: p1=7/10; p2=2/10; p3=1/10; H = -(0.7*log20.7 + 0.2*log20.2 + 0.1*log20.1)=1.16 bits 93) Adaptive method: Example S = {a,b,c}; Data: a,a,b,a,a,c,a,a,b,a.Symbol 1 2 3 4 5 6 7 8 9 10 a 1 2 3 3 4 5 5 6 7 7b 1 1 1 2 2 2 2 2 2 3c 1 1 1 1 1 1 2 2 2 2Pi 1/3 2/4 1/5 3/6 4/7 1/8 5/9 6/10 2/11 7/120.33 0.5 0.2 0.5 0.57 0.13 0.56 0.60 0.18 0.58H 1.58 1.0 2.32 1.0 0.81 3.0 0.85 0.74 2.46 0.78H=(1/10)(1.58+1.0+2.32+1.0+0.81+3.0+0.85+0.74+2.46+0.78) =1.45 bits/char 1.16 < 1.45 < 1.58 S.-Ad. Ad. Static 10Shannon-Fano Code: A top-down approach1) Sort symbols according their probabilities: p1 ≤ p2 ≤ … ≤ pN2) Recursively divide into parts, each with approx. the same number of counts (probability) . ECodes: 01 00 01 00 10 00 10 00 11 0 11 1Bitstream: 010 0 010 010 0 010 0 011 011 1 16 Shannon-Fano Code: DecodingA - 00B - 01C - 10 D - 11 0E - 11 1 010 11 010 ACBDEBinary. bits 011 010 ACDE1B87/39=2.23 bitsBinary tree 19 Huffman Code: DecodingA - 0B - 10 0 C - 10 1D - 11 0E - 11 1 011 010 ACDE1BBinary treeBitstream: 10 0 010 0 010 1 010 1 011 011 1