Introduction to Video Coding
Part 1: Transform Coding

Mozilla

Video Compression Overview
●

Most codecs use the same basic ideas

1) Motion Compensation to eliminate temporal
redundancy

⊖

Input

2

=

Reference frame

Residual

Mozilla

Video Compression Overview
2) A 2D transform (usually the DCT) to eliminate
spatial redundancy
156
151
141
128
114
102
92
86

Input Data
144 125 109 102 106 114
138 120 104 97 100 109
129 110 94 87 91 99
116 97 82 75 78 86
102 84 68 61 64 73
89 71 55 48 51 60
80 61 45 38 42 50
74 56 40 33 36 45

Transformed Data
700 100 100 0 0 0
200 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0

3

0
0
0
0
0
0
0
0

121
116
106
93
80
67
57
52

0
0
0
0
0
0
0
0

Mozilla

Video Compression Overview
3) Quantization to throw out unimportant details
(this is the “lossy” part)

4) Lossless compression to store the result
efficiently

4

Mozilla

The Ideal Linear Transform
●

Karhunen-Loève Transform (KLT)
–

See also: Principal Component Analysis (PCA)

–

Just a change-of-basis (like any other linear
transform)
●

–

Goal: A sparse representation of the pixel data

–

Pick basis vectors one by one minimizing the
distance of the data from the subspace they span
●

5

Transforms, e.g., an 8×8 block of pixels into 64
coefficients in another basis

Equivalently: maximizing the percent of the data’s
variance contained in that subspace
Mozilla

Karhunen-Loève Transform
●

Mathematically:
–

Compute the covariance matrix
N −1

–

1
T
R xx= ∑  x i−  ⋅ xi − 
N i= 0
Compute the eigenvectors of Rxx
●

–
●

6

Sort by magnitudes of the eigenvalues

Project pixel data onto the eigenvectors

Transform is data-dependent
–

So we need data to estimate it from

–

And would need to transmit the eigenvectors
Mozilla

Transforming Natural Images
●

Image data is highly correlated
–

Usually modeled as a first-order autoregressive
process (an AR(1) process)
x i= x i−1 ,  =0.95 (typically)
Correlation Coefficient

–

Gaussian Noise

Produces a simple cross-correlation matrix:

[

2

1  
 1 
R xx=  2 
1
3 2 
⋮
7

3

]

 ⋯
2


1
⋱

Mozilla

The Discrete Cosine Transform
●

If we assume this model holds for all image
blocks, can design one transform in advance
–

●

This is the Discrete Cosine Transform (DCT)

1-D Basis Functions (for an 8-point transform):
DC

●
8

AC...

Orthonormal, so inverse is just the transpose
Mozilla

The DCT in 2D
●

In 2D, first transform rows, then columns
–

●
●

Basis functions:
Two 8x8 matrix
multiplies is
1024 mults,
896 adds
–

9

Y = G·X·GT

16 mults/pixel

Mozilla

Fast DCT
●

●

The DCT is closely related to the Fourier
Transform, so there is also a fast decomposition
1-D: 16 mults, 26 adds
C4

0

0

C4

1

4

C6

2

-

3

2

S6
C6 -S6

6

C7

4

1

-

5

C4

-

C3
-

C4

6

S7
S3
C3 -S3

-

-

5

3

-S7
7

●

10

C7
-

7

2-D: 256 mults, 416 adds (4 mults/pixel)
Mozilla

DCT Example
156
151
141
128
114
102
92
86

700
200
0
0
0
0
0
0

Input Data
144 125 109 102 106 114
138 120 104 97 100 109
129 110 94 87 91 99
116 97 82 75 78 86
102 84 68 61 64 73
89 71 55 48 51 60
80 61 45 38 42 50
74 56 40 33 36 45

Transformed Data
100 100 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0

0
0
0
0
0
0
0
0

121
116
106
93
80
67
57
52

0
0
0
0
0
0
0
0

Shamelessly stolen from the MIT 6.837 lecture notes:
http://groups.csail.mit.edu/graphics/classes/6.837/F01/Lecture03/Slide30.html
11

Mozilla

Coding Gain
●

Measures how well energy is compacted into a
few coefficients
Variance of the input
2

C g =10 log 10

x



1
2 N

N −1

∏  y ∥hi∥
2

i =0

i



Variance of the i th
transform coefficient
Magnitude of corresponding
inverse basis function
12

Mozilla

Coding Gain (cotd.)
●

Calculating from a transform matrix and a
cross-correlation matrix:
C g =10 log 10

●

13

1



N −1
T
T
GR
G

×H
H ii
∏ xx ii
i =0



1
N

Coding gain for zero-mean, unit variance AR(1)
process with ρ = 0.95
4-point

8-point

16-point

KLT

7.5825 dB

8.8462 dB

9.4781 dB

DCT

7.5701 dB

8.8259 dB

9.4555 dB

Mozilla

Quantization
●

Divide each coefficient by a quantizer
–

●

●

Only required lossy step in the entire process
–

{700, 100, 200, 0, … }/32 → {22, 3, 6, 0, …}

–

Error: {22, 3, 6, 0, ...}×32 → {704, 96, 192, 0, ...}

Resulting list has lots of zeros and small
coefficients
–

14

Integer division, with rounding

That’s what makes it easy to compress

Mozilla

The Contrast Sensitivity
Function
●

15

Contrast perception varies by spatial frequency

Mozilla

Quantization Matrices
●

Choose quantizer for each
coefficient according to the CSF
–

●

But that’s at the visibility threshold
–

●

16

Example matrix:

16
12
14
14
18
24
49
72

Quantization Matrix
11 10 16 24 40 51
12 14 19 26 58 60
13 16 24 40 57 69
17 22 29 51 87 80
22 37 58 68 109 103
35 55 64 81 104 113
64 78 87 103 121 120
92 95 98 112 100 103

61
55
56
62
77
92
101
99

Above the threshold distribution more even

Most codecs vary quantization by scaling a
single base matrix
–

Will always be less than ideal at some rates

–

Theora has a more flexible model
Mozilla

Blocking Artifacts
●

When we have few bits, quantization errors
may cause a step discontinuity between blocks
–

Error correlated along block edge → highly visible

Low-bitrate Theora with loop filter disabled
17

Mozilla

Blocking Artifacts
●

Standard solution: a “loop” filter
–

Move pixel values near block edges closer to each
other

Low-bitrate Theora with loop filter enabled
18

Mozilla

Ringing Artifacts
●

●
●

19

Any quantization error is spread out across an
entire block
In smooth regions near edges, easily visible
HEVC plans to
add in-loop filters
for this, too

Mozilla

Low-pass Artifacts
●

HF coefficients get quantized more coarsely,
and cost lots of bits to code
–

They’re often omitted entirely

●

Resulting image lacks details/texture

●

Not often treated as a class of artifact
–

Low-pass behavior looks good on PSNR charts
●

20

This is one reason PSNR is a terrible metric

Mozilla

Low-pass Artifacts
●

●

21

Low-passed skin tones
(from original VP3
encoder)
Better retention of HF
(from Thusnelda
encoder)

Mozilla

Transform Size
●

How do we pick the size of the blocks?
–

Computation is N log N
●

–

Larger blocks allow us to exploit more correlation
●

–

Smaller blocks → fewer overlap edge → less ringing

Variable block size
–

22

Better compression

Except when the data isn’t correlated (edges)
●

●

Larger blocks require more operations per pixel

H.264: 8x8 and 4x4, HEVC: may add 16x16
Mozilla

The Time-Frequency Tradeoff

Time
●

The DCT has good
frequency resolution

Time

Tilings of the “Time-Frequency Plane”
depict this trade-off
–

23

●

Frequency

Raw pixel data has
good “time” (spatial)
resolution
Frequency

●

ΔT×ΔF ≥ constant

Mozilla

Leakage
●

In reality partition boundaries are not perfect

Frequency Response of 8-point DCT basis functions
24

Mozilla

Wavelets
Good LF resolution (models correlation well)

●

Better time resolution in HF (prevents ringing)

●

Smooth basis functions (no blocking artifacts)
Frequency

●

Time
Wavelet tiling of the time-frequency plane
25

Mozilla

Wavelets (cotd.)

Original
●

●
26

Dirac @ 67.2 kbps

Theora @ 17.8 kbps

Wavelets break down at low rates
–

HF “texture” requires spending even more bits to
code it separately at every position

–

Extreme low-passing is typical

Good for large-scale correlations (but Dirac
doesn’t use them for this)

Mozilla

Wavelets (cotd.)
●

Can do much better... but this is hard
Fixed Quantizers

Contrast-based Adaptation

House at 0.1 bpp from D. M. Chandler and S. S. Hemami: “ContrastBased Quantization and Rate Control for Wavelet-Coded Images.” In
Proc. Of the 5th International Conference on Image Processing
(ICIP’02), vol. 3, pp. 233–236, June 2002.
27

Mozilla

Lapped Transforms
●

Can eliminate blocking artifacts entirely
–

Like wavelets, but keep good DCT structure

–

Same idea has been used in audio forever

●

Overlap basis functions with neighboring blocks

●

Basis functions decay to zero at edges

Synthesis basis functions for 4-point lapped transform with 50% overlap
28

Mozilla

Lapped Transforms
●

No need for transform to be orthogonal

●

Separate analysis and synthesis filters

Analysis basis functions for 4-point lapped transform with 50% overlap

Synthesis basis functions for 4-point lapped transform with 50% overlap
29

Mozilla

Lapped Transforms: Prefilter
●

Implemented with a prefilter in the encoder
–

A linear transform that straddles block edges

–

Removes correlation across edge

–

Inverse applied in the decoder
Prefilter

Postfilter
DCT

IDCT

P-1

P
DCT

IDCT

P-1

P
DCT

IDCT

P-1

P
DCT
30

IDCT

Mozilla

Lapped Transforms: Prefilter
●

Prefilter makes things blocky

●

Postfilter removes blocking artifacts
–

Like loop filter, but invertible
●

31

And simpler: no conditional logic to control filter strength
Mozilla

Lapped Transforms
●

●

Cheaper than wavelets
–

8x16 LT: 3.375 muls/pixel (minus loop filter)

–

9/7 Wavelet (3 levels): 5.25 muls/pixel

Better compression
4-point

8-point

16-point

KLT

7.5825 dB

8.8462 dB

9.4781 dB

DCT

7.5701 dB

8.8259 dB

9.4555 dB

LT

8.6060 dB

9.5572 dB

9.8614 dB

9/7 Wavelet
●

32

9.46 dB

Can keep most of block-based DCT
infrastructure
Mozilla

Lapped Transforms
●

●

Early experiments (by Greg Maxwell) suggest it
works!
But others don’t think so
–

On2/Google (private communication)

–

Charles Bloom:

http://cbloomrants.blogspot.com/2009/07/07-06-09-small-image-compression-notes.html
●

●

33

“Obviously in a few contrived cases it does help, such
as on very smooth images at very high compression...
In areas that aren't smooth, lapping actually makes
artifacts like ringing worse. ”
This doesn’t match published examples: Barbara
(heavily textured) gets much more benefit than Lena
(smooth)
Mozilla

TF Resolution Switching
●

Recursively apply DCT/IDCT to
increase/decrease frequency resolution
–

Can apply to just part of the spectrum

–

Idea stolen from CELT/Opus (audio)

Time
34

Merging
Frequency

Frequency

Splitting

Time

Mozilla

TF Resolution Switching
●

Potential benefits
–

Can spread texture over larger regions
●

–

Can capture large-scale correlation better
●

–
●

35

Only apply to HF: less complex than larger transform
Like LF wavelet bands, but without the HF problems

Can reduce the number of transform sizes needed

Cost
–

Signaling (need to code per-band decision)

–

Encoder search (more possibilities to consider)
Mozilla

TF Resolution: Splitting
●

Better time resolution → reduced ringing
–

But not as good as using a smaller transform

8-point Lapped Transform basis functions after TF splitting
36

Mozilla

TF Resolution: Merging
●

Better frequency resolution → better coding gain
–

But not as good as a larger transform
●

25% overlap vs. 50% → effectively using smaller window

8-point Lapped Transform basis functions after TF merging
(first 8 only)
37

Mozilla

Intra Prediction
●

Predict a block from its causal neighbors

●

Explicitly code a direction along which to copy

●

38

Extend boundary of neighbors into new block
along this direction

Mozilla

Intra Prediction (cotd.)

39

Mozilla

Intra Pred. vs. LT
●

Intra pred. and LT have
similar roles
–

●

Both exploit correlation
with neighboring blocks

But very different
mechanisms
–

Best depends on image

R. G. de Oliveira and R. L. de Queiroz: “Intra
Prediction versus Wavelets and Lapped
Transforms in an H.264/AVC Coder.” In Proc.
15th International Conference on Image
Processing (ICIP’08), pp. 137–140, Oct. 2008.
40

Mozilla

Intra Pred. vs. LT
●

Combine lapped transforms with intra pred.
–

Better than either alone

–

Despite predicting from farther away with only 4
of 9 modes (pixels not available for others)
R. G. de Oliveira and B. PesquetPopescu: “Intra-Frame Prediction
with Lapped Transforms for
Image Coding.” In Proc. of the
36th International Conference on
Acoustics, Speech, and Signal
Processing (ICASSP’11), pp.
805–808, May 2011.

41

Mozilla

Intra Prediction as Transform
●

Generalize using idea of separate analysis &
synthesis filters
Domain of
analysis basis
functions

●

Can train optimal
transforms using standard
techniques (similar to KLT)
–

42

Range of
synthesis basis
functions

J. Xu, F. Wu, and W. Zhang: “Intra-Predictive Transforms for BlockBased Image Coding.” IEEE Transactions on Signal Processing
57(8):3030–3040, Aug. 2009.
Mozilla

Intra Pred. in Frequency Domain
●

Optimal KLT-like transforms are not sparse
–

May not even be separable
●

●

Idea: Use a standard transform, predict in
frequency domain
–

●

Signals are sparse in frequency domain, so we
should be able to enforce sparsity of predictor

Works with lapped transforms without
restrictions
–

43

E.g., 64×64 matrix multiply for 8×8 transform

Don’t have to worry about which pixels are
available, etc.

Mozilla

Intra Prediction Limitations
●

Intra prediction one of main innovations of H.264

●

But how is it actually used?
–

Most common mode is “DC” mode, which has no
orientation, and just uses one value for all pixels

–

Next is “horizontal” and “vertical”
●

–

Diagonal modes only really useful on strong edges
●
●

44

These align well with the DCT basis functions, so you
can fix things cheaply when it screws up
Prediction only uses the edge of a block
Can’t extend texture, not needed for smooth regions
Mozilla

Intra Prediction Improvements
●

VP8: “TM” mode: P = T + L – TL
–

●

Predicts gradients well, used over 50% of the time

HEVC proposals:
–

More orientations (clustered around H and V)

–

Multi-stage:
●
●
●
●

–
45

Predict 1 out of 4 pixels
Decode residual for those pixels
Extend prediction into remaining pixels
Effectively restricts prediction to LF

Orientation-adaptive transforms for residual
Mozilla

Orientation-Adaptive Transforms
●

Some images have many diagonally-oriented
features
–

●

Not compactly represented by DCT
–

●

Design custom transforms for each orientation

Lots of research over past 6 or 7 years
–

46

Including output of diagonal intra predictors

Issues (some solved, some not): separability, data
dependencies, fast implementations, coefficient
ordering, etc.
Mozilla

Orientation Coding Gain
●

47

Looks impressive (at least at 45°), but...

Mozilla

Orientation Adaptive vs. Intra
Pred. and Adaptive Block Size
Everything

Highly-oriented
images

Intra prediction +
Adaptive Block Size
Orientation Adaptive
+Adaptive Block Size
Adaptive Block Size
Fixed Block Size

Normal test
image

48

Mozilla

Additional References
●

●

●

●

●

●

49

H.S. Malvar: “Extended Lapped Transforms: Properties, Applications, and Fast
Algorithms.” IEEE Transactions on Acoustics, Speech, and Signal Processing,
40(11):2703–2714, Nov. 1992.
T.D. Tran: “Lapped Transform via Time-Domain Pre- and Post-Filtering.” IEEE
Transactions on Signal Processing 51(6):1557–1571, Jun. 2003.
W. Dai and T.D. Tran: “Regularity-Constrained Pre- and Post-Filtering for Block
DCT-based Systems.” IEEE Transactions on Signal Processing 51(10):2568–
2581, Oct. 2003.
J. Hu and J.D. Gibson: “New Rate Distortion Bounds for Natural Videos Based
on a Texture Dependent Correlation Model.” In Proc. 46th Allerton Conference on
Communication, Control, and Computing, pp. 996–1003, Sep. 2008.
J. Han, A. Saxena, and V. Melkote: “Jointly Optimized Spatial Prediction and
Block Transform for Video and Image Coding.” IEEE Transactions on Image
Processing (pre-print), Sep. 2011.
C.-L. Chang: “Direction-Adaptive Transforms for Image Compression.” Ph.D.
Thesis, Stanford University, Jun. 2009.

Mozilla

Questions?

50

Mozilla

Introduction to Video Coding
Part 2: Entropy Coding

Mozilla

Review
156
151
141
128
114
102
92
86

DCT

Input Data
144 125 109 102 106 114
138 120 104 97 100 109
129 110 94 87 91 99
116 97 82 75 78 86
102 84 68 61 64 73
89 71 55 48 51 60
80 61 45 38 42 50
74 56 40 33 36 45

Prefilter

0
0
0
0
0
0
0
0

IDCT

P-1

P
DCT

IDCT

P-1

P
DCT

IDCT

P-1

P
DCT
52

0
0
0
0
0
0
0
0

Postfilter
DCT

Lapped
Transforms

121
116
106
93
80
67
57
52

Transformed Data
700 100 100 0 0 0
200 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0

IDCT

Mozilla

Review (cotd.)
Merging
Frequency

TF
Resolution
Switching

Frequency

Splitting

Time

Time

Intra
Prediction

53

Mozilla

Shannon Entropy
●

Minimum number of bits needed to encode a
message given the probability of its symbols
n

H  X =−∑ p  x i log 2  p  xi 
i =1

●

Represents the limit of lossless compression

C. E. Shannon: “A Mathematical Theory of Communication.” The
Bell System Technical Journal, 27(3-4): 379–423, 623–656, Jul.,
Oct. 1948.

54

Mozilla

Two Questions
●

How do we know p(xi)?

●

More on this later
How close to this rate can we actually
compress data?
–

We’ll tackle this first
– Two approaches worth discussing
● Huffman coding (fast)
● Arithmetic coding (better compression)
–

55

Mozilla

Huffman Codes
●

●

Also known as
–

Variable-Length Codes

–

Prefix-free Codes

Basic idea
–

Vary the number of bits per symbol

–

Use fewer bits for more frequent symbols

–

Use more bits for rare symbols

D. A. Huffman: “A Method for the Construction of MinimumRedundancy Codes.” Proceedings of the Institute for Radio
Engineers, 40(9): 1098–1101, Sep. 1952.
56

Mozilla

Huffman Code Construction
●

●

●

57

Take the two least probable
symbols and put them in a
binary tree with “0” and “1”
branches
Set the probability of the
parent equal to the sum
of its childrens’, and put
it back in the list

0
C

Repeat until there’s only one
item left in the list...

Symbol Frequency Probability
A

8

2/3

B

2

1/6

C

1

1/12

D

1

1/12

P

1
D
Symbol

Probability

A

2/3

B

1/6

P

1/6

Mozilla

Huffman Code Construction
●

●

Follow the path from the root to
get each leaf’s codeword
Bits per symbol:
–
–

●

●

58

n

∑ p  xi  length  x i

0

R

A

1

0

Q

1

i=1

1.5 bits (25% compression)

B

0

Entropy: 1.42 bits/symbol

Huffman models the distribution

D

–

As if probs are (1/2, 1/4, 1/8, 1/8)

A

0

–

Instead of (2/3, 1/6, 1/12, 1/12)

B

10

C

110

D

111

–

Gap called Kullback-Leibler Divergence

1

C
Symbol

Called “cross-entropy”

P

Codeword

Mozilla

Huffman Code Problems
●

Only optimal when probabilities are powers of 2
–

Can’t represent probabilities greater than 1/2!

–

Code multiple symbols at once to approach bound
●

●

59

But size of code grows exponentially

Not adaptive when distribution not known
–

Code construction is slow: O(N log N)

–

Need to either pick a fixed code (suboptimal) or
transmit the code to use (overhead)

–

Adaptive versions exist, but still slow and complex
Mozilla

Huffman Coding in Video
●

●

Some form of VLC used in every standard from
H.261 (1988) to H.264-MVC (2009).
Video is not text compression
–

There are many different kinds of symbols that
need to be encoded
●

60

Macroblock modes, motion vectors, DCT coefficient
values, signs, run lengths, etc.

–

The size of the alphabet can differ for each one

–

They have vastly different probability distributions

–

Need to change which code we’re using on a
symbol-by-symbol basis
Mozilla

Huffman Coding in Video
●

Video is not text compression (cotd.)
–

We know the resolution of the video

–

Hence we know exactly when to stop reading
more symbols

–

No need to reserve special values to mark the end
of the stream

–

By convention, we can use zeros if we run out of
input bits
●

61

Can strip trailing zeros from the output

Mozilla

Huffman Coding in Video
●

Theora
–

Codes for block flags, MB modes, MVs fixed

–

Codes for DCT coefficients (the bulk of the data)
transmitted in a stream header
●
●

●

62

Different codes used for various coefficients
Can pick which set to use on a frame-by-frame basis

H.264: CAVLC (“Context Adaptive”)
–

Fixed set of codes to choose from

–

Choice is made based on previously decoded
symbols (the “context”)
Mozilla

Huffman Decoding
●

Read bits one at time, traverse tree (slow)

●

Finite State Machine
–

Current state is the index of an interior node of the
Huffman tree

–

Read n bits, use LUTs to get new state and a list
of zero or more decoded symbols

–

Can choose n independent of code lengths

–

Fastest method, but need to know which code
subsequent symbols will use
●

63

This decision may be very complex
Mozilla

Huffman Decoding: libtheora
●

Traverse multiple levels of the tree with a LUT
–
–

Peek ahead n bits

Bits Node Depth

Use a LUT to find

00

A

1

01

A

1

10
The node in the tree
11
(up to n levels down)
How many bits to consume

B

2

P

2

●

●

–

Example: 2-bit table → 5/6 of
symbols need only one lookup
●

–

Node

Depth

0

C

1

1

D

1

Usually one well-predicted branch per symbol

n trades off table size (cache) and branches
●

64

Bits

Use a larger value for the root node
Mozilla

Arithmetic Coding
●

●

Inefficiency of Huffman codes comes from
using a whole number of bits per codeword
Arithmetic coding doesn’t partition the code
space on bit boundaries
–

●

Gets very close to the Shannon bound
–

65

Can use fractional bits per codeword
Limited only by the precision of the arithmetic

Mozilla

Huffman Coding Revisited
●

Imagine bitstream as a binary number in [0, 1)

●

Code the message “ABC”
1

D
C
B

Range: [0,1)

A

0
66

0

5

10

Mozilla

Huffman Coding Revisited
●

Imagine bitstream as a binary number in [0, 1)

●

Code the message “ABC”
1

D
C
B

All messages
that start with A
fall in this range
A

Range: [0, 1/2)
{in binary: [0.0, 0.1)}

0
67

0

5

10

Mozilla

Huffman Coding Revisited
●

Imagine bitstream as a binary number in [0, 1)

●

Code the message “ABC”
1

D

DA

C

CA
BB

B

DB
CB

DD
DC

CD
CC

BD
BC

BA
AD
AC
AB

A

Range: [2/8, 3/8)
{in binary: [0.010, 0.011)}

AA
0
68

0

5

10

Mozilla

Huffman Coding Revisited
●

Encode message with shortest number in range

●

Number of bits ≤ ⌈-log2(Range)⌉ = -log2(1/64) = 6
1

D

DA

C

BA
AD
AC

69

0

DDA
DCA

DBA
DAB

CDA
CCA

CBA
CAB

BBA

BBB

CBB

CAD
CAC

BDA
BCA

BAB

DBB

DAD
DAC

CD
CC

CAA

BDB

BCB

BBD
BBC

BAD
BAC

BAA

ABB

AB

Range: [22/64, 23/64)
{in binary: [0.010110, 0.010111)}

ABD
ABC

ABA
AAD
AAC

AA
0

CB

BD
BC

BB

DD
DC

DAA

CA

B

A

DB

AAB
AAA

5

10

Mozilla

Arithmetic Coding
●

No need for intervals to be powers of two

●

Number of bits ≤ ⌈-log2(1/108)⌉ = ⌈6.75⌉ = 7
1

D
C

DA
CA
BB

B

BA

DD
DC

DB

DBA
DAB

DAA

CD
CC

CB

CBA
CAB

CAA
BD
BC

BDA
BCA

BBA

BBB

BAD
BAC

BAB

BAA

AD
AC

ADB

ADA

ACB

ACA

ABD
ABC

ABB

AB

Range: [58/108, 59/108)

ABA
AAD
AAC

A

AAB
AA
AAA

0
70

0

5

10

Mozilla

Arithmetic Coding
●

No need for intervals to be powers of two

●

Number of bits ≤ ⌈-log2(1/108)⌉ = ⌈6.75⌉ = 7
1

D
C

DA
CA
BB

B

BA

DD
DC

DB

DBA
DAB

DAA

CD
CC

CB

CBA
CAB

CAA
BD
BC

BDA
BCA

BBA

BBB

BAD
BAC

BAB

BAA

AD
AC

ADB

ADA

ACB

ACA

ABD
ABC

ABB

AB

Range: [58/108, 59/108)

ABA
AAD
AAC

A

AAB
AA
AAA

AAA dropped from 3 bits to 1.75

0
71

0

5

10

Mozilla

Arithmetic Coding
●

●

Can represent entire message with shortest
number (most trailing zeros) that lies in range
Infinite precision arithmetic called “Elias
Coding”
–

●

Gets within 1 bit of Shannon bound

Real arithmetic coding uses finite-precision
–

Use a “sliding window” that gets rescaled
●

72

L = lower bound of range, R = size of range

Mozilla

5-bit Window (Encoding)
●

Example: L = 0, R = 21, want to code B
1

D
C
B

21/32

A

0
73

0

window

5

10

Mozilla

5-bit Window (Encoding)
●

Partition R proportional to probabilities

●

Update: L=14, R=3
1

D
C
B

21/32

21/32
19/32
17/32

AD
AC

AB

14/32

R=3

A
AA

0
74

0

window

5

10

Mozilla

5-bit Window (Encoding)
●

Double L and R until R ≥ 16

●

Renormalize: output: 0 ← L=28, R=6
1

D
C
B
AD
AC

AB

34/64
28/64

A
AA

0
75

0

window

5

10

Mozilla

5-bit Window (Encoding)
●

Double L and R until R ≥ 16

●

Renormalize: output: 01 ← L=24, R=12
1

D
C
B
AD
AC

AB
A
AA

0
76

0

5

window

10

Mozilla

5-bit Window (Encoding)
●

Double L and R until R ≥ 16

●

Renormalize: output: 011 ← L=16, R=24
1

D
C
B
AD
AC

AB
A
AA

0
77

0

5

window

10

Mozilla

5-bit Window (Encoding)
●

Carry propagation: L can exceed 32

●

Update: output: 011, L=36, R=2
1

D
C
B
AD
AC

AB

ABD
ABC

ABB

ABA

A
AA

0
78

0

5

window

10

Mozilla

5-bit Window (Encoding)
●

Carry propagation: L can exceed 32

●

Update: output: 100, L=4, R=2
1

D
C
B
AD
AC

AB

ABD
ABC

ABB

ABA

A
AA

0
79

0

5

window

10

Mozilla

5-bit Window (Encoding)
●

Then renormalize like normal

●

Renormalize: output: 100001 ← L=0, R=16
1

D
C
B
AD
AC

AB

ABD
ABC

ABB

ABA

A
AA

0
80

0

5

window

10

Mozilla

5-bit Window (Decoding)
●

Decoding: Read bits into C, find partition it’s in

●

L=0, R=21, C=16 ← input: 100001
1

D
C
B

21/32

21/32
19/32
17/32

AD
AC

AB
A

14/32

R=3

Decode a B

AA

0
81

0

window

5

10

Mozilla

5-bit Window (Decoding)
●

Update: Same calculations as encoder

●

L=14, R=3, C=16,
1

input: 100001

D
C
B

21/32

21/32
19/32
17/32

AD
AC

AB

14/32

R=3

A
AA

0
82

0

window

5

10

Mozilla

5-bit Window (Decoding)
●

Renormalize: Shift more bits into C

●

L=16, R=24, C=4 ← input: 10000100
1

D
C
B
AD
AC

AB
A
AA

0
83

0

5

window

10

Mozilla

5-bit Window (Decoding)
●

If C isn’t in [L, L+R), borrow (inverse of carry)

●

L=16, R=24, C=36 ← input: 10000100
1

D
C
B
AD
AC

AB
A
AA

0
84

0

5

window

10

Mozilla

Arithmetic Coding: That’s it!
●

Variations
–

Renormalize 8 bits at a time (“Range Coding”)
●

–

Binary: Less arithmetic, no search in decoder
●

●

●

85

Byte-wise processing is faster in software
Lots of optimizations and approximations only work
with binary alphabets
This is what all current video codecs use

Partition functions...

Mozilla

Partition Functions
●

Splits R according to cumulative frequency
counts:
f i =frequency of i th symbol
i−1

c i= ∑ f k

total =c N

k=0

R R∗ci /total

●

f0

f1 f 2 f3

(“CACM” coder, Witten et al. 1987)
total = 12

R = 32
0
86

21

26

29

32

Mozilla

Partition Functions (cotd.)
R  R /total ∗c

●

(Moffat et al. 1998)

–

Better accuracy for fixed register size, one less div

–

Over-estimates the probability of the last symbol
●

f0

The error is small as long as total << R (less than 1%
given 6 bits of headroom)
f1 f 2 f3

total = 12

R = 32
0

16
(R/total)*f0

87

20 22

32
Rounding error

Mozilla

Partition Functions (cotd.)
R max c , 2∗c−total R

●

–

Stuiver and Moffat 1998

–

Requires total ≤ R < 2*total (shift up if not)

–

Distributes rounding error more evenly
●

f0

But always has a significant amount
f1 f 2 f3

total = 12 (scale by 2 to get 24 ≤ 32 < 48)

R = 32
0

16
under-estimated

88

24

28

32

over-estimated

Mozilla

Partition Functions (cotd.)
●

Table-driven (e.g., CABAC)
–

No multiplies or divides

–

Binary alphabets only

–

Small set of allowed ci’s

–

R restricted to 256...511, only bits 6 and 7 used

{

0,
ci =0
R  rangeTabLPS [indexOf ci ][ ⌊ R / 64 ⌋ mod 4] , 0ci total
R
ci =total
●

Truncation error can be as large as 25%
–

89

All added to the MPS, can waste 0.32 bits for LPS
Mozilla

Arithmetic Coding References
●

●

●

●

●

●

90

J.J. Rissanen: “Generalized Kraft Inequality and Arithmetic Coding.” IBM
Journal of Research and Development, 20(3): 198–203, May 1976.
I.H. Witten, R.M. Neal, and J.G. Cleary: “Arithmetic Coding for Data
Compression.” Communications of the ACM, 30(6): 520–540, Jun. 1987.
A. Moffat, R.M. Neal, and I.H. Witten: “Arithmetic Coding Revisited.”
ACM Transactions on Information Systems, 16(3): 256–294, Jul. 1998.
L. Stuiver and A. Moffat: “Piecewise Integer Mapping for Arithmetic
Coding.” In Proc. 8th Data Compression Conference (DCC ‘98), pp. 3–12,
Mar. 1998.
D. Marpe, H. Schwarz, and T. Wiegand: “Context-Based Adaptive Binary
Arithmetic Coding in the H.264/AVC Video Compression Standard.”
IEEE Transactions on Circuits and Systems for Video Technology, 13(7):
620–636, Jul. 2003.
http://people.xiph.org/~tterribe/notes/range.html
Mozilla

Questions?

91

Mozilla

Introduction to Video Coding
Part 3: Probability Modeling

Mozilla

Review: Huffman Coding
●

●

●

93

Relatively fast, but probabilities
limited to power of 2
LUTs can traverse
multiple levels of tree
at a time for decoder
FSM even faster, but
limited to a single
codebook

R

0

12 1

A

0

8
Bits Node Depth

B

00

A

1

2

01

A

1

10

B

2

11

P

2

Q

1

4
0

P
2

1

C

D

1

1

Bits

Node

Depth

0

C

1

1

D

1

Mozilla

Review: Arithmetic Coding
●

Efficient for any probabilities, but somewhat slow
–
1

Binary coders simplest, larger alphabets require more
arithmetic, codebook search
D
C

DA
CA
BB

B

BA

DD
DC

DB

DBA
DAB

DAA

CD
CC

CB

CBA
CAB

CAA
BD
BC

BDA
BCA

BBA

BBB

BAD
BAC

BAB

BAA

AD
AC

ADB

ADA

ACB

ACA

ABD
ABC

ABB

AB

Range: [58/108, 59/108)

ABA
AAD
AAC

A

AAB
AA
AAA

AAA dropped from 3 bits to 1.75

0
94

0

5

10

Mozilla

Modeling
●

●

“Modeling” assigns probabilities to each symbol
to be coded
Arithmetic coding completely separates modeling
from the coding itself
–

●

Basic approach: partition symbols into “contexts”
–

●

95

This is its biggest advantage
Can switch from context to context on every symbol

Model the distribution within a context as if it
were independently, identically distributed (i.i.d.)
Mozilla

Context Modeling
●

●

96

How do we get (approximately) i.i.d. Data?
–

Partition it into “contexts” based on the observed
values of a symbol’s neighbors

–

Each context has its own probability estimator:
P(x|y1,...,yk)

Having a separate context for each possible set
of values of y1,...,yk models the dependence of x
on y1,...,yk

Mozilla

Context Modeling
●

Example: A “skip macroblock” flag
–

●

●

97

Skipping is more likely if neighbors were also
skipped

Let c = 2×skipabove + skipleft be the context index
–

Use four different probability estimators, one for
each c

–

All macroblocks with the same value of c code their
skip flag using the same probability estimator

But... how do we estimate probabilities?
Mozilla

Frequency Counts
●

●

Most basic adaptive model
–

Initialize f(xi) = 1 for all xi

–

Each time xi is decoded, add 2 to f(xi)

–

p(xi) = f(xi) / total, where total = ∑f(xj)

Why add 2 instead of 1?
–

Minimizes worst-case inefficiency
●

–

98

Is that the right thing to minimize?

Called a Krichevsky-Trofimov estimator
(Krichevsky and Trofimov ‘81)
Mozilla

Frequency Counts in Practice
●

Rescaling
–

●

●
●

99

When total grows too large, scale all counts by ½
(rounding up to keep them non-zero)

Distributions not exactly i.i.d.
–

Can imply the need for a faster learning rate

–

Faster learning implies more frequent rescaling →
Faster forgetting

Optimal learning rate varies greatly by context
Biggest drawback: total not a power of 2
(divisions!)
Mozilla

Finite State Machine
●
●

Only used for binary coders
Each state has associated probabilities and
transition rules
–

●

LUT maps state and coded symbol to new state

Example: CABAC (64 states plus reflections)
–

p0 = 0.5, pσ = αpσ-1, α = (3/80)1/63 ≈ 0.949217

Defines probability of “Least Probable Symbol”
● Probability of “Most Probable Symbol” is (1 – p )
σ
Quantize  p old  ,
Coded MPS
– Transitions: p new =
Quantize 1− 1− pold  , Coded LPS
●

{

100

Mozilla

CABAC Transitions

101

Mozilla

QM Coder (JPEG) Transitions
“State” can include a notion of learning rate

LPS Probability

●

Probability State Index
102

Mozilla

Designing the FSM
●

●
●

FSMs are popular (JPEG, JBIG, H.264, Dirac,
OMS), but very little research on their design
Existing methods based on theory, not data
But it looks much like a classic machinelearning problem: Hidden Markov Models
–

Hidden (unobserved) state (e.g., the learning rate)

–

Each state has an unknown probability distribution
over the possible (observed) outputs

–

Good training algorithms exist
●

103

●

Baum-Welch (1970)
Baldi-Chauvin (1994)

Mozilla

Binarization
●

All existing video standards use binary alphabets
–

●

But we want to code non-binary values

Binarization: converting an integer to bits
–

Want to code as few symbols as possible, on
average (for speed and modeling efficiency)

–

Compression comes from arithmetic coder
●

104

Binarization doesn’t need to be perfect

Mozilla

Exp-Golomb Codes
●

Binarization method for many
syntax elements
–

●

Not optimal, but relatively simple
–

●

105

Four types: unsigned, signed,
“coded_block_pattern”, and
“truncated”

Bit String
0

Value(s)
0

10x

1...2

110xx

3...6

1110xxx

7...14

11110xxxx

15...30

111110xxxxx

31...62

Strongly prefers 0’s (coding a 1 takes 3 symbols!)

Also uses unary, truncated unary, fixed-length,
and for MB/sub-MB type a custom code
Mozilla

VP8 “Tree Coder”
●

Binarization is the Huffman
coding problem!
–

●

0

1

DC

Uses Huffman trees for most
syntax elements

0
TM

Each internal node is coded in
its own context
–

1

Example: Tree
for per-block
intra modes

But decision on which set of
contexts to use is made once, for
the whole tree

0

1

VE

0
0

1

1

HE

0

1

LD

0
RD

1

0

1

VR VL

0
HD

106

1
HU

Mozilla

Binary FSM vs. Freq. Counts
●

●

107

If a symbol can have N values,
its distribution has N-1 degrees
of freedom (DOFs)

0

0

0

1

1

1

0

0

1

0

1

1

0

–

Binary contexts: one per binary tree internal node

–

Frequency counts: one per value, minus one constraint
(probabilities sum to 1)

1

When values are correlated, binary contexts
adapt more quickly
–

Taking a branch raises the probability of everything on that
side of the tree

–

Example: seeing a MV pointing up means it’s more likely to
see other MVs that point up

Mozilla

Context Dilution
●

What is the cost of adaptation?
–

Learning probabilities adds (roughly) log(N) bits of
overhead per DOF to code N symbols
●

●

108

Assumes a static distribution... doesn’t count the cost
of re-learning if the distribution changes

How does the number of contexts impact this?
–

Lots of dependencies means lots of contexts

–

Lots of contexts means few symbols per context:
context dilution

Mozilla

Context Dilution Overhead
Larger contexts amortize learning overhead
Learning Overhead (bits/sym)

●

–
●

109

Number of Symbols per DOF

Need 1000 symbols per DOF to get under 1%

Gain from better modeling must offset overhead
Mozilla

Reducing Contexts
●

Merge contexts: c = skipabove + skipleft
–

●

4 contexts reduced to 3

Replace binary tree with DAG:
0

0

0

110

–

1

1

1

0

1

0

0

1

1

0

7 contexts reduced to 3

1

0

1

0

1

0

1

Mozilla

Parallelism
●

●

Arithmetic coding is inherently serial
–

State depends on a complicated sequence of
rounding and truncation operations

–

Now the bottleneck in many hardware decoders

Lots of ways to address this
–

111

All of them have drawbacks

Mozilla

Parallelism: Partitions
●

Code different parts of the frame independently
–

Allowed by both H.264 and VP8

●

Each uses an independent arithmetic coder

●

Can all be decoded in parallel

●

BUT

112

–

Must encode files specially (not on all the time)

–

Can’t predict across partition boundaries

–

Have to re-learn statistics in each partition

–

Non-trivial bitrate impact
Mozilla

Parallelism: HEVC Approach
●

Quantize the probabilities (just like with an FSM)

●

Map each probability to a separate stream

●

–

Encodes symbols with just that one probability

–

Allows efficient Huffman code (cheap!)

–

Allows FSM decoding (no codebook switching)

Decode all streams in parallel and buffer output
–

●

Software implementation half CABAC’s speed
–

113

Complicated scheduling problem: not evenly split
(Loren Merritt, Personal Communication, 2010)
Mozilla

Parallelism: Paired Coders
●
●

●

Idea due to David Schleef
Code pairs of symbols at a time with
independent coders
Carefully schedule when bits are written to/read
from the bitstream
–

●

114

So they get sent to the right coder

Drawbacks:
–

If you only want to code one symbol, must clock a
“null” (prob=1.0) symbol through other coder

–

Small speed-up
Mozilla

Parallelism: Non-Binary Coder
●

Coding a symbol from an alphabet of size 16 is
equivalent to coding 4 bits at once
–

This is just the libtheora Huffman decoding
approach applied to the VP8 tree coder idea

–

Most coding operations are per-symbol
●

–

Exception: codebook search
●
●

–
115

Context selection, state update, renormalization check,
probability estimator
Parallel search needs N multipliers for log(N) speed-up
Can keep it serial for speed/gate-count trade-off

Can’t use FSM to estimate probabilities
Mozilla

Non-Binary Coder (cotd.)
●

Tested with VP8 tree coder
–

Used mode and motion vector data (not DCT)

–

Binary decoder vs. multi-symbol decoder (pure C
implementation):
●

●
●

–

116

Includes all overheads for non-binary arithmetic
coding
Serial codebook search (no SIMD)
Doesn’t include overheads for probability estimation
(VP8 uses static probabilities)

Speed-up almost directly proportional to reduction
in symbols coded (over a factor of 2 for VP8)
Mozilla

Non-Binary Coder (cotd.)
●

Estimating probabilities is harder
–

Can use frequency counts, but no easy way to
ensure total is a power of two
●
●

●

Partitioning into contexts is harder
–

2 neighbors with a binary value is 22 = 4 contexts

–

With 16 possible values that’s 162 = 256 contexts
●

–
117

Could use approximate coder (e.g., Stuiver’s)
Could transmit explicit probabilities (à la VP8)

And each one has 15 parameters to estimate

Need other ways to condition contexts
Mozilla

Reducing Freq. Count DOFs
●

●

118

Frequency counts in large contexts
–

Only boost prob. of one value per symbol

–

Neighboring values often correlated

Idea: add something to the count of the coded
symbol’s neighbors, too
–

SIMD makes it as cheap as normal frequency
count updates

–

Can simulate any binary model (1st order approx.)

–

A binary model with reduced DOFs yield reduced
DOF frequency count models
Mozilla

Reducing Freq. Count DOFs
●

Example: Update after coding a 0 from
an 8-value context given model at right
{1,1,1,1,0,0,0,0}

0

1

0

1

0

1

+ {1,1,0,0,1,1,0,0}
+ {1,0,1,0,1,0,1,0}
{3,2,2,1,2,1,1,0}
●

All frequency counts are linear combinations
of these 3 vectors or their inverses
–

119

3 DOF instead of 7
Mozilla

Beyond Binary Models
●

No need to ape a binary context model
–

How do we train a model from data?
●

●

No need to use frequency counts
–

Some contexts well-approximated by parametric
distributions (Laplace, etc.) with 1 or 2 DOF

–

Must re-compute distribution after each update
●

120

Standard dimensionality reduction techniques (PCA,
ICA, NMF, etc.)

Can be combined with codebook search, early
termination, etc.
Mozilla

Other Improvements
●

●

121

Per-context learning rate
–

Optimal rate varies drastically

–

Easy to measure from real data (saves 3% bitrate)

–

FSM prob. estimator: expensive (lots of tables)

–

Freq. counts: cheap (1 shift/symbol)

Rate-dependent context merging
–

Lower rates code fewer symbols

–

Should use fewer contexts (context dilution)

–

No existing codecs do this
Mozilla

Practical Examples: Run-Level
Coding
●

●

122

JPEG and MPEG standards prior to H.264 use
(run,level) coding
Coefficients scanned in zig-zag order
–

“run”: Number of zero
coefficients before the next
non-zero coefficient

–

“level”: The value of the
non-zero coefficient

Mozilla

Practical Examples: Run-Level
Coding (cotd.)
●

●

●
●

●

123

Huffman code built for most common (run,level)
pairs
Additional “End Of Block” symbol indicates no
more non-zero coefficients
Escape code signals uncommon (run,level) pair
Fast: Lots of coefficients can be decoded from
one symbol
Huffman code is static, so some codewords can
produce more coefficients than remain in the
block (wasted code space)
Mozilla

Concrete Example: H.264 DCT
Coefficients
●

●

Codes locations of non-zero coefficients first,
then their values second
First pass (forwards):
–

coded_block_flag: 0 → skip whole block
●

significant_coeff_flag[i]: one per coeff (except last)
–

●

124

last_significant_coeff_flag[i]: one per significant coeff.
(except last coeff.): 1 → skip the rest of the block

Second pass (backwards):
–

coeff_abs_level_minus1[i]: one per sig. coeff.

–

coeff_sign_flag[i]: one per significant coeff.
Mozilla

H.264 DCT Coefficients (cotd.)
●

6 “categories”, each with separate contexts

#

Type

#

Type

0

Luma DC coefficients from Intra 16×16 MB

3

Chroma DC coefficients

1

Luma AC coefficients from Intra 16×16 MB

4

Chroma AC coefficients

2

Luma coefficients from any other 4×4 block 5

●

Context model
–

coded_block_flag
●

125

Luma coefficients from an 8×8 block

Separate contexts for each combination of
coded_block_flag values from two neighboring blocks

Mozilla

H.264 DCT Coefficients (cotd.)
●

Context model (cotd.)
–

significant_coeff_flag, last_significant_coeff_flag
just use position in the list for most categories
●

●

●

coeff_abs_level_minus1
–

Values less than 14 use unary (0, 10, 110, etc.)

–

Values 14 and larger code an Exp-Golomb “suffix”
●

126

Chroma DC and 8×8 luma coeffs. merge some
contexts
No dependence on values in other blocks at all!

Codes (coeff_abs_level_minus1-14) with Exp-Golomb
Mozilla

H.264 DCT Coefficients (cotd.)
●

coeff_abs_level_minus1 context model:
–

First bit
●

●

–

All remaining bits in unary prefix
●
●

–
●
127

1 context used if there’s a value greater than 1 in the
block
4 more contexts indexed by the number of 1’s seen
so far
5 contexts (except for chroma DC, which only uses 2)
Indexed by the number of values greater than 1

Exp-Golomb suffix: no entropy coding (p=0.5)

coeff_sign_flag: No entropy coding (p=0.5)
Mozilla

Alternate Design: SPIHT-based
●

●

●

128

Lots of research on entropy coding in the 90’s
for wavelet codecs
Major examples
–

EZW: Embedded Zerotrees of Wavelets (Shapiro
‘93)

–

SPIHT: Set Partitioning In Hierarchical Trees (Said
and Pearlman ‘96)

–

EBCOT: Embedded Block Coding with Optimal
Truncation (Taubman ‘98)

Most also applicable to DCT coefficients
Mozilla

Set Partitioning in Hierarchical
Trees
●

Arrange coefficients in hierarchical trees
–

Basis functions with similar orientation grouped
into the same tree
H
V

D

“Horizontal”, “Vertical”,
and “Diagonal” groups
with example parent/child
relationships
Other groupings possible,
can even switch between
them for each block

129

Mozilla

SPIHT Coding
●

“Bit-plane”/“Embedded” coding method
–

Coefficients converted to binary sign-magnitude

–

A coefficient is “significant” at level n if its
magnitude is at least 2n.

–

Scan each bitplane
●

Identify coefficients that become significant at this
level
–

●

●

130

Code the sign when the coefficient becomes significant

Code another bit of already-significant coefficients

Called “embedded” because you can stop
coding at any point to achieve a desired rate
Mozilla

Identifying Significant
Coefficients
●

“Partition” into three sets using hierarchical
trees:
–

List of Significant Pixels (LSP)

–

List of Insignificant Pixels (LIP)

–

List of Insignificant Sets (LIS)
●

Two types of entries
–
–

131

Type A, denoted D(i,j), all descendants of node (i,j)
Type B, denoted L(i,j), all descendants excluding
immediate children

Mozilla

Identifying Significant
Coefficients: Algorithm
●

Set LSP={}, LIP={(0,0)}, LIS={D(0,0)} in top plane

●

For (i,j) in LIP, output one bit to indicate significance
–

●

If significant, output sign bit, move (i,j) to LSP

For each entry in LIS, remove it, and then:
–

Type A: output one bit to indicate if any D(i,j) significant
●

●

–

Type B: output one bit to indicate if any L(i,j) significant
●

132

If so, output four bits to indicate significance of each
child, move them to LSP, and code a sign
Then add L(i,j) to LIS
If so, add all four children to LIS as Type A entries
Mozilla

SPIHT/DCT Results
Z. Xiong, O.G. Guleryuz, M.T.
Orchard: “A DCT-Based
Embedded Image Coder.”
IEEE Signal Processing
Letters 3(11):289–290, Nov.
1996.

SPIHT
DCT+EZW

133

Mozilla

SPHIT (cotd.)
●

Gives good performance even without an
entropy coder
–

●

But very slow
–

●

134

Arithmetic coding adds another 0.25-0.3 dB
Repeated scanning, list manipulation, branches,
etc.

However...

Mozilla

Non-Embedded SPIHT
●

Can make a non-embedded version of SPIHT
just by re-arranging the bits
–

Noticed independently by Guo et al. 2006, Charles
Bloom 2011:
http://cbloomrants.blogspot.com/2011/02/02-11-11-some-notes-on-ez-trees.html
●

●

135

Ironically Said and Pearlman made a non-embedded
coder strictly inferior to SPIHT (Cho et al. 2005)

Details for “Backwards” version in Guo et al.
2006
–

Don’t need “Backwards” part (can buffer one block)

–

Single-pass, no lists
Mozilla

Non-Embedded SPIHT:
Reducing Symbols/Block
●

●

Code difference in level where D(i,j) and L(i,j) or
children become significant
–

Unary code: equivalent to SPIHT

–

But we can use multi-symbol arithmetic coder

When D(i,j) and L(i,j) are significant at different
levels, must have at least one significant child
–

●

136

Code top bitplane for all four children at once

After a coefficient becomes significant, can
output rest of bits immediately
Mozilla

Additional References
●

●

●

●

●

●

137

R.E. Krichevsky and V.K. Trofimov: “The Performance of Universal Encoding.” IEEE
Transactions on Information Theory, IT-27(2):199–207, Mar. 1981.
J.M. Shapiro: “Embedded Image Coding Using Zerotrees of Wavelet Coefficients.”
IEEE Transactions on Signal Processing, 41(12):3445–3462, Dec. 1993.
A. Said and W. A. Pearlman: “A New, Fast, and Efficient Image Codec Based on Set
Partitioning in Hierarchical Trees.” IEEE Transactions on Circuits and Systems for
Video Technology, 6(3):243–250, Jun. 1996.
D. Taubman: “High Performance Scalable Image Compression with EBCOT.” IEEE
Transactions on Image Processing, 9(7):1158–1170, Jul. 2000.
Y. Cho, W.A. Pearlman, and A. Said: “Low Complexity Resolution Progressive Image
Coding Algorithm: PROGRES (Progressive Resolution Decompression).” In Proc.
12th International Conference on Image Processing (ICIP’05), vol. 3, pp. 49–52, Sep.
2005.
J. Guo, S. Mitra, B. Nutter, and T. Karp: “A Fast and Low Complexity Image Codec
Based on Backwards Coding of Wavelet Trees.” In Proc. 16th Data Compression
Conference (DCC’06), pp. 292–301, Mar. 2006

Mozilla

Questions?

138

Mozilla

Introduction to Video Coding
Part 4: Motion Compensation

Mozilla

Main Idea
⊖

Input
●

●
140

=

Reference frame

Residual

“Block Matching Algorithms” (BMA)
–

Code an offset for each block (the “motion vector”)

–

Copy pixel data from previous frame relative to
that offset

If the match is good, result is mostly zero
Mozilla

Motion Search: Exhaustive
●
●

Check all offsets within some range (±16 pixels)
Pick one with smallest Sum of Absolute
Differences (SAD)
N −1 M −1

∑ ∑ ∣I k  x , y− I k−1  xMV x , yMV y ∣
x=0 y=0

–

Error function not smooth in general
●

–

Successive Elimination: algorithm that lets you
skip many candidates due to overlap
●

141

Global optimum could be anywhere

M. Yang, H. Cui, and K. Tang: “Efficient Tree Structured Motion
Estimation Using Successive Elimination.” IEE Proceedings –
Vision, Image, and Signal Processing, 151(5):369–377, Oct.
2004.
Mozilla

Aperture Problem
●

In flat regions, most MVs give approximately
the same error
–

●

Random noise decides which one you choose

Along an edge, still ambiguous in 1-D subset
Image from Chen
and Willson, 2000.

●
142

Solution: account for coding cost of MV: D+λR
Mozilla

Motion Search: Pattern-Based
●

Start from (0,0)
–

Search, e.g., 4 neighbors (diamond search)
●
●

–

Fast: usually only have to check a few candidates

–

Lots of research into best patterns
●

●

“Hex” search (6-points): slower, but better quality

“Zonal Search” (Tourapis et al. 2001, 2002)
–

143

Move to the best result
Stop when there’s no improvement

Predict several MVs from neighbors, stop if under
threshold, pattern search from best otherwise
Mozilla

Why SAD?
●

●

It’s fast
–

No multiplies like Sum of Squared Differences (SSD)

–

SAD of 2×8 pixels in one x86 instruction

It fits the statistics of the residual better than SSD

Gaussian error (SSD metric)
Cauchy error (log(1+(x/a)2) metric)
Laplacian error (SAD metric)
N. Sebe, M.S. Lew, and D.P. Huijsmans: “Toward Improved Ranking Metrics.” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(10):1132–1143, Oct. 2000.
144

Mozilla

Subpel Refinement
●
●

Motion often isn’t aligned with the pixel grid
After initial search, “refine” motion vector to
subpel precision
half pel

–

Error well-behaved between
pixel locations

–

No need for exhaustive search at
subpel resolution
●

whole pel

145

½ pel alone would require 4× as
many search points

quarter pel

Mozilla

Subpel Interpolation
●

Need to interpolate between pixels in the
reference frame for subpel
–

Why is this hard?

–

Aliasing:

Image from http://en.wikipedia.org/wiki/Aliasing
146

Mozilla

Subpel Interpolation (cotd.)
●

Real signals are not bandlimited
–

●

147

Hard edges have energy at all frequencies

But even if they were, practical interpolation
filters can’t preserve spectrum perfectly
–

MPEG1...4 use bilinear

–

Theora uses linear

–

H.264, VP8 use 6-tap

–

HEVC proposes 12tap, non-separable
filters
Mozilla

Adaptive Interpolation Filters
(HEVC proposals)
●

Compute “optimal” filters for each frame
–

Transmit one filter per subpel location
●
●

Requires multiple passes in encoder (slow)
Lots of coefficients, significant amount of bitrate
–

●

Some subpel locations only used a few times
–

–

●

148

So many free parameters you’re basically fitting noise

Alternative: backwards-adaptive
●

●

Need R-D optimization to decide which to update

Computationally expensive in the decoder
No original to compare to, won’t be as good

No reason to expect same filter to be good over
a whole frame
Mozilla

Subpel Interpolation (cotd.)
●

Aliasing error maximal at halfpel locations
From T. Wedi and H.G. Musmann: “Motionand Aliasing-Compensated Prediction for
Hybrid Video Coding.” IEEE Transactions
on Circuits and Systems for Video
Technology, 13(7):577–586, Jul. 2003.

●

Worth spending time to do high-quality
upsample by a factor of 2
–

●

149

Further subpel refinement can use cheaper
interpolation

But reference image already corrupted by
aliasing...

Mozilla

Subpel: Edge-Directed
Interpolation
●

Lots of recent research into nonlinear filters
–

Can do better than linear ones, but expensive

–

Based on assumptions about natural images

From A. Giachetti and N.
Asuni: “Real-Time ArtifactFree Image Upscaling.” IEEE
Transactions on Image
Processing, 20(10):2760–
2768, Oct. 2011.
●
150

All we need is something fast enough for video
Mozilla

Coding MVs
●

Compute a predictor from neighboring blocks
–

Usually uses a median to avoid outliers

Median of 3
(MPEG codecs)

Median of 3 (Dirac)

Median of 4 (VC1)

●

Subtract predictor, code offset

●

Creates a non-trivial dependency between MVs
–

151

Usually ignored
Mozilla

Variable Block Size
●

Can give large gains (0.9 to 1.7 dB) at low rates

●

Most codecs support at least two sizes:

●

–

16×16 (the default)

–

8×8 (called “4MV”)

H.264/VP8 add 8×16, 16×8, 4×4 partition sizes
–

●

HEVC expands this to 32×32 or even larger
–

152

4×4 isn’t that useful according to Jason GarrettGlaser
Not necessary with good MV prediction, and good
searches in the encoder, but makes things easier
Mozilla

Non-Rectangular Partitions
●

●

Several HEVC proposals for 4×16, 16×4, Lshaped, split on arbitary line, etc.
One good collection of partition shapes:

From M. Paul and M. Murshed:
“Video Coding Focusing on Block
Partitioning and Occlusion.” IEEE
Transactions on Image Processing,
19(3):691–701, Mar. 2010.
153

Mozilla

Non-Rectangular Partitions
●

Most performance gain at low bitrates

–

Odd partition shapes allow you to
●
●

Spend fewer bits coding split
Code fewer MVs on a split
–

154

Not sure how important this is given dynamic programming
in the encoder
Mozilla

Reference Frame Structure
●

Types of frames:
–

I-Frames (Intra Frames, Key Frames)
●

–

P-Frames (“Predicted”, Inter Frames)
●
●

–

MPEG1...4: Single reference frame (last I or P frame)
H.264: Choose from multiple reference frames, all
from the past

B-Frames (“Bi-Predicted”)
●
●
●

155

No motion compensation

Up to two reference frames per block (averaged)
MPEG1...4: Closest I- or P-frame from past and future
H.264: Basically arbitrary choice (“Generalized Bframes”)
Mozilla

B-Frames: Direct Mode
●

Bi-prediction without signaling

Past frame

B-frame

Future frame

mvCol

mvL0 = mvCol*(tb/td)
Direct Mode
B-Partition

Co-located
Partition
mvL1 = mvL0 - mvCol

td
tb
Time
156

Mozilla

“Multihypothesis” Motion
Compensation
From B. Girod: “Efficiency
Analysis of
Multihypothesis MotionCompensated Prediction
for Video Coding.” IEEE
Transactions on Image
Processing, 9(2):173–
183, Feb. 2000.

●

Diminishing returns for more references

●

Optimal MV precision depends on noise level

●

Half-pel is a form of multihypothesis prediction!

157

Mozilla

Weighted Prediction
●

●

●

Fades between scenes are common, but very
expensive to code
Prior to H.264, B-frames could only average
two frames
H.264 added weighted prediction
–

Each reference gets one weight for entire frame
●

–

No way to disable on block-by-block basis
●

158

Can range from -128.0 to 127.0!
Can’t handle shadows in illumination changes that
only affect part of a frame
Mozilla

Template Matching Weighted
Prediction
●

HEVC proposal
–

Look at area around the block
you’re copying from in the
reference frame

–

Compare it to area you’ve already
coded in the current frame

–

●

159

Compute the optimal prediction
weight for the overlap

Reference Frame

Current Frame

Weight varies block by block, no
extra signaling required
Mozilla

Blocking Artifacts
●

●

Lapped transforms eliminate blocking artifacts
from the residual
But block-based motion compensation adds
more discontinuities on block boundaries
–

●

160

Hard to code with a lapped transform

Two approaches to avoid them
–

Overlapped Block Motion Compensation (OBMC)

–

Control Grid Interpolation (CGI)

Mozilla

Overlapped Block Motion
Compensation (OBMC)
●

Overlap the predictions from multiple nearby
MVs, and blend them with a window
–

Also a form of
multihypothesis
prediction

From H. Watanabe and
S. Singhal: “Windowed
Motion Compensation.” In
Proc. SPIE Visual
Communications and
Image Processing ‘91,
vol. 1605, pp. 582–589,
Nov. 1991.
161

Mozilla

OBMC (cotd.)
●

Used by Dirac
–

Also want to avoid blocking artifacts with wavelets

●

PSNR improvements as much as 1 dB

●

Issues
–

Motion vectors no longer independent
●

–

Low-pass behavior
●

–
162

Can use iterative refinement, dynamic programming
(Chen and Willson, 2000), bigger cost of ignoring this
Blurs sharp features

Handling multiple block sizes (2-D window
switching)

Mozilla

Multiresolution Blending
●

Technique due to Burt and Adelson 1983
–

Decompose predictor into low-pass and high-pass
subbands LL, HL, LH, HH
●

●

163

Just like a wavelet transform

–

Blend with small window in high-pass bands

–

Recursively decompose LL band

Prorposed simplification
–

One level of Haar decomposition

–

Blend LL band like OBMC, copy rest like BMA

–

Reduces OBMC multiplies by 75%
Mozilla

Control Grid Interpolation (CGI)
●

●

164

Instead of blending predictors like OBMC, blend
MVs (like texture mapping)
–

More expensive: Can’t use large loads

–

Need subpel interpolation at much finer resolution

–

Can’t model motion discontinuities, multiple
reference frames

Harder to estimate MVs
–

Can’t ignore dependencies anymore

–

Little PSNR improvement over BMA without
iterative refinement, dynamic programming
Mozilla

Switching Between OBMC and
CGI
●

Various published research suggests using
both techniques is better than either one alone
–

Switch on block-by-block basis

–

In the range of 0.5 dB better

●

None of them avoid blocking artifacts at switch

●

Alternate approach

165

–

Choose which method to use on the edges of
blocks

–

Pick interpolation formulas that can achieve
desired behavior on an edge
Mozilla

Adaptive Switching
●

VVVV

I w0 m0w 1 m1 w2 m2w3 m3 

●

BVVV

I w0 w1 m0 w2 m2w3 m3 ⋅s 0
I w0 w1  m1w2 m2 w3 m3 ⋅s 1 
I w 0 m0 w1 m1w2 m2w 3 m3 ⋅ s2 s 3 

●

BVBV

I w0 w1 m0 w2w3 m3 ⋅s 0 s 3 
I w0 w1  m1w2 w3 m2 ⋅ s1 s2 

●

VVBB

I 1−w1 m0 w1 m1 ⋅s 0 I w1 m1 1−w1 m2 ⋅s2 
I w 0 m0 w1 m1w2 m2w 3 m3 ⋅s1 I m3 ⋅s3

●

VBBB

I 1−w1 m0 w1 m1 ⋅s 0 I m2 ⋅s 2
I w 0 m0 1−w 0 m1 ⋅s1 I m3 ⋅s3

●

166

BBBB

wi : bilinear vector weights
s j : bilinear image weights

I m0 ⋅s 0 I m1 ⋅s 1I m2 ⋅s2 I m3 ⋅s3

Mozilla

Variable Block Size
●

●

167

Need a way change block size that doesn’t
create blocking artifacts
Dirac subdivides all blocks to the smallest level
and copies MVs
–

Lots of setup overhead for smaller blocks

–

Redundant computations for adjacent blocks with
same MV

–

Only works for OBMC, not CGI

Mozilla

Adaptive Subdivision
●

Slight modifications to the previous formulas
allow artifact-free subdivision in a 4-8 mesh
–

Neighbors differ by at most 1 level of subdivision

–

Fine-grained control (MV rate doubles each level)

–

Efficient R-D optimization methods (Balmelli 2001)
●

–
168

Developed for compressing triangle mesh/terrain data

Larger interpolation kernels, less setup overhead,
fewer redundant calculations
Mozilla

Demo

169

Mozilla

References
●

●

●

●

●

●

●

●

170

F. Dufaux and F. Moscheni: “Motion Estimation Techniques for Digital TV: A Review and a New
Contribution.” Proceedings of the IEEE, 83(6):858–876, Jun. 1995.
A.M. Tourapis: “Enhanced Predictive Zonal Search for Single and Multiple Frame Motion Estimation.” Proc.
SPIE Visual Communications and Image Processing, vol. 4671, pp. 1069–1079, Jan. 2002.
A.M. Tourapis, O.C. Au, M.L. Liou: “Highly Efficient Predictive Zonal Algorithms for Fast Block-Matching
Motion Estimation.” IEEE Transactions on Circuits and Systems for Video Technology, 12(10):934–947, Oct.
2002.
M.C. Chen and A.N. Willson, Jr.: “Motion-Vector Optimization of Control Grid Interpolation and Overlapped
Block Motion Compensation Using Iterated Dynamic Programming.” IEEE Transactions on Image
Processing, 9(7):1145–1157, Jul. 2000.
K.-C. Hui and W.-C. Siu: “Extended Analysis of Motion-Compensation Frame Difference for Block-Based
Motion Prediction Error.” IEEE Transactions on Image Processing, 16(5):1232–1245, May 2007.
W. Zheng, Y. Shishikui, M. Naemura, Y. Kanatsugu, and S. Itoh: “Analysis of Space-Dependent
Characteristics of Motion-Compensated Frame Differences Based on a Statistical Motion Distribution Model.”
IEEE Transactions on Image Processing, 11(4):377–389, Apr. 2002.
P.J. Burt and E.H. Adelson: “A Multiresolution Spline with Application to Image Mosaics.” ACM Transactions
on Graphics, 2(4):217–236, Oct. 1983.
L. Balmelli: “Rate-Distortion Optimal Mesh Simplification for Communications.” Ph.D. Thesis, École
Polytechnique Fédérale de Lausanne, 2000.

Mozilla

Questions?

171

Mozilla