Xinqiao
Liu
December
1, 2000
For
conditional replenishment video coding algorithms, typically a change detection
mask is transmitted as side information in addition to the signal within the
changed areas. The change detector should therefore achieve the best
performance without exceeding a certain bit-rate available for encoding the
change detection mask and the signal. To find the optimal change detection algorithm,
a Context-based Arithmetic Encoder (CAE) is first implemented to transmit the
binary change detection mask. The CAE is originally used for shape coding in
the object-based MPEG-4 standard. Based on this compression scheme, two direct
change detection algorithms are implemented and their performances are analyzed
using rate-distortion theory. Further, we propose an adaptive change detection
algorithm that is based on the inherent noise characteristics of digital
imaging system. We demonstrate how the change detection threshold is locally
adapted to each pixel’s luminance value. For the rate-constrained conditional
replenishment compression, simulation proves that our adaptive change detection
algorithm achieves the best performance among all the three algorithms.
Conditional
replenishment (see, e.g. [1] [2] [3]) has been proposed as a technique for
taking advantage of the similarity between successive frames in video-telephony
or video conferencing where cameras typically are stationary and scenes usually
change slowly. The rudimentary algorithm behind conditional replenishment is to
transmit the pixel intensity value plus an address for each picture element
that is changed by more than a certain threshold since the previous displayed frame.
A binary change detection mask is used to represent the address information and
transmitted separately. Conditional replenishment is significantly simpler than
other video compression methods in terms of computational complexity, while is
still able to achieve respectful compression ratio for certain video
applications. Experiments with desktop video without motion compensation reveal
that on average less than 3% of the pixels need to be replenished in most
head-and-shoulders scenes with CIF/Q-NTSC frame sizes [5].
There
have been some studies on conditional replenishment video coding (see e.g. [1]
[2]). Most of the research, however, concentrate mainly on the image quality.
The correlation between transmission bit-rate and the choice of change
detection schemes still need to be explored. Recently, in [6], a
perception-based change detection method in conditional replenishment video
coding was presented. Based on the contrast sensitivity of human visual system,
the authors suggest that only the pixel whose intensity differences relative to
the stimulus background exceed the visibility threshold should be encoded.
Therefore, the proposed method reduces the perceptual redundancy in addition to
the spatial and temporal redundancy. However, the paper only studied the case
of distortionless coding, and the threshold values used in their simulation are
arbitrary and lack of theoretical support.
Figure1
shows the block diagram of conditional replenishment video coding method [4].
Figure 1
Block diagram of conditional replenishment.
Typically,
the information in the changed area is decomposed into a change detection mask
and the signal within the changed area. They are encoded and transmitted
separately through the transmission channel under certain bit-rate constrain.
In this study, we assume that the encoders for both paths are lossless. A
Context-based Arithmetic Encoder (CAE) is used to encoding the binary change
detection mask. The CAE method is originally used for shape coding in the object-based
MPEG-4 standard. Here we don’t specify the compression scheme used for encoding
the signal within the changed area. However, it is reasonable to assume that
the bit-rate required for encoding the signal (R2) is
proportional to the bit-rate required for encoding the change mask (R1)
since they are all proportional to the number of pixels need to be replenished.
Based on this assumption, we then are able to look at the trade-off between
different change detection schemes and the bit rate required using
rate-distortion theory. Figure 2 shows the model we will be using through this
study.
Figure 2
Conditional replenishment model used in this study
Given
limited channel bandwidth, there are several ways to control the bit rate in
the change detector: i) by subsampling the change detection mask and thus
reducing the number of pixels within the changed area; ii) by adjusting the
detection threshold for the whole frame; iii) by using adaptive threshold based
on the noise characteristics, which eliminates those pixels that have changed
due to noise rather than the input. As the number of changed pixels decreases,
however, the distortion of the reconstructed frame increases. In this report,
we simulate the rate-distortion relationship of the above three methods. The
unconstrained Lagrangian cost function is then introduced in order to find the
optimum detection parameters for each method. And finally, we demonstrate that
the proposed noise adaptive threshold method generate the best image quality
given the same channel bandwidth among the three methods.
This
report is organized as follows: in section
II, we describe the context-based arithmetic encoding (CAE) scheme used in
encoding the change detection mask; in section III,
we present the rate-distortion result from both the subsampling change
detection method and threshold-adjusting change detection method; in section III, we describe the noise
characteristics in a typical digital imaging system and propose an adaptive
change detection algorithm. The performance of these three algorithms is
compared in section IV.
Context-based
arithmetic encoding [8] is one of the binary bitmap-based shape coding scheme
used in the object-based MPEG-4 standard [7]. Here we only consider the
intraframe mode compression.
The
binary change detection mask is first divided into 16x16 macroblocks. Within
each macroblock, the coder exploits the spatial redundancy of the binary shape
information to be coded. Pixels are coded in scan-line order and row-by-row.
Three different types of the macroblocks are defined: "black" block
in which none of the pixel has changed (all pixels are 0); "white"
block whose pixels are all changed and need to be replenished (all pixels are
1); and those boundary macroblock which contains both changed and unchanged
pixels. For black and white macroblocks, encoder only need to signal the
macroblock type and the number of bits it used is negligible. For those
boundary macroblocks, a template of 10 pixels is used to define the causal
context for predicting the binary value of the current pixel (S0). The template
is shown in Figure 2.
Figure 3
Template for defining the context of the pixel to be coded
The
template extends up to 2 pixels to the left, to the right and to the top of the
pixel (S0) to be coded. For encoding the pixels in the two top and left rows of
a macroblock, parts of the template are defined by the shape information of the
already transmitted macroblocks on the top and left side of the current
macroblock. For the two right-most columns, each undefined pixel of the context
is set to the value of its closest neighbor inside the macroblock.
For
encoding the state transition, a context-based arithmetic encoder is used. For
this project, however, the arithmetic encoder is not implemented. Instead, the
theoretical conditional entropy based this template is calculated and this
entropy is approximated as the bit-rate required for encoding the change mask.
In order to calculate the conditional entropy, conditional probability table
with size of 1024 contexts is first derived from a bigger training sequence.
The conditional entropy is given by:
Same
as predictive coding, the context-based conditional entropy coding greatly
reduces the statistical dependencies between adjacent pixels, since:
A
standard change detector is shown in the following figure:
Figure 4
Change detector block diagram
Suppose
the previous frame is represented by matrix A1, the current
frame is represented by matrix A2, after change detection,
the change mask is given by matrix C, which is binary with 0 represents
unchanged pixels and 1 represent changed pixels. The signal in the changed area
is thus given by and the reconstructed
frame is:
The
mean-square distortion is defined as:
Assume
the bit-rate to transmit C is R1 where R1=H(C),
and the bit-rate to transmit the change signal is R2=H(A2C).
Since both R1 and R2 are proportional to
the number of changed pixels, it is reasonable to assume that even though we don’t
specify the compression scheme used to encoding A2C. Thus the
total bit-rate R is:
where
k is the ratio between R2 and R1. Based on this assumption, we can just study
the rate-distortion between D and R1 instead.
The
Lagrangian cost function given Lagrange multiplier l is defined as
which
is under the constrain that:
The
test video sequence used in this study is captured by a stationary high-speed
digital camera with a person moving cross the screen. Following is two
consecutive sample frames and an example of the detected change mask:
Figure 5
Example of two continuous frames and the change mask.
There
are two direct ways to reduce the number of detected pixels in the change
detection: subsampling and threshold adjusting.
To reduce the bit-rate as well as allow lossy compression, the macroblock can be subsampled by a factor of 2, 4 or 8, resulting in a subblock of size 8x8, 4x4 or 2x2, respectively. The subblock is encoded using the CAE encoder as described above. The signal in the changed area is subsampled as well by the same ratio. The encoder transmits to the decoder the subsampling factor such that the decoder decodes the change mask and the signal and then upsamples them. The upsampling filter used in this study is a simple pixel replication filter combined with a 3x3 median filter.
Following figure shows the change mask generated at different subsampling ratio:
Figure 6 Change mask generated at different subsampling ratio. (a) 1:1, (b) 1:2, (c) 1:4, (d) 1:8
Following figures show that as the subsampling ratio increases, the bit-rate decreases while the distortion increases:
The rate distortion curve and the Lagrangian cost function at different Lagrange multiplier l are plotted as follows:
Another method to control the bit-rate is by adjusting the change detector threshold. As the threshold increased, few pixels will be detected. Following figure shows the change detection mask under different threshold values.
Following figures show that as detection threshold increases, the bit-rate decreases while the distortion increases.
The rate distortion curve and the Lagrangian cost function at different Lagrange multiplier l are plotted as follows:
A fundamental problem in designing an optimum change detector is how to separate pixels whose change is due to noise from pixels whose value change is due to real input signal change. The two algorithms described above do not provide a good solution. For example, in the threshold-adjusting scheme, even though increasing threshold reduces the bit rate, pixels whose input have relative small change will be neglected while pixels with large noise variance will still be detected and transmitted. This will increase the distortion of the reconstructed frame and degrade the compression performance. Therefore, a noise adaptive change detection algorithm is necessary.
Today, most of images and videos are captured by either CCD or CMOS image sensors[9]. During frame exposure time, photo-detector inside each pixel converts incident light into photocurrent. The photocurrent is integrated onto a capacitor and the charge Qi,j (or voltage) is read out at the end of exposure time (here i and j are the pixel index). In this process, additive noise corrupts the output signal charge. The noise can be expressed as the sum of two independent components, (i) shot noise Ui,j which is zero mean signal dependent gaussian distribution with:
where q is the electron charge and (ii) readout circuit and reset noise Vi,j (including quantization noise) with zero mean and variance dV2. Thus the total noise variance of pixel (i,j) is:
The above equation implies that the noise is signal dependent, especially when the shot noise is larger than the readout noise. The stronger the luminance level, the noisier the pixel will be. Therefore, to separate pixels whose change is due to noise from pixels whose value change is due to real input signal change, the detection threshold should be locally adaptive.
In the proposed adaptive detection algorithm, we first calculate the local average value by averaging signals over a small area with size 8x8. The threshold Ti,j is then set to be multiples of the standard deviation of the noise of local average signals, i.e.,
where m is the sensitivity factor that is set globally over the whole frame. Note that by changing m, we effectively adjusting the detection sensitivity while the threshold is still locally adapted. Following figure shows the relation between input signal value and the adapted threshold value:
Following figure shows the threshold varying across the whole frame:
Using this adaptive detection algorithm, we can analysis the rate-distortion relation by varying the sensitivity factor m. Following figure shows the change detection mask under different m values.
Following figures show that as detection sensitivity factor m increases, the bit-rate decreases while the distortion increases.
The rate distortion curve and the Lagrangian cost function at different Lagrange multiplier l are plotted as follows:
In this section we compare the performance of the three detection methods presented in this report. First we define the frame PSNR as:
Following figure shows the comparison between subsampling change detection and threshold-adjusting change detection algorithm. It can be seen that given the same bit-rate, the threshold-adjusting achieves much higher PSNR than the subsampling method. On the other hand, the subsampling method can achieve lower bit-rate than the threshold-adjusting method.
Following figure shows the comparison between the adaptive detection method and the threshold-adjusting method. The adaptive algorithm achieves slightly higher PSNR. The gain of the adaptive detection algorithm should be higher when evaluated using different criteria. The reason is that our definition of the distortion is referenced to the current frame A2, which include some noisy pixels in the unchanged region. Since our adaptive detection algorithm eliminates those pixels, the still background should looks smother from frame to frame when using conditional replenishment than from the original video sequence.
This
project studied the optimum change detection algorithm in rate-constrained
conditional replenishment video coding scheme. Based on the inherent noise
characteristics of digital imaging system, we presented an adaptive change
detection algorithm that can efficiently separate pixels whose change is
due to noise from pixels whose value change is due to real input signal change.
Two other change detection algorithms were also implemented together with a
context-based arithmetic encoder (CAE) that is originally used for shape coding
in MPEG-4 standard. Simulation proves that our adaptive change detection
algorithm achieves the best PSNR among all the three algorithms.
REFERENCES
[1] B. Haskell, "Frame
replenishment coding of television," in Image Transmission Techniques, W.K. Pratt, Ed. New York: Academic,
1979.
[2] B. Haskell, F.W. Mounts
and J.C. Candy, "Interframe coding of videotelephone pictures," Proc. Of IEEE, Vol. 60, pp. 792-800,
July 1972.
[3] A.N. Netravali and B.G.
Haskell, Digital pictures:
representation, compression and standards, 2nd ed. Holmdel, NJ,
1995
[4] B. Girod, Image and video compression, EE368b
class note, Fall 2000-2001.
[5] Y.J. Chiu and T. Berger,
"Video compression using fax techniques," in Proc. 1996 IEEE Data Compression Conf., Apr. 1-3, 1996.
[6] Y.J. Chiu and T. Berger, "A
software-only videocodec using pixelwise conditional differential replenishment
and perceptual enhancements," IEEE
Trans. on Circuits and Systems for Video Technology, Vol. 9, No. 3, pp.
438-450, April 1999.
[7] A.K. Katsaggelos, L.P.
Kondi, F.W. Meier, J. Ostermann, and G.M. Schuster, "MPEG-4 and
rate-distortion-based shape-coding techniques", Proc. of the IEEE, Volume: 86 6, pp. 1126-1154, June 1998.
[8] N. Brady, F. Bossen, N.
Murphy, “Context-based arithmetic encoding of 2D shape sequences,” ICIP97, Santa
Barbara, CA 1997
[9] A. J.P. Theuwissen, Solid-State Imaging with Charge-Coupled
Devices, Kluwer Academic, 1995