A Framework for Very High Performance Compression of Table Tennis Video Clips
Tong Lai Yu
Department of Computer Science
California State University at San Bernardino

    Abstract

    In this paper, we present a framework to compress video clips of Table Tennis ( Ping Pong ) with very high compression ratio for distribution or playback in embedded systems. Image processing techniques are utilized to extract the crucial parameters of the scenes in a table tennis video frame. In particular, the popular Active Contour Model is employed to track the players. The scene parameters thus obtained will be used for animation by the decoder and are referred to as scene animation parameters ( SAPs ). The SAPs of a frame are compressed using a simple DPCM model, where the differences between predicted SAPs from a previous frame and the SAPs of the current frame are calculated, transformed, quantized and entropy-encoded. The decoder carries out the usual entropy-decoding, inverse-quantization, inverse-transformation, and summation to obtain the predicted SAPs of each frame. It then uses a predefined graphics model to reconstruct the scene from the SAPs. Regardless of the original frame size ( image width x height ), data rates as low as 0.3 kbps can be obtained when the video is played at 20 frames per second.

    Keywords:Video Compression, MPEG4, Active Contour Model, Scene Graph, Low Bit Rate, Table Tennis

  1. Introduction

    Table Tennis ( Ping Pong ) is arguably the world's second or third most popular sports with an estimated number of 300 million participants [1,2]. The sport is particularly popular in Europe and Asia. A match between top players in China can easily attract tens of millions of spectators. Numerous video clips of Table Tennis matches and lessons have been distributed through DvDs or Internet[2]. Therefore, it is desirable to develop a technique to compress these video clips with very high compression ratio while still maintaining a high quality video playback. Traditional video compression standards like MPEG4 and H2.61 specify various kinds of techniques for compressing videos. Figure 1 shows a typical MPEG4 video compression/decompression process[3,4,5]. In the figure, T represents a transformation like DCT ( Discrete Cosine Transform ) and Q represents quantization; T-1 and Q-1 are the corresponding inverse transformation and quantization. ME is motion estimation and MC is motion compensation; uFn denotes an unfiltered frame.


    Figure 1. A typical MPEG4 video codec.

    While these standards do a good job in providing generic methods of compression, they fail to make customizations to compress special features and utilize them to achieve very high compression performance.

    In this article, we present a framework that is customized for Table Tennis ( TT ) video clip compression; utilizing the special features of a TT match, we can achieve very high compression ratios by combining techniques of image processing and computer graphics. Our main concern here is about TT video clips compression for distribution; the compression is done offline and only once but the uncompression will be done many times and maybe online. Therefore, the compression stage is allowed to be time-consuming and tedious as long as the decompression is fast.

    When watching a match, most TT fans including the author are more interested in the players' movement and the ball motion rather than the background audience and some other irrelevant features like the texture of the floor or the color of the table. The central theme of our framework is to replace the actual scenes in a frame with simulated graphics. Because of the characteristics of a TT match, only a small number of parameters are required to generate the simulated graphics. The graphics parameters become the video data and can be further compressed using traditional compression methods.

    Figure 2 shows the framework of our compression method , where SAPn is the set of Scene Animation Parameters generated from the image frame Fn using image processing techniques. As shown in the figure, SAPn can be compressed by a simple DPCM model[3] coupled with a transformation T and quantization Q; T can be a DCT or in its simplest form, it can be an identity transformation which does not do anything to the prediction errors e. The lossless compression could be arithmetic coding, Huffman coding or run-level coding followed by Huffman coding. The prediction errors e are obtained by subtracting the current 'frame' SAPn from a previous 'reconstructed-frame' SAPn-1[3]. In the decoding process, the reconstructed scene animation parameters ( after going through lossless decompression, inverse quantization, inverse transformation and summation ) are fed into a scene animation model which reconstructs the frame from the SAPs.



    Figure 2. Block diagram showing compression using image processing and animation.

    Figure 3a shows a typical scene of a TT match. As shown in the figure, we need parameters to describe the following objects,

    1. the ball,
    2. the table,
    3. the two players holding paddles

    The ball and the table are solid objects and are relatively easy to define and reconstruct. The players are deformable and are more difficult to track. The simplest object to specify is the ball; all we need is to define its pixel position in the current frame. We can use two 10-bit numbers to denote the coordinates of the ball center. Processing the table image is also simple; it can be specified by the coordinates of its four corners. Since a player always holds her paddle, we regard a player-and-paddle as one object and without ambiguity, sometimes refer to it as a player. Obviously, a player is deformable. It is a state of art of research to track a human in a video and animate it. A lot of research on human synthesis and animation has been done in the past decade[6,7]. Sophisticated Open-source applications for modeling and creating human graphics are also available[8]. In the following sections, we shall discuss recognizing, locating, specifying, and reconstructing the ball, the table and the players.

  2. The Table

    Of all the objects listed above, the table is the easiest to locate as it is a rigid object and has a large size; its position does not change rapidly from frame to frame. We also know that it is blue. ( If it is not, we can obtain the color by first watching a few frames of the video and making an estimate of its color components. As we have mentioned earlier, the compression stage could be tedious. After all, under normal conditions, a table tennis video clip is always edited before it is distributed. ) To extract the table from an image frame, we first segment the image into regions. A lot of segmentation techniques have been developed in the past few decades[9-14]. In our work, we adopted the region-growing techniques [13,14] based on statistical concentration inequalities[15,16] to segment the images and to identify the table. In this method, an image is partitioned into regions that satisfy a similarity criterion such as matching a color or reaching a brightness level. At the beginning, a region starts with a point ( a pixel ) and "grows" by grouping all neighboring points that possess a similar property. Specifically, our similarity criterion is a statistical predicate[15,16]. This is basically a union-find problem and can be implemented efficiently with a method developed by Tarjan[17]. Figure 3b shows the segmented regions of the image shown on Figure 3a based on this statistical method.

    The advantage of this method is that we can modify the predicate to identify specific regions with special features. In our case, in addition to the statistical predicate, we add the criterion of having blue color to grow regions and the largest region is identified as the table surface. Once the region is located, we apply a simple Hough Transform[9] to obtain the table edges. The intersection points of the edges give us the coordinates of the four corner of the table. The table surface with a net on it can be constructed from the coordinates as the net is always at the middle of the table. Since the size of a standard Ping Pong table is standardized, its length, width and height are fixed. Once we have determined the table surface projected on the screen, we also know how to draw the legs.

  3. The Ball

    There has been a number of research work on ball localization and tracking in videos[18,19,20], where one reconstructs the trajectory of a moving ball by tracking the ball through the frames of a video sequence. The motion of the ball in a sports game usually has a characteristic trajectory depending on the type of game played. The characteristic trajectory is exploited to help track the ball. However, it is not easy to follow a fastly moving ball; this is because the rapid movement of the ball could cause its image to appear fuzzy in a frame; the ball appears as an elongated smear without sharp contours, confusing most computer vision algorithms. Figure 4 shows an example of this phenonmenon.