In this paper, we present a framework to compress video clips of Table Tennis ( Ping Pong ) with very high compression ratio for distribution or playback in embedded systems. Image processing techniques are utilized to extract the crucial parameters of the scenes in a table tennis video frame. In particular, the popular Active Contour Model is employed to track the players. The scene parameters thus obtained will be used for animation by the decoder and are referred to as scene animation parameters ( SAPs ). The SAPs of a frame are compressed using a simple DPCM model, where the differences between predicted SAPs from a previous frame and the SAPs of the current frame are calculated, transformed, quantized and entropy-encoded. The decoder carries out the usual entropy-decoding, inverse-quantization, inverse-transformation, and summation to obtain the predicted SAPs of each frame. It then uses a predefined graphics model to reconstruct the scene from the SAPs. Regardless of the original frame size ( image width x height ), data rates as low as 0.3 kbps can be obtained when the video is played at 20 frames per second.
Keywords:Video Compression, MPEG4, Active Contour Model, Scene Graph, Low Bit Rate, Table Tennis
Table Tennis ( Ping Pong ) is arguably the world's second or third most popular sports with an estimated number of 300 million participants [1,2]. The sport is particularly popular in Europe and Asia. A match between top players in China can easily attract tens of millions of spectators. Numerous video clips of Table Tennis matches and lessons have been distributed through DvDs or Internet[2]. Therefore, it is desirable to develop a technique to compress these video clips with very high compression ratio while still maintaining a high quality video playback. Traditional video compression standards like MPEG4 and H2.61 specify various kinds of techniques for compressing videos. Figure 1 shows a typical MPEG4 video compression/decompression process[3,4,5]. In the figure, T represents a transformation like DCT ( Discrete Cosine Transform ) and Q represents quantization; T-1 and Q-1 are the corresponding inverse transformation and quantization. ME is motion estimation and MC is motion compensation; uFn denotes an unfiltered frame.
While these standards do a good job in providing generic methods of compression, they fail to make customizations to compress special features and utilize them to achieve very high compression performance.
In this article, we present a framework that is customized for Table Tennis ( TT ) video clip compression; utilizing the special features of a TT match, we can achieve very high compression ratios by combining techniques of image processing and computer graphics. Our main concern here is about TT video clips compression for distribution; the compression is done offline and only once but the uncompression will be done many times and maybe online. Therefore, the compression stage is allowed to be time-consuming and tedious as long as the decompression is fast.
When watching a match, most TT fans including the author are more interested in the players' movement and the ball motion rather than the background audience and some other irrelevant features like the texture of the floor or the color of the table. The central theme of our framework is to replace the actual scenes in a frame with simulated graphics. Because of the characteristics of a TT match, only a small number of parameters are required to generate the simulated graphics. The graphics parameters become the video data and can be further compressed using traditional compression methods.
Figure 2 shows the framework of our compression method , where SAPn is the set of Scene Animation Parameters generated from the image frame Fn using image processing techniques. As shown in the figure, SAPn can be compressed by a simple DPCM model[3] coupled with a transformation T and quantization Q; T can be a DCT or in its simplest form, it can be an identity transformation which does not do anything to the prediction errors e. The lossless compression could be arithmetic coding, Huffman coding or run-level coding followed by Huffman coding. The prediction errors e are obtained by subtracting the current 'frame' SAPn from a previous 'reconstructed-frame' SAPn-1[3]. In the decoding process, the reconstructed scene animation parameters ( after going through lossless decompression, inverse quantization, inverse transformation and summation ) are fed into a scene animation model which reconstructs the frame from the SAPs.
Figure 3a shows a typical scene of a TT match. As shown in the figure, we need parameters to describe the following objects,
The ball and the table are solid objects and are relatively easy to define and reconstruct. The players are deformable and are more difficult to track. The simplest object to specify is the ball; all we need is to define its pixel position in the current frame. We can use two 10-bit numbers to denote the coordinates of the ball center. Processing the table image is also simple; it can be specified by the coordinates of its four corners. Since a player always holds her paddle, we regard a player-and-paddle as one object and without ambiguity, sometimes refer to it as a player. Obviously, a player is deformable. It is a state of art of research to track a human in a video and animate it. A lot of research on human synthesis and animation has been done in the past decade[6,7]. Sophisticated Open-source applications for modeling and creating human graphics are also available[8]. In the following sections, we shall discuss recognizing, locating, specifying, and reconstructing the ball, the table and the players.
Of all the objects listed above, the table is the easiest to locate as it is a rigid object and has a large size; its position does not change rapidly from frame to frame. We also know that it is blue. ( If it is not, we can obtain the color by first watching a few frames of the video and making an estimate of its color components. As we have mentioned earlier, the compression stage could be tedious. After all, under normal conditions, a table tennis video clip is always edited before it is distributed. ) To extract the table from an image frame, we first segment the image into regions. A lot of segmentation techniques have been developed in the past few decades[9-14]. In our work, we adopted the region-growing techniques [13,14] based on statistical concentration inequalities[15,16] to segment the images and to identify the table. In this method, an image is partitioned into regions that satisfy a similarity criterion such as matching a color or reaching a brightness level. At the beginning, a region starts with a point ( a pixel ) and "grows" by grouping all neighboring points that possess a similar property. Specifically, our similarity criterion is a statistical predicate[15,16]. This is basically a union-find problem and can be implemented efficiently with a method developed by Tarjan[17]. Figure 3b shows the segmented regions of the image shown on Figure 3a based on this statistical method.
Figure 3a |
Figure 3b |
There has been a number of research work on ball localization and tracking in videos[18,19,20], where one reconstructs the trajectory of a moving ball by tracking the ball through the frames of a video sequence. The motion of the ball in a sports game usually has a characteristic trajectory depending on the type of game played. The characteristic trajectory is exploited to help track the ball. However, it is not easy to follow a fastly moving ball; this is because the rapid movement of the ball could cause its image to appear fuzzy in a frame; the ball appears as an elongated smear without sharp contours, confusing most computer vision algorithms. Figure 4 shows an example of this phenonmenon.
Figure 4. Motion-blur image of ball |
|
|
If the player is left-handed, the roles of Left Hand and Right Hand are interchanged. Each component is then tracked by the Active Contour Model. At the very beginning, a user has to provide a briefly estimated contour to each component. The ACM algorithm then iterates over a number of times and finds a better estimate for the actual component contour; these contours are then used to find the component contours of the next frame.
A player is then modeled as the set of components listed above connected by joints and is conveniently represented by a tree structure; the nodes of the tree define all the player components. For simplicity, we assume each component only has one degree of freedom ( DOF ) movement with respect to its parent. We also need a few parameters to specify the orientation and position of the player as a whole. Moreover, we define a player model in its neutral state. Player Animation Parameters ( PAPs ) consisting of deviations of joint angles from the neutral state are used to describe the state of the player at a certain instant. ( Note that the set of PAPs is a subset of the set of SAPs. ) When a neutral player model is deformed, a sequence of PAPs is generated. From the PAPs we obtain the joint angles and when the joint angles are specified, the state of the animated player can be found by forward kinematics[22].
Graphics techniques are employed to reconstruct the scene based on the decoded SAPs. We have used OpenGL and SDL ( Simple DirectMedia Layer )[39] to render the graphics. Rendering the table and the ball is straightforward. Modeling and synthesizing humans are a lot more complex as they are multibodies or articulated objects. Reference 6 gives a detailed discussion on human modeling and animation and current research on this topic. In general, to give a realistic look of a human, each body part is constructed as a mesh[6]. At the moment, we simplify the human model by representing it as a scene graph and we only use the four graphics primitives, sphere, cylinder,i cube and cone with transformation to represent any body part. For example, the eyes of a player can be represented as scaled spheres, and the nose as a scaled cone. A scene graph is a hierarchical approach to describing objects and their relationship to each other. By making use of scene graphs to describe objects, we can separate scene representation from rendering. When a node is undergone a transformation like translation or rotation, all its children will go through the same transformations too. The Virtual Reality Modeling Language ( VRML )[40] is an example of using scene graphs ( internally ) to describe the three-dimensional world. In our work, we have used the simple object-oriented scene graph toolkit developed by Edward Angel[41] to reconstruct the players in 3D; the toolkit is an open-source library which foundation is supplied by OpenGL and Unix; the library includes some basic objects for creating a scene: camera, light, attributes, transformations and some geometry objects, including spheres, cylinders, cubes and cones. A player in the neutral state is constructed using this scene graph toolkit and is then transformed ( or "deformed" ) using the decoded PAPs to obtain the current state.
We applied the above framework to compress a few short table tennis video clips downloaded from the Internet. In general, the program can track the table and the ball fairly well. However, the players who are deformable are more difficult to track; we have to adjust the component contours manually in some of the frames when the ACM fail to find the desired contours. So at the moment, our program only works semi-automatically; some manual work is required to obtain a decoded video that closely resembles the original one. However, it does demonstrate the concept of compressing table tennis video clips for distribution by combining techniques of image processing and computer graphics. If we assume a frame rate of 20 frames per second ( fps ), we can make an estimate of the bit rate in the worst case, where no compression on the SAPs is performed and each SAP requires a 10-bit number to represent. There are totally about 40 SAPs. The bit rate R under this situation is given by
With compression on the SAPs, the bit rate R could be reduced to
Note that the bit rate is independent of the frame size. Even if the video image size is very large we still can transmit at the rate of 0.2 to 2 kbps for a reasonable good playback.
We have presented a framework for compressing table tennis video clips by combining image processing and computer graphics techniques. A statistical region-growing segmentation algorithm and Hough transform is used to locate the table in a frame. Simple correlating and interpolation techniques are used to locate the ball. The Active Contour Model is used to track the players, which are deformable. Scene animation parameters ( SAPs ) of each frame are extracted and compressed with a simple DPCM model and entropy-encoding. The reconstructed SAPs are fed to a graphics model to reconstruct the playing scene. Experiments done on a few short table tennis video clips show that the framework is feasible and yields very low bit rate though further fine-tuning on the Active Contour Model is needed to make the compression process totally automatic and flawless.