Object Tracker using the Kinect Sensor

In this section we will describe our implementation of the Roomba tracker. We specified the requirements the tracker must satisfy and some simplifying assumptions it can assume. In Section Base-station we have mentioned how depth data can be streamed from the Kinect sensor to the workstation. We will begin this discussion by assuming the availability of a continuous stream of depth data.



Figure 1: Kinect camera.

In the first part of this document we will briefly explain two approaches that researchers use to implement object tracking. In the last part we will describe our tracker. The discussion will be very informal, although the underlying theory for object tracking in Computer Vision is very mathematically involved.



Figure 2: Complete Roomba tracker.

Multiple Blob Tracking: A very short survey

Tracking of a variable number of objects in a dynamic background is a considerable challenging task due to the sources of uncertainty in object locations, occlusion by moving and static objects, clutter in the environment. However, for our project, several simplifying assumptions, namely, static background, pre-determined number of actual moving objects in the scene, known shapes of the moving objects and known trajectories of these objects simplifies the tracker implementation.

We will briefly review two approaches in tracking multiple objects that looked promising to us. These are:
  1. Tracking by detection, and
  2. Image likelihood.
We will now describe them next.

Tracking by detection

The main idea of this approach is as follows. Find blobs in a frame at time t. Associate each of the blobs to blobs detected at time t-1.

Object tracking when visual data is present can be done using appearance based methods. These methods require high quality images to compute colour histograms to statistically model the background from the foreground. These methods then subtract the background from the image to get the moving foreground objects. Foreground objects are then segmented based on their shapes and associated with the previously found foreground objects.

This approach also works for dynamic cameras in dynamic background.

Image Likelihood

The image likelihood method works well for static cameras in static backgrounds. These methods statistically model the background using a mixture of Gaussian distributions. Instead of subtracting the background from the image these methods then assigns probability distributions over object configurations in the scene. The probability distributions is then updated by a particle filter.

Implementation of Kinect Object Tracker

Kinect finds depth data for 640x480 fixed points in its view. The depth data value varies from frame to frame even if the background is static. We therefore statistically model the static background before we place the satellite stations in the scene. We compute the average of the distance values for each of these fixed points. Optionally, we can compute the standard deviation and probabilistically model the depth data for each of the fixed points using a Gaussian distribution with the computed mean and standard deviation. We can also k cluster these data and find the mean and standard deviation for each of these k clusters to compute a mixture of Gaussian distributions to statistically model the depth value of each of the fixed points.

After statistically modelling the background, the four satellite stations are then placed in the scene. We then subtract the background depth values from the current depth values and compute the absolute values. We then use a threshold to filter even more depth data values. The resulting data then mostly contains connected patches of depth data for the satellite stations along with some isolated background points incorrectly identified as foreground objects.

Next, the background-subtracted and "thresholded" depth data is then convolved with a window of 30x30 depth data of value 1. The purpose of this procedure is to find individual data points with the highest number of surrounding depth data values. This step reasonably approximates the centres of the tracked satellite stations. We then sort these from the highest to the lowest and take the three highest feasible points as the locations of the tracked satellites in the scene. By feasible points we mean that the satellite stations can not overlap on each other and therefore their centres must be separated by at most the radius of the satellite stations.