Skip to end of metadata
Go to start of metadata

In this project, we implemented a 2D object classifying system. Specifically, our system works with the assumption that the object in question is dark and will stand out against a completely white background. First, we worked with classifying images, and then we implemented a live video feed version of this. To start off, the program runs through a user-specified folder of labeled training images, calculating feature vectors for each image. These training labels and the associated feature vectors form a database of known objects. We use the filenames to train the system on what input matches which object category. The training labels and feature vectors are written out to a file for future use.

To compare images, we used the features: bounding box fill ratio, bounding box width/height ratio, major axis/minor axis ratio, and the first 6 Hu moments. (We left out the 7th because it is intended for mirrored images, and we were not sure if it was translation/rotation/scale-invariant.) The first two features are calculated using built-in OpenCV functions for outlining a region and finding its oriented bounding box (findContours, minAreaRect, contourArea). The major axis/minor axis ratio is calculated by obtaining these axes from a bounding ellipse. The Hu moment features come from the built-in OpenCV functions moments and HuMoments.

Figure 1. Examples of connect component analysis output, color-coded by region

We defined two struct types to store data related to input images: a FeatureVector struct, to hold the features calculated for a particular image; and an ImageInfo struct, containing the original image, thresholded image, connected components, contours, bounding box, feature vector, axis endpoints, and label. Any time we read in an image, we compile the ImageInfo data for it so that we can classify and display it.

After reading in and processing the training images, our program enters classifying mode. The user has two options: classifying still image input or real-time video input. Still image input is classified image by image, and each image is displayed individually as shown below. Video input is classified on each frame, with the label and all features updating onscreen in real time. The donut sometimes needed extra lighting to be in focus correctly.

Figure 2. A classified wrench as original image and thresholded image with bounding box, contour, axes, and features

Figure 3. A classified spatula as original image and thresholded image with bounding box, contour, and axes

Figure 4. A classified shovel as original image and thresholded image with bounding box, contour, and axes

For classification, we implemented two systems: one based on scaled Euclidian distance and one KNN classifier. The first simply loops through every label in the database and calculates the average scaled Euclidian distance from each associated feature vector and the input image's feature vector. It classifies the image as the label with the smallest average distance. The KNN classifier calculates the distance from the input image to each individual feature vector in the database, sorts them to have the smallest distances first, and then finds the most common label among the top K matches. We found that the two classifiers worked about as well as each other, but that the Euclidian may classify slightly better on specific objects.

To test our system, we ran it with the given testing image set and obtained the following confusion matrix. Our system worked really well on this set, which is rather similar to our training set.

Figure 5. The confusion matrix for the testing set, using the Euclidian distance classifier

Here, you can see a demonstration of our program identifying objects in real time. Given that the video length turned out to be exactly 1 minute and 30 seconds, we decided to put an anime opening over it, and chose the most fitting one, with the name "Silhouette". Hopefully you find it more exciting than the background noise of people in the lab talking.


For one extension, we implemented having the system recognize when an object was unknown. For the scaled Euclidian distance classifier, this meant setting a minimum distance value; if the minimum distance found for a label is not smaller than this threshold, we do not consider the label to be a true match and return "unknown" instead. For the KNN classifier, we implemented something similar, where we do not return the most common label among the K nearest neighbors if those K neighbors have too large an average distance from the input image. We had some issues determining what minimum distance value to use, because making the value too high allows false positive identifications, while making it too low brings false negatives. Normalizing some of the features helped with this issue a little bit. One can see an example of an unknown input in the video linked above.

For another extension, we added displaying various information on the screen for video input. In addition to thresholded image and the label, we show all of the feature values, the bounding box, the region contour, and the major and minor axes. The bounding box and axes use the OpenCV function for drawing a line; the contour uses the built-in OpenCV function for drawing contours. All of these appear in Figures 2-4 above.

Special thanks to Walker for suggesting that we normalize the Hu moment data for Euclidian distances.