Skip to end of metadata
Go to start of metadata


It is commonplace today to search for images on a search engine such as Google by searching for keywords that occur near the image. This works well when looking for a specific object like recycling bins, but does not perform so well when looking for something more vague, like a building with trees around it. In the latter case, the seeker probably has an image that is like what they want to find, and it would be easier to search with the image as the query than search words. This search method is known as "content-based image retrieval (CBIR)," and works by finding images from a database with similar characteristics to the query image.


The challenge in creating a CBIR system is determining how to compare images to determine which are 'similar' to the query. Color histograms and texture energy maps of images are two descriptors that compare the overall look of an image, and thus do a reasonable job of finding similar pictures.
Most photos have different colors and textures in different regions, so it is important to compare these descriptors region-by-region in the image. Regions can be simply defined by dividing an image into a grid (such as 4X4). For each region, a color histogram in LUV space (in which distances are perceptual) and Law's texture energy measures are calculated.
A quick way to compare two histograms is by finding the intersection. The equation is below:

By subtracting the main part of the equation from one, this distance metric returns 0 for identical images, and 1 for completely disjoint ones. An unfortunate drawback of histogram intersection is that it only compares the same bins, so if a histogram is shifted by one bin, the distance will be larger than it is perceptually. Color histograms are 3-dimensional, so more advanced distance metrics are more difficult to compute, so since histogram intersection gives reasonable results, it is what this program uses. A bin size of 16 was chosen by comparing results entirely dependent upon color and judging 16 bins to be more effective than 8.
The Law's texture energy maps are generated by individually correlating 16 different filters with the original (grayscale) image (first processed to remove illumination by subtracting the mean value of each 16X16 pixel region from each pixel). Some maps can be combined because they are essentially the same, resulting in 9 energy values per pixel. As in creating color histograms, the images are sectioned off into 16 rectangular regions, and the average for each of the nine 'channels' is computed. Two images' textures are compared by taking the sum of the L2 (Euclidean) distances between each map at each region.
Computing color histograms and texture maps is time-intensive, so it is important to save the descriptors for each image in a database that is used in the search, rather than re-computing them each time.


This program was implemented in python, which has a module shelve that is useful for saving image descriptors (numpy arrays). Each image takes about .6s to describe, and the available database contains 1316 pictures, so the descriptors took about 15 minutes to compute. The color histogram is initially a 4x4x16x16x16 array, and the textures are 9x4x4. To store them in a single array, both are flattened, and then concatenated with hstack. The resulting 1x65680 array is the final descriptor, and the histogram can be accessed by descriptor[:65536], and the textures by descriptor[65536:]. Two of either of these numpy arrays can be compared quickly by using numpy's methods. The two distances are then combined by adding them together. Since the range of the texture distance is so much greater than that of the color, it is first divided by 25, which was empirically determined to bring the values near to the color distances.
Searching is straight-forward: simply compute the distance between the target and every other image, sort the images by their distance from the target, and select the closest ones to display to the seeker. Despite expediting the process by using numpy arrays, the search can still take a couple minutes to execute, which is unacceptable if one desires to perform several queries.
This problem is alleviated by pre-clustering using k-means. This works by arbitrarily selecting k images as starting means, clustering the images to the 'mean' to which they are nearest, and then, from the clusters, calculating new means. This process is repeated until the means do not change from one iteration to the next. Once the images are clustered, searches can be restricted to the cluster to which the query belongs. This means that far fewer images must be compared, and therefore the search takes less than 10s.


Since clustering took a long time (almost two minutes per iteration), I stopped when the means moved by a combined distance of less than 1. Apparently this was not long enough to get good clusters (or maybe it would help to run it multiple times with different starting means). In the first image below (of a classroom), notice that the 15-means clustering managed to separate images 2 and 6 into different clusters. Another example of erroneous clustering is that the image of two blue recycling bins is in the 10-means clustering, but not in any of the others, even though it does have a low distance.

This image retrieval does not perform too well when looking for a specific object. There are images in the database of blue bins in different environments so that the overall pictures do not look alike, and this program will not find them. In the last search (of the building surrounded by trees), on the other hand, it is plausible to say that all but the image of people are 'matches' to the query. In comparison, a Google-search for "blue recycling bin" gives many good images, but a search for "building by trees" is not as good, as show below.


This program has a wide range of precision, based on the query. When looking for specific objects, a 1/12 success rate is typical, but for the right image, a nearly perfect return is possible. This is partly due to the limited size of the image bank available, as there are not images that truly match certain queries. It would be possible to limit with a threshold, but for broad searches, relatively large distances can return good matches, so that is not practical to assign a threshold for all searches. A better solution for finding specific objects is to do shape-matching, as others have done, that extracts a shape from the query (through clustering in the image, or by edge detection), and searches for similar shapes in the bank.

  1. That's interesting. I was criticizing how the clustering was working, but I never tried to visualize where the pictures were plotted? Surely I could have made a graph, and looked to see if there even were clusters.

    I must have learned a thing or two this summer!