15-463 Project 1: Images of the Russian Empire

For the first project of Computational Photography, students were asked to reconstruct color images from black and white photographs taken by Sergei Mikhailovich Prokudin-Gorskii. The original photographs were taken as three separate captures on a single glass plate. Overlaid with red, green and blue films, the three captures attempt to represent grayscale intensity of the corresponding color channels. Scans of the original plates are available from the Library of Congress' Prokudin-Gorskii archive. The results of my program running on the required images can be found here and the alignment offsets can be found here. The results of my program running on a few images I chose are here and their offsets are here. You can also view the original plate scans.

Given three aligned color channels, it is very easy to combine them into a color image using the cat command in Matlab. The non-trivial part of this assignment lies in the ability to separate and align the channels from the scans. To separate the channels, I merely split the image into thirds vertically. As suggested on the project description page, I began by computing the L2 norm (otherwise known as sum of squared differences or SSD) for the red and green channels against the blue channel over a [-15,15] window.

The SSD technique worked very well for the small sample images, but the computational cost of this on the larger images is very high and inefficient. As per suggestion on the project description page, I recursively computed the alignment using the Gaussian pyramids for each image. The pyramid is essentially a logarithmic search over the image that first aligns on smaller versions of the two images and uses that best alignment to find a small window to search over with the larger images. With the gaussian pyramid and SSD heuristic, I was able to align all images quickly, but a little less than half of the images would align properly.

Solely running SSD over the channels is faulty since the intensity of one color channel may differ greatly from another. For example, a man wearing a highly saturated blue coat will have a intensity 1.0 on the blue channel and intensity 0.0 on the red and green channels. Such a high difference in intensity contributes negatively to the SSD matching, which caused many of the images to be heavily misaligned. While intensity of the pixels in the image may differ, the edges in each of the channels have a positive correlation in most situatinos. Subtracting a Gaussian blur of the original channel, I was able to perform simple edge detection and send the edges to the alignment algorithm. Running the program over a few images, it was evident that the sharp edge between the black and white striped borders were matched more often than the content of the actual photographs. Thus after the edge detection, I set the 10% border to zero.

After these improvements to the alignment algorithm, eight out of the twenty images were still misaligned. I found that when I was performing the pyramid search, the window for more fine searching was too constrained. Increasing the window size from [-1,1] to [-2,2], I then correctly aligned eighteen of the twenty images.

For the two images which had incorrect alignment, the algorithm chose to explore a local minima in one of the small images which did not lead to the global minimum SSD. Interestingly, in the 01043u.tif image the red channel found the correct offset but the green channel did not. One possible solution for this problem would be to apply a median filter on the edges to remove the excess edge noise. It appeared that this noise caused the incorrect window exploration.

I had some additional difficulties using Matlab for the first time. Richness of the Matlab libraries made it both easier and harder to program this project. With so many functions, a lot of the functionality that I need to code has already been implemented for me. Knowing what image and matrix manipulation functions I must write, however, is made difficult from Matlab's wealth of library functions since my knowledge of these libraries is so adolescent. Among other things, I was confused by the interchangeable use of matrix and pixel access, where M(row, col) is the same as M(y, x), and the use of column-major indices.

Finally, here are the resultant images.


Perl Amazon Scraper

A fellow CMU undergrad is working on a social recommendation website for movies, music, video games, restaurants, etc. and I offered to lend a hand. For his machine learning algorithm to find superb recommendations, he needs a diverse sample of user reviews for items in each of these categories. Being the awesome friend that I am, I offered to lend him a hand by writing an Amazon Music scraper.

For simplicity of code (read: regular expressions), I decided to scrape the mobile version of Amazon. Luckily, Amazon was nice enough to further simplify the markup for their mobile site for LWP, wget, and ChromeOS user agents.

To put some numbers out there, Amazon has roughly 2.5 million albums and 7.5 million reviews across all of these albums. The search page used to index these albums shows five albums per page (likewise for reviews). For each album, I have to request two additional pages: product overview for album and artist names, and product description for album year and track list. For each review, I only need to view the individual user review page for the user's name and rating. Thus my scraper will pull down roughly 14.5 million pages. Assuming each page is roughly 4.5KB, the total amount of data transferred will be approximately 65GB. On a 100 megabit connection, the total transfer time should amount to around 1.5 hours. Initially, the Perl module was designed to make these requests sequentially. A typical SLA serves 99.9% of all requests within 300ms, so the total sequential latency amounts to nearly 50 days. Running the initial script for the first three hours, empirical data suggested the entire operation would take nearly 47 days (not too far off from my rough estimate!).

To solve the latency issue, I split the work over several threads on the same machine. Making a thread for each album, the expected lifespan of a thread is 2.1 seconds (a figure still dominated by sequential latency). Once again assuming that the search page takes 300ms to load, the expected number of threads running at any given moment is 35. Thus I can expect this parallelized scraper to complete in just under 2 days. (If not for AFS orphaning tcp socket and file descriptor symlinks, the scrape would have taken no more than 2 days. I have yet to find time to investigate this issue with my scraper, though I expect it to not surface when not using a distributed networked file system.)


The next revision of this script will use the full Amazon website rather than the mobile version.



Where are my old projects?! I need to spend some time cleaning my room, my truck, and my hard drive. Something is bound to pop up!

Stop by on March 13th. I'm getting an itch to tidy up.

{ }