A few days ago I had my first experience with Amazon EC2. I have been wanting to try it for some time, and had some really nice use cases for it at my previous job (had a program that would have been awesome to rewrite as a map/reduce), but never got to use it. Well, recently that changed.
We have a process that runs quarterly to take postal code line data (maps) and simplify them (reduce the number of points) so that they can be used in a webapp that we use to maintain service provider territories (more details here). Due to the nature of how the graph algorithms work, we have to load the entire US (48 states) into RAM and apply the simplification on the entire graph, otherwise if doing it zipcode by zipcode the simplified points might not match up and we would have weird holes. The data used to fit into 4GB of memory, but something recently changed that pushed the memory usage above 10GB. We aren't sure what chagned, though most likely our data provider started sending us more detailed maps, and unfortunately we don't appear to keep previous versions of the data, so we can't go back and see what changed.
My PC has 6GB of RAM, as do most of the developers. Last time we had to update the maps, one developer was out on vacation, and his machine had just 4GB. I went ahead and borrowed his memory, made some tweaks to the graph library that we use (for example, it was storing Z coordinates for every point, while our data only had X and Y, so I reduced the footprint by about 1/3 by getting rid of Z), and ran it. It still consumed about 12GB, so there was a lot of thrashing and it ended up consuming my PC for about 3 hours while it ran.
That developer now has a full 6GB of memory, and isn't too keen on me taking it away from him, so I needed to find something better. As I said, I had been interested in EC2 before, and it turns out they have added "High Memory" instances since the lat time I looked at them. These instances start at 17GB of memory, and start at $0.50 per hour. I ordered up an instance, based on a Ubuntu 10.04 image, and installed the necessary tools. Our dataset is only about 80MB compressed, so uploading it was no trouble. Thankfully, there were also good scripts written to run our process, and they only needed minor tweaks to get them to run in Linux. I kicked off the process, and watched the memory usage in another session. At the end, it consumed about 13GB of memory, and ran in under 20 minutes. From start to finish, I was able to upload the data, decompress it, run the program, recompress the output, download the results, and shut down the instance in about 45 minutes.
It was a whole lot easier than I expected to do all of this, but there were a few things that I learned along the way:
- The Ubuntu images dont work too well in the new Micro instances (~630MB of RAM). I had ordered one up to tinker with before trying out the XL instance, since I would rather waste $0.02/hr than $0.50/hr learning something. After rebooting the image, it would no longer boot up. There is a solution that worked here: http://www.flevour.net/ubuntu-image-wont-reboot-if-running-t1-micro-amazon-ec2
- Any SSH communication with the instance is done via a private key file. It has been a long time since I had to do that with Putty, so I had to get a quick refresher. I had to download my private key from Amazon, run it through the key converter, attach it to putty, and then log in (but not type a password) of 'ubuntu' after connecting to the instance.