Data Scraping

Time to get some automation in place.  Rather than have to enter the draw twice a week, it would be so much better to have the application grab the draw the morning after it’s made.  This is interesting stuff.  My application is going to access another server, grab the HTML, parse it and store the results in my database.  This kind of scripting has many uses, automating and creating your own data feeds is something I’ve done in the past.  I purchased a subscription weather service, and extracted key values and stored them in a database for use in a DSL, line-quality metrics application.

A little poking around in Google searches yielded a gem called “Hpricot”.  Hpricot looked pretty cool.  It allowed you to parse HTML and pull divs and spans based on HTML class and id.  Even better they had a website where you could try it all out interactively.  I quickly realized that this would suit my needs exactly.

The MegaMillions site looked like it had been put together by people who understood what they were doing.  I say this because a great deal of sites don’t.  This site had well-structured CSS and naming conventions.  Extracting the draw data was a simple as:

doc = Hpricot(open("http://megamillions.com/includes/numberData_home.asp"))
draw_dates = doc/"div.num_date"
draw_picks = doc/"div.num_num"
draw_mega = doc/"div.num_mb"

None of that old string parsing code, this method just returned exactly what I needed.  Brilliant!

I came up with a little date calculation algorithm to determine whether the application should go and get the draw.  This was based on how many days had passed since the last draw recorded in the database.  For this I needed to remember to add the “require ‘date'”.  And, just for fun I decided to add a group of nine buttons, which would allow you to review the past nine draws.

Random Rules of Programming – #2 in an occasional series

Once a process has been started, it should be possible to interrupt and terminate it before completion.

Ideally, any changes completed should be “rolled back” to the state that they were before the action was initiated.  It is recognized that this latter part can be an issue in some environments, which lack transaction-style processing.  Therefore, suitable warnings should be given when an irreversible action is about to be performed.