Try out the latest version of Skin Spider Live!
So far, Project Skin Spider is very exciting. I have been working off of a home-grown XML database and it is, not to boast, going awesome. I have created the get gallery links page. This page spiders over a thumbnail site and finds all links to galleries and adds them to the queue. At first run, I was updating the in memory query AND data file for every single gallery link added. This was perfect at first, but now, after some testing, I have over 700 items in the spider queue. Writing this file to memory for every link was causing some serious performance issues.
To overcome the spidering issues, I have added a "Commit" parameter to the AddRecord() method of the DatabaseService.cfc ColdFusion component. It defaults to true, flagging that at the end of the method call, all dirty tables should be committed to file. However, on pages such as the one mentioned above, I am adding so many items to the queue in one go that I want to flag the AddRecord() method to NOT commit files.
Then, once I am done adding all items to the queue, I call Commit() on the DatabaseService.cfc ColdFusion component. This will loop over all the tables and commit and dirty ones to file. A query table is considered dirty if it contains information that is NOT reflected in the data file.
To accommodate those changes, I had to pull out the call to write to file into it's own function Commit(). Before hand, any time you added a row, only that table would be written to file. I have added a flag IsDirty to all in-memory queries. The Commit() method now loops over all tables, checks to see if they are dirty (IsDirty = true) and if so, writes them to file.
These changes have proves for a significant performance increase. I was worried a bit about data corruption - I mean, what if the page crashes. But then I realized that all I am doing is adding items to a queue. This is NOT mission critical. If the page times out our crashes then all I lose is queue items. Not a big deal.
One other note about the database tables themselves. Since they are ColdFusion query objects they are passed around by reference. This means that technically, I could get a reference to them via the GetTable() method just once and then use it through out the system without every re-getting it. This is due to the fact that any updates made in the DatabaseService.cfc ColdFusion component are also made in any references to the table as well. Knowing this, however, creates HIGH COUPLING between my business logic and the DatabaseService.cfc inner workings. High coupling is always bad. Therefore, any time I need to get the database table, I call the GetTable() method. This will allow the inner working of the DatabaseService.cfc ColdFusion component to change without it affecting the calling pages.
If you look on the spider_gallery_links.cfm page, you will see that I am processing the CFHttp content in the content area of the user's page. Ordinarily, I would do this sort of processing in the pre-page processing. In this case, however, I want to be able to provide feedback to the user on a per-gallery basis. To do so, I have to process the CFHttp content in the main area, then CFFlush to the user's browser after every database update. I am happy with this. I think it provides a much nicer user experience.
As you can tell, the current phase of the ColdFusion application development demo is to get the spidering to work. If the Xml database is the backbone of the system, then the spidering is it's raison d'etre - it's reason to be. If I cannot get this working, then there is little point to going on with the application development demo.
To get a handle on where I wanted to go with it, I mapped out some spidering work flow:
Goto page for entering thumbnail gallery url
- Enter thumbnail gallery url
- Grab content of thumbnail gallery site
- Spider for links to galleries (without validating gallery itself)
- Assert: New galleries have been added to queue in database
Goto page for spidering gallery information
- Get next gallery item in the queue. If no gallery items left, goto video spider
- Grab content of gallery
- Check to see if it has video files (ie. wmv, mov, mpeg, etc.)
- If it does NOT have movies, remove from queue, refresh page (for next gallery)
- Take a screen shot of the gallery and save
- Spider all links for movies (including any content between A tags)
- Add video links to queue along with inner content of A tag (the idea is we want to try to grab the thumbnail itself)
- Add gallery to database
- Goto to video spidering page with url from just added gallery
Goto page for spidering video information
- Get videos for the previously spidered gallery
- Try to download first one
- Try to get a thumbnail (from the inner A tag information)
- Add the video to the database
- Refresh page to keep getting videos from this gallery
- If there are no videos left from this gallery go back to gallery spidering page
Just thinking this out has helped me see how I want to codify it. In fact, by thinking it out, I saw that I needed to add the database table "queue". Originally I thought I would just add items to the standard database table with an is_queued column. But, then I realized that I don't want to constantly be manipulating the main files. The queue table allows me to store more specialized information and doesn't mess with main table performance.
I have also added a CFError template, site_error.cfm. This doesn't do anything at the moment except display a site-friendly error page. Eventually this would email myself the error or display it so that I could stay on top of things. Be careful though, you never want to display error information on the live site as it may give away secret information like the directory structure of your server or DNS information.
I have also added a global errors object, REQUEST.PageErrors. This is just a simple array that you can add errors to. Each page that wants to display errors will then include the _page_errors.cfm template. This template expects the REQUEST.PageErrors object and will iterate over it display errors in a site-wide consistent manner. Right now, this is just a ColdFusion array. In the next phase I will turn this into a Collection object that can have more logic to it (such as storing field names and errors, checking for erorr count, etc). The idea here was to factor out the erorr display and put it into one place. So far, it is working very nicely for such simple pages.