View the most updated code base here.
Try the most updated application online here.
Learn all about Project Skin Spider here.
I have made some excellent updates to Project Skin Spider. In particular, I have just about finished the spidering of galleries, videos, and thumbnails. This was really the "proof of concept" code module. This would prove that the Skin Spider idea is viable. Check out pages spider_gallery_links.cfm, spider_gallery.cfm, and spider_video.cfm. If you look at those, you will see that I am using ColdFusion and Regular Expressions and some smooth CFHttp calls to find the appropriate links and download them into the system.
As I was building the spidering pages, I realized that as I was testing I kept downloading duplicate information. Even worse than that, I kept spidering pages that I had already spidered and found to have no videos on them. This lead me to the use of a "blacklist" database table. Now, anytime that I find a gallery that doesn't have videos or a video that doesn't download properly, I add it to the blacklist. Then, when I am going to add new records to the database I first check to see if they have already been added (as valid entries) OR if they have been blacklisted. This is making the database much cleaner.
So far, in testing, it doesn't seem that hot linking is going to be much of an issue. I know that some servers are all hardcore about their bandwidth usage, but so far so good. When I make my CFHttp calls, I am sending a user agent in both the tag attributes as well as in a CFHttpParam tag. Additionally, I always supply a valid CFHttpParam tag for the CGI variable, http_referer. Basically with the CFHttp tag I try my best to mimic the work flow of an actual browser.
I am not sure if I am going to need it, but currently, I am also taking a screen shot of the gallery page with the Web Shot command line utility. This is such a nifty little utility that I made a demo for a while back. You basically use CFExecute and give it some arguments and BAM! You get a screen shot of the gallery. Right now, that image, and all the other image I am going to force to be 100 x 75 pixels in dimension. The video thumbs are not all that size so there is going to be some distortion. For now, though, I think this should work alright. We can update perhaps in Phase II.
I am making sure to leave plenty of execution time for gallery and video downloads. Using the CFSetting tag, I give the RequestTimeOut about 4-5 minutes on the very intensive pages. I am also trying to keep the content stream steady by alternating between spidering a gallery and spidering the videos. Every time I spider a gallery, I then jump over and try to spider the videos from that gallery. This should keep the in-flux of videos at a good rate.
I try to keep the rate of content streaming good, but this smells like a job for Asynchronous gateways. I can imagine putting all the videos to spider on a single queue and then just letting the ColdFusion gateways slam them with multiple threads. But the reality of the situation is that is not an enterprise application. It is meant to be run on a personal computer that happens to be running ColdFusion on it. Of course, there might be ways to speed this up.
I had thought about not making it into a pop-up. I had thought about keeping it server side... but to be honest, I am not sure how to run pages like that. I suppose I could have done a scheduled task, but I think if you use a lot of CFLocation tags pages crap out because of redirect overflows or something. Still, I am sure that there is a way to make this perhaps a bit more streamlined.
Also, since I mentioned the use of CFFlush, it makes me think about the use of ColdFusion frameworks in later phases of the Skin Spider project. As I have said before, this phase, Phase I, is basically just a proof of concept. It is a "worst practices" method of building ColdFusion applications. I am trying to keep it clean, but at the same time, I am trying to make the same mistakes that many new programmers make. Well, mistakes is the wrong word. Basically, I am just not making it very "upper level" programming. But that's the whole point of the project - to learn - to take it from low level to high level.
But I digress, back to frameworks. As you can see in the code, I am performing a lot of CFFlush tags. I think a lot of frameworks have problems with the CFFlush tag because they build their content templates from inside-out and I think CFFlush conflicts with this idea. I am very interested to hopefully get some feedback from framework people on the next phase of development.
One funny thing about this batch of updates is that I realized that I have not made any way to updated items in the database. The DatabaseService.cfc ColdFusion component only has ways to add a new record and delete existing records. I have to come up with a way to update records.
Also, one final note, going back to the asynchronous processes. Right now, the application is designed to run one spidering process at a time. If I were to run more than one at a time, I would probably have to tighten up my calls to the database. Right now, I try to make fewer calls to the database to increase the page's performance. However, if I knew that I could have two pages spidering at the same time, then I would have to be setting a lot more database flags to ensure that no two items were being spidered at the same time. This would probably involved some CFLock tags and double-check locking. But, that will have to wait until the next Phase.
Looking For A New Job?
- Back-End Engineer - Node.js & Mongo at Interface Foundry
- Senior ColdFusion Web Developer at HD Web Studio
- In House ColdFusion at Marketing Holdings