Ask Ben: CFHttp For Web Mining And Image Hot Linking
I like the pornography and I trying to create a rather large library of it on my file server at home. I have build my own ColdFusion spider which crawls over adult web sites and downloads free graphics and videos. I am having a problem with CFHttp where only some of the content is downloaded. Most of it shows up as some weird "not available" graphic; however, when I go to the URL in question, the graphics show up fine and I can right-click and save. Am I not using CFHttp correctly?
Let me just start off by saying that you must be careful about copyright laws with this sort of thing. I don't know the laws, but just be careful about what you are "grabbing" from other company's web sites. Remember that you are not only taking their content, you are also using their bandwidth which can impact their file-transfer limits.
That being said, it sounds like you are using ColdFusion CFHttp correctly. The problem here lies on the target server that is serving up the requested graphics or videos. What you are doing is sometimes referred to as web mining or image hot linking.
Web Mining is just a generic terms for gathering information off of the web is some sort of systematic, usually automated fashion. Image hot linking is when you display on your site an image that is located under another domain (and probably on another server). Unfortunately the server administrators don't want you "stealing" their images and bandwidth and that "not available" image that is showing a lot of the time is their attempt to stop you from grabbing content that they paid for and server from their site.
So, how do you get around this issue? First, you have to understand how hot linking it being prevented (in most cases). Thomas Scott does a good job on A List Apart of explaining how to check the user's referrer url to block hot linking. There are times, I find, when a server is doing something more complicated, that I cannot crack, but those servers are few and far between and are generally large servers that handle, specifically, file-serving.
Ok, so now to the nitty gritty. To overcome this "issue," you have to extend your ColdFusion CFHttp method a bit to set browser variables sent in the CGI object. Let's work with an example. Say you are trying to grab the image:
WARNING: Adult Image
... off of the page:
WARNING: Adult Site
... we want to perform a ColdFusion CFHttp grab using the target PAGE as the referrer for the image grab. This way, we can fool the server into thinking that we are a user on the page viewing their images (after all, your browser is just making requests to the server for images like we are). Additionally, you will want to change the user agent as ColdFusion sends its own user agent by default in the CFHttp call:
<cfhttp url="http://www.donovanphillips.com/galleries/ncg/fawn01/FawnNCG1-0033.JPG" method="GET" useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:184.108.40.206) Gecko/20060308 Firefox/220.127.116.11" getasbinary="yes" result="objHttp"> <!--- Set referrer params. In this case, we want to override the referrer. ---> <cfhttpparam type="CGI" name="http_referer" value="http://www.donovanphillips.com/galleries/ncg/fawn01/index.php" encoded="false" /> </cfhttp>
As you can see, we set the user agent to be some flavor of the Mozilla Firefox browser and the referrer is the page upon which the image was originally linked. Now, as I said before, this does NOT work all the time, but it does work a good amount of the time. What you do with the binary image data (stored in objHttp.FileContent in the above example) is up to you. Be sure to check that the image is valid before you try to do anything with it:
<!--- Check to see if we found the image. ---> <cfif ( FindNoCase( "200", objHttp.Statuscode ) AND FindNoCase( "image", objHttp.Responseheader["Content-Type"] ) )> <!--- We have an image. ---> <cfelse> <!--- Blast! The image didn't come through. ---> </cfif>
If you want to see this in action, please check out my ColdFusion CFHttp example in my ColdFusion Snippets section.
Want to use code from this post? Check out the license.
I'm trying to pull my typepad rss feed into an CFM document. I haven't had any luck doing this with Cold Fusion. So I created a php file on my server that pulls my typepad RSS feed. The php file works fine by itself, but I'm having a devil of a time getting it to be included in a cfm page.
When I use CFINCLUDE, it apparently just reads it as text and tries to process it as coldfusion... which doesn't work. How do I get it to look at the php file, process it, and then bring the results into my CFM page?
ColdFusion will not inherently excute a PHP page. There are ways to run some PHP in ColdFusion, but I have never done that. You can either try to get it to work in ColdFusion (RSS reading), or use the PHP file to write an XML file that ColdFusion reads in and parses maybe?