Data Mining Document Text for Searching

Posted April 18, 2006 at 12:12 AM

Tags: ColdFusion

Right now I am in the process of getting this site up and running, and one of the milestones in that journey is the site search. Currently, the search works for the database content of the web log and the snippets. More will come as the site evolves. Once of the details of site search is the uploaded document search. Snippets, for example, can have an uploaded sample code file. In order for text of this document to return the Snippet itself, I have to be able to search the content of the document.

I have tried using Verity over the years, and frankly, it's always more headache than it has been worth. Now, granted, maybe I am not the best at setting it up, but it just always has so much setup/using cost. Not to mention that the CFMX7 version tends to crash our server. So right now, what I am trying to do is strip out data from the document and store that in the database along with the file info and association info. So far, it has been working nicely. I can strip the text out of text documents, html, htm, word, excel, etc. The one beast I am having trouble with right now is the PDF. The dreaded PDF. I think I am going to have to go Third-Party on this one, unless I can figure out a built-in Java way to extract text.

Now, even though the content that I get out of the documents is not 100% spot on (some words get deleted, punctuation gets removed), I still keep the gist of the content, and frankly, I think that's good enough to search on.

Post Comment  |  Ask Ben  |  Print Page




Reader Comments

Oct 17, 2008 at 5:03 PM // reply »
3 Comments

Hi Ben,

I came across your article but found no solution. I am attempting to perform a screen scrape from an online PDF which is dynamically populated via a URL.

For example:
http://www.domain.com/test.pdf?var=1

Id like to pull that data into a variable and perform a screen scrape. That would be the best. Any ideas on the best method for this?


Oct 29, 2008 at 8:08 AM // reply »
1 Comments

This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.

http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdfview_1&handle=euclid.ssu/1216238228


Post Comment  |  Ask Ben

Recent Blog Comments
Mar 20, 2010 at 12:07 PM
Drawing On The iPhone Canvas With jQuery And ColdFusion
Simply awesome. Saved my day. ... read »
Mar 20, 2010 at 9:00 AM
Building A Fixed-Position Bottom Menu Bar (ala FaceBook)
I would like to say thx for an easy way to create a bottom bar. I do have a ?. Is it possible to center the bar if i want to resize it to ex 85%. Regards Offenbach ... read »
Mar 19, 2010 at 7:26 PM
MySQL 3/4 - com.mysql.jdbc.Driver And allowMultiQueries=true
Thank you very much for this post. Adding allowMultiQueries="true" in context.xml didn't help until I added it to url as allowMultiQueries=true Good idea is to use prepared statements and it will he ... read »
Jim
Mar 19, 2010 at 4:49 PM
Nobody Puts Baby In The Corner!
Wow. This is like suddenly finding a support group for your secret shame. I'm not alone! I always liked this movie, even though it is extremely cheesy. I just wish Jennifer Grey hadn't gotten the ... read »
Mar 19, 2010 at 4:47 PM
Application.cfc OnRequest() Method Affects OnError() Arguments
@Jason and @Ben, I've been doing some CF9 refactoring on our systems and noticed an odd occurrence with onError as well. Found a way to work around my problem, but what I saw was... Background: Our ... read »
Jim
Mar 19, 2010 at 4:44 PM
Shoot 'Em Up Starring Clive Owen And Paul Giamatti
I actually enjoyed this movie quite a lot. It was different, certainly, but I think they were going for more of a Quentin Tarentino-"wow, that was weird"-vibe than an actual spoof. Once I realize ... read »
Mar 19, 2010 at 4:34 PM
An Intensive Exploration Of jQuery With Ben Nadel (Video Presentation)
Hey I guess the video is down. Is there anyway you can upload to youtube or vimeo or some other service? Greatly appreciated. ... read »
Mar 19, 2010 at 4:24 PM
ColdFusion CFPOP - My First Look
@Ben Thanks for the follow up! The root of the problem had to do with being able to trace bounced emails to specific records in a DB table. Let's say you run an email campaign and you get 1,000 bou ... read »