Data Mining Document Text for Searching

Posted April 18, 2006 at 12:12 AM by Ben Nadel

Tags: ColdFusion

Right now I am in the process of getting this site up and running, and one of the milestones in that journey is the site search. Currently, the search works for the database content of the web log and the snippets. More will come as the site evolves. Once of the details of site search is the uploaded document search. Snippets, for example, can have an uploaded sample code file. In order for text of this document to return the Snippet itself, I have to be able to search the content of the document.

I have tried using Verity over the years, and frankly, it's always more headache than it has been worth. Now, granted, maybe I am not the best at setting it up, but it just always has so much setup/using cost. Not to mention that the CFMX7 version tends to crash our server. So right now, what I am trying to do is strip out data from the document and store that in the database along with the file info and association info. So far, it has been working nicely. I can strip the text out of text documents, html, htm, word, excel, etc. The one beast I am having trouble with right now is the PDF. The dreaded PDF. I think I am going to have to go Third-Party on this one, unless I can figure out a built-in Java way to extract text.

Now, even though the content that I get out of the documents is not 100% spot on (some words get deleted, punctuation gets removed), I still keep the gist of the content, and frankly, I think that's good enough to search on.



Reader Comments

Oct 17, 2008 at 5:03 PM // reply »
6 Comments

Hi Ben,

I came across your article but found no solution. I am attempting to perform a screen scrape from an online PDF which is dynamically populated via a URL.

For example:
http://www.domain.com/test.pdf?var=1

Id like to pull that data into a variable and perform a screen scrape. That would be the best. Any ideas on the best method for this?


Oct 29, 2008 at 8:08 AM // reply »
1 Comments

This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.

http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdfview_1&handle=euclid.ssu/1216238228


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 20, 2013 at 4:38 PM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Dana, Your confusion is well founded, since this is a very confusing features. In fact, it ONLY works if you use array notation. Meaning, that this: arrayToList( query[ "columnName" ] ) ... read »
May 20, 2013 at 4:34 PM
Using A Dynamic Column Name With ValueList() In ColdFusion
I was thinking chicken and the egg, I wouldn't have expected it to work in the valuelist going in I guess. Maybe I just need a beer, long day :) ... read »
May 20, 2013 at 4:29 PM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Dana, That's if you're trying to reference a specific row. In this case, we're trying to reference the entire query column as one cohesive value. So, you are correct that if you wanted to output a ... read »
May 20, 2013 at 4:24 PM
Using A Dynamic Column Name With ValueList() In ColdFusion
I thought when you used array notation to reference queries you always had to have the row or it would throw a similar error as well? ... read »
May 20, 2013 at 11:45 AM
Using jQuery's Animate() Step Callback Function To Create Custom Animations
This is really useful. I found out that you don't actually have to use a dummy css property (surprisingly). To animate a property in a linear-gradient for instance I did this this.css('someLinearGra ... read »
May 20, 2013 at 10:51 AM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Josh, Oh snap! You're totally right! I'm not sure I've ever tried that. I did know that you can call a number of other array-methods on ColdFusion query columns: http://www.bennadel.com/blog/167 ... read »
May 20, 2013 at 10:45 AM
Using A Dynamic Column Name With ValueList() In ColdFusion
@Ben - I believe you can achieve the same functionality with ColdFusion's built in ArrayToList() function. ArrayToList( users[ "id" ] ); ... read »
May 20, 2013 at 10:21 AM
My Experience With AngularJS - The Super-heroic JavaScript MVW Framework
Is there any error logging and handling framework in angularjs, if not then in what way I can do this. ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools