Data Mining Document Text for Searching

Posted April 18, 2006 at 12:12 AM by Ben Nadel

Tags: ColdFusion

Right now I am in the process of getting this site up and running, and one of the milestones in that journey is the site search. Currently, the search works for the database content of the web log and the snippets. More will come as the site evolves. Once of the details of site search is the uploaded document search. Snippets, for example, can have an uploaded sample code file. In order for text of this document to return the Snippet itself, I have to be able to search the content of the document.

I have tried using Verity over the years, and frankly, it's always more headache than it has been worth. Now, granted, maybe I am not the best at setting it up, but it just always has so much setup/using cost. Not to mention that the CFMX7 version tends to crash our server. So right now, what I am trying to do is strip out data from the document and store that in the database along with the file info and association info. So far, it has been working nicely. I can strip the text out of text documents, html, htm, word, excel, etc. The one beast I am having trouble with right now is the PDF. The dreaded PDF. I think I am going to have to go Third-Party on this one, unless I can figure out a built-in Java way to extract text.

Now, even though the content that I get out of the documents is not 100% spot on (some words get deleted, punctuation gets removed), I still keep the gist of the content, and frankly, I think that's good enough to search on.



Reader Comments

Oct 17, 2008 at 5:03 PM // reply »
6 Comments

Hi Ben,

I came across your article but found no solution. I am attempting to perform a screen scrape from an online PDF which is dynamically populated via a URL.

For example:
http://www.domain.com/test.pdf?var=1

Id like to pull that data into a variable and perform a screen scrape. That would be the best. Any ideas on the best method for this?


Oct 29, 2008 at 8:08 AM // reply »
1 Comments

This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.

http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdfview_1&handle=euclid.ssu/1216238228


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
InVision App - Prototyping Made Beautiful With Prototyping Tools Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
Feb 10, 2012 at 7:21 PM
jQuery AJAX Strips Script Tags And Inserts Them After Parent-Most Elements
Update! Instead of $(eval(options.insertAfter)).after(data['insertData']); I now use: var ajaxNode = document.createElement('span'); var parent = $(eval(options.insertAfter))[0].parentNode; ... read »
Feb 10, 2012 at 6:18 PM
jQuery AJAX Strips Script Tags And Inserts Them After Parent-Most Elements
encountered this same, what I consider, jQuery bug last week. I'm building a site in which I load some content via AJAX. This content contains Linkedin share button placeholders which Linkedin API ne ... read »
Feb 10, 2012 at 11:30 AM
Cross-Origin Resource Sharing (CORS) AJAX Requests Between jQuery And Node.js
After you understand the concepts here, this is an awesome cheatsheet for enabling CORS in just about anything http://enable-cors.org/ ... read »
JM
Feb 10, 2012 at 9:10 AM
My Safari Browser SQLite Database Hello World Example
@Amy, Here is a very good tutorial on how to use JOIN: http://www.sqltutorial.org/sqljoin-innerjoin.aspx ... read »
Feb 10, 2012 at 4:42 AM
Building A Twitter-Inspired RESTful API Architecture In ColdFusion
This is great, very useful Ben. I spotted a small typo in the api.cgm listing: <cfthrow type="Unauthroized" /> Cheers Stefan ... read »
Feb 9, 2012 at 10:35 PM
CFDirectory Filtering Uses Pipe Character For Multiple Filters (Thanks Steve Withington)
I was wondering if there would be a filter you could apply so that you got everything but what you included in the filter. As in show me all docs that are not a .pdf. ... read »
Feb 9, 2012 at 10:29 PM
Learning ColdFusion 9: Application-Specific Data Sources
@Ben, No offence, but if people were really wanting advanced features they would be using a platform like ASP.NET MVC. CFML is so structurally compromised as a tag-based scripting language that ... read »
Feb 9, 2012 at 10:03 PM
Subversion - Cleanup Failed To Process The Following Paths
@Leviaguirre, do you still have problems with this? ... read »