Data Mining Document Text for Searching

Posted April 18, 2006 at 12:12 AM by Ben Nadel

Tags: ColdFusion

Right now I am in the process of getting this site up and running, and one of the milestones in that journey is the site search. Currently, the search works for the database content of the web log and the snippets. More will come as the site evolves. Once of the details of site search is the uploaded document search. Snippets, for example, can have an uploaded sample code file. In order for text of this document to return the Snippet itself, I have to be able to search the content of the document.

I have tried using Verity over the years, and frankly, it's always more headache than it has been worth. Now, granted, maybe I am not the best at setting it up, but it just always has so much setup/using cost. Not to mention that the CFMX7 version tends to crash our server. So right now, what I am trying to do is strip out data from the document and store that in the database along with the file info and association info. So far, it has been working nicely. I can strip the text out of text documents, html, htm, word, excel, etc. The one beast I am having trouble with right now is the PDF. The dreaded PDF. I think I am going to have to go Third-Party on this one, unless I can figure out a built-in Java way to extract text.

Now, even though the content that I get out of the documents is not 100% spot on (some words get deleted, punctuation gets removed), I still keep the gist of the content, and frankly, I think that's good enough to search on.



Reader Comments

Oct 17, 2008 at 5:03 PM // reply »
6 Comments

Hi Ben,

I came across your article but found no solution. I am attempting to perform a screen scrape from an online PDF which is dynamically populated via a URL.

For example:
http://www.domain.com/test.pdf?var=1

Id like to pull that data into a variable and perform a screen scrape. That would be the best. Any ideas on the best method for this?


Oct 29, 2008 at 8:08 AM // reply »
1 Comments

This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.

http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdfview_1&handle=euclid.ssu/1216238228


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 25, 2013 at 10:01 PM
My Experience With AngularJS - The Super-heroic JavaScript MVW Framework
@Avi, Really glad to help! @Jaredwilli, I'm finding a this image hits home with a lot of people :) Hopefully we can all work through the rough patches together! @Prateek, AngularJS has error ... read »
May 25, 2013 at 9:53 PM
Nested Views, Routing, And Deep Linking With AngularJS
@Mrsean2k, I'm glad I could help! I haven't been able to keep up with the ui-router stuff. I keep saying that I'll carve out time, but I just haven't gotten to it :( ... read »
May 25, 2013 at 9:49 PM
What If All User Interface (UI) Data Came In Reports?
@Jonah, Thanks for the book recommendations. I am looking them up right now. I can see that Object Thinking is available for the Kindle App - sweet! Also, I just recently heard Martin Fowler on the ... read »
May 25, 2013 at 9:41 PM
HashKeyCopier - An AngularJS Utility Class For Merging Cached And Live Data
@Chris, I'm super excited to hear that my posts are helpful. I am also loving AngularJS; but, it definitely has some caveats and some odd behaviors and some things that just don't seem to "wor ... read »
May 25, 2013 at 9:36 PM
Ask Ben: Manually Enforcing Basic HTTP Authorization In ColdFusion
@Adam, @Jason, After reading these comments, I double-checked my latest implementation and I am happy to report that I am using listFirst() and listRest(). ... read »
May 25, 2013 at 9:31 PM
Using "//" And ".//" Expressions In XPath XML Search Directives In ColdFusion
@Daxesh, I am not sure I understand the question about the current node. If you already have a reference to the current node, why would you need to query for it? As for parent node, I believe that ... read »
May 25, 2013 at 10:08 AM
Using "//" And ".//" Expressions In XPath XML Search Directives In ColdFusion
@Ben, my question is that i want the current node with its tag and its parent node. i just want only that data. So, give me the solution for that. and remember solution is working on " xpath 1.0 ... read »
May 25, 2013 at 10:01 AM
Using "//" And ".//" Expressions In XPath XML Search Directives In ColdFusion
hey ben, i want get my current node tag and also want the root node tag withing. So, how can i fix it.. ! ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools