| important.documents | project.overview |
|
Final Presentation
This presentation is a very general summary of the work that I did here at ORNL. Project Poster This poster is meant to give the viewer a big picture look at my project. It definately helps if I'm standing next to it! Noun Phrase Extraction Presentation This presentation contains an overview of the various, current noun phrase extraction techniques, with a primary focus in the area of machine learning. Personal Abstract This is the mission statement that was required by the RAMS program. It states in very simple terms what my general goals are for this summer. Notes Archive This archive contains my daily notes from this internship. |
With the growing problem of data overabundance, many techniques for knowledge discovery have been formulated. Document clustering is one such method. This method aims to classify documents into meaningful categories that allow the user to quickly browse to pertinent information. Current techniques employ a word by word comparison of each document to be clustered. Certain “stop words” are not included in this process; determiners, conjunctions, prepositions, and pronouns do not lend any additional information to a sentence. This can be very computationally very expensive. We propose a new method of feature selection for document clustering which allows us to perform fewer comparisons. Proprietary issues require me to be no more descriptive than that at this time. Preliminary results suggest a 9% increase in accuracy over the established technique, as well as an 82% reduction in comparisons. Because only the most significant words are being used for clustering, documents are more likely to be clustered correctly. This drastic reduction in comparisons also means that this technique is much less computationally complex. |