Text classification, an aspect of natural-language processing, is a useful way to capture insights from large amounts of unstructured text. However, using the current bag-of-words model of text classification can be expensive and has many limitations. At the recent Predictive Analytics World event in San Francisco, Galvanize's Michael Tamir and Personagraph's Daniel Hansen discussed how the use of Google's open source Word2Vec tool can address this problem.
Word2Vec creates topic vectors for each word in a given document or group of documents and, by comparing these vectors, researchers can determine how these words relate to one another and get a sense of what these documents are about. "I can subtract vectors, I can rotate the vectors, I can look at how far one vector is from another," Tamir said. "So by embedding these words into a vector space, we can capture a lot of structure."
They noted using Word2Vec also is very easy and inexpensive, and the tool performs well in situations where the bag-of-words model does not. For example, in a test where significantly reducing the number of features dramatically reduced the accuracy of a bag-of-words model, Word2Vec's accuracy remained relatively stable.
From CIO Australia
View Full Article
Abstracts Copyright © 2015 Information Inc., Bethesda, Maryland, USA
No entries found