Cornell University researchers have developed a technique that will make it easier for computers to find written articles that are related to one another based on the content of those articles without using shared links. Understanding the general topic of a piece of writing or a speech is exceedingly difficult for computers. To compensate for this, computers use statistical tricks, such as the thorough analysis of Web links, which is one of the foundations of the Google search engine. Using this information, the Cornell researchers explored how news cycles start and end and the role bloggers play in the cycles.
Cornell graduate student Jure Leskovec says for a computer to help solve those questions, it would need to know which articles are related to each other, based on the subject of those articles. Google's daily feed of news stories and blog items, which can include more than a million pieces per day, was used for the analysis.
Leskovec and colleague Lars Backstrom, along with Cornell professor Jon Kleinberg, stumbled upon what Leskovec describes as an "embarrassingly simple" idea. If different stories use the same words in quotation marks they most likely should be grouped together. Searching for patterns of characters in quotation marks is simple for a computer. The researchers discovered that this simple method easily counted which phrases were used most frequently and was highly successful in finding similar content.
From Forbes
View Full Article
Abstracts Copyright © 2009 Information Inc., Bethesda, Maryland, USA
No entries found