acm-header
Sign In

Communications of the ACM

ACM TechNews

Crawling Toward a Wiser Web


View as: Print Mobile App Share:
The Common Crawl logo.

The Common Crawl Foundation wants to give the public direct access to crawls of the Web.

Credit: Common Crawl Foundation

When search engines respond to a query, they do so not by searching the Web, but a local replica called a crawl. Carrying out and storing a crawl of the Web has historically meant having access to significant storage and computing resources, but the Common Crawl Foundation, established in 2007, is trying to change that.

Through its Common Crawl, the foundation seeks to give the public direct access to crawls of the Web. The latest Common Crawl was conducted in January 2015 and covered 1.8 billion Web pages. The Common Crawl is hosted on Amazon Web Services and anyone with sufficient storage capabilities can download the entire 139 terabytes for their own use.

The Common Crawl website features more than 30 publications detailing studies that make use of Common Crawl data.

Linguists have been among the most enthusiastic adopters of the Common Crawl, using it to collect millions of exemplars of the same text rendered in different languages, which can be used as the raw material for automatic translation. Other researchers have used the Common Crawl to analyze duplication of content on the Web, the frequency with which various numbers are used online, and the interconnectedness of the Web.

From American Scientist
View Full Article

 

Abstracts Copyright © 2015 Information Inc., Bethesda, Maryland, USA


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account