acm-header
Sign In

Communications of the ACM

ACM Careers

GitHub Is Beta Testing a Source Code Search Engine


View as: Print Mobile App Share:
silhouette of a blackbird

GitHub relies on sharding to keep the search index manageable.

GitHub has a lot of code to search — more than 200 million repositories — and says last November's beta version of a search engine optimized for source code has caused a "flurry of innovation."

In a recent blog post, GitHub engineer Timothy Clem delved into the technology used to scour just a quarter of those repos, a code search engine built in Rust called Blackbird.

Blackbird currently provides access to almost 45 million GitHub repositories, which together amount to 115 TBytes of code and 15.5 billion documents.

Using ripgrep on an 8-core Intel CPU to run an exhaustive regular expression query on a 13-GByte file in memory, Clem explained, takes about 2.769 seconds, or 0.6 GByte/second/core. "We can see pretty quickly that this really isn't going to work for the larger amount of data we have," he said.

So GitHub front-loaded much of the work into precomputed search indices. Even so, these indices are too large to fit in memory. GitHub Code Search is presently in beta testing.

From The Register
View Full Article


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account