acm-header
Sign In

Communications of the ACM

ACM TechNews

JPL Creates PDF Archive to Aid Malware Research


View as: Print Mobile App Share:
The 8 million PDFs were downloaded from websites across the globe.

The Digital Corpora project hosts the huge data archive as part of Amazon Web Services’ Open Data Sponsorship Program, and the files have been packaged in easily downloadable zip files.

Credit: Science RF/Adobe

Data scientists at the U.S. National Aeronautics and Space Administration's Jet Propulsion Laboratory (JPL) have compiled 8 million PDF files into an open source archive for enhancing online security.

The corpus is part of the Defense Advanced Research Projects Agency (DARPA) Safe Documents program.

Experts can look through this archive to find information on malware that could be concealed within a file's code to help predict emerging online threats and to augment PDF technology.

The researchers identified the PDFs for inclusion using Common Crawl, a public repository of Web-crawl data, while specialized software re-fetched truncated files.

The approximately 8-terabyte dataset is the largest publicly available corpus of its type.

From Jet Propulsion Laboratory
View Full Article

 

Abstracts Copyright © 2023 SmithBucklin, Washington, D.C., USA


 

No entries found