The Digital Corpora project hosts the huge data archive as part of Amazon Web Services’ Open Data Sponsorship Program, and the files have been packaged in easily downloadable zip files.
Credit: Science RF/Adobe
Data scientists at the U.S. National Aeronautics and Space Administration's Jet Propulsion Laboratory (JPL) have compiled 8 million PDF files into an open source archive for enhancing online security.
The corpus is part of the Defense Advanced Research Projects Agency (DARPA) Safe Documents program.
Experts can look through this archive to find information on malware that could be concealed within a file's code to help predict emerging online threats and to augment PDF technology.
The researchers identified the PDFs for inclusion using Common Crawl, a public repository of Web-crawl data, while specialized software re-fetched truncated files.
The approximately 8-terabyte dataset is the largest publicly available corpus of its type.
From Jet Propulsion Laboratory
View Full Article
Abstracts Copyright © 2023 SmithBucklin, Washington, D.C., USA
No entries found