Data scientists at the U.S. National Aeronautics and Space Administration's Jet Propulsion Laboratory (JPL) have compiled 8 million PDF files into an open source archive for enhancing online security.
The corpus is part of the Defense Advanced Research Projects Agency (DARPA) Safe Documents program.
Experts can look through this archive to find information on malware that could be concealed within a file's code to help predict emerging online threats and to augment PDF technology.
The researchers identified the PDFs for inclusion using Common Crawl, a public repository of Web-crawl data, while specialized software re-fetched truncated files.
The approximately 8-terabyte dataset is the largest publicly available corpus of its type.
From Jet Propulsion Laboratory
View Full Article
Abstracts Copyright © 2023 SmithBucklin, Washington, D.C., USA
No entries found