acm-header
Sign In

Communications of the ACM

Communications of the ACM

The World-Wide Telescope


Most scientific data will never be examined directly by scientists; rather it is put into online databases where it is analyzed and summarized by computer programs. Scientists increasingly see their instruments through online scientific archives and analysis tools, rather than through the raw data. Today, this analysis is primarily driven by scientists posing queries, while the scientific archives are becoming active databases, self-organizing and able to recognize interesting and anomalous facts as data arrives. In some fields, data from many different archives is correlated to produce new insights. Astronomy is an excellent example of these trends, and federating astronomy archives poses interesting challenges for computer scientists.

Computational science is a new branch of most disciplines. A thousand years ago, science was primarily empirical. Over the past 500 years, each discipline has added a theoretical component. Theoretical models often motivate experiments and generalize our understanding. Most disciplines have both empirical and theoretical branches. In the past 50 years, many have added a third computational branch; for example, ecology has empirical, theoretical, and computational branches, as do physics and linguistics.

Computational science traditionally means simulation and grew out of our inability to find closed-form solutions for complex mathematical models; today, computers simulate such complex models. Computational science has now also evolved to include information management, and scientists face mountains of data stemming from four converging trends:

New instruments. The flood of data from new scientific instruments is driven by Moore's Law, doubling their data output every year or so;

Simulations. The flood of data from simulations rises with computational power;

Storage. Petabytes of data are economically stored online; and

The Net. The Internet and computing Grid make all these archives accessible to anyone anywhere.

Scientific information management poses profound computer science challenges. Acquisition, organization, query, and visualization tasks scale almost linearly with data volume. By using parallelism, these problems can be solved in bounded times (minutes or hours). In contrast, most statistical analysis and data mining algorithms are nonlinear. Many tasks involve computing statistics among sets of data points in a metric space. Pairwise algorithms on N points scale as N2. If the data size increases a thousandfold, the work and time can grow by a factor of a million. Many clustering algorithms scale even worse and are infeasible for terabyte-scale data sets.

The new online science needs new data mining algorithms that can be executed in parallel using near-linear processing, storage, and bandwidth. Unlike current algorithms that give exact answers, these algorithms will likely be heuristic and give approximate answers [3, 11].

Back to Top

Archetype for Online Science

Astronomy exemplifies these phenomena. For thousands of years it was primary empirical. Theoretical astronomy, which began with Johannes Kepler (1571–1630), is co-equal with observation. Astronomy was early to adopt computational techniques to model stellar and galactic formation and celestial mechanics. Today, simulation is an important aspect of the field, producing new science and solidifying our grasp of existing theories.

Modern telescopes produce terabytes of data per year; next-generation telescopes will produce terabytes per night. In the era of photographic instruments, astronomers could carefully analyze individual photographic plates. In the era of electronic cameras, astronomers would need years to manually analyze just a single evening's observation. Instead, the data is fed into software pipelines using massive parallelism to analyze the images and recognize, classify, and catalog objects. Astronomers use data analysis tools to explore and visualize the data catalogs. Only when they recognize something anomalous do they go back to the source pixels; hence most source data is never examined directly by humans.

Astronomy data is collected by dedicated instruments around the world, as well as by those in Earth orbit and beyond. Each of them measures the light intensity (flux) in certain spectral bands. Using this information, astronomers extract hundreds of object attributes, including magnitude, extent, probable structure, and morphology. Even more can be learned by combining observations from various times and from different instruments.

Temporal and multispectral studies require integration of data from several archives. Until recently, this was very difficult. With the advent of the high-speed Internet and inexpensive online storage, it is much easier for scientists to compare data from multiple archives; Figure 1 shows a comparison of temporal and multispectral information about the same object, the Crab Nebula.

Astronomers typically acquire data from an archive by requesting large parts of it on magnetic tape or by requesting smaller parts by File Transfer Protocol (FTP) through the Internet. The data arrives encoded as Flexible Image Transport System (FITS) files with a unique coordinate system and measurement regime [5]. Their first task is to convert the "foreign" data into a "domestic" format and measurement system. Just as each computer scientist has dozens of definitions for the word "process," each astronomy subgroup has its own vocabulary and style. They then analyze the data with a combination of scripting language (tcl and Python are popular) and personal analysis toolkits acquired throughout their careers. Using these tools, they "grep" the data, apply statistical tests, and look for common trends or for outliers. Figure 2 describes the general patterns astronomers look for using visualization packages to "see" the data as 2D and 3D scatter plots.

The FTP-GREP metaphor does not, however, work for terabyte-size data sets. FTPing or GREPing a gigabyte takes a minute, but FTPing or GREPing a terabyte can take a day or more, and sequentially scanning a petabyte takes years. So, the old analysis tools will not work on future massive astronomy data sets and databases.

Database query systems have automatic indexing and parallel search algorithms that can explore huge databases in parallel. A 100TB database occupies several thousand disks; searching them one at a time would take months, but a parallel search takes only an hour. More important, indices can focus the search to run in seconds or minutes. But because current data mining algorithms have superlinear computation and I/O costs, astronomers, as well as practitioners in other disciplines, need new linear-time approximations running on parallel computers.

Back to Top

All Astronomy Data Online

Nearly all the "old" astronomy data is online today as FITS files that can be FTPed to a particular site. Astronomers have a tradition of publishing their raw data after validating and analyzing it. We estimate that about half of the world's astronomical data is online—about 100TB in all.

Palomar Observatory, a 200-inch optical telescope run by the California Institute of Technology (www.astro.caltech.edu/observatories/palomar/), conducted a detailed optical sky survey in the 1950s using photographic plates. Originally, it published them via prints on plastic film for about $25,000 per copy. That data has now been digitized and is freely available on the Internet. Several new all-sky surveys begun in the 1990s gather deep and statistically uniform surveys of the sky in about 20 spectral bands. Each survey will ultimately collect terabytes of data. They have large sky coverage, sound statistical plans, are well-documented, and are being designed to be federated into a Virtual Observatory [2]. Software to process the data, present the catalogs to the public, and federate the data with other archives is a major part of each survey, often representing more than 25% of a project's budget.

The astronomy community was early to also put its literature online [6]. In addition, the astronomy community has been cross-correlating the scientific literature with the archives [3, 7]. Today, these tools allow us to point at an object and quickly locate all available literature about it, as well as all the other archives with information about the object.

Back to Top

Description and Benefits

The World-Wide Telescope is simultaneously emerging from the world's online astronomy archives. It collects observations in all the observed spectral bands from the best instruments on Earth and beyond, searching back to the beginning of history. The "seeing" is always good; the sun, moon, and clouds never cause dead time when we cannot observe. All this data will be cross-indexed with the online literature. Today, all astronomy literature is available online through www.google.com and www.AstroPh.com. In the future, it should be possible to find and analyze the underlying observational data just as readily. Indeed, anyone may even be able to request the acquisition of new data.

The World-Wide Telescope is having a democratizing effect on astronomy. Professional and amateur astronomers alike have nearly equal access to its data. The major difference is that some have much better data analysis tools and skills than others. Following up on a conjecture often requires a careful look at an object using an advanced instrument like the Hubble Space Telescope, so some still have privileged access to such instruments. But for studies of the global structure of the universe, tools to mine the online data will represent a wonderful telescope in their own right.

The World-Wide Telescope is an extraordinary tool for teaching astronomy, giving students at every grade level access to the world's best telescope. It may also be a great way to teach computational science skills, as the data is real and well-documented and has a strong visual component.

Building the World-Wide Telescope requires skills from many disciplines. First, the many astronomy groups that gather data and produce catalogs must make the data available. Theirs is at least 75% of the effort, but once accomplished, the data will still have to be made accessible and useful. Making the data useful requires three additional components:

Plumbing. Good database plumbing will store, organize, query, and access the data, both in huge single-site archives and in the distributed Internet database combining all the archives. This involves data management and, as important, meta-data management to integrate the different archives.

Data mining algorithms and tools. Data mining algorithms and tools help recognize and categorize data anomalies and trends. They draw heavily on techniques from statistics and machine learning but also require new approaches that scale linearly with data size. Most of them will be generic, but some will involve a deep understanding of astronomy; and

Data visualization. Good data visualization tools will make it easy to pose queries in a visual way and understand the answers.

The World-Wide Telescope will be a vast database. Each spectral band is tens of terabytes. The multispectral and temporal dimensions grow data volumes to petabytes. Automatic parallel search and good indexing technologies are essential. We expect the large database search and index problems to be solved; there has been great progress in the past, and more is on the horizon.


When queries take three days and hundreds of lines of code, one asks fewer questions and gets far fewer answers.


In contrast, the problem of integrating heterogeneous data schemas, which has eluded solution for decades, is now even more pressing. Automatically combining data from multiple data sources, each with its own lineage, units, quality, and conventions, is a special challenge in its own right. Today, database archivists and data miners do schema integration one item at a time. The World-Wide Telescope must make it easy for astronomers to publish their data on the Internet in understandable formats. It must also make it easy for their colleagues to find and analyze the data using standard tools.

Back to Top

The Virtual Observatory and SkyServer

A prototype virtual telescope called the SkyServer (see SkyServer.SDSS.org/) is just a small part of the larger Virtual Observatory being built jointly by the international astronomy community [2]. SkyServer began as an attempt to make the Sloan Digital Sky Survey (SDSS) data readily available [12]. The project later expanded to include tools for data mining, an educational component, and an effort to federate the SDSS with other archives and with the literature.

The SkyServer gives interactive access to the data via a point-and-click virtual telescope view of the pixel data and via canned reports generated from the online catalogs. It also allows ad hoc catalog queries. All data is accessible via standard browsers. A Java GUI client interface lets users pose SQL queries, while Python and Emacs interfaces allow client scripts to access the database. All these clients use the same public HTTP/SOAP/XML interfaces.

We designed the SkyServer database to answer 20 queries typifying the questions astronomers might ask of an archive [10]; two examples are "Find gravitational lens candidates" and "Create a grided count of galaxies satisfying a color cut." We were delighted to find that all the queries had fairly short SQL equivalents; indeed, most were only a single SQL statement [4].

An anecdote conveys how the SkyServer's interactive data access can change the way astronomers work. A colleague challenged us to find "fast moving" asteroids. This was an excellent test case; he had written a 12-page tcl script that ran for three days on the flat files of the data set, so we had a benchmark against which we could compare our experience. It took a long day to debug our understanding of the data and develop an 18-line SQL query answering the question. The resulting query returned the answer in a few minutes. This interactive (not three-day) access allowed us to "play" with the data and identify other interesting objects. Being able to ask questions in a few hours and get answers in a few minutes changes the way data is viewed; experiments can be interactive. When queries take three days and hundreds of lines of code, one asks fewer questions and so gets far fewer answers. This and similar experiences convinced us that interactive access to scientific data and data mining tools can dramatically improve productivity.

The SkyServer is also an educational tool. Several related interactive astronomy projects, from elementary to graduate level, have been developed in three languages: English, German, and Japanese. Interest in this aspect of the SkyServer continues to grow.

The SDSS data is public. Computer scientists have begun using it in data mining and visualization research. A 0.1% edition consists of about 1GB, and a 5% edition consists of about 100GB; the 100% edition will be about 25TB when complete in 2007. The 5% edition can be cloned for about $5,000. In parallel, our colleagues at CalTech built VirtualSky.org, putting most of the Digital Palomar Sky Survey data online.

Having built Web servers providing HTML access to the data, the next step is federating them into a single database with transparent access to all the data. As the data sets are already federated with the literature, one can point at an object and find everything written about it, along with all other archives cataloging that object [7, 8].

SkyQuery gives a taste of such a Virtual Observatory data federation (see SkyQuery.net). Using Web-services technologies, SkyQuery federates the optical SDSS archive at the Fermi National Accelerator Laboratory (FermiLab) in Batavia, IL, with a radio survey [1] archive at Johns Hopkins University and with the Two Meter All Sky Survey [9] archive at CalTech. Given a query, the SkyQuery portal polls these three SkyNodes and returns their combined answer. A query can be stated as: "For a certain class of objects, find all information about corresponding objects in the other surveys." The SkyQuery portal combines this information with a picture of the object generated by a fourth Web service. Automatically answering this query requires a uniform naming coordinate system, measurement units, and error handling; in other words, it exposes many of the schema-integration problems the World-Wide Telescope will face.

Building the World-Wide Telescope Web service will require an objectified definition of astronomy objects. This object model will define a set of classes, as well as the methods on the classes. Each archive then becomes a Web service instantiating the classes it implements. Defining the object model is a fascinating challenge for both astronomers and computer scientists.

Back to Top

Conclusion

The primary goal of the World-Wide Telescope is to make astronomers more productive, helping them better understand their data. But it also represents an archetype for the evolution of computational science, from its simulation roots to the broader field of capturing, organizing, analyzing, exploring, and visualizing scientific data. It is a prototype for this new role. Similar trends are occurring in most other disciplines, including genomics, economics, and ecology.

This transformation poses special challenges for the database community, which will have to deal with huge data sets, richer datatypes, and much more complex queries. Federating the archives is a good test of distributed systems technologies, including Web services and distributed object stores. The data mining community is also challenged by the huge size and high dimensionality of the data. Because the data is public, the World-Wide Telescope is also an excellent place to compare and evaluate data mining algorithms. The World-Wide Telescope challenges statisticians to develop algorithms that run fast on very large data sets. It also poses the challenge of making it easy to visually explore the data, posing queries in natural ways and seeing the answers in intuitive formats. Finally, but perhaps most important, it can be a valuable resource for teaching the new astronomy, as well as for teaching computational science.

Back to Top

References

1. Becker, R., Helfand, D., White, R., Gregg, M., and Laurent-Muehleisen, S. Faint Images of the Radio Sky at 20 Centimeters (FIRST); see sundog.stsci.edu.

2. Brunner, R., Ed. Virtual Observatories of the Future. Astronomical Society of the Pacific, San Francisco, CA, 2001; see www.voforum.org/.

3. Connolly, A., Genovese, C., Moore, A., Nichol, R., Schneider, J., and Wasserman, L. Fast algorithms and efficient statistics: Density estimation in large astronomical data sets. Astronom. J. (2002).

4. Gray, J., Slutz, D., Szalay, A., Thakar, A., van den Berg, J., Kunszt, P., and Stoughton, C. Data Mining the SDSS SkyServer Database. Tech. Rep. MSR-TR-2002-01, Microsoft, Redmond, WA, Jan. 2002.

5. Hanisch, R. et al. Definition of the Flexible Image Transport System (FITS). NOST 100-2.0. NASA/Science Office of Standards and Technology, Code 633.2, NASA Goddard Space Flight Center, Greenbelt, MD, Mar. 29, 1999; see archive.stsci.edu/fits/fits_standard/.

6. Murray, S., Eichhorn, G., Kurtz, M., Accomazzi, A., Stern Grant, C., Bohlen, E., and Thompson, D. Astrophysics Data System (ADS). NASA and Harvard University, Cambridge, MA; see adswww.harvard.edu.

7. NASA. NASA/IPAC Extragalactic Database (NED). Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA; see nedwww.ipac.caltech.edu/.

8. Ochsenbein, F., Bauer, P., and Marcout, J. The VizieR database of astronomical catalogues. J. Astron. Astrophys., Suppl. Ser. 143, 221 (2000); see also SIMBAD Astronomical Database simbad.u-strasbg.fr/.

9. Skrutskie, M. et al. 2 Micron All Sky Survey (2MASS); see pegasus.phast.umass.edu.

10. Szalay, A., Gray, J., Thakar, A., Kunszt, P., Malik, T., Raddick, J., Stoughton, C., and van den Berg, J. The SDSS SkyServer: Public access to the Sloan Digital Sky server data. In Proceedings of ACM SIGMOD 2002 (Madison, WI, June 3–6). ACM Press, New York, 2002, 451–462.

11. Szapudi, I. et al. Estimation of correlations in large samples. In Proceedings of MPA/MPE/ESO Conference (Mining the Sky), A. Banday, S. Zaroubi, and M. Bartelman, Eds. Springer-Verlag, New York, 2001, 249–255.

12. York, D. et al. The Sloan Digital Sky Survey: Technical summary. Astronom. J. 120 (Sept. 2000), 1579–1587.

Back to Top

Authors

Jim Gray ([email protected]) is a distinguished engineer in Microsoft Research Group, San Francisco, CA.

Alex Szalay ([email protected]) is the Alumni Centennial Professor in the Department of Physics and Astronomy at The Johns Hopkins University, Baltimore, MD.

Back to Top

Footnotes

Jordon Raddick of Johns Hopkins University led the SkyServer education effort. Tom Barclay of Microsoft, Tamas Budavari of Johns Hopkins University, Tanu Malik of Johns Hopkins University, Peter Kunszt of CERN, Don Slutz of Microsoft, Jan van den Berg of Johns Hopkins University, Chris Stoughton of FermiLab, and Ani Thakar of Johns Hopkins University helped build the SkyServer (at FermiLab) and SkyQuery. Hewlett-Packard Co., Microsoft Corp., and FermiLab provide financial and technical support for the SkyServer. Roy Williams, Julian Bunn, and George Djorgovski, all of CalTech, are building VirtualSky.

Back to Top

Figures

F1Figure 1. The Crab Nebula, the first recorded supernova a thousand years ago. These images show the importance of comparing old data and temporal data. Such cataclysmic and variable objects reveal interesting temporal phenomena. The images from three different spectral bands show that different information is available from each instrument. Temporal and multispectral data on the same objects generally yield better models than single studies [

F2Figure 2. Scientists examine data sets looking for central clusters, isolated data clusters, points between clusters, holes, and isolated points [

Back to top


©2002 ACM  0002-0782/02/1100  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2002 ACM, Inc.


 

No entries found