Knowledge management (KM) within an organization is often considered as a way to increase competitive ability [10]. However, KM has not always been well received within many corporations. A Bain & Company report [11] evaluated 25 different types of tools, of which KM tools ranked 24th in satisfaction. The report also showed how KM software has a relatively high rate of defection at 13%. The primary reasons for this are the expense [5], the difficulty acquiring new knowledge [3], and knowledge dissemination. Consequently, many researchers have advocated data mining of external data sources to supplement organizational knowledge [1].
It has been established that programs can be written to retrieve and store data from the Web [6]. However, development and execution of these programs are quite complicated. Large programming efforts and high maintenance costs are duplicated across corporations to achieve similar or identical results. Also, data retrieved by such techniques is static. Figure 1 depicts a programmer who collects data from the Web and stores that data into the corporate database, as opposed to ad hoc queries used inside a database to query various information in different formats depending upon the users' needs. New information that is available for the Web will be made available with these ad hoc queries instead of the static ones that are stored. Another point is that the information available outside is not stored explicitly in the database, which means new information is always available when queried. However, as with traditional database views, Structured Query Language (SQL) commands can transfer this information to a permanent storage location.
In this article, we develop an SQL extension that allows corporate databases to be joined to explicit information contained on any corporate or external Web site. By using existing SQL/database technology, not only are costs minimal for implementation of this new SQL extension, but users can seamlessly retrieve information from database-Web joins (see Figure 1). We seek answers to two questions:
We undertake three tasks here. The first is to design new principled extensions to the SQL language called a WEBVIEW, allowing transparent joins between database and Web data. The second is to show the WEBVIEW is robust such that it can capture Web data of interest, and that identical uses of the extensions will yield identical results. And the third is to develop a tool that implements these extensions as a proof of concept that the WEBVIEW extension is practical for use.
We examine two different literature bases derived from information systems and computer science, including research on KM and data mining, SQL access of HTML, and multi-database systems (MDBSs). We continue with a description of how our proposed WEBVIEW can make contributions to KM, SQL, MDBS design, and HTML/ XML information capturing.
KM and Data Mining. Most KM literature centers on identifying sources of knowledge within a company and capturing that tacit knowledge known only by one or a few employees, and converting that knowledge to explicit knowledge inside a knowledge repository of some sort [10]. Software tools that aid KM have been reported to be expensive and of questionable value [5].
Mobasher, Cooley, and Srivastava [9] describe how pattern matching is not sufficient for data mining, and that "useful and quality" information needs to be identified from these patterns. We build upon their research by creating database constructs that allow ad hoc queries of patterns, thus allowing a dynamic retrieval of data patterns that are deemed useful. Chung and Gray [1] explain how KM, data warehousing, and data mining all work in conjunction with each other, and how the Web has added a new dimension to KM by facilitating the acquisition of new knowledge from external sources. We add to this literature by developing a language and tool that facilitate data collection and join it to existing database information.
SQL and HTML. SQL is the language used by most databases and has been advocated as a means to access specific Web data (for example, [4]). SQL is said to be relationally complete in that it can be used to express any query supported by predicate (or relational) calculus [2]. By tightly coupling Web data to SQL using SQL extensions, we get the benefit of being relationally complete (since SQL itself is relationally complete) and are left with simpler tasks of ensuring that our SQL extension is robust in that it is sufficient to capture all Web data, including hierarchical representations (XML) and relational representations (links). An SQL extension also ensures that users can access Web data transparently so that Web access is available to any SQL-based tool. Specifically, that any SQL statements, such as SELECT, remains unaltered when accessing the new WEBVIEW construct. Thus far, no single proposed tool for data mining has addressed the challenges of SQL transparency and robustness.
MDBS. There have been many articles that discuss SQL extensions, mainly in the area of MDBSs that can access disjoint relational SQL databases (for example, [7]). Lakshmanan, Sadri, and Subramanian [8] advocate five required features for SQL extensions: (1) the language must have expressive power that is independent of the schema where the database is structured; (2) the language must allow restructuring of one database to conform to the schema of another; (3) the language must be easy to use yet sufficiently expressive; (4) the language must provide full capabilities that are downward compatible with SQL, so that existing SQL functions properly in the presence of the MDBS; and (5) the language must be able to be efficiently implemented. We build upon Lakshmanan, Sadri, and Subramanian's work [8] by proposing an SQL extension to incorporate these five requirements into a WEBVIEW: (1) it must have expressive power that is independent of HTML, XML, or other Web-based markup languages; (2) it must allow the restructuring of Web data to conform to a database schema; (3) it must be shown to be sufficient to capture any Web data, including XML or HTML; (4) it must function like existing database constructs to allow transparency for the database developer; and (5) it must be efficiently implemented.
The nature of the WEBVIEW allows Web pages to be retrieved and joined to a database in an ad hoc manner, so that Web pages can be joined to new corporate data without the need for further programming.
The CREATE WEBVIEW command for creating ad hoc queries is shown in Code Example 1. The table here also summarizes the CREATE WEBVIEW clauses, which can be used in any order, except that the COLUMN command must follow the applicable ROW or NESTED ROW, and the CREATE WEBVIEW command must occur first.
To test the viability of the CREATE WEBVIEW command, we "piggyback" our engine on top of an existing Open Database Connectivity (ODBC) database manager utilizing virtual tables, and corresponding SQL statements are then sent to the database engine through the ODBC manager. Thus, CREATE WEBVIEW can be tested with any database that supports (or has third-party support for) ODBC (for example, Oracle, Sybase, SQL Server, and Access). The table and Code Example 1 describe the skeleton for the WEBVIEW scheme.
The tool we developed takes SQL statements, including the new CREATE WEBVIEW extension, and passes them to an ODBC database engine. The WEBVIEW tool shows the usability of the CREATE WEBVIEW statement and use of this statement in combination with existing SQL syntax.
CREATE WEBVIEW that captures data sets that span several Web pages. Code Example 2 shows how we can use the CREATE WEBVIEW statement to retrieve the results of a Froogle search. (Froogle is a popular "shop bot" that compares prices of various vendors.) Code Example 2 shows how a search string can be used to retrieve results shown in Figure 2. Here, we provide an example of querying Froogle by mimicking a search for a popular book, The Da Vinci Code, using its ISBN number. The LINK directive inside the CREATE WEBVIEW command allows this WEBVIEW to span 88 Web pages and retrieve 871 prices from various book sellers.
CREATE WEBVIEW that captures hierarchical data sets (XML). In order to be sufficient for the data-collecting task, the CREATE WEBVIEW statement needs to be able to retrieve hierarchical data from a Web page. Code Example 3 shows the XML used for instruction in an XML and B2B class at a midwestern university in the U.S. and shows how we can use the CREATE WEBVIEW statement to retrieve the results of XML similar to that shown in Code Example 2. Code Example 3 shows how the hierarchical nature of XML can be captured into a relational format by using the CREATE WEBVIEW statement with a NESTED clause. Notice that, in the XML, Joe Student does not have a customer number; WEBVIEWs set any missing field to NULL.
WEBVIEWS created via joins to database tables. On some data retrievals, complex behavior is required to get to the proper page. Code Example 3 shows how a WEBVIEW can be joined to a database table (or other WEBVIEWs) by using a SELECT statement within the USING clause of the WEBVIEW. In Figure 3, we combine the power of a SELECT statement inside the USING clause to retrieve a list of sites from a database, combining it with the recursive nature of the LINK clause, leading to a very powerful routine. The code used to create Figure 3 was used to retrieve eBay online auction sites from a database and use them to represent a data set containing 3,795 auctions in a little less than three minutes (on a high-speed line) from over 78 Web pages. Such a technique can be used for monitoring competitors within an industry on thousands of products, gathering data for market or academic research, or using KM and data mining tools to determine relationships that may be hidden from management. Furthermore, the nature of the WEBVIEW allows Web pages to be retrieved and joined to a database in an ad hoc manner, so that Web pages can be joined to new corporate data without the need for further programming. Finally, the simplicity of the WEBVIEW allows the database administrator to easily capture Web pages that may be prone to change.
This is the first research to identify a need for ad hoc joins to any Web data (regardless of format) and database data where the user can join existing corporate data to Web pages on the fly without major additional programming effort. With such a wide variety of Web pages, including varying HTML and XML formats, we feel this is best accomplished via a transparent SQL extension.
Our research addresses this need via WEBVIEW, an SQL language extension we developed that can collect and disseminate external Web data to a corporate database based on the varied information needs of the organization using SQL commands that have not changed other than the CREATE WEBVIEW addition.
External information brought into the firm from outside sources, especially for KM applications, is difficult to collect, refresh, and manage. The tool and the SQL language allow us to manipulate the data from the Web pages. It has the ability to download enormous amounts of data from large numbers of Web pages. Since data is not explicitly stored, the data derived is not static; up-to-date information is made available when the query is made. Also, data is not stored in the corporate databases in various formats to avoid redundancy and duplication of data. The tools developed using this extension have the potential to impact corporate competitive strategies, supplier and client relations, and corporate research. For researchers, this language and tool allow the building of relatively cost-free databases of actual transaction, economic, and market data on the Web. We have focused on the WEBVIEW SQL construct. Future research should specify a user application interface that can automatically code for some of the more common patterns found in Web pages to simplify WEBVIEW development.
1. Chung, H.M. and Gray, P. Special section: Data mining. Journal of Management Information Systems 16, 1 (Summer 1999).
2. Codd, E.F. Further normalization of the data base relational model. In Data Base Systems. Prentice-Hall, Englewood Cliffs, NJ, 1972, 3364.
3. Davenport, T.H. and Prusak, L. Working Knowledge: How Organizations Manage What they Know. Harvard Business Press, Cambridge, MA, 1998.
4. Deutsch, A., Fernandez, M., Florescu, D., Levy, A., and Suciu, D. A query language for XML. Computer Networks 31, 11 (May 1999), 11551169.
5. Horwitch, M. and Armacost, R. Helping knowledge management be all it can be. The Journal of Business Strategy 23, 3 (May/June 2002), 2631.
6. Kauffman, R.J., March, S.T., and Wood, C.A. Mapping out design aspects for data-collecting agents. International Journal of Intelligent Systems in Accounting, Finance, and Management 9, 4 (Dec. 2000), 217236.
7. Krishnan, R., Li, X., Steier, D, and Zhao, L. On heterogeneous database retrieval: A cognitively guided approach. Information Systems Research 12, 3 (Sept. 2001), 286303.
8. Lakshmanan, L.V.S., Sadri, F., and Subramanian, S.N. SchemaSQL: An extension to SQL for multidatabase interoperability. ACM Transactions on Database Systems 26, 4 (2001), 476519.
9. Mobasher, B., Cooley, R., and Srivastava, J. Automatic personalization based on Web usage mining. Commun. ACM 43, 8 (Aug. 2000), 142151.
10. Nonaka, I. Dynamic theory of organizational knowledge creation. Organization Science 5, 1 (Feb. 1994), 1437.
11. Rigby, D. Management Tools: Annual Survey of Senior Executives, 2001; www.bain.com/bainweb/expertise/tools/overview.asp.
Figure 1. Static versus dynamic representation of Web data.
Figure 2. Retrieval results for The Da Vinci Code search.
Figure 3. Retrieval results for eBay sites from a database.
Figure. Code Example 1. WEBVIEW specification.
©2005 ACM 0001-0782/05/0900 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2005 ACM, Inc.
No entries found