The Web is an integral part of today's business dealings. Companies and institutions exploit the Web to conduct their business; customers make daily use of the Net to perform all kinds of transactions. In addition, most users browse through pages of personal interest. The Web, as we know, is massive and its data collected from countless sources. Consequently, search tools should be able to accurately extract, filter, and select what is "hidden" from such tools.
Web Usage Mining (WUM) typically extracts knowledge by analyzing historical data such as Web server access logs, browser caches, or proxy logs. WUM techniques are important for several reasons. It is possible to model user behavior and, therefore, to forecast their future movements. The information mined can subsequently be used in order to personalize the contents of Web pages, to improve Web server performance, to structure a Web site according to the preferences expressed by the users, or to help the business to carry out a specific users' target.
Web Personalization (WP) or recommender systems [3] are typical applications of WUM. These systems were introduced to improve Web site usage by customizing the contents of a Web site with respect to the users' needs. They provide mechanisms that collect information describing user activity and elaborate this information.
In a first stage, WUM can be used to determine the number of Web server accesses, the pages requested, the interval time between different user sessions, and the IP address of the Web server users. The WP system elaborates this information in order to extract a user profile based on their navigational behaviors that can be employed to provide personalized navigational information. Personalization systems usually process the information related to users sessions, that is, a sequence of pages requested by the same user. For instance the sequence: Home → Science → Computer → Science → Algorithms could be a typical session of a person browsing a directory (for example, Yahoo!) with an interest in computer algorithms.
The WUM-based personalization process is typically structured according to three phases: Preprocessing, Pattern Discovery, and Pattern Analysis.
Preprocessing consists of elaborating the raw Web access logs to produce data in a format usable by the Pattern Discovery phase. The preprocessing goal is that all the non-relevant data is pruned by the logs, and is based on the kind of mining analysis to be carried out. At the end of this phase, a knowledge base is produced.
SUGGEST is a novel solution to implementing Web personalization as a single online module that performs user profiling, model updating, and recommendation building.
The goal of the Pattern Discovery phase is to evaluate useful data patterns in the knowledge base created in the previous phase (for example, access user patterns) by exploiting various data mining techniques such as Statistical Analysis, Frequent Pattern Analysis, Clustering, and Association Rules Mining.
Pattern Analysis aims to find interesting patterns. Once the correlations have been determined, one must decide which pattern to keep and which to reject with respect to the process-mining goal.
In the past, WP systems architecture was comprised of two components performed offline and online with respect to the Web server activity. The offline component includes the Preprocessing and Pattern Discovery phases, while the online one implements the Pattern Analysis phase in order to generate the personalized content such as links to pages, advertisements, or information relating to products or services estimated to be of interest for the current user.
We should note that in order to enhance navigation experience these systems must provide two main characteristics: they should be non-intrusive and scalable with respect to the Web site size. Moreover, the management of dynamic pages is another important issue of WP systems. Recent Web-based marketing strategies have been mainly focused on presentation of products and services and interaction with the clients. This led Web designers to use dynamic pages extensively. On the other hand, the static approach is preferable only in the case of "small" and "quasi-statical" Web sites.
Here, we introduce SUGGEST a novel solution to implement WP as a single online module that performs user profiling, model updating, and recommendation building [4]. SUGGEST is designed to dynamically generate personalized contents of potential interest for users of large Web sites made up of pages dynamically generated. It is based on an incremental personalization procedure tightly coupled with the Web server. It is able to update incrementally and automatically the knowledge base obtained from historical usage data and to dynamically generate a list of page links (suggestions). The suggestions are used to personalize the HTML page requested on-the-fly. The adoption of a LRU-based (Least Recently Used) algorithm handling the knowledge base makes it possible for SUGGEST to manage large Web sites.
The majority of the existing WP systems are structured according to the offline and online modules (see Figure 1).
Quite a large number of commercial and open source solutions have been proposed and are often available on the market. IndexFinder [9] is a semi-automatic solution to develop adaptive Web sites. Based on statistical analysis and the visit-coherence assumption (that is, pages within the same session are in general conceptually related) Web logs are analyzed to carry out clusters of frequently co-occurring pages. Then, index pages containing a collection of hyperlinks to related but unlinked pages regarding a specific topic are generated. A page index forms the suggestions presented to the Web master.
A solution proposed in [7] involves enhancing usage mining by enriching the set of information usually registered into Web logs with formal semantics based on the ontology underlying the site. Data mining techniques can be applied to enriched Web logs to extract knowledge that could be used to improve the navigational structure as well as exploited in recommender systems.
A recommender system proposed in [4] is based on both content and usage data and exploits semantics annotation of Web logs to produce suggestions including documents that are not in the same path, and whose content is relevant with respect to the page visited. In order to characterize every site page, a set of keywords is carried out by means of a text mining analysis, and is mapped in categories according to a domain-specific taxonomy and thesaurus. The Web logs are then enriched with relevant keywords and categories. The documents are clustered, based on the similarity between the category terms, and used to expand the recommendation set suggested to the end user.
SETA is aimed at supporting e-commerce initiatives, in particular for assisting the navigation of users through Web virtual stores [1]. The system is designed as a multi-agent architecture. Each specialized agent has been designed to support a different activity of the front-end of a Web store. Adaptation is carried out using a classification of the users based on Bayesian Networks that demonstrate user behavior from the profile specified. The system requires an initial step of personal information collection.
Oracle 10gAS Personalization is a commercial product offering Web personalization functions [8]. The periodically built predictive model is used to generate real-time suggestions. The model itself is built by choosing the best model among those generated by two distinct classification algorithms: Predictive Association Rules [5] and a Transactional Naïve Bayes [6]. The model is stored in the table of an Oracle database, making it possible to exploit the data mining model by PS/SQL procedures.
The main limitation of traditional systems is the loosely coupled integration of the Web personalization system with the Web server ordinary activity. Indeed, the use of two components has the disadvantage of having an "asynchronous cooperation" among the components. The offline component must be periodical to have up-to-date data patterns, but the frequency of the updates is a problem that must be solved on a case-specific basis. On the other hand, the integration of the offline and online component functions in a single component poses other problems in terms of overall system performance, which should have a very low impact on user response times. Thus, the system must be able to generate personalized results in a small fraction of a user session. Moreover, the knowledge mined by using a single component must be comparable or better, of those mined by using two separate components. Figure 2 shows the architecture of SUGGEST the completely online Web Recommender System we recently proposed [2].
SUGGEST is completely online and incremental, and it is aimed at providing the users with information about the pages they may find of interest. It bases personalization on a user's classification that evolves according to the user's requests.
Usage information is represented by means of an undirected graph whose nodes are associated to the identifiers of the accessed pages, and each edge is associated to a measure of the correlation existing between nodes (pages). This graph is incrementally modified to keep the user model up-to-date.
In our model the "interest" in a page does not depend on its contents but on the order by which a page is visited during a session. Therefore, to weight each edge of the graph we introduced a novel formula:
where Nij is the number of sessions containing both pages i and j, Ni and Nj are the number of sessions containing only page i or j, respectively. Dividing Nij by the maximum between single occurrences of the two pages has the effect of discriminating internal pages from the so-called index pages. Index pages are those that do not generally contain useful content and are only used as a starting point for a browsing session. We decided to consider index pages to be of too little interest as potential suggestions because they are very likely to be included in too many sessions.
Index pages are used in other works (for example, [9]) to present the results of the personalization phase. In these cases index pages are not actually used to identify potentially useful information but just to present the personalization results.
SUGGEST user sessions (identified by means of a cookie-based protocol) are used to build "Session Clusters" eventually leading to a list of suggestions. It finds groups of strongly correlated pages by partitioning the graph according to its connected components. Each component in turn represents a different class, or cluster, of users. The connected components are obtained in an incremental way by using a derivation of the well-known Breadth-First Search (BFS) visit limited to the nodes involved in the request. Basically, we start from the current page identifier and we explore the component to which it belongs. If there are any nodes not considered in the visit a previously connected component has been split and needs to be identified. We simply apply the BFS again, starting from one of the nodes not visited. Furthermore, in order to limit the number of edges of the graph we applied a threshold.
Edges whose weights are smaller than the predefined threshold are considered poorly correlated and thus discarded. Pages in the same cluster are ranked according to their co-occurence frequency and clusters with size lower than a threshold value are discarded as not significant enough. Note that the update algorithm does not involve the exploration of the entire graph but just the nodes associated to the pages of the cluster containing the starting page. Usually, each cluster is composed by only a fraction of the entire node set, thus the cost of the algorithm is very limited.
The data structure used to store the weights is an adjacency matrix where each entry contains the weight related to a pair of accessed pages.
In order to manage Web sites with a number of pages not known, such as Web sites that intensively use dynamic pages, a very innovative solution is applied in SUGGEST, which indexes a page when required. To allow the adjacency matrix to become manageable in size, a LRU algorithm is applied. The Web master of a site may adjust the matrix size according to predetermined constraints such as available resources and performance level. Smaller matrix size values, however, may lead to poor system performance due to frequent page replacements.
After the model has been updated SUGGEST prepares the list of suggestions on the basis of a classification of the user session. This is made in a straightforward way by finding the clusters having the largest intersection with the pages belonging to the current session. The final suggestions are composed by the most relevant pages in the cluster, according to the ranking determined by the clustering phase. The suggestions, are then inserted as a list of links in the requested page. Visited pages are not included in the suggestions therefore users belonging to the same class could have different sets of suggestions, depending on which pages have been visited in their active session.
SUGGEST is implemented as a single Apache Web server module in order to allow easy deployment on potentially any kind of Web site currently available, without changing the site itself. Experimental results demonstrate that SUGGEST is able to provide significant suggestions as well as good system performance.
In order to validate our approach, we performed several experiments using three Web server access logs available at www.web-caching.com: NASA (27 days, 19K sessions), USASK (180 days, 10K sessions), and BERKLEY (22 days, 22K sessions).
The metric used to measure the quality of SUGGEST basically tries to estimate the effectiveness of a recommendation system as the capacity of anticipating users requests that will be made farther in the future (see [2] for more details.) Figure 3a shows experimental results. The threshold used to prune the graph edges (minfreq) is represented on the x-axis, whereas the quality of the suggestions is given on the y-axis. It can be seen that more than 50% of the suggested pages, in the case of USASK log, have been retained so as to meet the users' need. We also plotted the results of a recommender system giving random links as output. As it was expected, the results in this case are quite poor.
In terms of view of efficiency, we elaborated 100,000 requests varying the number of requests performed simultaneously from 10 to 110. As shown in Figure 3b, the overhead introduced by SUGGEST is less than 8%. Moreover, if we consider that SUGGEST is able to anticipate the users' requests, this module will increment the efficiency of the whole Web server system since users will spend less time navigating the Web server pages thus giving more free space to a larger number of users.
In this article we have introduced SUGGESTa completely online Web recommender system that does not require user intervention on the model-building module. We also empirically demonstrated that SUGGEST effectively and efficiently provides recommendations to users.
1. Ardissono, L., Goy, A., Petrone, G., and Segnan, M. Personalization in business-to-customer interaction. Commun. ACM 45, 5 (May 2002), 5253.
2. Baraglia, R. and Silvestri, F. An online recommender system for large Web sites. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (Sept. 2024, 2004).
3. Eirinaki, M. and Vazirgiannis, M. Web mining for Web personalization. ACM Trans. on Internet Technology 3, 1 (Feb. 2003) 127.
4. Eirinaki, M., Vazirgiannis, M., and Varlamis, I. Sewep: Using site semantics and a taxonomy to enhance the Web personalization process. In Proceedings of Knowledge Discovery in Data 2003 (Aug. 2003).
5. Megiddo, N. and Srikant, R. Discovering predictive association rules. In Proceedings of Knowledge Discovery and Data Mining (1998), 274278.
6. Nath, S.V. Customer churn analysis in the wireless industry: A data mining approach. See the section on Transactional Naive Bayes.
7. Oberle, D., Berendt, B., Hotho, A., and Gonzalez, J. Conceptual user tracking. In Proceedings of Web Intelligence, First International Atlantic Web Intelligence Conference. E. Menasalvas Ruiz, J. Segovia, and P.S. Szczepaniak, Eds. (Madrid, Spain, May 5-6, 2003). Springer, 142154.
8. Oracle Corporation. Oracle application server 10g business intelligence overview. See the section on the Personalization Tool.
9. Perkowitz, M. and Etzioni, O. Adaptive Web sites. Commun. ACM 43, 8 (Aug. 2000).
Figure 1. Architecture of a typical Web Recommender System.
Figure 2. Architecture of the SUGGEST Online Recommender System.
Figure 3. (a) Effectiveness of SUGGEST measured as the probability of recommending a page of potential future interest. (b) Comparison of the number of requests satisfied per seconds in the case of using SUGGEST vs. Apache without SUGGEST installed.
©2007 ACM 0001-0782/07/0200 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2007 ACM, Inc.
No entries found