< Contents ERCIM News No. 60, January 2005

SUGGEST: An Online Recommender System for Large Web Sites

by Ranieri Baraglia and Fabrizio Silvestri

The continual increase in Web usage has led to the need for automatic Web-mining tools able to accurately extract, filter and select information of interest from the huge quantitiesof data available. In particular, Web Usage Mining (WUM) tools typically extract knowledge by analyzing the data logged during user navigation, and can be used to develop personalization or recommender systems whose main goal is to improve web site usability.

SUGGEST is a recommender system that has been designed to dynamically generate personalized content of potential interest for users of large web sites. The system has been developed at ISTI-CNR, Pisa, in the context of a project of the Italian Ministry of Education and Research called 'Services for Enhanced Contents Delivery'. It is implemented as a module of the Apache Web server, and its usage does not require any modification to the site being examined. Personalization is achieved by means of a set of suggestions (page links) dynamically generated on the basis of the active user session, which are used to personalize the HTML page requested on-the-fly

Typically, the WUM personalization process is structured according to two components, performed off-line and online with respect to the web server activity. By analyzing the historical data (ie server access log files), the off-line component builds a knowledge base which is used in the online phase to generate the personalized content. This content can be expressed in several forms, such as links to pages or advertisements considered of interest for the current user.

The main limitation of this two-tier approach is the loosely coupled integration of the WUM system with the web server activity. This comports the periodic running of the off-line component to update the knowledge base; the frequency with which this updating operation should be performed is case-sensitive. The merging of the two components also raises other problems in terms of system efficiency.The integration must have little impact on user response times, and the knowledge mined by a single component must be comparable or better than that obtained using two separate components.

The solution introduced by SUGGEST eliminates drawbacks and satisfies the criteria mentioned above. By exploiting a single component working completely online with respect to the web server functionalities, the system can update the knowledge base incrementally and automatically and can generate a list of suggestions.

SUGGEST is structured in the following three steps:

User Session Identification
User sessions are identified by means of cookies stored on the client side. Cookies contain the keys to identify the client sessions. On each page request, SUGGEST identifies the URL requested and the URL from which the request originates. The knowledge base is updated according to the characteristics of the current session, and suggestions are then generated. To extract information about navigational patterns, SUGGEST models the web page of a site as an undirected graph whose nodes are associated with the identifiers of the accessed pages, and edges are associated with weights representing the degree of correlation existing between pages. Presuming that interest in a page depends on its content and not on the order in which a page is visited during a session, the edge weight is computed as W=Nij/max{Ni,Nj}, where Nij is the number of sessions containing both pages i and j, and Ni and Nj are the number of sessions containing only page i or j, respectively. Dividing by the maximum number between single occurrences of the two pages has the effect of discriminating internal pages from the so-called index pages (eg home pages) that are of little interest as potential suggestions.

In order to manage web sites with an a priori unknown number of pages, eg sites that use dynamic pages intensively, SUGGEST indexes pages only when they are required. This solution can lead to a large increase in the adjacency matrix M used to store the weights related to each pair of pages. To avoid M assuming an unmanageable size, a 'Least-Recently-Used' algorithm is applied. According to this algorithm, information about a page less recently accessed is replaced with that for a currently accessed page. The smaller the matrix size, the poorer the system performance due to frequent page replacements. Parameters such as web site size, available resources and performance level required can help to define the size of M.

Page Clustering
To find groups of strongly correlated pages, the graph is partitioned according to its connected component. Starting from the current page identifier u, a Depth First Search is applied on the graph induced by M and the component reachable from u is connected. To reduce the contributions of poorly represented links, the computation of the connected components is driven by the predefined threshold values Minfreq and MinClusterSize. Edges with a weight below Minfreq identify poorly correlated elements which are not considered by the connected components algorithm. Components of size smaller than MinClusterSize are considered not sufficiently significant and are discarded. Pages in the same cluster are ranked according to their co-occurrence frequency.

Suggestion Building
In order to build suggestions, the current user session must be classified. This is done in a straightforward manner by finding the cluster that includes the largest number of pages in that session. Suggestions are composed by the most relevant pages in the cluster, according to the order determined by the clustering phase. An example of how suggestions are presented to the user is given in the figure.
More details about SUGGEST can be found in the paper: R. Baraglia, F. Silvestri 'An Online Recommender System for Large Web Sites', IEEE/WIC/ACM International Conference on Web Intelligence, Beijing, China, September 20-24, 2004 (best paper award).

Example of suggestions generated by SUGGEST.
Example of suggestions generated by SUGGEST.


Please Contact:
Ranieri Baraglia, ISTI-CNR, Italy
Tel: +39 050 315 2994

Fabrizio Silvestri, ISTI-CNR, Italy
Tel: +39 050 315 3011