Although association rule mining is an important pattern recognition and data analysis technique, extracting and finding significant rules from a large collection has always been challenging. The ability of information visualization to enable users to gain an understanding of high dimensional and large-scale data can play a major role in the exploration, identification, and interpretation of association rules. In this paper, we propose a method that provides multiple views of the association rules, linked together through a filtering mechanism. A visual inspection of the entire association rule set is enabled within a matrix view. Items of interest can be selected, resulting in their corresponding association rules being shown in a graph view. At any time, individual rules can be selected in either view, resulting in their information being shown in the detail view. The fundamental premise in this work is that by providing such a visual and interactive representation of the association rules, users will be able to find important rules quickly and easily, even as the number of rules that must be inspected becomes large. A user evaluation was conducted which validates this premise.
Association rule mining, as one of the important knowledge discovery and pattern recognition methods, looks for interesting relations among items in a database in the form of if-then rules [
Many techniques for exploring association rules employ visualization in order to provide a graphical representation of the data. However, when applying visualization methods to illustrate association rules, one quickly realizes that they are not easy to represent graphically. The reason for this problem is the multiple relational nature of association rules, which is difficult to show in a clear manner especially when there are a large number of rules or when the rules relate many items to one another. In addition, since important aspects of the relations are the interestingness measures (e.g., support and confidence), representing this information along with the relations further complicates any visual representations of the data.
Although several methods have been proposed for visualizing association rules, most of them show the entire set of rules in a single view. As a result, they often display an overwhelmingly large amount of data, making it hard for knowledge managers to evaluate and interpret the rules. This difficulty stems from screen clutter and occlusion problems that occur when presenting a large number of rules and relations. In this paper, we attempt to overcome this problem by presenting a novel Scalable Association Rule Visualization (SARV) technique which helps users find interesting association rules and understand the relations between them, even when the set of association rules is large. The main contribution of SARV is that it avoids screen clutter and occlusion problems by separating overview and detail views of the association rules. Further, it supports users in following Shneiderman’s advice for interacting with data through “overview first, zoom and filter, then details-on-demand” [2, p. 337].
By reducing the complexity of visualizing a large number of rules on a single screen, SARV helps users to understand and interpret the association rules easily, even in a large dataset. In addition, unlike previous works which employ clustering techniques and show representations of clusters instead of the specific rules [3-5], SARV employs highlighting and focusing techniques which allow users to explore the rules easily and interactively. From a cognitive point of view, SARV enables users to explore large collections of rules, easily identify potentially interesting rules, and subsequently focus on the details of such rules without losing the “big picture” perspective on the collection as a whole.
The remainder of this paper is organized as follows: In Section 2, an overview of association rules and techniques for their visualization are presented. In Section 3, the design rationale and features of SARV are described in detail. Section 4 outlines the evaluation framework; the results of the user study are explained in Section 5. The paper concludes with a summary of the primary contributions of this work and an outline of future work in Section 6.
In this section, after providing some principles about association rules, we review some previous works in visualizing association rules. Many systems have been developed in recent years for visualizing association rules. In order to provide a structured overview of these works, we categorize them based on their scalability and their ability to handle a large collection of rules.
In data mining and knowledge discovery, association rules are one of the popular techniques for representing knowledge in the form of relations between variables. One of the important applications of this technique is in Market Basket Analysis, which is used as a basis for decision making in marketing activities, promotional pricing, cross selling, and advertisement [
There are many different interestingness measures in data mining, which are used for selecting and ranking extracted patterns based on their potential value for decision makers [
Hilderman and Hamilton [
While such measures of interestingness may be of value to expert knowledge managers who can accurately interpret their meaning, novice or infrequent users may have some difficulty in understanding the implications of such measures. As such, the current implantation of SARV employs only the traditional measures of support and confidence, showing these in a visual manner in order to help decision makers interpret the quality of the rules.
Although some researchers have considered dynamic association rules [
Quantitative Association Rule (QAR) mining is an influential research problem because of the popularity of quantitative databases [
As mentioned by Berti-Équille [
Some works have considered weighted association rules (WAR), which associate a weight parameter with each item in a resulting association rule [
The simplest way to represent a small number of association rules are textual descriptions, which can be examined with all the low level details such as the items contained in the LHS and RHS, and the interestingness measures such as support and confidence [
Studies on human perception and information theory [14-16] have shown that graphical representations facilitate the search for patterns by harnessing the capabilities of the human visual system to elicit information. Such visual representations allow the user to see the important elements within the data without having to read the data in detail.
The goal of information visualization is to create graphical representations of abstract data or concepts [16,17]. In doing so, such visual representations promote cognitive activities in which the viewers are able to gain understanding or insight into the data being displayed [
At the most fundamental level, information visualization techniques are used when one draws pictures to visually represent data sets. However, when such data sets are large, high dimensional, or contain complex relationships, generating useful visual representations can be a challenging problem. While there are a number of visual features that are available for representing the various dimensions or attributes of the data (e.g., spatial location, colour, shape, size, etc.), care must be taken to select and use visual features that can be easily decoded and understood by the viewer. The goal is to display the data in a coherent manner, allowing the viewer to compare and explore the data visually [
The visualization of association rules can provide immediate insight into the primary characteristics of set of rules (e.g., the items, the relations between them, and the support and confidence measures), which facilities their evaluation. Several techniques have been proposed for visualizing association rules, which can be categorized in six different groups: table-based views, parallel coordinates, matrix views, graph views, mosaic plots, and 3D techniques.
In table-based views, the columns of a rule table include rule IDs, items in the LHS and RHS, and support and confidence measures. In this presentation technique, each row represents a specific association rule. Such tablebased views were used extensively in early association rule visualization work [
Parallel coordinates is a method for representing highdimensional data in two dimensions [
An alternate method for visually representing the highdimensional nature of association rules is the use of matrix views. Within these, rows represent the LHS items and columns represent the RHS items. Support and confidence of rules are shown using different colours and shapes at the intersections between the LHS and RHS [24,27,28]. While the technique provides an effective overview of the rules, it has four important drawbacks: 1) shape and colour can be difficult to decode when they represent quantitative measures [
Graph views are another technique that is widely used to visualize association rules [1,5,22,24]. Although this form of visualization represents association rules in a more concise manner than that of matrix views, as the number of items and associations increase, graph-based visualizations can become cluttered and difficult to interpret. In this technique, nodes in the graph represent the items, and edges represent the relations between the LHS and RHS items. The area of a node often encodes the support of the rule, and colour can be used to encode the confidence measures. The most important disadvantage of this technique is that it is not easy to find a specific item in this graph, because of the somewhat arbitrary shape of the graph and location of the nodes, especially when there are a large number of rules and items. Some have attempted to address this issue through the use of radial graph layouts [
Mosaic plots provide a very compact representation of association rules [23,24,29]. In this technique, the contingency tables that are responsible for the rules are represented graphically, where individual LHS items are shown as horizontal bars along the x-axis and the support of an association is represented by the height of the vertical column above the specified item. Although this technique shows the generalization and specification relations between rules, the presentation of rules with mosaic plots is very complicated to visually decode, making it difficult to recognize the interestingness measures. While the technique may be suitable for focused discovery where the set of attributes under consideration is small, as the number of items increases, it is not easy to interpret the items and relations.
Although some researchers have proposed using 3D visualization as a means for providing more space for the representation of association rules [3,4,30,31], these techniques usually suffer from occlusion problems, especially when presenting a large number of rules. For example, Blanchard et al. [
In spite of the advantages of previous works in visualizaing association rules, the most common problem they encounter is their inability to handle a large collection of rules. In general, this results in occlusion and screen clutter problems due to the need to compress the visual representation into a single view. In other words, by presenting a large number of rules over many items in a single view, it is not easy for users to recognize the relations between the items and their interestingness measures. An alternate approach is to display different characteristics of the rules simultaneously in different views. However, the fundamental trade-off is that it is not possible for users to perceive and compare all of these characteristics at once, requiring them to switch between different views in order to see different features of a specific rule.
In order to overcome this difficulty, we designed SARV to follow Schinderman’s Visual Information Seeking Mantra: “overview first, zoom and filter, then details-ondemand” [2, p. 337]. The goal is to provide an effective visual representation of association rules that scales well with a large number of association rules. SARV employs three synchronized views of the association rules set: A matrix view that provides overview and filtering operations; a graph view that displays the details of a selection of potentially interesting items and their corresponding association rules; and a detail view that allows users to inspect the features of specific rules. In the section that follows, SARV and its components are described in detail.
The primary goal in the design of SARV was to provide support for the visual exploration of association rules that would scale well with the number of rules that were shown, and avoid the clutter and occlusion problems that were present in other systems. This was achieved by using three coordinated views of the association rules, as illustrated in
The three main parts of SARV are the matrix view, graph view, and detail view. The matrix view provides an overview of all of the association rules, allowing the user to filter and select rules that are potentially useful. The graph view shows the subset of rules selected from the matrix view, clearly illustrating the relationships between the LHS items and the RHS items. At any time, the features of specific rules can be accessed within the detail view by highlighting the rules within the matrix or graph views.
The matrix view, graph view, and detail view of the association rules are visualized separately since during association rule exploration, users often seek rules based on their interestingness measures first. Once the potentially large collection of rules is filtered to show only those rules that may be value, users can then view the details of the subset of rules and the relations between the items. As such, the graph view allows SARV to present a smaller collection of selected and filtered rules, avoiding the screen clutter and occlusion issues that would occur by showing the entire set that is present in the matrix view. Whenever details are required for specific association rules, these can be obtained by highlighting the rule and viewing the data in the detail view.
The matrix view in SARV is a 2D matrix representation that provides a “big picture” overview of the rules and allows users to identify interesting aspects within the data based on the support measure. Due to the two-dimensional relationship between LHS and RHS items in association rules, a grid-based representation is a convenient method for providing an overview of the association rules in a single coherent manner.
The matrix view is a grid in which the LHS items label the rows of the grid and RHS items label the columns (is the number of items in the database). Each cell in this grid, corresponding to row r and column c, is the representation of the rules that have item r in their LHS, and item c in their RHS. The rules that are mapped to each cell may have other items besides r in their LHS and other items besides c in their RHS. However, they necessarily have item r in their LHS, and the item c in their RHS.
There have been numerous studies in the field of information visualization showing that good colour coding is an effective graphical device to reduce visual search time [
Within the matrix view, colour darkness is used in order to distinguish between selected and unselected items within the LHS and RHS sets. Light green is used as a background colour of selected LHS items and dark green for unselected LHS items. Similarly, light red is used as a background colour of selected RHS items and dark red for unselected RHS items.
The matrix view allows users to visually identify rules based on the support measure. It employs colours between black and white for values between 1 and 0. The colour of each cell, corresponding to row r and column c represents the maximum support among all of the rules that have item r in the LHS and item c in the RHS. Although this colour coding does not make it possible to decode the specific quantitative support values, it does allow the user to perceive relative differences. Since determining the exact support values is not important in the overview window, the matrix view allow users to identify rules with higher support values (darker cells) and ignore rules with lower support values (lighter cells).
The primary drawback of this approach is that the number of items that can be shown simultaneously is limited by the screen space available for showing the matrix view. Although the screenshots and test data sets used in this paper do not have an exceedingly large number of items, the approach can scale by using a high-resolution display, supporting vertical and horizontal scrolling, or supporting zooming.
Interaction is an important element in any visualization system. Allowing users to interact with system and to browse the association rules helps them to perceive different aspects of the rules and the relations between them. In order to avoid clutter in the graph view (which will be described next) and present interesting relations between LHS and RHS items, the matrix view allows users to filter items to focus on those that are of interest. As shown in
When users select a specific cell in the matrix
view, the details of the rules that are related to that cell (i.e., have item c in their LHS and item r in their RHS) are shown in the detail view. For example, as illustrated in
Once users select items from the LHS and RHS in the matrix view that they think may be important, they may wish to further understand the relationships among these rules. The goal of the graph view is to enable users to observe and inspect such relations, find the rules that are related to a specific item, and see the details of selected rules (items, relations, support and confidence).
As discussed previously, graphs can be used to represent association rules in a clear and concise manner. However, there is a limit on the number of relations and nodes that can reasonably be represented in a graph format; exceeding this limit results in visual clutter. In order to overcome this problem, a subset of the items may be selected in the matrix view. These items and the association rules that connect them are shown in the graph view.
In order to enable accurate decoding, a structured graph is employed to represent the subset of the association rules. Each green square on the left side of graph view represents an item that was selected from the LHS items of the matrix view. Similarly, each red square on the right side of the graph view represents an item that was selected from the RHS items of the matrix view. The blue circles in the middle represent the rules; the relations between the LHS and RHS items of the rules are represented with line segments that connect these items.
In this representation, the colour of the lines encodes the support of the rule (using the same grey-scale encoding that was used in the matrix view). In addition, the thickness of the lines encodes the confidence of the rules, where thicker lines represent higher confidence.
As with the matrix view, there is a constraint on the number of LHS items, RHS items, and rules that can be reasonably displayed within the graph view. If too many are shown, the display will become congested and difficult to interpret. However, the intent of this view is not to show all possible rules for all items, but instead to show more details for a smaller subset of items of interest (selected via the matrix view), along with their associated rules. This view will scale to support more data by increasing the screen resolution and implementing scrolling or zooming operations.
One of the important features of SARV is that it allows users to highlight rules related to specific items in the LHS or RHS. Once the user clicks an item in the LHS of the graph view, all rules that have this item in their LHS are highlighted in yellow. In the same way, when the user clicks an item in the RHS, all rules that have this item in their RHS are highlighted. Such highlighting mechanisms support the disambiguation of rules when there is a large degree of edge crossing within the graph view. In addition, the system shows the details of the highlighted rules in the detail view. Displaying the details of the rules helps users to further identify and understand specific information regarding these rules.
This highlighting feature not only helps users to find the relations between the rules, but also helps them to find the relations between items and rules. Furthermore, there is an additional highlighting feature that shows all related LHS and RHS items for a specific rule when the user clicks on the blue circles that represent the rules in the graph view. These two types of highlighting are shown in
This view of the data shows the details of the rules that have been selected within the matrix view or graph view in a textual format (as shown at the bottom of
The detail view is visually encoded as a yellow region of the display. This colour is intentionally coordinated with the cell selection colour in the matrix view, and with the item and rule selection colour in the graph view. Whenever the user makes such a selection in these primary views, their selected element is highlighted in yellow, and the corresponding information is presented in the yellow detail view. This allows users to easily relate the information presented in the detail view back to its source location within the matrix and graph views. Since this detail view is primarily for information purposes, no other interaction is supported.
The three distinct views of SARV (matrix view, graph view, and detail view) provide facilities for finding important association rules and the relations between them without loosing the “big picture” perspective of the entire set. Employing the combination of these views allows users to conduct different types of rule exploration. The two main activities that are usually undertaken in rule discovery processes are finding important association rules across the entire set of items, and finding important association rules related to a specified set of items. Here, we define important rules as those with strong support and confidence.
In order to find the important association rules using SARV, a user would first select the LHS and RHS items that correspond to the darker cells in the matrix view. These darker cells identify the rules that have higher support measures. The user could then inspect the specific association rules for a given cell by clicking on the cell, resulting in the display of the corresponding rules in the detail view. Selecting specific LHS and RHS items of interest in the matrix view would result in the loading of the rules that relate to these items into the graph view. Focusing on this selected subset of items and rules, the user could then find the important association rules by choosing the darker and thicker lines between the items in the graph view (representing the support and confidence for the rules, respectively). By selecting the central (blue) nodes in the graph view, the users could then view the specific details regarding the rules of interest.
If the user is instead interested in finding important association rules that are related to a specified set of items, they may seek the dark cells in the matrix view which correspond to the specific set of LHS and RHS items. For example, in order to find the important rules related to items i, j, and k, both the rows and columns corresponding to these items could be selected by the user. Doing so would show any relations between these items in the graph view. If the user is seeking other items that may also be related to this item set, they can visually scan the matrix view, seeking dark cells that are in the same rows or columns of the items of interest, inspecting the rules within these cells. They may then add any additional items they find in order to include their corresponding rules within the graph view. From there, the subset of rules can then be examined in detail, focusing on those that have dark and thick lines connecting the items to the rules.
In both of these scenarios, the user is empowered to discover interesting features within the potentially large set of association rules. The visual representation within the matrix view supports visual scanning activities and the identification of patterns within the rules. The graph view supports the interpretation of the relationships between the LHS and RHS items via the selected subset of rules. At any time, the user may inspect the details of the rules with a simple click operation in the matrix and graph views, showing the corresponding rules in the detail view.
SARV was designed to allow a large number of association rules to be presented simultaneously, allowing users to select interesting features and explore aspects of the association rules that are potentially useful. The goal was to create a system that would scale well, continuing to support the task of finding interesting and useful rules even as the total number of rules represented within the system grew. Although the system can also scale well as the number of items increases (by increasing the screen size of the matrix and graph views), our primary focus here is on the scalability with respect to the number of rules.
As an information visualization system, SARV allows users to visually identify and focus on the important association rules quickly and easily, even when there are a relatively large number of rules. While we may rationalize the various design choices made in the development of the system, since SARV is primarily a user interface to the underlying data, the true value of the approach can only be validated with user evaluations. The goal is to design a study that empirically measures both quantitative and qualitative data related to the use of the system, and ultimately to be able to make well-supported statements regarding its value [
A user evaluation in a controlled laboratory setting was designed to study SARV. Two independent variables were defined and manipulated: the size of the data set (number of association rules), and the type of task to be performed. Five dependent variables were measured: time to task completion, error rates, confidence in the outcome of the task, ease of completing the task, and subjective measures regarding the system in general.
Three different data sets of varying sizes (50, 250, and 500 association rules) were made available to participants in the study. The goal in manipulating this independent variable was to determine whether the size of the underlying data set has an effect on the performance of the participants. Note that the number of items within these data sets remained constant (i.e., 30 items); what was manipulated was the number of association rules within the data set. In order to address potential learning effects, the order of exposure to the different data sets was varied using a 3 × 3 Latin Square, resulting in the assignment of participants to three different groups. As a result, the order in which the participants saw the data sets was eliminated as an independent variable.
Two different tasks were devised for participants to perform using SARV. The goal of the first task was for the participant to find the ten most significant association rules from the data set. The goal of the second task was for the participant to find the five most significant association rules that are related to a given set of three items. Here, we define significance of a rule as one that has a high support and confidence.
The tasks were provided as specific scenarios, asking the participants to consider themselves a knowledge manager within a company. The three different data sets contained the same number of items and the same number of association rules that satisfied the conditions of the tasks. The purpose in manipulating this independent variable was to determine if the type of task undertaken has an effect on the performance of the participants as the size of the data set is manipulated.
In order to simplify the study design, the participants performed the first task on all three data sets (in the order dictated by their group assignment), followed by the second task on all three data sets. The participants did not use the same data set for these two sequential tasks, which mitigated their ability to learn the answers to the second task by having just completed the first task on the same data set.
After gaining informed consent for participating in the study, a pre-study questionnaire was administered to each participant to measure their education level, prior experience with databases or data mining techniques, familiarity with association rules, and experience with data mining software. Although participants were prescreened to ensure they had sufficient prior knowledge about data mining techniques, a brief overview of association rules was provided in order to ensure a common baseline level of understanding of the domain.
As the participants performed each task, they were required to write down the association rules they found. The time they took to complete each task was measured. In addition, the investigator carefully observed the participants and took detailed notes regarding their use of the system. After each task, participants completed a short questionnaire regarding their confidence in the results obtained and the ease of completing the task.
After all tasks were completed, a post-study questionnaire was administered to measure subjective reactions to the use of the system in general. All participants in the study were financially compensated for their time.
As a result of this study design, nine different hypotheses can be verified or refuted (four related to each of the two tasks, plus one related to the use of the system in general). In general, these hypotheses predict that the participants will be able to perform at a similar level, regardless of the increase in the number of association rules that need to be examined to complete the tasks.
H1: Participants will take a similar amount of time to find the ten most significant association rules, regardless of the size of the data set (50, 250, and 500 association rules).
H2: Participants will make a similar number of errors in finding the ten most significant association rules, regardless of the size of the data set (50, 250, and 500 association rules).
H3: Participants will report a similar level of confidence in finding the ten most significant association rules, regardless of the size of the data set (50, 250, and 500 association rules).
H4: Participants will report a similar level of ease in finding the ten most significant association rules, regardless of the size of the data set (50, 250, and 500 association rules).
H5: Participants will take a similar amount of time to find the five most significant association rules related to a given set of items, regardless of the size of the data set (50, 250, and 500 association rules).
H6: Participants will make a similar number of errors in finding the five most significant association rules related to a given set of items, regardless of the size of the data set (50, 250, and 500 association rules).
H7: Participants will report a similar level of confidence in finding the five most significant association rules related to a given set of items, regardless of the size of the data set (50, 250, and 500 association rules).
H8: Participants will report a similar level of ease in finding the five most significant association rules related to a given set of items, regardless of the size of the data set (50, 250, and 500 association rules).
H9: Participants will provide positive subjective responses to statements regarding the system in general.
A total of 12 participants were purposefully recruited from the graduate student population within our department. Participants were pre-screened to ensure they had taken at least one course in databases or data mining. The pre-study questionnaire verified a similar level of education and prior experience with association rules. As a result, we characterize the participants as knowledgeable users.
As a 3 × 2 (data set size × task type) within-subjects design, each participant performed a total of six tasks with the system. Due to the differences in the tasks (finding significant rules vs. finding significant rules associated with a given subset of items), this data cannot be combined, and direct comparisons between tasks are not informative. As such the results from each task are presented separately in the sections that follow. The primary analysis is to verify whether data set size has an impact on the participants’ performance. An analysis of the participants’ subjective reactions to using SARV after all the tasks were completed is provided at the end of this section.
The average time to task completion for each data set is illustrated in
A pair-wise analysis of this data using ANOVA (see
Three different types of errors were identified based on the participants’ discovered association rules. These included participants identifying an association rule that did not meet the significance criteria, not being able to identify a sufficient number of association rules, and identifying duplicate association rules. The average number of each of these errors is illustrated in
Due to the extremely low error rate and the high degree of variability in the different types of errors, for the purposes of statistical verification, the errors were grouped together and a single ANOVA analysis was performed. The results from this analysis show that there is no statistical significance in the error rates as the number of association rules to be examined increases. As a result, we conclude that H2 is supported.
Confidence in the results of the task were measured after each task was completed, using a five-point Likert scale, ranging from very confident (5), through neutral (3), to very unconfident (1). Across all data sets, the scores ranged from three to five; the average scores are illustrated in
A pair-wise Wilcoxon-Mann-Whitney test was conducted on this subjective data to identify if there were any statistically significant differences as a result of completing the task with different sized data sets. The result of this analysis is reported in
As with the confidence measure, participants were asked to indicate the ease with which they were able to complete each task using a five-point Likert scale. The range of possible responses was from very simple (5), through neutral (3), to very difficult (1). The scores provided by the participants were similar to that of the confidence measure, ranging from three to five. The average scores are illustrated in
A pair-wise Wilcoxon-Mann-Whitney test was conducted to identify statistically significant differences between the data sets (see
measure, no statistically significant differences were found due to the changing number of association rules to be examined, providing support for H4 (participants will report a similar level of ease in completing the task, regardless of the data set size).
The findings on Task 1 (finding the ten significant association rules) illustrate the key benefit of using SARV to examine the set of association rules. Even though the number of rules increased by factors of 5 and 10 over the smallest group of rules, the time to find the required set of rules, the error rates, the perceived confidence, and the perceived ease did not change. In particular, the ability to filter the association rules using the matrix view, and then examines the rules using the graph view and detailed view made the task equally as easy even as the number of rules available for examination increased.
The average time to task completion for each data set is illustrated in
A pair-wise analysis of this data using ANOVA (see
tween the 50 and 250 association rule data sets, and between the 50 and 500 association rule data sets. However, the difference between the 250 and 500 association rule data sets was not statistically significant. As a result, we conclude that H5 (no differences in time to task completion regardless of data set size) is not supported. That is, participants were able to perform the task faster when there were just 50 association rules to consider, when compared to having to consider 250 or 500 association rules.
The task required the participants to find five significant association rules from a set of 50, which were related to a given set of three items from the total of 30 items. As a result, the search space was already rather small, allowing the participants to be able to quickly complete the task without the need to take advantage of the key benefits of SARV. However, as the data sets became larger, the task became more difficult, requiring the participant to do more exploration within the data. As a result, they took more time to complete the task.
A positive note in this finding is that once the data sets became large enough to make the task difficult (i.e., data sets of size 250 and 500 association rules), there was no statistically significant difference in the participants time to task completion measurements. So, while we continue to hold that H5 is not supported, we believe that this is a result of the task being somewhat trivial when there are very few association rules to consider. As such, with a sufficiently large data set as the baseline, this hypothesis may become supported in future studies.
The average number of errors encountered during the completion of the task is illustrated in
Confidence in the results of the task were measured using the same five-point Likert scale as in the first task. Across all data sets, the scores ranged from three to five; the average scores are illustrated in
A pair-wise Wilcoxon-Mann-Whitney test was conducted on this subjective data to identify statistically significant differences as a result of completing the task with different data sets (
A similar five-point Likert scale was used to measure participants’ impressions of the ease of completing the task. The scores provided by the participants were similar to that of the confidence measure, ranging from three to five. The average scores are illustrated in
A pair-wise Wilcoxon-Mann-Whitney test was conducted to identify statistically significant differences between the data sets (see
nificant differences were found, providing support for H8 (participants experiencing a similar level of ease in completing the task, regardless of the data set size).
Similar to Task 1, the findings on Task 2 (finding the five most significant association rules related to a subset of items) highlight the benefits of using SARV to explore among the association rules. Even as the number of association rules varied by a factor of ten, all of the measures except for the time to task completion remained essentially unchanged. When there were only 50 total association rules, participants were able to find the five most significant that referenced the small subset of items much quicker than when the number of rules got larger. Even so, the error rates, perceived confidence, and perceived ease were consistent across all sizes of the data.
After completing all of the tasks with all of the data sets, participants were asked to rank their subjective agreement to a number of statements using a five-point Likert scale. The response options ranged from strongly agree, through neutral, to strongly disagree.
Since the study design did not provide a baseline against which these responses can be compared, conducting a statistical analysis of the data is not meaningful. However, we can conclude that there are substantial positive reactions to the learnability, visual encodings, ease of use, and utility of SARV, which support H9 (participants will provide positive subjective reactions to the system in general).
In this paper we proposed a new technique for the visualization of association rules. SARV presents the association rules in three synchronized views: the matrix view providing an overview of the rules, the graph view illustrating relationships and further details on a subset of rules, and the detail view showing the complete information of selected rules and/or items. The system was designed with the purpose of scaling well as the number of association rules becomes large, while avoiding the problems of occlusion and visual clutter.
SARV was designed to follow Shneiderman’s visual information seeking matra of “over view first, zoom and filter, then details on demand”. An overview of the entire set of the rules is provided in the matrix view. The user may use this to filter the rules based on LHS and RHS items of interest, populating the graph view with the corresponding rules. This subset of the rules can be visually inspected further, allowing the user to gain a better understanding of the relationships that the rules represent. At any point, the specific details of the rules can be accessed and evaluated.
A user study was performed to verify the scalability of SARV with respect to the number of rules being repre-
sented. The results of this study showed that with only one minor exception, participants were able to find a set of the most significant rules, as well as the significant rules related to a given set of items, in approximately the same time with the same error rate for data sets that contained small (50), medium (250) and large (500) numbers of rules. In addition, subjective measures showed that the participants’ perceptions of confidence and ease in completing the task did not differ with the size of the data set. Subjective responses to the system in general were also very positive. These results show the value of the visualization and interactive filtering features of SARV.
For future work, we plan to add features such as multiselection in order to allow users to compare the rules, the visual representation of other interestingness measures, and support for more complex association rules such as dynamic association rules, quantitative association rules, and weighted association rules. In addition, we plan to perform additional user studies to measure the scalability of the approach as the number of items within the dataset grows.