 J. Intelligent Learning Systems & Applications, 2010, 2, 156-166 doi:10.4236/jilsa.2010.23019 Published Online August 2010 (http://www.SciRP.org/journal/jilsa) Copyright © 2010 SciRes. JILSA Knowledge Discovery for Query Formulation for Validation of a Bayesian Belief Network Gursel Serpen, Michael Riesen Electrical Engineering and Computer Science, College of Engineering, University of Toledo; School of Law, University of Toledo, Toledo, USA. Email: gserpen@eng.utoledo.edu, riesen@fraser-ip.com Received February 23th, 2010; revised July 6th, 2010; accepted July 20th, 2010. ABSTRACT This paper proposes machine learning techniques to discover knowledge in a dataset in the form of if-then rules for the purpose of formulating queries for validation of a Bayesian belief network model of the same data. Although domain expertise is often available, th e query formulation task is tedious and labo rious, and hence automatio n of query formu- lation is desirable. In an effort to automate the query formulation process, a machine learning algorithm is leveraged to discover knowledge in th e form of if-then ru les in the data from which th e Bayesian belief network model u nder valida- tion was also induced. The set of if-then rules are processed and filtered through domain expertise to identify a subset that consists of “interesting” and “significant” rules. The subset of interesting and significant rules is formulated into corresponding queries to be posed, for validation purposes, to the Bayesian belief network induced from the same dataset. The promise of the proposed methodology was assessed through an empirical study performed on a real-life dataset, the National Crime Victimization Survey, which has over 250 attributes and well over 200,000 data points. The study demonstrated that the proposed approach is feasible and provides automation, in part, of the query formulation process for validation of a complex probabilistic model, which cu lminates in subs tantial savings for the need for human expert involvement and investment. Keywords: Rule Induction, Semi-A ut o mat ed Query Generation, Bayesian Net Validation, Knowledge Acquisition Bottleneck, Crime Da ta, National Crime Victimization Survey 1. Introduction Query formulation is an essential step in the v alidation of complex probabilistic reasoning models that are induced from data using machine learning or statistical techniques. Bayesian belief networks (BBN) have proven to be computationally viable empirical probabilistic models of data [1]. Advances in machine learning, data mining, and knowledge discovery and extraction fields greatly aided in maturation of Bayesian belief networks, particularly for classification and probabilistic reasoning tasks. A Bayesian belief network can be created through a multi- tude of means: it can be induced solely from data, hand- crafted by a domain expert, or a combination of these two techniques can be leveraged. A Bayesian belief net- work model essentially approximates the full joint prob- ability distribution in the domain of interest. The de- velopment of a Bayesian belief network model is fol- lowed by a rigorous validation phase to ascertain that the model in fact approximates the full joint probability dis- tribution reasonably well, even under the set of inde- pendence assumptions made. Validation is a comprehen- sive, multi-part process and often requires costly domain expert involvement an d labor. When a BBN model is used as a probabilistic reason- ing engine, the validation requires a complex and chal- lenging approach, wherein a multitude of validation-re- lated activities must be performed [2-5] and as part of one such activity, queries must be formed and posed to the network. Any subset of variables might be considered as evidence in such a query, which leads to the need to formulate an inordinate number of queries based on various subsets of variables. During validation by query- ing, a value assignment to some variables in the network is made and the posterior marginal probability or expec- tation of some other variables is desired. In other words, marginal probabilities and expectations can be calculated conditiona lly on any number of observations or eviden ce supplied to the network. It is also desirable, given that certain evidence is supplied, to ask for the values of non-evidence variables that result in the maximum pos- sible posterior probability for the evidence, i.e., an ex-
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 157 planation for the available evidence. One can specify a group of variables in the network to be estimated or es- timate all variables in the network collectively. The ex- isting literature for validation of BBNs as probabilistic reasoning tools is sparse and mainly promotes ad hoc approaches or mechanisms. The formulation of an appropriate “query” requires the use of extrinsic methods in order to discover relation- ships among attributes. More specifically, in forming a query, access to a specific domain expertise can prove to be an efficient method in choosing which attributes to include as evidence and which attributes to identify for explanation or estimation. Experts in the domain of the focus data can prove to be a useful resource in forming the queries. However, there are many challenges in util- izing domain experts in manual formulation of queries and these challenges are in addition to the shear cost and resources needed. Conducting interviews with one or preferably more experts in the relevant field of interest is one of the pre- liminary steps in manual query formulation. Such inter- views typically expose many issues and challenges asso- ciated with relying on experts in the field to focus and to form queries. Experts interviewed are likely to demon- strate an interest in forming unique queries that would parallel their own expertise or interest, which might not fully overlap with the specific domain on which the model was built [6]. The list of potential queries sug- gested by the domain experts could prove to be inappli- cable as the specific dataset employed to develop the BBN model might not include all the attributes sought by the domain experts. In other circumstances, experts may be interested in applying local and regional attributes rather than the global attributes or the national attributes used in the dataset. It is highly desirable to develop an automated proce- dure that formulates queries by leveraging the same data- set that was employed to induce the Bayesian belief net- work model. In similar terms, exploration of other, and possibly automated, ‘options’ in generating useful and possibly non-obvious queries would be attractive. Data mining and machine learning techniques can be em- ployed, through an inductive process, to discover auto- matically “queries” from a given dataset. More specifi- cally, rule discovery and ex traction algorithms can prove useful in “query formation”. Examples of specific such algorithms are PART [7] and APRIORI [8]. 1.1 Problem Statement Validation of a complex Bayesian belief network, i.e., one that has on the order of hundreds of variables, in- duced from a large dataset, like the National Crime Vic- timization Survey (NCVS), is a highly challenging task since it requires major investment of resources and do- main expertise, while also being labor-intensive. The data mining and knowledge discovery algorithms are poised to offer a certain degree of relief from this challenge, and hence can be leveraged to automate segments of the overall process of query formation for validation. A ma- chine learning or data mining algorithm can be leveraged to mine for rules in a dataset from which the Bayesian belief network model was induced, wherein these rules can be formulated as queries for validation purpo ses. Th e proposed study envisions processing a large and complex dataset through a ru le-generation algorith m 1) to discover embedded knowledge in the form of if-then rules, and subsequently 2) to identify, through expert involvement, a subset of “interesting” and “significant” rules that can be formulated as queries for validation of the Bayesian belief net work m od el of the data s et. The next section discusses and elaborates on validation of a Bayesian belief network (BBN) model of a dataset, automatic query generation through a specific knowledge discovery tool, th e NCVS dataset leveraged for th is stud y, and the development of a BBN model on the same data- set. The subsequent section will demonstrate application of the proposed methodology to discover rules in the data set, filtering of rules to identify an interesting and sig- nificant subset, mapping of chosen rules into queries, and demonstration of application of such queries for valida- tion purposes on a specific BBN model of a real-life size dataset that has over 250 attributes and 200,000 data points, namely the National Crime Victimization Survey. 2. Background This section discusses fundamental aspects of the prob- lem being addressed. Elaborations on validating Bayes- ian belief networks when employed as probabilistic rea- soning models, query formulation with the help of ma- chine learning and data mining, the dataset used for the study, National Crime Victimization Survey (NCVS), and the development of the Bayesian belief network model of the dataset are presented. 2.1 The NCVS Dataset The National Crime Victimization Survey (NCVS) [9-10], previously the National Crime Survey (NCS), has been collecting data on personal and household victimization through an ongoing survey of a nationally representative sample of residential addresses since 1973. The geo- graphic coverage is 50 United States. The ‘universe’ is persons in the United States aged 12 and over in “core” counties within the top 40 National Crime Victimization Survey Metropolitan Statistical Areas (MSA). The sam- ple used was a stratified multistage cluster sample. The NCVS MSA Incident data that was chosen for this study contains select household, person, and crime incident variables for persons wh o reported a v iolent crime within any of the core counties of the 40 largest MSAs from January 1979 through December 2004. Household, per- Copyright © 2010 SciRes. JILSA
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 158 son, and incident information for persons reporting non-violent crime are excluded from this file. The NCVS, which contains 216,203 instances and a total of 259 at- tributes, uses a labeling system for the attributes repre- sented by letters and numbers. A typical attribute of in- terest is labeled by a five character (alpha-numeric) tag, e. g., V4529. 2.2 Bayesian Belief Network Model of NCVS Data A Bayesian belief network (BBN) expresses a view of the joint probability distribution of a set of variables, given a collection of independence relationships. This means that a Bayesian belief network will correctly rep- resent a joint probability distribution and simplify the computations if and only if the con ditional independence assumptions hold. The task of determining a full joint distribution, in a brute-force fashion, is daunting. Such calculations are computationally expensive and in some instances impossible. In order to address this formidable computational challenge, Bayesian belief networks are built upon conditional independence assumptions that appear to hold in many domains of interest. A Bayesian belief network enables the user to extract a posterior belief. All causal relationships and conditional probabilities are incorporated into the network and are accessible through an automated inference process. A once tedious and costly (in terms of computation) me- thod of extracting posterior beliefs in a given domain is now space-efficient and time-efficient. It is also possible to make queries on any attribute of one’s choosing as long as it is one of those included in the model. One can easily adjust the prior evidence in the same manner ena- bling him to effectively compare and contrast posterior probabilities of a given attribute based on prior knowl- edge. The introduction of such a method has increased the breadth and depth of statistical analysis exponen- tially. The BBN creation process consists of multiple phases. Following any preprocessing needed on a given dataset, the learning or training phase starts, wherein appropriate structure learner and parameter learner algorithms need to be selected by means of empirical means [11-17]. Learning a Bayesian belief network is a two stage proc- ess: first learn a network structure and then learn the probability tables. There are various software tools, some in the public domain and open source, to accomplish the development of a BBN through induction from data. For instance, the open-source and public-domain software tool WEKA [7], a machine learning tool that facilitates empirical development of clustering, classification, and functional approximation algorithms, has been leveraged to develop a BBN from the NCVS dataset for the study reported herein. The validation phase can best be managed through a software tool that can implement the “probabilistic in- ferencing” procedure applicable for Bayesian belief net- works. Another open-source and public-domain software tool, the JavaBayes [18] was used for this purpose, which is able to import an already-built BBN model, and facili- tate through its graphical user interface querying of any attribute for its posterior probability value among many other options. A BBN model developed in WEKA can easily be imported into the JavaBayes. Once imported, the JavaBayes allows the user to identify and enter the evidence, and query a posterior belief of any attribute. In this study, the BayesNet tool of the WEKA has been used to induce a classifier with the “Victimization” attribute in the NCVS dataset as the class label [19]. The NCVS dataset has been split into training and test subsets with 66% and 33% ratios, respectively. Simula- tions were run for a variety of structure and parameter learning options. Results suggest that a number of BBN models performed exceptionally well as classifiers for the “Victimization” attribute in the NCVS dataset. All WEKA versions of the local hill climbers and local K2 search algorithms led to classification performances on the test subset with 98% or better accuracy. Since the classification accuracy rates were so close to each other, the value of parameter “number of parent nodes” became significant given that it directly relates to the approxima- tion capability of the BBN to the full joint distribution. Accordingly, the BBN model generated through the local K2 algorithm with Bayes learning and four parent nodes (the command-line syntax is “Local K2-P4-N-S BAYES” in WEKA format) was selected as the final network. This model, which, upon request, can be obtained in BIF for- mat from the authors, has been used exclusively in the validation experiments reported in the following sections. 2.3 Validation of Bayesian Belief Networks Validation of a Bayesian belief network is a comprehen- sive process. Once the Bayesian belief network (BBN) is induced from the data and subsequently tuned by the domain experts, the next step is the testing for validation of the premise that the network faithfully represents the full joint probability distribution subject to conditional independence assumptions [5,20,21]. As part of the vali- dation task, values computed by the BBN are compared with those supplied by the domain experts, statistical analysis, and the literature. Another distinct activity for validation entails querying any variable for its posterior distribution or posterior expectation, and to obtain an explanation for a subset of or all of the variables in the network. In that respect, knowledge discovery and data mining tools, in conjunction with the domain experts, are leveraged to formulate a set of so-called “interesting” and “significant” queries to pose to the BBN. Validating a BNN is no trivial task and necessitates ad hoc and em- pirical elements. More specifically, a comprehensive and Copyright © 2010 SciRes. JILSA
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 159 rigorous process of evaluation and validation of a BBN model entails the following: 1) Perform elicitation review that consists of reviewing the graph structure for the model, and reviewing and comparing probabilities with each other [22]. 2) Carry out sensitivity analysis that measures the ef- fect of one variable on another [3]. 3) Implement validation using the data that entails analysis of predictive accuracy and expected value cal- culations. 4) Conduct case-based evaluations that may include the following: run the model on test cases, compare the model output with the expert judgment, and finally, compare the model predictions with the “ground truth” or accepted trends currently relied upon by experts in the domain of interest. The case-based evaluations validation step is the most costly and challenging since it requires sub s tantial human expertise. In particular, elicitation of expert judgment to be leveraged for the validation of the Bayesian belief network poses a serious obstacle since numerous test cases or “queries” must be generated and applied to the Bayesian belief network model. The expected values must be defined in advance by human experts to form a basis for comparison with those calculated by the net- work itself. 2.4 Query Formulation Machine learning and data mining techniques may be leveraged to automatically discover “queries” for a given dataset. A query is the calculation of the posterior prob- abilities of any attribute or variable based upon the given prior evidence. When a user provides that a specific at- tribute is observed to have a (discrete) value, this ‘evi- dence’ may be used in calculating the posterior probabil- ity of a dependent variable. This is best understood by an example. Assume that the user makes a query for the posterior probability that a person will be a victim of burglary. This query is dependent upon the values ob- served for relevant attributes like the gender of the po- tential victim. If burglary is shown to b e dependent upon the gender of the victim, then the prior observed value of male or female for the potential victim’s gender will need to be supplied by the user in order to calculate the condi- tional probability of this incident. This is analogous to an if-then rule: such a rule is a candidate for a query. One rule could postulate that “If the gender of the victim is female Then the prob- ability of burglary will be gr eater than 0.60.” By having such a rule at one’s disposal, the process of making valid and knowledgeable queries can be stream- lined. One does not necessarily have to solely rely on an expert for help to formulate “interesting” and “signifi- cant” queries. A rule set may be generated using one of many knowledge discovery algorithms, which can be structured to produce a set of if-then rules. Machine learning and data mining techniques prove useful for discovering knowledge that can be modeled as a set of if-then rules. Among the viable algorithms, PART [23], C4.5 or C5 [24], and RIPPER [25] from machine learn- ing, and APRIORI [8] and its derivatives from the data mining fields are pr ominent. 3. Automation of Query Generation This section presents application of machine learning algorithms for knowledge discovery in the form of if-then rules on the NCVS dataset for the purpose of formulating queries to the Bayesian belief network model of the same dataset. Although da ta mining algorithms are also appropriate for knowledge extraction and subse- quent automation of the query formulation process [26], their computation al cost may quickly become prohib itive if care is not exercised. Decision tree or list based algo- rithms within the domain of machine learning are ap- pealing in that they can generate a rule set for a given single attribute of interest often within reasonable spatio- temporal cost bounds. Accordingly, the machine learning algorithm PART is chosen for the rule discovery and extraction task given its desirable algorithmic and com- putational properties. The PART algorithm [23] combines two approaches, C4.5 [24] and RIPPER [25] in an at- tempt to avoid their respective disadvantages. The main steps for validation of a Bayesian belief net model of data through automated query generation are shown in Figure 1. The rule induction algorithm PART is applied to the NCVS dataset in order to extract a set of rules. The same rules are leveraged, following further processing by do- main experts, as queries to the BBN model of the NCVS Use Machine Learning rule induction algorithms to derive a rule set in If-Then format from data Convert the rule set into a query set and filter the que- ries for “interestingness” and “significance” with the help of doma in ex per ts Apply selected queries to Bayesian belief net model of the same data for validation purposes Solicit domain experts to evaluate the query responses by the Bayesian belief network model Figure 1. Generic overview of steps for Bayesian belief net validation through automated query generations Copyright © 2010 SciRes. JILSA
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 160 dataset for validation purposes. In itially, a subset of rules is labeled as “interesting” and “significant” by the do- main experts, wherein “interesting” is a subjective label- ing by a particular domain expert based upon the rela- tionship of the evidence and the resultant projected probability of the THEN con sequent variable. Next, these rules are formulated as queries and evidence associated with each query supplied to the BBN model on Java- Bayes. Posterior probability calculations performed by the JavaBayes reasoning or inferencing engine for the attribute(s) of interest, which can be any subset from the list, are compared to expected values. This is done to infer if, in fact, the BBN model approximates reason ably well the joint probab ility distribution for the set of attrib- utes entailed by the NCVS dataset. 3.1 PART Algorithm and Rules on NCVS Data Rules that are derived from a dataset through a machine learning algorithm like PART expose the relationship between a subset of attributes and a single attribute of interest (or the class label), i.e. in this case the class label is designated as the “Victimization” due to its signifi- cance in the domain. Any attribute can be designated as the class label and would require a separate run of the PART algorithm to generate the set of rules whose con- sequents are the class label. Through the PART algo- rithm, the knowledge entailed by the dataset is captured into a framework with a set of if-then rules. Specifically, the format for a rule complies with the following: IF premise THEN consequent, where the premise is a statement of the form of a logical conjunction of a subset of attribute-value pairs, and the consequent represents a certain type of victimization. We have used the WEKA implementation of the PART algorithm throughout this study. Available options for the PART as implemented in the WEKA package and their associated default set- tings are shown in Table 1. The NCVS Incident dataset was preprocessed prior to the rule induction step: the attribute count was reduced from 259 to 225 through removal of those that were not deemed to be relevant for the study. The attributes in th e NCVS Incident dataset are represented, with a few ex- ceptions, by a label that has four numeric characters pre- ceded by the letter “V”. The PART algorithm was ap- plied to the NCVS dataset with default parameter values and the V4529 (Victimization) as the class attribute. Values for the V4529 attribute are shown in Table 2. The algorithm was trained on a 66%-33% training-testing split of the NCVS dataset, and generated a list of 176 rules [ 2 7 ] . The rules output are in the traditional IF-THEN format, where the premise is the logical conjunction of a set of attribute-valu e pairs (i.e., evidence) followed by the con- sequent which is a specific value of the class attribute. Table 3 illustrates one of the rules discovered by the PART algorithm on the NCVS data and its interpretation. Table 1. Parameter optio ns and default values for the WEKA PART algorithm. PART OptionExplanation Default Values -C number Confidence threshold for pruning 0.25 -M number Minimum number of instances per leaf2 -R Use reduced error pruning False -N number Number of folds for reduced error pruning 3 -B Use binary splits for nominal attributesFalse -U Generate unpruned decision list False -Q <seed> Se ed for random data shuffling 1 Table 2. Values for the NCVS attribute V4529 V4529 Label Description of Values for “Victimization” Attribute V4529 x60Completed/Attempted rape x61Sexual attack/assault/serious assault x62 Attempted/completed robbery with injury from serious assault x63Attempted/completed robbery with injury from minor assault x64Attempted/completed robbery without injury x65Attempted/completed aggravated assault x66Threatened assault with weapon x67Simple assault completed with injury x68Assault without weapon without injury x69Verbal threat of rape/sexual assault x70Verbal threat of assault x71Attempted/Completed purse snatching and pocket picking x72Burglary x73Attempted forcible entry x74Attempted/completed motor vehicle theft x75Attempted/completed theft 3.2 Query Formulation Based on PART Rules The process of query formulation using the PART rules and posing the queries to the BBN model entails human expert involvement and is the focus of the discussion in this section. A PART rule, which is captured th rough the “IF-premise-THEN-consequent” framework, readily lends itself to the query formation: the premise becomes the prior evidence for a query, where posterior probability value calculation is desired for the rule consequent. Such queries may be employed to validate, among other uses, Copyright © 2010 SciRes. JILSA
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 161 Table 3. A sample rule generated by the PART algorithm and its interpretation PART Rule Interpretation V4113 = 0 & V4094 = 0 & V4119 = 0 & V4117 = 0 & V4118 = 0 & V4096 = 9:67 If the victim did not receive injuries from an attempted rape (V4113 = 0), and was not attacked in the form of rape (V4094 = 0), and was not knocked unc onscious (V4119 = 0), and did not have broken bones or teeth as a result of inci- dent (V4117 = 0), and did not sustain any internal injuries (V4118 = 0), and could not answer if (s)he was or was not a victim o sexual assault (V4096 = 9), Then there is a high probability that this person will be a victim of “Simple Assault Completed with Injury” (V4529 = x67) the Bayesian belief network model of the full j oint prob- ability distribution of the 225 attributes in the NCVS dataset. The list of 176 rules generated by the PART al- gorithm was manually processed by domain experts, Gabrielle Davis [28] and Michael Riesen [27], to identify those that are interesting and significant for query forma- tion to serve as the validation set through the domain specialist’s somewhat subjective perspective. The list of 49 rules identified accord ingly to be leveraged as queries to the BBN model of the NCVS dataset are listed in [27]. Conversion of PART rules to queries and posing re- sulting queries to the JavaBayes realization of the BBN model is a straightforward process and will be illustrated next. The middle column in Table 4 displays (in Java- Bayes format) the posterior probability for the victimiza- tion attribute V4529 with no prior evidence observed before any query is posed as provided by the BBN model. One of the simple rules generated by the PART that will be used as an example query is shown in Table 4. The premise part of the rule, i.e., V4127 = 2 AND V4095 = 1, is considered as prior evidence and supplied to the BBN model as such. Next, the JavaBayes is asked to perform “reasoning” or “inference” using the supplied prior evi- dence through the BBN model of the NCVS data. Once the inferencing calculations are complete, the updated posterior probabilities for all discrete values of the vic- timization attribute are as shown in the rightmost column in Table 4. As an example, the probability value for the x60 value of the victimization attribute is now 0.612, a marked increase compared to the no-evidence case. Tran- slating the NCVS notation of the above comparison, this rule indicates that when a victim is attacked in such a way that the victim perceived the incident as an at- tempted rape (V4095 = 1) and th e victim was not injured to the extent that the victim received any medical care, including self treatment (V4127 = 2), there is a 61% chance that this victim would be a victim of a completed rape or attempted rape (V4529 = x60). Next, another and relatively more complex rule gener- ated by the PART algorithm as shown in Table 5 was presented as a query to the BBN model on JavaBayes. In Table 6, the process of supplying the evidence as pro- vided from this PART rule is shown. First, the prior evi- dence that the victim suffered no injuries that are related to attempted rape (V4113 = 0) is supplied. Then, further prior evidence is supplied through V4052 = 0, meaning that the offender d id not use a rifle, shotgun or any other gun different from a handgun. More prior evidence is added in the form of V4050 = 3, indicating that there was a weapon used, but the specific type is not applicable as reported in the NCVS. In the final step, V4241 = 1 as prior evidence is provided. However, with this addition of V4241 = 1 the JavaBayes running in the Java Runtime Environment generated an OutOfMemory exception, al- though the heap size was set to 3.5 GB. Nevertheless, for each of the reportable cases, the corresponding posterior probability table for the NCVS Victimization attribute V4529 is displayed. As shown in Table 6, inclusion of each further evidence has a direct affect on the posterior probability of the consequent (i.e., the so-called “Then” part of a rule), which can be observed through the value of x65 discrete label for the class attribute V4529. Table 4. A PART rule (V4127 = 2 & V4095 = 1: 60), associ- ated JavaBayes query, and updated posterior probability values for V4529 with increasing evidence Conditional Probabilities of V4529 Labels Posterior Probabilities for V4529 with No Evidence Posterier Probabilities with Evidence due to V4127 = 2 & V4095 = 1 p(x60|evidence) 0.004 0.612 p(x61|evidence) 0.001 0.005 p(x62|evidence) 0.005 0.006 p(x63|evidence) 0.005 0.009 p(x64|evidence) 0.025 0.003 p(x65|evidence) 0.036 0.003 p(x66|evidence) 0.006 0.069 p(x67|evidence) 0.022 0.008 p(x68|evidence) 0.055 0.003 p(x69|evidence) 0.000 0.007 p(x70|evidence) 0.019 0.227 p(x71|evidence) 0.018 0.010 p(x72|evidence) 0.113 0.007 p(x73|evidence) 0.032 0.009 p(x74|evidence) 0.053 0.008 p(x75|evidence) 0.598 0.008 Copyright © 2010 SciRes. JILSA
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 162 Table 5. PART rule and associated JavaBayes query PART Rule Corresponding JavaBayes Query Syntax V4113 = 0 AND V4052 = 0 AND V4050 = 3 AND V4241 = 1:65 Posterior distribution: probability (“V4529”|V4113 = 0, V4052 = 0, V4050 = 3, V4241 = 1 ) Table 6. Posterior probabilities for the Victimization attri- bute V4529 with progressively increasing prior evidence (fraction truncated beyond third signific ant digit) Posterior Distributions V4529 Values probability (V4529| V4113 = 0) probability (V4529| V4113 = 0, V4052 = 0) probability (V4529| V4113 = 0, V4052 = 0, V4050 = 3) p(x60|evidence) 0.032 0.023 0.028 p(x61|evidence) 0.004 0.005 0.003 p(x62|evidence) 0.064 0.195 0.210 p(x63|evidence) 0.066 0.003 0.002 p(x64|evidence) 0.083 0.073 0.086 p(x65|evidence) 0.206 0.624 0.630 p(x66|evidence) 0.010 0.024 0.010 p(x67|evidence) 0.259 0.003 0.002 p(x68|evidence) 0.245 0.001 0.001 p(x69|evidence) 0.000 0.000 0.000 p(x70|evidence) 0.020 0.037 0.021 p(x71|evidence) 0.000 0.001 0.000 p(x72|evidence) 0.000 0.000 0.000 p(x73|evidence) 0.000 0.000 0.000 p(x74|evidence) 0.000 0.000 0.000 p(x75|evidence) 0.001 0.000 0.000 3.3 Validation of NCVS BBN Model through PART-Induced Queries Each of 49 rules that were identified as “interesting” and “significant” by the domain experts was carefully con- sidered as a test query. In light of the memory limitation encountered earlier, original rules had to be altered in order for the system to be able compute the posterior probabilities within th e memory constrain ts of the system available. Accordingly, some of the rules were elimi- nated due to memory limitati ons: a total of 22 rules were selected, revised and included in the query list. Table 7 shows a revised version of the rules supplied by the PART algorithm, which were computable and hence was applied as queries to the BBN model of the NCVS data. The attributes or evidence variables in each rule was ran- ked by domain experts [28-29], in order of interest (i.e. importance to study of the domain). The domain experts were able to classify two general groups of “interesting” and “significant” rules: 1) rules listi ng IF premi ses that pro- duced an unexpected result; and 2) rules that were in di- rect alignment with the accepted standards in the domain. Some attributes that are originally appearing in a spe- cific rule and were ranked low by the experts were ex- cluded from the corresponding query due to memory constraints. As a result of exclusion of certain attrib- utes-value pairs from many of the 22 rules used as query, it is expected that the cons equent attribute value is likely to be affected and possibly change from the value as in- dicated by the original rule induced by the PART rule discovery algorithm. Each revised rule in Table 7 is in- dicated with an (R) next to the number of the rule. The posterior probabilities of each rule in Table 7 upon being posed as a query and as computed by the Java- Bayes are displayed in Table 8, where only significant probability values are denoted for the sake of presenta- tion clarity. Table 9 represents the rules recovered from computed probabilities in Table 8 to comparatively de- monstrate the differences between the revised rules in Table 7 and those computed by the BBN model of the NCVS data in Table 9. In formulating rules in Table 9, any consequent attribute value that has a comparatively significant probability value was included. Due to revi- sion of the original rules induced from the NCVS data, there are differences between the consequents of rules in Tables 7 and 9. Although there are discrepancies between the conse- quents of the rules in Table s 7 and 9, knowledge exposed by the PART rules is still present to a large degree. The “x75” represents the crime of attempted or completed theft and is a dominant value for the victimization attrib - ute. With no evidence being presented, “x75” will repre- sent nearly 60% of all crimes reported in the NCVS. In- terestingly, the PART rules have extr acted a second la yer of usable information. The revised rules are not necessar- ily “incorrect” but are showing how a particular set of values can drastically affect the outcome of the victimi- zation attribute. For example, rule 10 in unrevised form provides that the victimization attribute should have a large value for “x71”. As noted in Tables 8 and 9, “x71” is not the dominant value for the revised rule 10. How- ever, the change in posterior probability for the variable “x71” from 1.8% to 18% is nevertheless noteworthy. Where the rules generated by the PART algorithm are queried exactly as they appear, the consequents of the rule hold true as the dominant variable. Since certain queries fail due to memory error, rules had to be revised to demonstrate at least a portion of the knowledge ex- tracted by the original PART-induced rules. Copyright © 2010 SciRes. JILSA
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 163 Table 7. Revised query list based on PART rules Rule No If Then V4529 = Rule No If Then V4529 = 1 (R) V4065 = 1 & V4026 = 9 & V3018 = 1 & V3024 = 2 75 12 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & 71 2 (R) V4052 = 0 & V4083 = 9 & V4094 = 0 & V4095 = 0 & V4024 = 7 65 13 (R) V4322 = 9 & V4065 = 1 & V4307 = 0 & V4024 = 8 71 3 (R) V4052 = 0 & V4112 = 0 & V4113 = 0 & V4095 = 0 & V4094 = 0 & V4024 = 1 65 14 V4322 = 9 & V4065 = 1 & V4285 = 9 & V4307 = 0 & V4024 = 7 & MSACC = 35 71 4 (R) V4052 = 0 & V4094 = 0 & V4095 = 0 & V4111 = 0 & V4024 = 2 65 15 (R) V4322 = 9 & V4065 = 1 & V4024 = 3 71 5 (R) V4322 = 9 & V4065 = 1 & V4024 = 5 71 16 (R) V3024 = 2 & V3020 = 23 & V2045 = 1 71 6 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & V3018 = 2 & MSACC = 17 71 23 V4073 = 0 & V4029 = 9 & V3018 = 2 & V4152 = 9 & V2045 = 2 & V3019 = 2 75 7 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & V3018 = 2 & MSACC = 26 71 45 (R) V4065 = 1 & V4029 = 9 & V3018 = 2 & 75 8 (R) V4322 = 9 & V4065 = 1 & V4024 = 2 71 46 (R) V3020 = 8 71 9 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & MSACC = 4 71 47 (R) V3020 = 24 & V3014 = 3 75 10 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & V3015 = 5 71 48 (R) V4113 = 0 & V4052 = 0 & V4050 = 3 & 65 11 (R) V4322 = 9 & V4128 = 1 & V4094 = 0 & V4095 = 0 & V4052 = 0 & V4051 = 0 & V4289 = 2 & 65 35 V4322 = 9 & V4052 = 0 & V4081 = 9 & V4095 = 0 & V4094 = 0 & V4096 = 9 & V4036 = 9 & V4024 = 5 65 The query results for revised PART rules were re- viewed by two domain experts [28,29]. In the majority of the cases, both experts found the predicted posterior probabilities to be reasonable and in accord with the cur- rent statistical trends provided by conventional means. As an example, the Bureau of Justice Statistics (BJS) provides periodic statistical reports [9]. BJS reported that, based upon violent crimes statistics from 1973-2005, beginning with the 25-34 age category, the rate at which persons were victims of violent crimes declined signifi- cantly as the age category increased [30]. The BJS also reports that in general, males experienced higher vic- timization rates than females for all types of violent crime except rape/sexual assault [9]. Where the gen erated rules included attributes (e.g. V3014 (Age), V3018 (Gender), and V4024 (location of incident)) that were consistent with known and generally accepted trends, the experts were not surprised with the values predicted and agreed that the posterior probabilities based upon each set of the evidence attributes were not in the extremes, based upon current publications in the field. The values were not un- expectedly high and thus did not trigger a shocking re- sponse. Conversely, the posterior values were not inor- dinately low compared to expected results, and thus the validity of the predicted value was not drawn into ques- tion. Rules 11, 35 and 48 were highlig hted by th e experts as the strongest rules, having the most sensible values for posterior prediction as compared to the generally ac- cepted statistical values presented in currently available publications and studies. In particular, the experts easily identified a known relationship or correlation between the IF premise and consequent for each of the rules 11, 35, and 48. In each of these three strongest rules, experts found the prior evidence values clearly set the stage for the associated posterior victimization predictions. Over- all, both experts indicated that the responses computed by the BBN model of the NCVS data to all queries posed were expected and reasonable in generality, suggesting that the model is realistic, and accordingly is a good ap- proximation to the joint probability distribution. As an exception to the generally positive feedback, rule 10 was found to be somewhat extraordinary. Rule 10 included the attribute that the victim was never married (V3015 = 5). A value of 5 for V3015 shows a distinct increase for the probability of a purse snatching or pick- pocketing. Domain experts were surprised to find that this evidence value would have such an impact on the posterior probability of pick pocketing. Although the posterior prediction was not necessarily discounted, ex- perts were skeptical, outside a more thorough explana- tion of the increased victimization. However, the skepti- cism did not detract from the intriguing prospect that the generated rule might have exposed “new” knowledge. As the experts reviewed the list of rules, the inclusion of certain “unusual” or unexpected attributes similar to the attribute uncovered by rule 10 stimulated the most feed- back from the domain experts. The experts were inter- ested in further investigation of th e “new” and “unusual” Copyright © 2010 SciRes. JILSA
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network Copyright © 2010 SciRes. JILSA 164 Table 8. Query results as probability values for revised PART rules in Table 7 (only highest probability values are shown and fractions are truncated bey ond the second significant digit) Rule No x61 x62 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75 1 0.04 0.20 0.03 0.68 2 0.26 .09 0.61 0.01 3 0.22 .06 0.67 4 0.23 0.74 5 0.05 0.13 0.06 0.02 0.36 0.32 6 0.04 0.09 0.28 0.12 0.41 7 0.03 0.01 0.08 0.36 0.09 0.38 8 0.02 0.14 0.02 0.01 0.30 0.46 9 0.04 0.15 0.21 0.12 0.41 10 0.07 0.17 0.18 0.01 0.07 0.43 11 0.98 0.01 12 0.05 0.15 0.19 0.09 0.44 13 0.01 0.12 0.07 0.69 14 0.01 0.03 0.35 0.08 0.49 15 0.01 0.07 0.16 0.04 0.02 0.14 0.47 16 0.02 0.03 0.11 0.03 0.05 0.59 23 0.01 0.20 0.11 0.02 0.63 35 0.09 0.06 0.78 0.03 45 0.02 0.20 0.10 0.03 0.62 46 0.03 0.04 0.03 0.07 0.08 0.04 0.59 47 0.05 0.10 0.03 0.05 0.60 48 0.21 0.08 0.63 0.02 combination of attribute-value pairs presented in gener- ated rules, stating that the rules could provide a starting point for further research of factors that may not have been fully developed with conventional methods. The implications of using a rule generating algorithm such as the PART to essentially generate queries are po- tentially profound. Limitations associated with user bias and limited domain knowledge may impede the self-gene- ration of useful and interesting queries. Using PART as an automatic query generation tool could potentially un- cover a not-so-obvious relationship between prior evi- dence and the resulting posterior probability of another attribute. Applying this principle to the NCVS data, the practical significance means uncovering the specific at- tributes of a victim or circumstance that makes them more or less probable to be a victim of a specific crime. As an example of practical implementation within the context of criminal justice, by identifying these relation- ships that have the greatest impact on posterior probabil- ity, resources can be channeled into areas that would be most effective in combating violent crime. Domain experts indicated that automatic query genera- tion using the PART algorithm or an equivalent would be helpful in not only discovering any hidden or novel rela- tionships between attributes, but more practically as a method to reinforce trends and relationships already re- lied upon in the field. A second group of domain experts1 were independently interviewed and asked to provide a list of self-generated queries that would be of personal interest. None of the second group was able to provide a list of more than three potential queries. The second group was then presented with the automatically gener- ated queries. All experts in the second group found that 1Six Professors at the University of Toledo College of Law
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 165 Table 9. Rules reconstructed from probability values in Table 8 (only modified rules are shown) Rule No If Then V4529 = Rule No If Then V4529 = 5 (R) V4322 = 9 & V4065 = 1 & V4024 = 5 74 & 75 14 V4322 = 9 & V4065 = 1 & V4285 = 9 & V4307 = 0 & V4024 = 7 & MSACC = 35 71 & 75 6 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & V3018 = 2 & MSACC = 17 71 & 75 15 (R) V4322 = 9 & V4065 = 1 & V4024 = 3 704 & 74 & 75 7 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & V3018 = 2 & MSACC = 26 71 & 75 16 (R) V3024 = 2 & V3020 = 23 & V2045 = 1 72 & 75 8 (R) V4322 = 9 & V4065 = 1 & V4024 = 2 74 & 75 23 V4073 = 0 & V4029 = 9 & V3018 = 2 & V4152 = 9 & V2045 = 2 & V3019 = 2 7 1 & 75 9 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & MSACC = 4 71 & 75 35 V4322 = 9 & V4052 = 0 & V4081 = 9 & V4095 = 0 & V4094 = 0 & V4096 = 9 & V4036 = 9 & V4024 = 5 62 & 65 10 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & V3015 = 5 70 & 71 & 75 45 (R) V4065 = 1 & V4029 = 9 & V3018 = 2 & 71 & 75 12 (R) V4322 = 9 & V4065 = 1 & V4024 = 7 & 70 & 71 & 75 46 (R) V3020 = 8 75 13 (R) V4322 = 9 & V4065 = 1 & V4307 = 0 & V4024 = 8 71 & 75 the collection of automatically generated queries was re- latively easy to review compared to the alternative of postulating the-defined list of rules and queries. Each of the experts in the second group agreed that it is sometimes difficult to consider the impact of a par- ticular variable, especially if the particular variable is not one that has been extensively researched using other known techniques. In this way, the automatic rule gen- eration may also be used as a reliable method to test prior hypotheses. Each member of the second group also agreed that an automatically generated list of rules provided a catalyst to the generation of user-defined rules and que- ries. At a minimum the relationships of the attributes presented in the generated rules caused members in the second group to reflect upon their own conception of trends in victimization, which ultimately resulted in a wholesale request for more information on the resultant effect of certain unexpected attributes on the posterior probability of victimization. 4. Conclusions This paper presented an approach to address the acquisi- tion bottleneck problem in generating human expert-for- mulated queries for validation of a Bayesian belief net- work model. A machine learning based approach for rule discovery from a dataset to serve as potential q ueries was proposed. The proposed technique employs machine learning (and potentially data mining) algorithms to gen- erate a set of classification or association rules that can be converted into corresponding queries with minimal human intervention and processing in the form of filter- ing for interestingness and significance by domain ex- perts. The application and utility of proposed methodol- ogy for semi-automated query formulation based on rule discovery was demonstrated on validation of a Bayesian belief network model of a real life size dataset from the domain of criminal justice. REFERENCES [1] D. Heckerman, “Bayesian Networks for Data Mining,” Data Mining and Knowledge Discovery, Vol. 1, No. 1, 1997, pp. 79-119. [2] K. B. Laskey and S. M. Mahoney, “Network Engineering for Agile Belief Network Models,” IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 4, 2000, pp. 487-498. [3] K. B. Laskey, “Sensitivity Analysis for Probability As- sessments in Bayesian Networks,” Proceedings of the Ninth Annual Conference on Uncertainty in Artificial In- telligence, Washington, D.C., 1993, pp. 136-142. [4] M. Pradham, G. Provan, B. Middleton and M. Henrion, “Knowledge Engineering for Large Belief Networks,” Proceedings of the Tenth Annual Conference on Uncer- tainty in Artificial Intelligence, Seattle, Washington, 1994, pp. 484-490. [5] O. Woodberry, A. E. Nicholson and C. Pollino, “Param- eterising Bayesian Networks,” In: G. I. Webb and X. Yu Eds., Lecture Notes in Artificial Intelligence, Springer- Verlag, Berlin, Vol. 3339, 2004, pp. 1101-1107. [6] S. Monti and G. Carenini, “Dealing with the Expert In- consistency in Probability Elicitation,” IEEE Transac- tions on Knowledge and Data Engineering, Vol. 12, No. 4, 2000, pp. 499-508. [7] H. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and Techniques,” 2nd Edition, Morgan Kaufmann, San Francisco, 2005. [8] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, 1994, Copyright © 2010 SciRes. JILSA
 Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network Copyright © 2010 SciRes. JILSA 166 pp. 487-499. [9] US Department of Justice, Bureau of Justice Statistics. National Crime Victimization Survey: Msa Data, 1979- 2004. Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2007-01-15. http://www. icpsr.umich. edu/cocoon/NACJD/STUDY/04576.xml [10] T. C. Hart and C. Rennison, Bureau of Justice Statistics, “Special Report”, March 2003, NCJ 195710. http://www. ojp.usdoj.gov/bjs/abstract/rcp00.html [11] R. Blanco, I. Inza and P. Larrañaga, “Learning Bayesian Networks in the Space of Structures by Estimation of Distribution Algorithms,” International Journal of Intel- ligent Systems, Vol. 18, No. 1, 2003, pp. 205-220. [12] R. Bouckaert, “Belief Networks Construction Using the Minimum Description Length Principle,” Lecture Notes in Computer Science, Springer-Verlag, Berlin, Vol. 747, 1993, pp. 41-48. [13] L. M. de Campos, J. M. Fernánde z-Luna and J. M. Puerta, “An Iterated Local Search Algorithm for Learning Bayes- ian Networks with Restarts Based on Conditional Inde- pendence Tests” International Journal of Intelligent Sys- tems, Vol. 18, No. 2, 2003, pp. 221-235. [14] J. Cheng, R. Greiner, J. Kelly, D. A. Bell and W. Liu, “Learning Bayesian Networks from Data: An Informa- tion—Theory Based Approach,” Artificial Intelligence, Vol. 137, No. 1-2, 2002, pp. 43-90. [15] D. Heckerman, D. Geiger and D. M. Chickering, “Learn- ing Bayesian Networks: The Combination of Knowledge and Statistical Data,” Machine Learning, Vol. 20, No. 3, 1995, pp. 197-243. [16] T. V. Allen and R. Greiner, “Model Selection Criteria for Learning Belief Nets: An Empirical Comparison,” Pro- ceedings of International Conference on Machine Learn- ing, Stanford, 2000, pp. 1047-1054. [17] Y. Guo and R. Greiner, “Discriminative Model Selection for Belief Net Structures,” Proceedings of the Twentieth National Conference on Artificial Intelligence, Pittsburgh, 2005, pp. 770-776. [18] F. G. Coz man, “JavaBayes Software Package,” Uni versi ty of São Paulo, Politécnica, cited 2006. http://www.cs.cmu. edu/~fgcozman/home.html [19] R. Bouckaert, “Bayesian Network Classifiers in Weka,” Technical Report, Department of Computer Science, Waikato University, Hamilton, 2005. [20] M. J. Druzdzel and L. C. van der Gaag, ”Building prob- abilistic Networks: Where do the Numbers Come from?” IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 4, 2000, pp. 481-486. [21] T. Boneh, “Visualisation of Structural Dependencies for Bayesian Network Knowledge Engineering,” Masters Thesis, University of Melbourne, Melbourne, 2002. [22] M. J. Druzdzel and L. C. van der Gaag, “Elicitation of Probabilities for belief Networks: Combining Qualitative and Quantitative Information,” Proceedings of the Tenth Annual Conference on Uncertainty in AI, Seattle, 1995, pp. 141-148. [23] H. Witten and E. Frank, “Generating Accurate Rule Sets without Global Optimization,” Proceedings of the Fif- teenth International Conference on Machine Learning, Madison, 1998, pp. 144-151. [24] J. R. Quinlan, “C4.5: Programs for Machine Learning,” Morgan Kaufmann Publishers, San Mateo, 1993. [25] W. W. Cohen, “Fast Effective Rule Induction,” Pro- ceedings of the 12th International Conference on Ma- chine Learning, Lake Tahoe, 1995, pp. 115-123. [26] J. Hipp, U. Guntzer and G. Nakaeizadeh, “Algorithms for Association Rule Mining—A General Survey and Com- parison,” ACM SIGKDD Explorations, Vol. 2, No. 1, 2000, pp. 58-64. [27] M. Riesen, “Development of a Bayesian Belief Network Model of NCVS Data as a Generic Query Tool,” Masters Project, Engineering, University of Toledo, Toledo, 2007. [28] G. Davis, Private communications, College of Law, Uni- versity of Toledo, Toledo, 2008. [29] P. Ventura, “Private Communications, Criminal Justice,” University of Toledo, Toledo, 2008. [30] S. M. Catalano, Crime Victimization 2005, NCJ 214644. http://www.ojp.usdoj.gov/bjs/abstract/cv05.html
|