Knowledge Discovery for Query Formulation for Validation of a Bayesian Belief Network

doi:10.4236/jilsa.2010.23019

Paper Menu >>

Journal Menu >>

J. Intelligent Learning Systems & Applications, 2010, 2, 156-166

doi:10.4236/jilsa.2010.23019 Published Online August 2010 (http://www.SciRP.org/journal/jilsa)

Knowledge Discovery for Query Formulation for

Validation of a Bayesian Belief Network

Gursel Serpen, Michael Riesen

Electrical Engineering and Computer Science, College of Engineering, University of Toledo; School of Law, University of Toledo,

Toledo, USA.

Email: gserpen@eng.utoledo.edu, riesen@fraser-ip.com

Received February 23th, 2010; revised July 6th, 2010; accepted July 20th, 2010.

ABSTRACT

This paper proposes machine learning techniques to discover knowledge in a dataset in the form of if-then rules for the

purpose of formulating queries for validation of a Bayesian belief network model of the same data. Although domain

expertise is often available, th e query formulation task is tedious and labo rious, and hence automatio n of query formu-

lation is desirable. In an effort to automate the query formulation process, a machine learning algorithm is leveraged to

discover knowledge in th e form of if-then ru les in the data from which th e Bayesian belief network model u nder valida-

tion was also induced. The set of if-then rules are processed and filtered through domain expertise to identify a subset

that consists of “interesting” and “significant” rules. The subset of interesting and significant rules is formulated into

corresponding queries to be posed, for validation purposes, to the Bayesian belief network induced from the same

dataset. The promise of the proposed methodology was assessed through an empirical study performed on a real-life

dataset, the National Crime Victimization Survey, which has over 250 attributes and well over 200,000 data points. The

study demonstrated that the proposed approach is feasible and provides automation, in part, of the query formulation

process for validation of a complex probabilistic model, which cu lminates in subs tantial savings for the need for human

expert involvement and investment.

Keywords: Rule Induction, Semi-A ut o mat ed Query Generation, Bayesian Net Validation, Knowledge Acquisition

Bottleneck, Crime Da ta, National Crime Victimization Survey

1. Introduction

Query formulation is an essential step in the v alidation of

complex probabilistic reasoning models that are induced

from data using machine learning or statistical techniques.

Bayesian belief networks (BBN) have proven to be

computationally viable empirical probabilistic models of

data [1]. Advances in machine learning, data mining, and

knowledge discovery and extraction fields greatly aided

in maturation of Bayesian belief networks, particularly

for classification and probabilistic reasoning tasks. A

Bayesian belief network can be created through a multi-

tude of means: it can be induced solely from data, hand-

crafted by a domain expert, or a combination of these

two techniques can be leveraged. A Bayesian belief net-

work model essentially approximates the full joint prob-

ability distribution in the domain of interest. The de-

velopment of a Bayesian belief network model is fol-

lowed by a rigorous validation phase to ascertain that the

model in fact approximates the full joint probability dis-

tribution reasonably well, even under the set of inde-

pendence assumptions made. Validation is a comprehen-

sive, multi-part process and often requires costly domain

expert involvement an d labor.

When a BBN model is used as a probabilistic reason-

ing engine, the validation requires a complex and chal-

lenging approach, wherein a multitude of validation-re-

lated activities must be performed [2-5] and as part of

one such activity, queries must be formed and posed to

the network. Any subset of variables might be considered

as evidence in such a query, which leads to the need to

formulate an inordinate number of queries based on

various subsets of variables. During validation by query-

ing, a value assignment to some variables in the network

is made and the posterior marginal probability or expec-

tation of some other variables is desired. In other words,

marginal probabilities and expectations can be calculated

conditiona lly on any number of observations or eviden ce

supplied to the network. It is also desirable, given that

certain evidence is supplied, to ask for the values of

non-evidence variables that result in the maximum pos-

sible posterior probability for the evidence, i.e., an ex-

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 157

planation for the available evidence. One can specify a

group of variables in the network to be estimated or es-

timate all variables in the network collectively. The ex-

isting literature for validation of BBNs as probabilistic

reasoning tools is sparse and mainly promotes ad hoc

approaches or mechanisms.

The formulation of an appropriate “query” requires the

use of extrinsic methods in order to discover relation-

ships among attributes. More specifically, in forming a

query, access to a specific domain expertise can prove to

be an efficient method in choosing which attributes to

include as evidence and which attributes to identify for

explanation or estimation. Experts in the domain of the

focus data can prove to be a useful resource in forming

the queries. However, there are many challenges in util-

izing domain experts in manual formulation of queries

and these challenges are in addition to the shear cost and

resources needed.

Conducting interviews with one or preferably more

experts in the relevant field of interest is one of the pre-

liminary steps in manual query formulation. Such inter-

views typically expose many issues and challenges asso-

ciated with relying on experts in the field to focus and to

form queries. Experts interviewed are likely to demon-

strate an interest in forming unique queries that would

parallel their own expertise or interest, which might not

fully overlap with the specific domain on which the

model was built [6]. The list of potential queries sug-

gested by the domain experts could prove to be inappli-

cable as the specific dataset employed to develop the

BBN model might not include all the attributes sought by

the domain experts. In other circumstances, experts may

be interested in applying local and regional attributes

rather than the global attributes or the national attributes

used in the dataset.

It is highly desirable to develop an automated proce-

dure that formulates queries by leveraging the same data-

set that was employed to induce the Bayesian belief net-

work model. In similar terms, exploration of other, and

possibly automated, ‘options’ in generating useful and

possibly non-obvious queries would be attractive. Data

mining and machine learning techniques can be em-

ployed, through an inductive process, to discover auto-

matically “queries” from a given dataset. More specifi-

cally, rule discovery and ex traction algorithms can prove

useful in “query formation”. Examples of specific such

algorithms are PART [7] and APRIORI [8].

1.1 Problem Statement

Validation of a complex Bayesian belief network, i.e.,

one that has on the order of hundreds of variables, in-

duced from a large dataset, like the National Crime Vic-

timization Survey (NCVS), is a highly challenging task

since it requires major investment of resources and do-

main expertise, while also being labor-intensive. The data

mining and knowledge discovery algorithms are poised

to offer a certain degree of relief from this challenge, and

hence can be leveraged to automate segments of the

overall process of query formation for validation. A ma-

chine learning or data mining algorithm can be leveraged

to mine for rules in a dataset from which the Bayesian

belief network model was induced, wherein these rules

can be formulated as queries for validation purpo ses. Th e

proposed study envisions processing a large and complex

dataset through a ru le-generation algorith m 1) to discover

embedded knowledge in the form of if-then rules, and

subsequently 2) to identify, through expert involvement,

a subset of “interesting” and “significant” rules that can

be formulated as queries for validation of the Bayesian

belief net work m od el of the data s et.

The next section discusses and elaborates on validation

of a Bayesian belief network (BBN) model of a dataset,

automatic query generation through a specific knowledge

discovery tool, th e NCVS dataset leveraged for th is stud y,

and the development of a BBN model on the same data-

set. The subsequent section will demonstrate application

of the proposed methodology to discover rules in the data

set, filtering of rules to identify an interesting and sig-

nificant subset, mapping of chosen rules into queries, and

demonstration of application of such queries for valida-

tion purposes on a specific BBN model of a real-life size

dataset that has over 250 attributes and 200,000 data

points, namely the National Crime Victimization Survey.

2. Background

This section discusses fundamental aspects of the prob-

lem being addressed. Elaborations on validating Bayes-

ian belief networks when employed as probabilistic rea-

soning models, query formulation with the help of ma-

chine learning and data mining, the dataset used for the

study, National Crime Victimization Survey (NCVS), and

the development of the Bayesian belief network model of

the dataset are presented.

2.1 The NCVS Dataset

The National Crime Victimization Survey (NCVS) [9-10],

previously the National Crime Survey (NCS), has been

collecting data on personal and household victimization

through an ongoing survey of a nationally representative

sample of residential addresses since 1973. The geo-

graphic coverage is 50 United States. The ‘universe’ is

persons in the United States aged 12 and over in “core”

counties within the top 40 National Crime Victimization

Survey Metropolitan Statistical Areas (MSA). The sam-

ple used was a stratified multistage cluster sample. The

NCVS MSA Incident data that was chosen for this study

contains select household, person, and crime incident

variables for persons wh o reported a v iolent crime within

any of the core counties of the 40 largest MSAs from

January 1979 through December 2004. Household, per-

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network

158

son, and incident information for persons reporting

non-violent crime are excluded from this file. The NCVS,

which contains 216,203 instances and a total of 259 at-

tributes, uses a labeling system for the attributes repre-

sented by letters and numbers. A typical attribute of in-

terest is labeled by a five character (alpha-numeric) tag, e.

g., V4529.

2.2 Bayesian Belief Network Model of NCVS

Data

A Bayesian belief network (BBN) expresses a view of

the joint probability distribution of a set of variables,

given a collection of independence relationships. This

means that a Bayesian belief network will correctly rep-

resent a joint probability distribution and simplify the

computations if and only if the con ditional independence

assumptions hold. The task of determining a full joint

distribution, in a brute-force fashion, is daunting. Such

calculations are computationally expensive and in some

instances impossible. In order to address this formidable

computational challenge, Bayesian belief networks are

built upon conditional independence assumptions that

appear to hold in many domains of interest.

A Bayesian belief network enables the user to extract a

posterior belief. All causal relationships and conditional

probabilities are incorporated into the network and are

accessible through an automated inference process. A

once tedious and costly (in terms of computation) me-

thod of extracting posterior beliefs in a given domain is

now space-efficient and time-efficient. It is also possible

to make queries on any attribute of one’s choosing as

long as it is one of those included in the model. One can

easily adjust the prior evidence in the same manner ena-

bling him to effectively compare and contrast posterior

probabilities of a given attribute based on prior knowl-

edge. The introduction of such a method has increased

the breadth and depth of statistical analysis exponen-

tially.

The BBN creation process consists of multiple phases.

Following any preprocessing needed on a given dataset,

the learning or training phase starts, wherein appropriate

structure learner and parameter learner algorithms need

to be selected by means of empirical means [11-17].

Learning a Bayesian belief network is a two stage proc-

ess: first learn a network structure and then learn the

probability tables. There are various software tools, some

in the public domain and open source, to accomplish the

development of a BBN through induction from data. For

instance, the open-source and public-domain software

tool WEKA [7], a machine learning tool that facilitates

empirical development of clustering, classification, and

functional approximation algorithms, has been leveraged

to develop a BBN from the NCVS dataset for the study

reported herein.

The validation phase can best be managed through a

software tool that can implement the “probabilistic in-

ferencing” procedure applicable for Bayesian belief net-

works. Another open-source and public-domain software

tool, the JavaBayes [18] was used for this purpose, which

is able to import an already-built BBN model, and facili-

tate through its graphical user interface querying of any

attribute for its posterior probability value among many

other options. A BBN model developed in WEKA can

easily be imported into the JavaBayes. Once imported,

the JavaBayes allows the user to identify and enter the

evidence, and query a posterior belief of any attribute.

In this study, the BayesNet tool of the WEKA has

been used to induce a classifier with the “Victimization”

attribute in the NCVS dataset as the class label [19].

The NCVS dataset has been split into training and test

subsets with 66% and 33% ratios, respectively. Simula-

tions were run for a variety of structure and parameter

learning options. Results suggest that a number of BBN

models performed exceptionally well as classifiers for

the “Victimization” attribute in the NCVS dataset. All

WEKA versions of the local hill climbers and local K2

search algorithms led to classification performances on

the test subset with 98% or better accuracy. Since the

classification accuracy rates were so close to each other,

the value of parameter “number of parent nodes” became

significant given that it directly relates to the approxima-

tion capability of the BBN to the full joint distribution.

Accordingly, the BBN model generated through the local

K2 algorithm with Bayes learning and four parent nodes

(the command-line syntax is “Local K2-P4-N-S BAYES”

in WEKA format) was selected as the final network. This

model, which, upon request, can be obtained in BIF for-

mat from the authors, has been used exclusively in the

validation experiments reported in the following sections.

2.3 Validation of Bayesian Belief Networks

Validation of a Bayesian belief network is a comprehen-

sive process. Once the Bayesian belief network (BBN) is

induced from the data and subsequently tuned by the

domain experts, the next step is the testing for validation

of the premise that the network faithfully represents the

full joint probability distribution subject to conditional

independence assumptions [5,20,21]. As part of the vali-

dation task, values computed by the BBN are compared

with those supplied by the domain experts, statistical

analysis, and the literature. Another distinct activity for

validation entails querying any variable for its posterior

distribution or posterior expectation, and to obtain an

explanation for a subset of or all of the variables in the

network. In that respect, knowledge discovery and data

mining tools, in conjunction with the domain experts, are

leveraged to formulate a set of so-called “interesting”

and “significant” queries to pose to the BBN. Validating

a BNN is no trivial task and necessitates ad hoc and em-

pirical elements. More specifically, a comprehensive and

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 159

rigorous process of evaluation and validation of a BBN

model entails the following:

1) Perform elicitation review that consists of reviewing

the graph structure for the model, and reviewing and

comparing probabilities with each other [22].

2) Carry out sensitivity analysis that measures the ef-

fect of one variable on another [3].

3) Implement validation using the data that entails

analysis of predictive accuracy and expected value cal-

culations.

4) Conduct case-based evaluations that may include

the following: run the model on test cases, compare the

model output with the expert judgment, and finally,

compare the model predictions with the “ground truth” or

accepted trends currently relied upon by experts in the

domain of interest.

The case-based evaluations validation step is the most

costly and challenging since it requires sub s tantial human

expertise. In particular, elicitation of expert judgment to

be leveraged for the validation of the Bayesian belief

network poses a serious obstacle since numerous test

cases or “queries” must be generated and applied to the

Bayesian belief network model. The expected values

must be defined in advance by human experts to form a

basis for comparison with those calculated by the net-

work itself.

2.4 Query Formulation

Machine learning and data mining techniques may be

leveraged to automatically discover “queries” for a given

dataset. A query is the calculation of the posterior prob-

abilities of any attribute or variable based upon the given

prior evidence. When a user provides that a specific at-

tribute is observed to have a (discrete) value, this ‘evi-

dence’ may be used in calculating the posterior probabil-

ity of a dependent variable. This is best understood by an

example. Assume that the user makes a query for the

posterior probability that a person will be a victim of

burglary. This query is dependent upon the values ob-

served for relevant attributes like the gender of the po-

tential victim. If burglary is shown to b e dependent upon

the gender of the victim, then the prior observed value of

male or female for the potential victim’s gender will need

to be supplied by the user in order to calculate the condi-

tional probability of this incident. This is analogous to

an if-then rule: such a rule is a candidate for a query. One

rule could postulate that

“If the gender of the victim is female Then the prob-

ability of burglary will be gr eater than 0.60.”

By having such a rule at one’s disposal, the process of

making valid and knowledgeable queries can be stream-

lined. One does not necessarily have to solely rely on an

expert for help to formulate “interesting” and “signifi-

cant” queries. A rule set may be generated using one of

many knowledge discovery algorithms, which can be

structured to produce a set of if-then rules. Machine

learning and data mining techniques prove useful for

discovering knowledge that can be modeled as a set of

if-then rules. Among the viable algorithms, PART [23],

C4.5 or C5 [24], and RIPPER [25] from machine learn-

ing, and APRIORI [8] and its derivatives from the data

mining fields are pr ominent.

3. Automation of Query Generation

This section presents application of machine learning

algorithms for knowledge discovery in the form of

if-then rules on the NCVS dataset for the purpose of

formulating queries to the Bayesian belief network model

of the same dataset. Although da ta mining algorithms are

also appropriate for knowledge extraction and subse-

quent automation of the query formulation process [26],

their computation al cost may quickly become prohib itive

if care is not exercised. Decision tree or list based algo-

rithms within the domain of machine learning are ap-

pealing in that they can generate a rule set for a given

single attribute of interest often within reasonable spatio-

temporal cost bounds. Accordingly, the machine learning

algorithm PART is chosen for the rule discovery and

extraction task given its desirable algorithmic and com-

putational properties. The PART algorithm [23] combines

two approaches, C4.5 [24] and RIPPER [25] in an at-

tempt to avoid their respective disadvantages. The main

steps for validation of a Bayesian belief net model of

data through automated query generation are shown in

Figure 1.

The rule induction algorithm PART is applied to the

NCVS dataset in order to extract a set of rules. The same

rules are leveraged, following further processing by do-

main experts, as queries to the BBN model of the NCVS

Use Machine Learning rule induction algorithms to

derive a rule set in If-Then format from data

Convert the rule set into a query set and filter the que-

ries for “interestingness” and “significance” with the

help of doma in ex per ts

Apply selected queries to Bayesian belief net model of

the same data for validation purposes

Solicit domain experts to evaluate the query responses

by the Bayesian belief network model

Figure 1. Generic overview of steps for Bayesian belief net

validation through automated query generations

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network

160

dataset for validation purposes. In itially, a subset of rules

is labeled as “interesting” and “significant” by the do-

main experts, wherein “interesting” is a subjective label-

ing by a particular domain expert based upon the rela-

tionship of the evidence and the resultant projected

probability of the THEN con sequent variable. Next, these

rules are formulated as queries and evidence associated

with each query supplied to the BBN model on Java-

Bayes. Posterior probability calculations performed by

the JavaBayes reasoning or inferencing engine for the

attribute(s) of interest, which can be any subset from the

list, are compared to expected values. This is done to

infer if, in fact, the BBN model approximates reason ably

well the joint probab ility distribution for the set of attrib-

utes entailed by the NCVS dataset.

3.1 PART Algorithm and Rules on NCVS Data

Rules that are derived from a dataset through a machine

learning algorithm like PART expose the relationship

between a subset of attributes and a single attribute of

interest (or the class label), i.e. in this case the class label

is designated as the “Victimization” due to its signifi-

cance in the domain. Any attribute can be designated as

the class label and would require a separate run of the

PART algorithm to generate the set of rules whose con-

sequents are the class label. Through the PART algo-

rithm, the knowledge entailed by the dataset is captured

into a framework with a set of if-then rules. Specifically,

the format for a rule complies with the following: IF

premise THEN consequent, where the premise is a

statement of the form of a logical conjunction of a subset

of attribute-value pairs, and the consequent represents a

certain type of victimization. We have used the WEKA

implementation of the PART algorithm throughout this

study. Available options for the PART as implemented

in the WEKA package and their associated default set-

tings are shown in Table 1.

The NCVS Incident dataset was preprocessed prior to

the rule induction step: the attribute count was reduced

from 259 to 225 through removal of those that were not

deemed to be relevant for the study. The attributes in th e

NCVS Incident dataset are represented, with a few ex-

ceptions, by a label that has four numeric characters pre-

ceded by the letter “V”. The PART algorithm was ap-

plied to the NCVS dataset with default parameter values

and the V4529 (Victimization) as the class attribute.

Values for the V4529 attribute are shown in Table 2. The

algorithm was trained on a 66%-33% training-testing split

of the NCVS dataset, and generated a list of 176 rules [ 2 7 ] .

The rules output are in the traditional IF-THEN format,

where the premise is the logical conjunction of a set of

attribute-valu e pairs (i.e., evidence) followed by the con-

sequent which is a specific value of the class attribute.

Table 3 illustrates one of the rules discovered by the

PART algorithm on the NCVS data and its interpretation.

Table 1. Parameter optio ns and default values for the WEKA

PART algorithm.

PART OptionExplanation Default

Values

-C number Confidence threshold for pruning 0.25

-M number Minimum number of instances per leaf2

-R Use reduced error pruning False

-N number Number of folds for reduced error

pruning 3

-B Use binary splits for nominal attributesFalse

-U Generate unpruned decision list False

-Q <seed> Se ed for random data shuffling 1

Table 2. Values for the NCVS attribute V4529

V4529

Label Description of Values for “Victimization” Attribute V4529

x60Completed/Attempted rape

x61Sexual attack/assault/serious assault

x62 Attempted/completed robbery with injury from serious

assault

x63Attempted/completed robbery with injury from minor assault

x64Attempted/completed robbery without injury

x65Attempted/completed aggravated assault

x66Threatened assault with weapon

x67Simple assault completed with injury

x68Assault without weapon without injury

x69Verbal threat of rape/sexual assault

x70Verbal threat of assault

x71Attempted/Completed purse snatching and pocket picking

x72Burglary

x73Attempted forcible entry

x74Attempted/completed motor vehicle theft

x75Attempted/completed theft

3.2 Query Formulation Based on PART Rules

The process of query formulation using the PART rules

and posing the queries to the BBN model entails human

expert involvement and is the focus of the discussion in

this section. A PART rule, which is captured th rough the

“IF-premise-THEN-consequent” framework, readily lends

itself to the query formation: the premise becomes the

prior evidence for a query, where posterior probability

value calculation is desired for the rule consequent. Such

queries may be employed to validate, among other uses,

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 161

Table 3. A sample rule generated by the PART algorithm

and its interpretation

PART Rule Interpretation

V4113 = 0 &

V4094 = 0 &

V4119 = 0 &

V4117 = 0 &

V4118 = 0 &

V4096 = 9:67

If the victim

 did not receive injuries from an attempted rape

(V4113 = 0), and

 was not attacked in the form of rape (V4094 = 0), and

 was not knocked unc onscious (V4119 = 0), and

 did not have broken bones or teeth as a result of inci-

dent (V4117 = 0), and

 did not sustain any internal injuries (V4118 = 0), and

 could not answer if (s)he was or was not a victim o

sexual assault (V4096 = 9),

Then

 there is a high probability that this person will be a

victim of “Simple Assault Completed with Injury”

(V4529 = x67)

the Bayesian belief network model of the full j oint prob-

ability distribution of the 225 attributes in the NCVS

dataset. The list of 176 rules generated by the PART al-

gorithm was manually processed by domain experts,

Gabrielle Davis [28] and Michael Riesen [27], to identify

those that are interesting and significant for query forma-

tion to serve as the validation set through the domain

specialist’s somewhat subjective perspective. The list of

49 rules identified accord ingly to be leveraged as queries

to the BBN model of the NCVS dataset are listed in [27].

Conversion of PART rules to queries and posing re-

sulting queries to the JavaBayes realization of the BBN

model is a straightforward process and will be illustrated

next. The middle column in Table 4 displays (in Java-

Bayes format) the posterior probability for the victimiza-

tion attribute V4529 with no prior evidence observed

before any query is posed as provided by the BBN model.

One of the simple rules generated by the PART that will

be used as an example query is shown in Table 4. The

premise part of the rule, i.e., V4127 = 2 AND V4095 = 1,

is considered as prior evidence and supplied to the BBN

model as such. Next, the JavaBayes is asked to perform

“reasoning” or “inference” using the supplied prior evi-

dence through the BBN model of the NCVS data. Once

the inferencing calculations are complete, the updated

posterior probabilities for all discrete values of the vic-

timization attribute are as shown in the rightmost column

in Table 4. As an example, the probability value for the

x60 value of the victimization attribute is now 0.612, a

marked increase compared to the no-evidence case. Tran-

slating the NCVS notation of the above comparison, this

rule indicates that when a victim is attacked in such a

way that the victim perceived the incident as an at-

tempted rape (V4095 = 1) and th e victim was not injured

to the extent that the victim received any medical care,

including self treatment (V4127 = 2), there is a 61%

chance that this victim would be a victim of a completed

rape or attempted rape (V4529 = x60).

Next, another and relatively more complex rule gener-

ated by the PART algorithm as shown in Table 5 was

presented as a query to the BBN model on JavaBayes. In

Table 6, the process of supplying the evidence as pro-

vided from this PART rule is shown. First, the prior evi-

dence that the victim suffered no injuries that are related

to attempted rape (V4113 = 0) is supplied. Then, further

prior evidence is supplied through V4052 = 0, meaning

that the offender d id not use a rifle, shotgun or any other

gun different from a handgun. More prior evidence is

added in the form of V4050 = 3, indicating that there was

a weapon used, but the specific type is not applicable as

reported in the NCVS. In the final step, V4241 = 1 as

prior evidence is provided. However, with this addition

of V4241 = 1 the JavaBayes running in the Java Runtime

Environment generated an OutOfMemory exception, al-

though the heap size was set to 3.5 GB. Nevertheless, for

each of the reportable cases, the corresponding posterior

probability table for the NCVS Victimization attribute

V4529 is displayed. As shown in Table 6, inclusion of

each further evidence has a direct affect on the posterior

probability of the consequent (i.e., the so-called “Then”

part of a rule), which can be observed through the value

of x65 discrete label for the class attribute V4529.

Table 4. A PART rule (V4127 = 2 & V4095 = 1: 60), associ-

ated JavaBayes query, and updated posterior probability

values for V4529 with increasing evidence

Conditional

Probabilities of

V4529 Labels

Posterior

Probabilities for

V4529 with No

Evidence

Posterier

Probabilities with

Evidence due to

V4127 = 2 &

V4095 = 1

p(x60|evidence) 0.004 0.612

p(x61|evidence) 0.001 0.005

p(x62|evidence) 0.005 0.006

p(x63|evidence) 0.005 0.009

p(x64|evidence) 0.025 0.003

p(x65|evidence) 0.036 0.003

p(x66|evidence) 0.006 0.069

p(x67|evidence) 0.022 0.008

p(x68|evidence) 0.055 0.003

p(x69|evidence) 0.000 0.007

p(x70|evidence) 0.019 0.227

p(x71|evidence) 0.018 0.010

p(x72|evidence) 0.113 0.007

p(x73|evidence) 0.032 0.009

p(x74|evidence) 0.053 0.008

p(x75|evidence) 0.598 0.008

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network

162

Table 5. PART rule and associated JavaBayes query

PART Rule Corresponding JavaBayes Query Syntax

V4113 = 0 AND

V4052 = 0 AND

V4050 = 3 AND

V4241 = 1:65

Posterior distribution:

probability (“V4529”|V4113 = 0, V4052

= 0, V4050 = 3, V4241 = 1 )

Table 6. Posterior probabilities for the Victimization attri-

bute V4529 with progressively increasing prior evidence

(fraction truncated beyond third signific ant digit)

Posterior Distributions

V4529 Values probability

(V4529|

V4113 = 0)

probability

(V4529|

V4113 = 0,

V4052 = 0)

probability

(V4529|

V4113 = 0,

V4052 = 0,

V4050 = 3)

p(x60|evidence) 0.032 0.023 0.028

p(x61|evidence) 0.004 0.005 0.003

p(x62|evidence) 0.064 0.195 0.210

p(x63|evidence) 0.066 0.003 0.002

p(x64|evidence) 0.083 0.073 0.086

p(x65|evidence) 0.206 0.624 0.630

p(x66|evidence) 0.010 0.024 0.010

p(x67|evidence) 0.259 0.003 0.002

p(x68|evidence) 0.245 0.001 0.001

p(x69|evidence) 0.000 0.000 0.000

p(x70|evidence) 0.020 0.037 0.021

p(x71|evidence) 0.000 0.001 0.000

p(x72|evidence) 0.000 0.000 0.000

p(x73|evidence) 0.000 0.000 0.000

p(x74|evidence) 0.000 0.000 0.000

p(x75|evidence) 0.001 0.000 0.000

3.3 Validation of NCVS BBN Model through

PART-Induced Queries

Each of 49 rules that were identified as “interesting” and

“significant” by the domain experts was carefully con-

sidered as a test query. In light of the memory limitation

encountered earlier, original rules had to be altered in

order for the system to be able compute the posterior

probabilities within th e memory constrain ts of the system

available. Accordingly, some of the rules were elimi-

nated due to memory limitati ons: a total of 22 rules were

selected, revised and included in the query list. Table 7

shows a revised version of the rules supplied by the

PART algorithm, which were computable and hence was

applied as queries to the BBN model of the NCVS data.

The attributes or evidence variables in each rule was ran-

ked by domain experts [28-29], in order of interest (i.e.

importance to study of the domain). The domain experts

were able to classify two general groups of “interesting”

and “significant” rules: 1) rules listi ng IF premi ses that pro-

duced an unexpected result; and 2) rules that were in di-

rect alignment with the accepted standards in the domain.

Some attributes that are originally appearing in a spe-

cific rule and were ranked low by the experts were ex-

cluded from the corresponding query due to memory

constraints. As a result of exclusion of certain attrib-

utes-value pairs from many of the 22 rules used as query,

it is expected that the cons equent attribute value is likely

to be affected and possibly change from the value as in-

dicated by the original rule induced by the PART rule

discovery algorithm. Each revised rule in Table 7 is in-

dicated with an (R) next to the number of the rule.

The posterior probabilities of each rule in Table 7 upon

being posed as a query and as computed by the Java-

Bayes are displayed in Table 8, where only significant

probability values are denoted for the sake of presenta-

tion clarity. Table 9 represents the rules recovered from

computed probabilities in Table 8 to comparatively de-

monstrate the differences between the revised rules in

Table 7 and those computed by the BBN model of the

NCVS data in Table 9. In formulating rules in Table 9,

any consequent attribute value that has a comparatively

significant probability value was included. Due to revi-

sion of the original rules induced from the NCVS data,

there are differences between the consequents of rules in

Tables 7 and 9.

Although there are discrepancies between the conse-

quents of the rules in Table s 7 and 9, knowledge exposed

by the PART rules is still present to a large degree. The

“x75” represents the crime of attempted or completed

theft and is a dominant value for the victimization attrib -

ute. With no evidence being presented, “x75” will repre-

sent nearly 60% of all crimes reported in the NCVS. In-

terestingly, the PART rules have extr acted a second la yer

of usable information. The revised rules are not necessar-

ily “incorrect” but are showing how a particular set of

values can drastically affect the outcome of the victimi-

zation attribute. For example, rule 10 in unrevised form

provides that the victimization attribute should have a

large value for “x71”. As noted in Tables 8 and 9, “x71”

is not the dominant value for the revised rule 10. How-

ever, the change in posterior probability for the variable

“x71” from 1.8% to 18% is nevertheless noteworthy.

Where the rules generated by the PART algorithm are

queried exactly as they appear, the consequents of the

rule hold true as the dominant variable. Since certain

queries fail due to memory error, rules had to be revised

to demonstrate at least a portion of the knowledge ex-

tracted by the original PART-induced rules.

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 163

Table 7. Revised query list based on PART rules

Rule

No If Then

V4529 = Rule

No If Then

V4529 =

1 (R)

V4065 = 1 &

V4026 = 9 &

V3018 = 1 &

V3024 = 2

75 12 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 7 & 71

2 (R)

V4052 = 0 &

V4083 = 9 &

V4094 = 0 &

V4095 = 0 &

V4024 = 7

65 13 (R)

V4322 = 9 &

V4065 = 1 &

V4307 = 0 &

V4024 = 8

3 (R)

V4052 = 0 &

V4112 = 0 &

V4113 = 0 &

V4095 = 0 &

V4094 = 0 &

V4024 = 1

65 14

V4322 = 9 &

V4065 = 1 &

V4285 = 9 &

V4307 = 0 &

V4024 = 7 &

MSACC = 35

4 (R)

V4052 = 0 &

V4094 = 0 &

V4095 = 0 &

V4111 = 0 &

V4024 = 2

65 15 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 3 71

5 (R) V4322 = 9 &

V4065 = 1 &

V4024 = 5 71 16 (R)

V3024 = 2 &

V3020 = 23 &

V2045 = 1 71

6 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 7 &

V3018 = 2 &

MSACC = 17

71 23

V4073 = 0 &

V4029 = 9 &

V3018 = 2 &

V4152 = 9 &

V2045 = 2 &

V3019 = 2

7 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 7 &

V3018 = 2 &

MSACC = 26

71 45 (R)

V4065 = 1 &

V4029 = 9 &

V3018 = 2 &

8 (R) V4322 = 9 &

V4065 = 1 &

V4024 = 2 71 46 (R) V3020 = 8 71

9 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 7 &

MSACC = 4

71 47 (R)

V3020 = 24 &

V3014 = 3 75

10 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 7 &

V3015 = 5

71 48 (R)

V4113 = 0 &

V4052 = 0 &

V4050 = 3 &

11 (R)

V4322 = 9 &

V4128 = 1 &

V4094 = 0 &

V4095 = 0 &

V4052 = 0 &

V4051 = 0 &

V4289 = 2 &

65 35

V4322 = 9 &

V4052 = 0 &

V4081 = 9 &

V4095 = 0 &

V4094 = 0 &

V4096 = 9 &

V4036 = 9 &

V4024 = 5

The query results for revised PART rules were re-

viewed by two domain experts [28,29]. In the majority of

the cases, both experts found the predicted posterior

probabilities to be reasonable and in accord with the cur-

rent statistical trends provided by conventional means.

As an example, the Bureau of Justice Statistics (BJS)

provides periodic statistical reports [9]. BJS reported that,

based upon violent crimes statistics from 1973-2005,

beginning with the 25-34 age category, the rate at which

persons were victims of violent crimes declined signifi-

cantly as the age category increased [30]. The BJS also

reports that in general, males experienced higher vic-

timization rates than females for all types of violent crime

except rape/sexual assault [9]. Where the gen erated rules

included attributes (e.g. V3014 (Age), V3018 (Gender),

and V4024 (location of incident)) that were consistent

with known and generally accepted trends, the experts

were not surprised with the values predicted and agreed

that the posterior probabilities based upon each set of the

evidence attributes were not in the extremes, based upon

current publications in the field. The values were not un-

expectedly high and thus did not trigger a shocking re-

sponse. Conversely, the posterior values were not inor-

dinately low compared to expected results, and thus the

validity of the predicted value was not drawn into ques-

tion.

Rules 11, 35 and 48 were highlig hted by th e experts as

the strongest rules, having the most sensible values for

posterior prediction as compared to the generally ac-

cepted statistical values presented in currently available

publications and studies. In particular, the experts easily

identified a known relationship or correlation between

the IF premise and consequent for each of the rules 11,

35, and 48. In each of these three strongest rules, experts

found the prior evidence values clearly set the stage for

the associated posterior victimization predictions. Over-

all, both experts indicated that the responses computed

by the BBN model of the NCVS data to all queries posed

were expected and reasonable in generality, suggesting

that the model is realistic, and accordingly is a good ap-

proximation to the joint probability distribution.

As an exception to the generally positive feedback,

rule 10 was found to be somewhat extraordinary. Rule 10

included the attribute that the victim was never married

(V3015 = 5). A value of 5 for V3015 shows a distinct

increase for the probability of a purse snatching or pick-

pocketing. Domain experts were surprised to find that

this evidence value would have such an impact on the

posterior probability of pick pocketing. Although the

posterior prediction was not necessarily discounted, ex-

perts were skeptical, outside a more thorough explana-

tion of the increased victimization. However, the skepti-

cism did not detract from the intriguing prospect that the

generated rule might have exposed “new” knowledge.

As the experts reviewed the list of rules, the inclusion of

certain “unusual” or unexpected attributes similar to the

attribute uncovered by rule 10 stimulated the most feed-

back from the domain experts. The experts were inter-

ested in further investigation of th e “new” and “unusual”

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network

164

Table 8. Query results as probability values for revised PART rules in Table 7 (only highest probability values are shown and

fractions are truncated bey ond the second significant digit)

Rule

No x61 x62 x64 x65 x66 x67 x68 x69 x70 x71 x72 x73 x74 x75

1 0.04 0.20 0.03 0.68

2 0.26 .09

0.61 0.01

3 0.22 .06

0.67

4 0.23

0.74

5 0.05 0.13 0.06 0.02

0.36 0.32

6 0.04 0.09

0.28 0.12

0.41

7 0.03 0.01 0.08

0.36 0.09

0.38

8 0.02 0.14 0.02 0.01

0.30 0.46

9 0.04 0.15

0.21 0.12

0.41

10 0.07 0.17 0.18 0.01 0.07

0.43

0.98 0.01

12 0.05 0.15 0.19 0.09

0.44

13 0.01 0.12 0.07

0.69

14 0.01 0.03

0.35 0.08

0.49

15 0.01 0.07 0.16 0.04 0.02 0.14

0.47

16 0.02

0.03 0.11 0.03 0.05

0.59

23 0.01

0.20 0.11 0.02

0.63

35 0.09 0.06

0.78 0.03

45 0.02 0.20 0.10 0.03

0.62

46 0.03

0.04 0.03 0.07 0.08 0.04

0.59

47 0.05 0.10 0.03 0.05

0.60

48 0.21 0.08

0.63 0.02

combination of attribute-value pairs presented in gener-

ated rules, stating that the rules could provide a starting

point for further research of factors that may not have

been fully developed with conventional methods.

The implications of using a rule generating algorithm

such as the PART to essentially generate queries are po-

tentially profound. Limitations associated with user bias

and limited domain knowledge may impede the self-gene-

ration of useful and interesting queries. Using PART as

an automatic query generation tool could potentially un-

cover a not-so-obvious relationship between prior evi-

dence and the resulting posterior probability of another

attribute. Applying this principle to the NCVS data, the

practical significance means uncovering the specific at-

tributes of a victim or circumstance that makes them

more or less probable to be a victim of a specific crime.

As an example of practical implementation within the

context of criminal justice, by identifying these relation-

ships that have the greatest impact on posterior probabil-

ity, resources can be channeled into areas that would be

most effective in combating violent crime.

Domain experts indicated that automatic query genera-

tion using the PART algorithm or an equivalent would be

helpful in not only discovering any hidden or novel rela-

tionships between attributes, but more practically as a

method to reinforce trends and relationships already re-

lied upon in the field. A second group of domain experts1

were independently interviewed and asked to provide a

list of self-generated queries that would be of personal

interest. None of the second group was able to provide a

list of more than three potential queries. The second

group was then presented with the automatically gener-

ated queries. All experts in the second group found that

1Six Professors at the University of Toledo College of Law

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network 165

Table 9. Rules reconstructed from probability values in

Table 8 (only modified rules are shown)

Rule

No If Then

V4529 = Rule

No If Then

V4529 =

5 (R) V4322 = 9 &

V4065 = 1 &

V4024 = 5

74 &

75 14

V4322 = 9 &

V4065 = 1 &

V4285 = 9 &

V4307 = 0 &

V4024 = 7 &

MSACC = 35

71 &

6 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 7 &

V3018 = 2 &

MSACC = 17

71 &

75 15 (R) V4322 = 9 &

V4065 = 1 &

V4024 = 3

704 &

74 &

7 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 7 &

V3018 = 2 &

MSACC = 26

71 &

75 16 (R) V3024 = 2 &

V3020 = 23 &

V2045 = 1

72 &

8 (R) V4322 = 9 &

V4065 = 1 &

V4024 = 2

74 &

75 23

V4073 = 0 &

V4029 = 9 &

V3018 = 2 &

V4152 = 9 &

V2045 = 2 &

V3019 = 2

7 1 &

9 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 7 &

MSACC = 4

71 &

75 35

V4322 = 9 &

V4052 = 0 &

V4081 = 9 &

V4095 = 0 &

V4094 = 0 &

V4096 = 9 &

V4036 = 9 &

V4024 = 5

62 &

10 (R)

V4322 = 9 &

V4065 = 1 &

V4024 = 7 &

V3015 = 5

70 &

71 &

75 45 (R) V4065 = 1 &

V4029 = 9 &

V3018 = 2 &

71 &

12 (R) V4322 = 9 &

V4065 = 1 &

V4024 = 7 &

70 &

71 &

75 46 (R) V3020 = 8 75

13 (R)

V4322 = 9 &

V4065 = 1 &

V4307 = 0 &

V4024 = 8

71 &

the collection of automatically generated queries was re-

latively easy to review compared to the alternative of

postulating the-defined list of rules and queries.

Each of the experts in the second group agreed that it

is sometimes difficult to consider the impact of a par-

ticular variable, especially if the particular variable is not

one that has been extensively researched using other

known techniques. In this way, the automatic rule gen-

eration may also be used as a reliable method to test prior

hypotheses. Each member of the second group also agreed

that an automatically generated list of rules provided a

catalyst to the generation of user-defined rules and que-

ries. At a minimum the relationships of the attributes

presented in the generated rules caused members in the

second group to reflect upon their own conception of

trends in victimization, which ultimately resulted in a

wholesale request for more information on the resultant

effect of certain unexpected attributes on the posterior

probability of victimization.

4. Conclusions

This paper presented an approach to address the acquisi-

tion bottleneck problem in generating human expert-for-

mulated queries for validation of a Bayesian belief net-

work model. A machine learning based approach for rule

discovery from a dataset to serve as potential q ueries was

proposed. The proposed technique employs machine

learning (and potentially data mining) algorithms to gen-

erate a set of classification or association rules that can

be converted into corresponding queries with minimal

human intervention and processing in the form of filter-

ing for interestingness and significance by domain ex-

perts. The application and utility of proposed methodol-

ogy for semi-automated query formulation based on rule

discovery was demonstrated on validation of a Bayesian

belief network model of a real life size dataset from the

domain of criminal justice.

REFERENCES

[1] D. Heckerman, “Bayesian Networks for Data Mining,”

Data Mining and Knowledge Discovery, Vol. 1, No. 1,

1997, pp. 79-119.

[2] K. B. Laskey and S. M. Mahoney, “Network Engineering

for Agile Belief Network Models,” IEEE Transactions on

Knowledge and Data Engineering, Vol. 12, No. 4, 2000,

pp. 487-498.

[3] K. B. Laskey, “Sensitivity Analysis for Probability As-

sessments in Bayesian Networks,” Proceedings of the

Ninth Annual Conference on Uncertainty in Artificial In-

telligence, Washington, D.C., 1993, pp. 136-142.

[4] M. Pradham, G. Provan, B. Middleton and M. Henrion,

“Knowledge Engineering for Large Belief Networks,”

Proceedings of the Tenth Annual Conference on Uncer-

tainty in Artificial Intelligence, Seattle, Washington, 1994,

pp. 484-490.

[5] O. Woodberry, A. E. Nicholson and C. Pollino, “Param-

eterising Bayesian Networks,” In: G. I. Webb and X. Yu

Eds., Lecture Notes in Artificial Intelligence, Springer-

Verlag, Berlin, Vol. 3339, 2004, pp. 1101-1107.

[6] S. Monti and G. Carenini, “Dealing with the Expert In-

consistency in Probability Elicitation,” IEEE Transac-

tions on Knowledge and Data Engineering, Vol. 12, No.

4, 2000, pp. 499-508.

[7] H. Witten and E. Frank, “Data Mining: Practical Machine

Learning Tools and Techniques,” 2nd Edition, Morgan

Kaufmann, San Francisco, 2005.

[8] R. Agrawal and R. Srikant, “Fast Algorithms for Mining

Association Rules,” Proceedings of the 20th International

Conference on Very Large Data Bases, Santiago, 1994,

Knowledge Discovery for Query Formulation for Validation of A Bayesian Belief Network

166

pp. 487-499.

[9] US Department of Justice, Bureau of Justice Statistics.

National Crime Victimization Survey: Msa Data, 1979-

2004. Ann Arbor, MI: Inter-university Consortium for

Political and Social Research, 2007-01-15. http://www.

icpsr.umich. edu/cocoon/NACJD/STUDY/04576.xml

[10] T. C. Hart and C. Rennison, Bureau of Justice Statistics,

“Special Report”, March 2003, NCJ 195710. http://www.

ojp.usdoj.gov/bjs/abstract/rcp00.html

[11] R. Blanco, I. Inza and P. Larrañaga, “Learning Bayesian

Networks in the Space of Structures by Estimation of

Distribution Algorithms,” International Journal of Intel-

ligent Systems, Vol. 18, No. 1, 2003, pp. 205-220.

[12] R. Bouckaert, “Belief Networks Construction Using the

Minimum Description Length Principle,” Lecture Notes

in Computer Science, Springer-Verlag, Berlin, Vol. 747,

1993, pp. 41-48.

[13] L. M. de Campos, J. M. Fernánde z-Luna and J. M. Puerta,

“An Iterated Local Search Algorithm for Learning Bayes-

ian Networks with Restarts Based on Conditional Inde-

pendence Tests” International Journal of Intelligent Sys-

tems, Vol. 18, No. 2, 2003, pp. 221-235.

[14] J. Cheng, R. Greiner, J. Kelly, D. A. Bell and W. Liu,

“Learning Bayesian Networks from Data: An Informa-

tion—Theory Based Approach,” Artificial Intelligence,

Vol. 137, No. 1-2, 2002, pp. 43-90.

[15] D. Heckerman, D. Geiger and D. M. Chickering, “Learn-

ing Bayesian Networks: The Combination of Knowledge

and Statistical Data,” Machine Learning, Vol. 20, No. 3,

1995, pp. 197-243.

[16] T. V. Allen and R. Greiner, “Model Selection Criteria for

Learning Belief Nets: An Empirical Comparison,” Pro-

ceedings of International Conference on Machine Learn-

ing, Stanford, 2000, pp. 1047-1054.

[17] Y. Guo and R. Greiner, “Discriminative Model Selection

for Belief Net Structures,” Proceedings of the Twentieth

National Conference on Artificial Intelligence, Pittsburgh,

2005, pp. 770-776.

[18] F. G. Coz man, “JavaBayes Software Package,” Uni versi ty

of São Paulo, Politécnica, cited 2006. http://www.cs.cmu.

edu/~fgcozman/home.html

[19] R. Bouckaert, “Bayesian Network Classifiers in Weka,”

Technical Report, Department of Computer Science,

Waikato University, Hamilton, 2005.

[20] M. J. Druzdzel and L. C. van der Gaag, ”Building prob-

abilistic Networks: Where do the Numbers Come from?”

IEEE Transactions on Knowledge and Data Engineering,

Vol. 12, No. 4, 2000, pp. 481-486.

[21] T. Boneh, “Visualisation of Structural Dependencies for

Bayesian Network Knowledge Engineering,” Masters

Thesis, University of Melbourne, Melbourne, 2002.

[22] M. J. Druzdzel and L. C. van der Gaag, “Elicitation of

Probabilities for belief Networks: Combining Qualitative

and Quantitative Information,” Proceedings of the Tenth

Annual Conference on Uncertainty in AI, Seattle, 1995,

pp. 141-148.

[23] H. Witten and E. Frank, “Generating Accurate Rule Sets

without Global Optimization,” Proceedings of the Fif-

teenth International Conference on Machine Learning,

Madison, 1998, pp. 144-151.

[24] J. R. Quinlan, “C4.5: Programs for Machine Learning,”

Morgan Kaufmann Publishers, San Mateo, 1993.

[25] W. W. Cohen, “Fast Effective Rule Induction,” Pro-

ceedings of the 12th International Conference on Ma-

chine Learning, Lake Tahoe, 1995, pp. 115-123.

[26] J. Hipp, U. Guntzer and G. Nakaeizadeh, “Algorithms for

Association Rule Mining—A General Survey and Com-

parison,” ACM SIGKDD Explorations, Vol. 2, No. 1,

2000, pp. 58-64.

[27] M. Riesen, “Development of a Bayesian Belief Network

Model of NCVS Data as a Generic Query Tool,” Masters

Project, Engineering, University of Toledo, Toledo, 2007.

[28] G. Davis, Private communications, College of Law, Uni-

versity of Toledo, Toledo, 2008.

[29] P. Ventura, “Private Communications, Criminal Justice,”

University of Toledo, Toledo, 2008.

[30] S. M. Catalano, Crime Victimization 2005, NCJ 214644.

http://www.ojp.usdoj.gov/bjs/abstract/cv05.html