We investigated the application of Causal Bayesian Networks (CBNs) to large data sets in order to predict user intent via internet search prediction. Here, sample data are taken from search engine logs (Excite, Altavista, and Alltheweb). These logs are parsed and sorted in order to create a data structure that was used to build a CBN. This network is used to predict the next term or terms that the user may be about to search (type). We looked at the application of CBNs, compared with Naive Bays and Bays Net classifiers on very large datasets. To simulate our proposed results, we took a small sample of search data logs to predict intentional query typing. Additionally, problems that arise with the use of such a data structure are addressed individually along with the solutions used and their prediction accuracy and sensitivity.
Bayesian networks modeled with cause and effects with each variable represented by a node, and causal relationships by an arrow (an edge), are known as Causal Bayesian Networks (CBNs) [
The crow and fox problem is implemented in three tiers (
The elder care problem (shown in
These CBNs represent the inherent logical causes in such a way that, if the user is performing an action, what is his/her intent and thus, why he/she is performing the action. This differs a bit from this project in that, while they are attempting to determine why the agent is performing a given action, we are attempting to figure out what the agent is going to do next. Beyond the examples above from Pereira [
Accordingly, some recent work addresses dynamic BN application in traffic flow count [
In case of a large dataset, analyzing causal intention that requires the creation and utilization of the large number of nodes, is cumbersome, and a challenging research issue. The reason behind this issue is that, the search queries require exploration of large heterogeneous data sets for dealing with missing values, uncertainties, and determining
patterns and relationships. At this point CBNs are currently being used for intention prediction, specifically for the implementation of assisting a user with a given task. Historically, however, these CBNs are restricted to a very small and controlled dataset and do not implement the ability to learn and self-modify their own behavior. Implementation with larger and evolving datasets creates several obstacles that must be addressed in order for such an implementation to be feasible. The use of search queries in particular creates additional problems with the non-cyclic nature of CBNs. Hence, incorporation of causal variables (causes and effects) along with Bayesian Network (BN) is rational to intentional query search identification and modeling and imposes some challenges. The first and foremost is the creation of the CBN itself as most tools require a specific model for the implementation of a CBN. When working with Big Data, manual entry of data into a CBN is not a feasible choice. As such, a method must be created to either automatically populate such a structure and/or create a unique implementation of a CBN specifically for use with large datasets. Secondly, the calculation of probabilities for such a large and specific dataset cannot be inferred, assumed, or calculated by hand. An algorithm must be created and used to determine the probabilities for each possible configuration of the CBN. The occurrence of novel data must also be accounted for and factored in with the final product.
This paper aims to expand the use of CBNs to much larger data sets in order to test the potential scalability of such a network. As with any Big Data problem, memory usage must be considered. The storage and access of data used must be efficient in order to be practical. Growth factors for continued learning must also be considered in this aspect. Finally, an overarching issue that is taken into consideration through the entirety of this work is computing the run time of the algorithm used. When dealing with Big Data algorithm, efficiency is a key factor and algorithms that run in O(N) time should be the minimum standard.
For the parser, we chose a script format, which was more straightforward, but made for several versions of the parser rather than a modular design. A modular design was opted not to be used since each search log had to be checked manually for format. From there it was best to simply modify the existing script to take into the new log into account. Log styles differed between search engines as well as years.
Each search log used a different tab delimited format. Some started with the ID of the user, some included information such as date and time. Extraneous information such as date and time was thrown first, by automatically excluding certain columns of data. We also found that users would often make the same search query repeatedly during the same session. These identical repeated search queries were discarded, as we felt that they did not represent unique relationships. From there, extraneous characters were thrown away, and all queries were made lower case in order to increase relationships between words and prevent identical nodes from appearing. Then the ID of the user and their search queries were put into a log file to preserve them. The IDs ranged from a session id to an IP address, but we felt that they represented unique enough identification to be mixed in their varying styles. ID was simply kept for reference as it was trivially easy to take it out.
First, the file was read into memory and delimited into a list by the newline character. An initial length was taken, for record and debugging purposes, and to see how effective the script was.
The initial for loop was the primary difference between each script. It dictated which tab delimited columns were kept and loaded into the list and which were ignored. This section was kept to a runtime of n. Sections that were useful were appended to the existing section in order to keep the runtime to n. Steps are shown in
The next for loop checks for duplicates. It checks for empty nodes and nodes with no search entry, which were logged by some engines. We also removed any new line characters and any entries that were clearly searching for a web address, although this was not robust. We decided here that interpreting users’ wills were outside the scope of this project, and while a best guess was made to strip out and sanitize our nodes, if the user wanted to search “4.jpg,” that with other unusual searches were flushed out when we calculated probabilities. Spelling errors were also outside the scope of this project, but when running the project, we found some common spelling errors to be the “most frequent” next node. Again, the law of numbers dictates that these difference will not be a problem, that is, frequent misspellings will yield a large enough population group that they will have a substantial data set. The final step of this loop was checking for duplicate entries by the same user id. We then filtered out all the empty nodes out of the list.
After this, unwanted characters were filtered out. We took all printable characters and removed the letters, numbers, space, and period, so that images would not be split into two separate nodes. We then reduced all remaining white space to a single space in an n squared operation. Then we took the final length and printed our results.
The Bayesian Network (BN) is a class of multivariate statistical models applicable to many areas in science and technology. In particular, the Bayesian Network has become popular as an analytical framework in causal studies, where the causal relations are encoded by the structure (or topology) of the network. Causal Bayesian Network (CBN) incorporates Bayesian network in directed acyclic graph (DAG).
In
A causal observation provides information about statistical relations among a number of events. There are three common statistical relations that represent the principle of common causes between two events “X” and “Y”: 1) X causes Y, 2) Y causes X, or both events are generated by a third event “Z” or set of events, their common cause. For example, searching for a “computer” and searching for a “computer desk” are statistically related because computer causes people to go on buying a table for it. Similarly, searching for a “computer” may cause searching different computers, or printers. In searching different computers user may compare various features associated to the
computer including the price. In these ways, a user may search ways, a computer within his/her budget or a computer with various features regardless of price. Hence, the causal observation of one of these events helps the model to infer that other events within the underlying causal model will exit or not.
Interventions often enable us to differentiate among the different causal structures that are compatible with an observation. If we manipulate an event A and nothing happens, then A cannot be the cause of event B, but if a manipulation of event B leads to a change in A, then we know that B is a cause of A, although there might be other causes of A as well. Forcing some people to go on a diet can tell us whether the diet increases or decreases the risk of obesity. Alternatively, changing people’s weight by making them exercise would show whether body mass is causally responsible for dieting. In contrast to observations, however, interventions do not provide positive or negative diagnostic evidence about the causes of the event on which we intervened. Whereas observations of events allow us to reason diagnostically about their causes, interventions make the occurrence of events independent of their typical causes.
Counterfactual reasoning tells us what would have happened if events other than the ones we are currently observing had happened. If we are currently observing that both A and B are present, then we can ask ourselves if B would still be present if we had intervened on A and caused its absence. If we know that B is the cause of A, then we should infer that the absence of A makes no difference to the presence of B because effects do not necessarily affect their causes. But, if our intervention had prevented B from occurring, then we should infer that A also would not occur.
The graph (
Common cause:
Causal chain:
Common effect:
The equations specify the probability distribution of the events within the model in terms of the strength of the causal links and the base rates of the exogenous causes that have no parents (e.g., X in the common cause model). Implicit in the specification of the parameters of a Bayes net are rules specifying how multiple causes of a common effect combine to produce the effect (e.g., noisy or rule) or (in the case of continuous variables) functional relations between variables. A parameterized causal model allows it to make specific predictions of the probabilities of individual events or patterns of events within the causal model.
Modeling InterventionsWith the help of the graph surgery (Pearl, 2000), the procedure to model changes in a causal model caused by interventions, a “manipulated graph” is constructed. According to Pearl (2000), the traditional Bayes nets and other probabilistic theories lack the expressive power to distinguish observational and interventional conditional probabilities [
[
Observation of Y:
Intervention on Y:
Equation (4) and Equation (5) signifies that the probability of consequences of interventions can be calculated if the variables of the causal model are known. Hence, it implies that Z occurs with the observational conditional probability, which is on the presence of Y (P (Z|Y = 1), and X occurs with a probability corresponding to its base rate (P(X)). This intervention on Y is defined as the causal chain model.
CBN Models after intervention
Naturally, both of these values are significantly smaller and can be ignored for the sake of simplicity. Hence, the maximum possible entities (or event) can be represented.
It is noticeable that in the natural causal effect or graph surgery fewer variables are needed to me considered in interventional probability computation. The common cause can be computed from the probability corresponding to its base rate, and the first effect is determined by the base rate of its cause and the strength of the probabilistic relation between first and second causes.
The causal graph data structure is implemented with an adjacency matrix, linked list of directions edges connections.
As is commonly known there are two primary methods to use when dealing with graph- adjacency, adjacency matrices and adjacency lists. The pros and cons of each method must be weighed to establish which is most efficient for dealing with Big Data CBN. Traditionally an adjacency matrix is the preferred method when dealing with large amounts of data so as to prevent redundant storage of values with multiple links. A problem arises in this project when the quantity of zero values is drastically greater than that of non-zero values.
Initial evaluation of word frequency using logs from all the web determined that only 9% of the 173665 unique search terms had 10 or more occurrences and thus 10 or more potential adjacencies.
Using an adjacency matrix would potentially lead to 158035 × (173665 − 10) or more than 274 billion empty values per adjacency matrix per word position. Given the incredibly sparse nature of such an adjacency structure, an adjacency list was deemed the most appropriate option.
The unique nature of search queries allowed for significant reduction in the size of the truth tables to be used in the CBN for this project. This uniqueness comes from the mutual exclusivity of search terms for each node. For example, if the word Truck is the first word used in the search term, then no other word can be true as the first word. A full-valued truth table for a given node would have a number of entries calculated by the equation in Equation (6). Where T is the number of entries in the truth table, N is the number of unique key words, and M is the word position of the current node.
Reduction caused by the mutual exclusivity of the nodes in a given word position reduce this to the still large but much more manageable maximum shown in Equation (7).
In practice both of these values are significantly smaller, these equations simply represent the maximum possible entries in a given node’s truth table.
Due to the large volume of data, a unique data structure needed to be created. This includes the establishment of a given node, the directed connection to following nodes, and the truth table needed to define the probabilities of a given state based on the existing previous node states. The code for this project was written in Python with intent to eventually transfer over to LISP code. Python was used due to the robust nature of the language combined with solid readability and pre-existing functions.
Nodes were created as a Python library with the individual words as keys. These words are the initial basis for each node and are used as identifiers for both node population and forward searching for prediction methods. Due to CBN being acyclic in nature, a given node could not exist in more than one location so as to prevent a potential infinite loop.
Given the limitations on variable naming conventions in programming languages, a method to delineate the different occurrences of the same word in different locations is needed. In order for a node to be accessed normally, the word that is keyed to the node is used as an identifier. A problem occurs in that a word could potentially exist in any word position yet must be distinct for each position else the graphs become cyclical.
Initially, a method was established to append word position to the initial key to create a unique identifier for each node. This was ultimately rejected due to the additional time it would take to append this value during data mining and removal of this value for forwards searching through the nodes via string matching.
Instead, a sub-library was created within each node that would indicate the starting position based off of a non-indexed word position (starting at 1). This method subverted both of the problems mentioned above as the key words could maintain their string identity while still maintaining the acyclic nature of the CBN which is a directed acyclic graph (DAG).
Contained within the sub-libraries exist the word occurrence frequency (number of times that word occurs in that position), and the trimmed-down truth tables. These truth tables are calculated by dividing the number of specific occurrences of the specific path taken to reach that node by the total number of occurrences of that node in the given position.
These sub-libraries are then ranked by these probability factors during creation and modification as to reduce the total amount of time between user input and program output. Further ordering (e.g. table entries with the same values) is arbitrary and will generally be ordered chronologically by creation time.
Several steps are taken in order to correctly interpret user input, search the CBN for the appropriate values, and return suggestions to the user. The initial search terms are taken as a single entry from the user, delimited by spaces. The interpreter then counts the number of terms being used and loads the last word in the search term as the KEY. Next, both the KEY and position index are used to locate the appropriate node and position to compare the entire search term to.
Once the correct truth table is located, the interpreter compares the entire string from user input until the top 5 matches have been found. These top 5 matches are returned as output to the user as the predictive text. The big data is proceeded through the facilities of BIG RED II from Indiana University, IN. An algorithm to convert a sample Python Code for Creation of Data Structure from CSV is shown in
The final program created was broken up into three primary sections: Data Parsing, Data Structuring, and Interpretation. This was primarily done for timing reasons. The parsing of the data takes a considerable amount of time (upwards of 20 minutes) for each search log, given the independence of the parsing of each line, this could easily be broken up to run in parallel on a supercomputer. This also allowed for all the parsed data to be combined into a single log file for breaking into data structures.
The organizing of the “mega” log into data structures took significantly less time than the parsing (only about 5 minutes for the entire log.) The primary reason these two steps were not performed in the same program was that, as individual steps, additional analysis could be performed on the parsed CSV file in order to help determine the best ways to construct the data structure and to perform any secondary calculations needed to support proposed ideas.
The transition of the data structure to the interpreter caused some difficulty in execution. Originally the structure was sent to a Python script that would start when the interpreter was launched. This scrip file was around 400 Mb and took around four minutes to load into memory. An alternative was found through Python’s Pickle functionality. Pickle turns a unique data structure into a binary file that Pickle can then read
and load much faster when called. The result was a binary file that was only a few kb larger than the original script file, but a 50% total reduction in load time with the interpreter.
As previously mentioned, the mutual exclusivity of words in a given position reduces the storage requirements of the truth tables by a significant margin. This mutual exclusivity comes from the fact that two different words cannot exist in the same position. The causal nature of the CBN is also a contributing factor in the reduction of truth table sizes. Since our graph is causal, any word that never precedes a given term need not be included in our truth tables as they will never have any kind of influence on the probability of that term or word [
This reduction is calculated in the worst-case-scenario in Equations (6) and (7). These are maximum models for a potential node though, and are highly unlikely to ever fully be reached. A more practical representation can be seen in
the directed graph. It is because of all of these contributing factors that the data storage and subsequently the traversal time of such a CBN are reduced so drastically and allow for this model to be implemented in a practical way without the need for supercomputing.
The implementation of a super computer would be most strongly utilized in the farming and parsing of additional search logs. As the data is not dependant on any other part during this process, it can be set up in parallel to make extensive use of any supercomputer. The creation of the data structures could also be implemented to take advantage of such a machine, but would require more intensive modifications of the code likely involving the use of mutexs. We compared query prediction accuracy and sensitivity with some built-in Bayesian classifiers included in Weka machine learning tools. Obtained results are summarized in
Results in
Many additional features could be added to this program in order to make it more robust
and helpful. The most basic is the inclusion of more search logs. While this project handled about six and a half million search terms, more data will always lead to more accurate results in terms of search prediction.
A simple learning algorithm could also be implemented within the interpreter that updates the data with new search terms and reinforces existing terms as they occur. This could also potentially lead to customized terms for individual users, which would only need to create an additional mutually exclusive precondition of some forms of user ID or could, depending on the desired format, be stored locally for the user. For even further customization, a localization could also be established using a similar method so that users in a specific geographical region would be more likely to get similar results. This would be useful for things like searching for local restaurants or other geographically oriented concepts.
Two significantly more robust additions could include a letter-by-letter live word prediction as the user begins to type. Google Auto-Complete implements this ability. This concept could potentially implement the same methods of prediction as the rest of the project, but we speculate that a simply ordered word frequency list would be adequate for this implementation. The second would be the detection and correction of
Algorithm | Product performance | Process performance | Parameters | Remark |
---|---|---|---|---|
CBN | 77.14% | 78.3% | ||
Weka_Naive Bayes | 73.65% | 70.19% | Default | |
Weka_Bayes Net | 71.38 | 68.32% | Simple estimator | BAN |
Weka_Bayes Net | 73.71% | 69.55% | Simple estimator | TAN |
Process (%) | Product (%) | |||
---|---|---|---|---|
Low | High | Low | High | |
CBN | 67.2 | 55.3 | 59.8 | 44.5 |
High = 100% of User 3, Req 5, and P & C 4 | 30.4 | 74.5 | 44.6 | 53.78 |
High = 100% of Req 5, P & C1, P & C2, P & C4, and Term 5 | 38.2 | 64.8 | 44.1 | 56.9 |
High = 100% of User 5, Req 2, Req 3, P &C 1, P & C2, P & C3, P & C4, P% C5, Term 5 | 24.7 | 77.5 | 29.5 | 76.7 |
misspelled words. This would be a substantial undertaking unless pulled from some form of API, but could potentially reduce the number of nodes, and thus drastically reduce the amount of branching and storage space needed.
The final and most significant projection for this project would be following further down the Bayesian Network to obtain predictions that exceed just the next word. This would likely require a bit more search time, but not much additional coding. This would likely be implemented with a limited-depth search for ordered values. Intuitively, the more words desired for prediction, the longer the search is going to take by a significant factor.
As discussed previously, this project could lay the groundwork for future text-based data mining for prediction usage with large data sets. Internet searching needs not be a limiting factor for the implementation of CBN with text [
Any implementation, such as that just mentioned, would likely need to implement selected key words so as to reduce the total number of nodes. This is due to the fact that this method would drastically reduce the need for word ordering and remove the mutual exclusivity that is important controlling input size of the current model.
Hossain, G., Haarbauer, J., Abdo, J. and King, B. (2016) Causal Analysis of User Search Query Intent. Journal of Computer and Communications, 4, 108-131. http://dx.doi.org/10.4236/jcc.2016.414009
Note 1: Parsing Search Data
import string
#print “What file would you like to open?” #comment this and the next line back in
filename = “97_03_10.log” #raw_input(“?”)
f = open (filename, “r”)
filelines = f.readlines ()
filedata = [len (filelines)]
parsedoc = [
del f
for line in filelines:
parsedoc.append (line.strip ().split (“\t”) [1:])
#delfilelines
for i in range (len (parsedoc) −1): #this is where the magic happens
if (not parsedoc [i]):
# print True
continue
if (len (parsedoc[i]) = = 1):
parsedoc [i] = [
continue
parsedoc [i] [
if ((parsedoc [i][
parsedoc [i] = [
continue
#nextline is to prevent j from reaching into the land of the lost
for j in range (i + 1, i + (20 if (20 + I < len(parsedoc)) else (len (parsedoc) −i −1))):
if (parsedoc [i] = = parsedoc [j]):
parsedoc [j] = [
parsedoc = filter (None, parsedoc)
#DOC SHOULD BE CLEAN. IF YOU WANT TO SPLIT it, do it now
wantedchars = string.ascii_letters + “.” + string.digits
unwanted = string.printable
for i in wantedchars:
unwanted = unwanted.replace (i, “”)
for i in range (len (parsedoc)):
# try: parsedoc [i] [
# except: continue
for j in unwanted:
parsedoc [i] [
while(“ “ in parsedoc [i] [
parsedoc [i] [
parsedoc [i] [
filedata.append (len (parsedoc))
print “Originally”,
printfiledata [
print “lines.”
print “Currently”,
printfiledata [
print “lines.”
for line in parsedoc:
print line [
for word in line[
print word + “,”,
print “”
Note 2: Data Structure Creation
import pickle as pl
defcreaterelations ():
ourfile = “megalog”
f = open (ourfile, “r”)
filelines = f.readlines ()
parsedoc = [
for line in filelines:
parsedoc.append (line.strip ().split (“,”) [1:])
# for i in range(len (parsedoc [-1])):
# parsedoc [−1] [i] = “_” + parsedoc [−1] [i]
“”
Example relations-
relations = {“tree”: {1: {“branch”: 20, “stump”: 11, “”:5}{2: ...}}}
“”
relations = {}
for line in parsedoc:
for i in range (len (line)-1):
word = line [i]
nextword = “ ”.join (line [: i + 2])
if not relations.has_key (word):
relations.update ({word: {}})
posdict = relations [word]
j = i + 1
if not posdict.has_key (j):
posdict.update ({j: {}})
wordpos = posdict [j]
if not wordpos.has_key (nextword):
wordpos.update ({nextword: 0})
wordpos.update ({nextword: wordpos [nextword] + 1})
return relations
#this section formats relations into a list that can be used with
#lisp code
defformatlisp (relations):
for key in relations.keys ():
print “(setq”, key, “(”.strip (),
posdict = relations [key]
forpos in posdict.keys ():
print “(”.strip (),
wordpos = posdict [pos]
for word in wordpos. keys ():
if not word:
print “(”.strip (), “nil”, str (wordpos [word]). strip (),“)”,
else:
print “(”.strip (),word, str (wordpos [word]).strip (),“)”,
print “)”,
print “))”
#formatlisp (createrelations ())
relations = createrelations ()
pl.dump (relations,open (“megafile.p”, “wb”))
Note 3: Interpreter
#from megafile import relations
#from createlist import relations
import pickle
fromdatetime import datetime
startTime = datetime.now ()
relations = pickle.load(open(“megafile.p”, “rb”))
defgetnext (node, numberofnext):
global relations
inputlist = node. strip (). split (“ ”) #list of input words
ifrelations.has_key (inputlist [−1]):
currnode = relations [inputlist [−1]] [len (inputlist)] #dictionary from current list
else:
updaterelations (inputlist)
return [node, 0]
nodelist = [
fordictkey in currnode.keys ():
ourkey = “ ”.join (dictkey.split (“ ”)[: len (inputlist)])
inputkey = “ ”.join (inputlist)
if (ourkey = = inputkey):
nodelist.append ([dictkey, currnode.get (dictkey)])
#now we have [[woods, 1], [woods books.2]...]
nodelist = sorted (nodelist, key = lambda keypair: keypair [
ourrange = numberofnext if numberofnext
updaterelations (inputlist)
returnnodelist [: ourrange]
defprintnodes (nodes):
pass
defupdaterelations (inputlist):
pass
print "Time to load:"
printdatetime.now ()-startTime
while (1):
node = raw_input (“>>”)
if node = = "/exit":
exit (0)
nextnodes = getnext (str (node),5) #return a list
for i in nextnodes:
print I [
Submit or recommend next manuscript to SCIRP and we will provide best service for you:
Accepting pre-submission inquiries through Email, Facebook, LinkedIn, Twitter, etc.
A wide selection of journals (inclusive of 9 subjects, more than 200 journals)
Providing 24-hour high-quality service
User-friendly online submission system
Fair and swift peer-review system
Efficient typesetting and proofreading procedure
Display of the result of downloads and visits, as well as the number of cited articles
Maximum dissemination of your research work
Submit your manuscript at: http://papersubmission.scirp.org/
Or contact jcc@scirp.org