^{1}

^{*}

^{2}

^{2}

This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) is then applied to the HTML code in order to extract the tables. The model was trained and tested with five hundred and twenty six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. The system was evaluated in terms of accuracy, precision, recall and f-measure. The overall evaluation results show 88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing that the method is good at solving the problem of table extraction.

A table is made up of series of rows and columns. It is a cell that houses information on a table, the information can be alphabetic, numeric or alphanumeric; a table can also house images depending on what the user wants to use the table for. In sciences, tables are used to summarize a topic or a statement, also, in accounts; tables are used to know the money spent, the money at hand and the money to be spent. So, a table can be referred to as a set of facts or figures arranged in lines or columns.

A table as a list of facts or numbers arranged in a special order, usually in rows and columns and as an arrangement of numbers, words or items of any kind, in a definite and compact form, so as to exhibit some set of facts or relations in a distinct and comprehensive way, for convenience of study, reference, or calculation.

According to [

Among many inspirational works is the work of [

Many researchers have used HMM to extract one thing or the other from documents. Ojokoh et al. [

Oro et al. [

Tengli et al. [

HMM is one of the most popular models being used for sequential model, it is widely used because it is simple and easy enough that one can actually estimate the pre-eminence, it is also rich enough that it can handle real world applications. HMM is a doubly stochastic process with an underlying stochastic process that is not observable (it is hidden), but can only be observed through another set of stochastic processes that produce the sequence of observed symbols [

Information can be extracted using standard approaches of hand-written regular expressions (perhaps stacked), using classifiers (like generative: naïve Bayes classifier, discriminative: maximum entropy models), sequence models (like Hidden Markov model, Conditional Markov model (CMM)/Maximum-entropy Markov model (MEMM), Conditional random fields (CRF) (are commonly used in conjunction with IE for tasks as varied as extracting information from research papers to extracting navigation instructions). Hybrid approaches can also be used for IE which is the combination of some of the standard approaches listed above but HMM was used in this research work.

This research work was therefore motivated by the need to recognize and extract tables from a large collection of documents of different types and structures and it was able to achieve its aim upon converting the different faces of the documents to similar faces using document converter.

It has been shown in this work that HMM is a fast learner even with small size data set and can also do very well with a large size data set since it is a machine learning model.

A Hidden Markov Model (HMM) is a finite state automation comprising of stochastic state transitions and symbol emissions. The automation models a probabilistic generative process whereby a sequence of symbols is produced by starting at a designated start state, transitioning to a new state, emitting one of a set of output symbols selected by that state, transitioning again, emitting another symbol, and so on, until a designated final state is reached. Associated with each of a set of states is a probability distribution over the symbols in the emission symbol and a probability distribution over its set of outgoing transitions [_{ }

Expressing the model mathematically, we have the definitions in Equations (1) to (6)

The Hidden Markov Model is a five-tuple

S is the set of states,

where,

_{i} at time t = 1 (4)

where M is the number of emission symbols in the discrete vocabulary, V.

where,

A table contains properties like td, tr, tbody, table, paragraph (that is table data, table row, table body, table, paragraph respectively). Each of these properties is a state generating sequence of observable/emission symbols during transition from one state to another. Given an HMM, each transition is performed by determining the sequence of states that is most likely to have generated the entire observable/emission symbol sequence. Viterbi algorithm is a common way of showing the most probable path of (for) the sequence [

1) For all the tables generated, pick a table (t_{i}. where n = number of tables) if they are not exhausted.

2) Display the fetched table (t_{i})

3) Generate the HTML (h_{i}) for t_{i}

4) For all the tags (

5) Display the tag

6) Fetch the next token.

a) If it is not a tag but inner HTML, indent positively and display content.

b) If it is another tag indent positively (indent right) and display content.

c) If it is closing tag indent negatively and display content.

The tables extracted by the HMM were sometimes not the same as the original tables but may be different in either of the following:

1) the contents of two cells may merge into a single cell (that is, the content of cell B may be extracted into cell A leaving cell B empty and adding to the content of cell A and vice versa).

2) a cell may be missing,

3) a row may be missing as the case may be.

But in all, there is retention of about 90% to 98% of the original table.

The system architecture shown in

The design spans through all the tools used for pre-processing heterogeneous documents into HTML code, the generation of Transition Matrix, Initial Probability Matrix, Observable Symbol Matrix, Smoothing the Observable Symbol Matrix using Laplace Smoothing Method and Viterbi’s algorithm to determine the best path, and finally the extracted tables and contents.

The data sets used in this research work were self-generated because there is no known standardised data set for table extraction. It contains five hundred and twenty six (526) tables in twenty-five (25) documents of Microsoft Word, Portable Document Format and Hypertext Markup Language. This set of documents is divided into two phases: The training phase and the testing phase. In document pre-processor, heterogeneous documents were converted to their HTML equivalence. The PDF is first converted to Word document using a PDF converter (PDFC) before being converted to its HTML equivalence.

To build an HMM for table extraction in documents, one needs to first decide how many states the model will contain and what transitions between states should be allowed [

Given:

This is a system of n states.

For all

The case study has five states: TD, TH, TBody, P,

1) Get the object document from the CK Editor using get Data() API function

2) Determine which of the states of interest comes first

3) Find the probability of each of the states considering the number of documents used

4) Store this in a table.

5) Output the initial probability matrix.

Observable/emission symbols are the symbols extracted by the system.

In this research work, the following are the categories of observable symbols:

1) Numbers

2) Single words

3) Short statements

4) Long statements

5) Title case statements

6) Uppercase statements

The probability of the observable/emission symbols was computed and the algorithm computes the probability of the observable/emission symbols, stores its matrix and generates an output.

A description of the observable/emission symbols and samples from the data set is shown in

Since the observable/emission symbol probability assigns zero probability to some unseen emissions, retaining these zeroes would affect the result negatively, and then there is a need for smoothing. There are so many methods used for smoothing, but in this research work, Laplace method was used. The maximum likelihood model was modified such that one (1) is placed as the numerator and m is added to the denominator [

where V_{k} is the unseen symbol while transitioning from state i to j, m is the addition of the total number of sequence of states and the number of non-zero probability across row.

This phase contains two hundred and five (205) tables. This is used for the evaluation of the trained model.

After going through the processes of transitioning, transition probability, observable/emission probabilities, a trained Hidden Markov Model would have been formed. This model consisting of values for the transition and observable/emission probability matrices alongside with the sequence of symbols derived from the testing data sets will be passed to the Viterbi Algorithm which will produce a sequence of states (

S/N | Emission Symbols | Examples from sample |
---|---|---|

1 | Numbers | 2, 3, 4 |

2 | Words | Programming, Work done, Thanks |

3 | Short statements | ID card automation, Database design |

4 | Long statements | I have personally embarked on some projects and also embarked on some other projects as a member of a team. The following are some of the projects |

5 | Title case statements | Number of Staff in Statistics |

6 | Capital letter statements | FEMI VINCENT,REMI OLADAPO |

The specific problem to be solved by Viterbi algorithm is to obtain the most possible state sequence,

The Viterbi algorithm makes a number of assumptions. First, both the observed events and hidden events must be in a sequence. This sequence often corresponds to time. Second, these two sequence need to be aligned and an instance of an observed event needs to correspond to exactly one instance of a hidden event. Third, computing the most likely hidden sequence up to a certain point t must depend only on the observed event at point t, and the most likely sequence at point t-1. These assumptions listed above can be elaborated as follows. The Viterbi algorithm operates on a state machine assumption. That is, at any time the system being modeled is in some state. There are a finite number of states, however large, that can be listed. Each state is represented as a node. Multiple sequences of state (paths) can lead to a given state, but one is the most likely path to that state called the “survivor path.” This is a fundamental assumption of the algorithm because the algorithm will examine all possible paths leading to a state and only keep the one most likely. This way the algorithm does not have to keep track of all possible paths, only one per state. A second key assumption is that a transition from a previous state to a new state is marked by an incremental metric, usually a number. This transition is computed from the event. The third key assumption is that the events are cumulative over a path in some sense, usually additive. So the crux of the algorithm is to keep a number for each state. When an event occurs, the algorithm examines moving forward to a new set of states by combining the metric of a possible previous state with the incremental metric of the transition due to the event and chooses the best. The incremental metric associated with an event depends on the transition possibility from the old state to the new state.

The Viterbi algorithm is stated as follows in four major steps:

1) Initialization:

2) Recursion:

3) Termination:

4) Path (state sequence) backtracking

The result of the best path given by the Viterbi’s algorithm is interpreted to its equivalent table in HTML code for proper evaluation.

The testing data set was evaluated to see how well the trained model performed the task of table extraction.

Evaluation was done measuring per token accuracy, precision, recall and f-measure for each extracted table in the tested documents. These evaluation measures are defined in Equations (16)-(21).

where A is number of correctly extracted cells (True Positives), B is number of cells existing but not extracted (False Negatives), C is number of cells extracted but associated with wrong labels (False Positives) and D is number of cells that did not exist and were not extracted (True Negatives).

Experiments were carried out using a self-generated data set. The training set was used to train the HMM and the testing set was used to evaluate the effect of extraction.

Document | Accuracy | Precision | Recall | F-measure |
---|---|---|---|---|

16 | 0.930 | 0.982 | 0.943 | 0.958 |

17 | 0.877 | 0.965 | 0.898 | 0.895 |

18 | 0.816 | 0.938 | 0.856 | 0.850 |

19 | 0.680 | 0.933 | 0.865 | 0.783 |

20 | 0.850 | 0.945 | 0.882 | 0.859 |

21 | 0.923 | 0.981 | 0.938 | 0.956 |

22 | 0.816 | 0.926 | 0.847 | 0.844 |

23 | 0.836 | 0.931 | 0.870 | 0.876 |

24 | 0.805 | 0.911 | 0.852 | 0.871 |

25 | 0.815 | 0.918 | 0.852 | 0.874 |

Overall for 526 tables | 0.888 | 0.968 | 0.917 | 0.888 |

Average for 526 tables | 0.821 | 0.935 | 0.842 | 0.878 |

Overall for 66 tables | 0.922 | 0.980 | 0.937 | 0.935 |

Average for 66 tables | 0.885 | 0.966 | 0.903 | 0.919 |

Overall for 139 tables | 0.852 | 0.955 | 0.897 | 0.864 |

Average for 139 tables | 0.790 | 0.920 | 0.814 | 0.859 |

(

Precision | Recall | F-measure | |
---|---|---|---|

Trigram HMM | 93.5 | 93.2 | 93.4 |

96.8 | 91.7 | 88.8 |

This research work presented how tables can be recognised and extracted automatically from heterogeneous documents using Hidden Markov Model (HMM). Smoothing was done for transition probability matrix and Viterbi’s algorithm was used to get the best path with the use of an algorithm which helped to display both the extracted table and its HTML equivalence; these were stored in a database and also displayed in web browsers. Automatic table recognition and extraction provide scalability and usability for digital libraries and their collections. Heterogeneous documents (except HTML documents) were initially pre-processed and converted to HTML codes after which an algorithm recognises the table. HMM was applied to extract the table portion from HTML code. The model was trained and tested with five hundred and twenty-six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing. Evaluation was done to determine the accuracy, precision, recall and f-measure of the extraction.

In this research work, only Word, PDF and HTML documents were used; future work could accommodate other types of documents like Excel, PowerPoint and so on. Modification can also be done to this work in the nearest future to see that not only Google Chrome is able to do the extraction completely, but all internet browsers. A four-level cross validation is suggested for future work. Other means of validation apart from accuracy, precision, recall and f-measure can be used for evaluation.

Florence FolakeBabatunde,Bolanle AdefowokeOjokoh,Samuel AdebayoOluwadare, (2015) Automatic