passed through a series of three categories of validation operators in which sub-entity sequences (extract sequences) are also labeled (valid or invalid) as described in Section 3.5. By characterizing sequence fields, the approach enables sub-entity occurrences to be either extended into value types (v-types) or transformed into either basic structural types (b-types) or user-defined types (u-types) using a ML-based transformation model. Finally, sub-entity transformed and labeled extract sequences of the desired NHCE are used to generate output for use in solving ML problems like validation or classification of the NHCE under consideration (can also be for use in further NHCE extraction). A schematic which summarizes the involved processes (template matching, entity transformation and validation) is shown in Figure 2. We explain the motivations and requirements which led to this approach designed in Section 3.3. It is inline with these requirements that the rest of the approach is presented.
3.3. Requirements and Motivations
1) Annotation and Parsing: The objective of leveraging Web content necessitates an annotation process for the identification of text fragments which constitute the desired NHCE and a wrapper for parsing those fragments into sub-entities. Instead of manual supply of documents (as text source or for training), an iterative and interactive extraction can allow us to leverage Web content and save human capital.
2) Validation of Sub-entity Sequences: Since knowledge about the extracts is initially not exact enough, a natural mechanism is to consider all sequences of sub-entities which constitute the desired NHCE extracts. To validate or invalidate an extracted text fragment and subsequently label it, a variety of means (e.g. statistical, expert input) is necessary if one is to expect comprehensive results.
3) Sub-entity Transformation: To cope with the evolving and transforming nature of Web information, subentities must be allowed and guided to transformation in order to accommodate new occurrences as more NHCE extracts are seen.
3.4. Annotation and Parsing
We adopt a top-down extraction process for annotating
Figure 2. NHCE extraction process: A schematic diagram showing template matching, entity transformation and sequence validation.
and parsing extracts. Generic rules are used to extract text fragments which constitute the desired extract. At this stage, semantical and structural relationships are ignored. For instance, when ssh commands are desired, ssh -R 5555: localhost: 3389 and ssh -l 5555: localhost: 3389 will all be extracted because the relationships between -R, -l and 5555: localhost: 3389 are ignored. Shown in Figure 3 is the employed generic rules format in JAPE .
In the generic rule format, the terms are logical expressions for matching sub-entities. Under JAPE, these can be organized into JAPE macros. A combination of sub-entities forming up a sequence is bound to a label which is used as a reference to the matched text fragment during post-processing after the rule has fired. For each label, there are intermediate labels which are used for parsing the matched text fragment. This annotation approach, therefore, also serves as a parser for the desired NHCE extract.
We use rules in this initial stage and conduct statistical processing in later semi-automated stages because rules are easier to interpret, develop and deploy while statistical processing brings in robustness  to noise for unstructured data.
3.5. Sequence Validation
Extract sequences are then validated using template sequences and a set of two other categories of operators (ontological and direct expert input) as shown in Figure 4(a). Operators, and
are such that a sequence (of objects) for which either or or is labeled either or or respectively. Operators serve to bring flexibility into validation process and enables hands-on by an
Figure 3. Pattern rule format.
Figure 4. Sequence validation.
expert. Ontological and expert operators are case and domain dependent and therefore are described further with a case study in Section 4. The different validation results (,and) are then combined by using Algorithm 1.
Algorithm 1. Aggregation of validation results.
Templates, as shown in Figure 3, are made up of all possible sequences of the desired NHCE and are in RDF format. Sequences in a template can be manually written or auto-generated using specifications shown in Figure 4(b). They are defined in terms of fields and their types with either missing occupants or entities, as defined in a definition files to which the types act as pointers, with a specified level of allowed repetition. For instance, with ssh commands as desired NHCE extracts, the first field (ID) can only have cmdid as its occupant with allowed repetition 1. The third field (INPUT), however, can be occupied by either of the following; null, input, input input etc. i.e. It is possible to have multiple inputs occupying an INPUT field (allowed repetition 2).
Each sequence in a template is assigned an observation score called sequence frequency-inverse page frequency, sf-ipf (Equation (1)) which shows how significant it is. It is denoted as s in Figure 3. In Equation (1), is the total number of pages seen. For a document-based source like the Web, is the total number of documents. is the sequence count in a template t, (has value 1 in this proposed approach). The definition of sf-ipf is given by Definition 6. sf-ipf is analogous to tf-idf () which is a well known numeric statistic used in Information Retrieval and Text Mining for showing how important a word is in a document. In templates, there is also a provisioning for defining a bias weight based on the source of the extracts because for some sources (e.g. the Web), an extract can have a statistical advantage over others due to factors other than its information richness (e.g. social, commercial factors). Bias weights are specified using an approximated extracts distribution.
sf-ipf is a numerical statistical weight on a sequence (q) of types that reflects the significance of that sequence across all sequences (in a template t) which defines an extract.
3.5.1. Approximated Distribution
When parsing and matching an n-entity NHCE extract, we use n-bit binary flags to represent its matches whereby there are different flags. In ssh command example, where extracts are 6-entity as described in Section 2, there are (100000 to 111111) different flags. Since it is a complex problem to model how sequences are distributed across a heterogeneous information source (e.g. the Web) , our approach involves approximating this distribution using a minimum of information. We use the flag which corresponds to the commonly used extract occurrence. We approximate a distribution from this flag value (as the mean), using lognormal distribution (Equation (2)) under the following two assumptions.
Assumption 1: Extract occurrences are skewed distributed around a commonly used one. e.g. ssh-X user@ host Assumption 2: The frequency of occurrence decreases with matched sub-entities. For instance, it is more likely to encounter ssh user than ssh-F file.txt user@host.
Standard deviation provides a means to tune the distribution skewness depending on the desired extracts and respective source.
Validation is in two stages. The preliminary one is during template matching (Figure 3) and the other stage is during validation and output generation. Sequences are validated using a series of validation operators, the first of which is statistical, based on sf-ipf. A sequence is considered valid and assigned a label after its sf-ipf has passed a threshold frequency. Other case-specific operators are then enforced and labels and are assigned and then the various validation results are aggregated using a simple algorithm (Algorithm 1).
3.5.3. Role of Sequence Validation
In this approach, sequence validation plays two major roles. First is to refine the just extracted sequences of sub-entities by filtering out those which do not comply with the template through a matching process. And second is to validate template sequences using statistical sf-ipf and other user operators (ontological and direct expert input).
3.6. Entity Transformation
To guide transformation of an entity, we first characterize its field and then use support of these characteristics to produce a transformation model. We statistically characterize entities using their observed structures (length and count) and types (b-type and u-type). The used structural properties and types are shown in Table 3.
Table 3. Entity characteristics.
During extraction and parsing, characterized features are formed for each entity field, out of its occurrences. These are in triplet form. Occurrence of “-L 127.0.0.1” in INPUT field of the ssh command could produce characteristics such as “count”, 2, , “btype”, “alphanumeric”, , “utype”, “ip”, etc. Support of a characteristic is a fraction of characteristic’s count to all occurrences (Equation (3)).
Using support values of the four categories of characteristics in Table 3 (length, count, b-type, and u-type), a set of fuzzy rules are produced to guide entity transformation. Transformation states and fuzzy conditions are shown in Figure 5 and Table 4 respectively. Conditions for the transitions in Figure 5 which are indicated by numbers are shown in Table 4.
4. Case Study
In this section we demonstrate the proposed approach and discuss its performance. We use linux commands (ssh, find and shutdown) as desired extracts and demonstrate extraction and subsequent use in solving a command validation problem. In our case study, we treat command validation problem as a classification problem. We use our approach to extract commands from the Web, parse them, validate them and ultimately label them to produce training data for use in building a classification model. In discussing and presenting this case study, we focus on one command only (ssh) and present the results for others in discussion (Section 4.5).
We implemented the approach under Java environment using tools which are shown in Table 5. Figure 6 shows the design schematic of the implemented system. The details of the implementation are as follows.
1) Template is implemented using RDF container sequences as shown in Figure 7. In a template, field types (e.g. option and destination) also act as a pointers to files which define sub-entities (e.g. ip, email address etc.). A template contained 12 ssh command sequences (shown in Table 7) which were auto generated using specifications shown in Figure 4(b). The shown bias weights and sf-ipf
Table 4. Entity transformation: Conditions.
Table 5. Used tools and environments.
Table 6. Sub-entity types of ssh command.
Table 7. Comma separated ssh template sequences.
Figure 5. Entity transformation: States.
scores are after extraction and validation and are therefore described in Section 4.3.
2) Entity Type Definitions are shown in Table 6. We defined 10 utypes used in ssh commands.
3) Generic Rules, although seem complicated as shown in Figure 8, are easy to write because the details of subentity relationships are ignored at this stage. The rules do not cater for COMMAND field of ssh command.
4) Lookup Lists provide references to various NHCE keywords like cmdid (e.g. ssh), options (e.g. -l) etc.
5) Characterized Features give a statistical snapshot of observed sub-entities in the fields. Table 8 shows characteristic triplets in tabular form. The values in the table are discussed in Section 4.3.
6) Transformation Model is implemented using C 4.5
Figure 6. Schematic of the implemented design.
Figure 7. Ssh template.
Figure 8. JAPE pattern rule for ssh commands.
Table 8. Field characteristic triplets in tabular form.
algorithm which is built out of manually crafted fuzzy rules based on transitions shown in Figure 5.
7) Only template operator is used because we feel that the rationale behind any ontology and direct expert input is ambiguous for evaluational purposes.
4.2. Experimentation Data
We retrieved from the Web 120 html pages which contain information (usage and example) about ssh command, 50 html pages for shutdown command and 70 for find command. We used queries like “how to use...”, “linux command reference”, “how to connect using...”, “how to shutdown...”, “how to locate...”, “how to find...” etc. against search engines and retrieve relevant pages.
4.3. Extraction, Validation and Results
This section and Section 4.4 are discussed using ssh command. Documents were annotated using GATE’s ANNIE annotator in which we used our generic rules. Annotation results depicted that the documents contain about 959 ssh command candidates (extracts) with which we experimented. Our objectives were to investigate:
• Effectiveness of our post-rule processing in extracting and parsing NHCE.
• Performance of sf-ipf based validation method.
• Effects of bias weights.
• How rich the current Web content is in terms of diversity and utility of its extracts.
To realize these objectives, we manually prepared a benchmark list of 1250 valid ssh commands. We do not claim that this is a comprehensive list of ssh commands. The list only covers ipv4, does not include neither COMMAND field nor redirections and pipelines (for simplicity and easiness in experimentation). Threshold values were set proportionally, i.e. by assuming even distribution across characteristic values with the exception to structural characteristics (count and length) which were set to match btype characteristics. For instance, since there are 3 b-types, then. Other thresholds are as follows, , and. Threshold was set to the template average value of sf-ipf. The values of bias weights used were changing depending on the observed most common sequence. In the end, the bias weights were as shown in Table 7 with sequence No. 2 as the most common sequence, and therefore used a mean in the distribution Equation (2).
4.3.1. Post-Rule Processing Results
Post-rule processing serves as both a parser and initial validator. Out of the 959 candidate extracts, 275 were validated at this stage as to contain ssh commands. These were contained in 56 of the 120 used documents, meaning that the positive predictive value (PPV) of Web retrieval was 47%. The distribution of extracts among documents is such that a document contains 4.9 extracts on average, document with the highest number of extracts has 35 extracts and the majority of documents contain only 1 extract. The validated candidates were also parsed into NHCE sub-entities for each of which its occurrences were used in characterizing that field. Table 8 summarizes characterized feature triplets and their observed support values.
4.3.2. Validation Results
In the template, 9 of the 12 template sequences were validated using sf-ipf. These resulted into 171 extract of which 70 successfully transformed. For example, ssh-R 5555: localhost: 3389 to ssh-R porthostport is a successful transformation while ssh-R 5555: localhost: 3389 to ssh-R alphanumeric is a failed transformation. A field transformation into a non-existing type (neither utype nor vtype) is considered a failure and renders the whole extract sequence transformation unsuccessful. Like indicated in Figure 5, b-type is therefore not a final transformation state. Table 9 summarized the results of transformation for each field.
When checked against the benchmark list, the precision of the successfully validated and transformed extracts was 83% while recall was 27.46%.
4.3.3. Effects of Bias Weights
Bias weights are meant to enable user to reduce the effect of Web-inherent statistical advantage. By changing the value of (Equation (2)), the significance of a template sequence can be tuned. Without bias weights, 9 out of 12 template sequences were validated as pointed out earlier. With bias however, sequence No. 6 is invalidated (Figure 9).
Table 9. Entity transformation summary.
Figure 9. Biasing sf-ipf.
4.4. Supervised Learning
The final validated and transformed extracts comprise of sequences of vtypes and utypes (e.g. ssh-v-l user host). These can be used for further validation of ssh commands through either direct use as templates or as labeled examples in supervised learning. We experimented the later by using v-types and utypes as attributes in training datasets and build one class svm classifiers for validation of ssh commands. By using a one class classifier, invalid commands are considered outliers. 10-fold cross validation results, using the benchmark list as testing dataset, depicted a classifier with F-Measure 0.336 and accuracy of 39.33%.
The presented case study demonstrates our proposed approach in terms of how NHCE candidates are annotated from free text, parsed, validated, sub-entity transformed and finally used in validating similar NHCEs in a supervised learning setup. Table 10 summarizes results for all three linux commands which were experimented.
Our rule-based annotation approach is rather of a brutal force nature because relationships between subentities were not taken into account when writting the general rules. The initial extracts were therefore too noisy. For instance a text fragment ssh -X -L 188.8.131.52 hostmachine led to two extract candidates with both -L 184.108.40.206 and hostmachine considered as occurrences in one and -L 220.127.116.11 as occurrence while hostmachine as occurrence in another. Nonetheless the approach is preferred for a number of reasons in addition to existing facts like rule-based methods being methods which offer sound performances as pointed out in Section 2. First reason is that generic rules can be easily written and interpreted. Second reason is that, by considering all ambiguous text fragments in contention, fields in a template sequence are able to see more occurrences and so improve the richness of their entity
Table 10. Results summary.
characteristics. Improved entity characteristics lead to better entity transformation. In Table 9, for example, although only 10 different vtypes appear in the final validated extracts, 14 were transformed.
The use of hand-coded binding labels for parsing falls in the same line of reasoning as generic rules. These are equally important in order to maximize field observations in a top-down extraction approach.
The initial validation, during template matching, removes noisy extracts, a consequence of using generic rules. As presented, initial validation saw candidate extracts reduced to 275 from 959 for ssh command. This is a 71% noise. shutdown and find do not appear too noisy as shown in Table 10 (23.4% for shutdown and 22.1% for find) because of the relatively few documents used and lesser popularity within the Web content.
sf-ipf based validation approach have shown both strength in filtering valid transformable extracts. This is supported by good precision values for all three commands (ssh-83.0%, shutdown-93.2% and find-63.2%). These results might demonstrate neither richness of Web content nor practical utility of the extracts because of very low recall values but demonstrate the effectiveness of sf-ipf in filtering out best NHCE candidates from the original extracts. From our experimentations, it is evident that sf-ipf alone does not produce comprehensive enough results for practical use. This is depicted by low accuracy values when the extracts were used in supervised learning. It is a result of being only statistical and the use of generic rules. For instance, since an occurrence shutdown -c is often then an extract shutdown-c now is likely to be validated even though it is not valid. As pointed earlier, it is for this reason that our approach incorporates expert input in either ontological operators or direct input operators forms. These operators are provided in terms of relationships between sub-entities. For example, an ontology or list which describes all valid inputs for a particular input flag in ssh commands. Surely, when used, these improve performance of extracts. When we experimented using direct input by specifying all valid inputs for the most observed input flag (-L) in ssh command, we noticed a 10% increase in supervised learning accuracy.
Another reason for low accuracy values of the end classification models is the small number of used documents relative to what is avaliable in the Web. Preparation of experimentation documents was done manually, therefore it was quite an expensive process.
5. Related Works
Although Bibpro  is a citation parser, it resembles our work in the use of sequence templates. In Bibpro, structural properties and local properties of citation fields are used to create template sequences. This means that a significant prior knowledge about these properties is needed in advance. In our approach, template sequences are created from user’s specifications about sub-entity types which occupy NHCE field. Also Bibpro’s use of canonicalization makes it highly reliant to NLP. In terms of cardinality of sub-entities, “Booktitle” and “Month” are the fields of highest cardinality (12) which are nevertheless made up of only NPL keywords. We could not compare our approach against Bibpro because of differences in extract types and difficulty in implementation of Bibpro as its not open source.
The idea of using relational databases  can also be applied in recognition and classification of NHCEs. According to this idea, extracts are considered as database queries which have well understood formats and relationships between sub-entities (columns) are not explicit. Although in , this idea is presented in NPL context, it provides food for throught in dealing with NHCE.
Transformation of sub-entities is closely related to the discovery of new attributes. Wong et al.  employ semantic analysis on text fragments and use Bayesian Learning to infer them. The approach enables discovery of such attributes like “online price”, “publishing year” etc. using EM algorithm model.
Statistical methods using state-of-the art HMM  and CRF  are promising ones in dealing with NHCEs. The challenge, however, lies in making them easy to deploy and understand, and integrate with expert input.
To the best of our knowledge, there is no published IE work for the extraction and classification of NHCEs. Most studied extracts are made up of keyword subentities for which dictionaries  and lists are preferred.
This paper presented an approach to recognize and classify Named High cardinality Entities. The approach treats sub-entities of a NHCE as a extract sequence and uses a series of three categories of operators to validate these extract sequences. The foundational operator is a template based operator which scores template sequences with a statistical score called sequence frequency-inverse page frequency and use these sequences to validate extract sequences. The approach also enables sub-entities to transform into either a value type (v-type) or a userdefined type (u-type). This transformation brings flexibility on how extraction knowledge is shared, how extract features are captured and also how extracts are used.
We have demonstrated this approach using linux commands as desired NHCEs. Experimentation results reveal the effectiveness of sequence validation based extraction in capturing both structural and user-defined features and types, characterize them and subsequently use the characterized features to validate extract sequences. We have also shown how an approximated distribution can be used to tune significances of template sequences as a way to regulate statistical acceptability of extract sequences.
There are some areas where improvements are still needed in order to make this approach easier to use and deploy. These includes automating the process of building a transformation model, improving the generic rules approach so that extracts are less noisy but in sufficient numbers and developing a framework by which ontological operators can be smoothly integrated. Other issues, which are to be further researched on, concern alternatives to generic rules approach in annotating and parsing NHCE extract, e.g. statistical CRF-based.
A part of this work is supported by third supplementary budget for 2011 of Ministry of Internal Affairs and Communications of Japan, Research and Development of Applicable Resource Unit Construction and Reconstitution Technology for Communication Network on Large-Scale Disasters.