Using schema transforation pathways for biological data integration

doi:10.4236/jbise.2008.13035

Paper Menu >>

Journal Menu >>

J. Biomedical Science and Engineering, 2008, 1,204-209

Published Online November 2008 in SciRes. http://www.srpublishing.org/journal/jbise JBiSE

Using schema transforation pathways for biological

data integration

Hao Fan & Fei Wang

International School of Software, Wuhan University, Hubei, P.R. China, 430072. Email:{hfan, feiwangg}@whu.edu.cn), Tel.: +86 27 6877-8605.

Received on September 10, 2008; revised and accepted on September 28, 2008

ABSTRACT

In web environments, proteomics data integra-

tionin the life sciences needs to handle the

problem of data conflicts arising from the het-

erogeneity of data resources and from incom-

patibilities between the inputs and outputs of

services used in the analysis of the resources.

The integration of complex, fast changing bio-

logical data repositories can be potentially sup-

ported by Grid computing to enable distributed

data analysis. This paper presents an approach

addressing the data conflict problems of pro-

teomics data integration. We describe a pro-

posed proteomics data integration architecture,

in which a heterogeneous data integration sys-

tem interoperates with Web Services and query

processing tools for the virtual and materialised

integration of a number of proteomics resources,

either locally or remotely. Finally, we discuss

how the architecture can be further used for

supporting data maintenance and analysis ac-

tivities.

Keywords: Evoked potentials (EPs), Alpha sta-

ble distribution, Blind source separation, Mini-

mum dispersion (MD), Fractional lower order

statistics (FLOS)

1. INTRODUCTION

In the life sciences, along with the deepening investiga-

tion into human genome, especially into their functional-

ities, study on proteomics becomes an important key is-

sue of life sciences research. Since protein is the reflector

of genome functionalities, research on proteomics is to

investigating the structures and functions of protein, in

order to interpret the variation mechanism of a life in

physiologies or pathologies. Issues relating to protein

innate existence forms and activity patterns, such as in-

terpreting gene modifiers, protein interactions and con-

figurations, require the solution of study on the protein

complement of the genome, i.e. proteomics, which is also

an essential component of any comprehensive functional

genomics study targeted at the elucidation of biological

functions.

Proteome databases and genome pools are generally

used for proteomics research. However, global protein

expression analysis refers to any experiment in which the

expression of all genes is monitored simultaneously,

which generate large amounts of data, but there is no

universal system for the description of gene expression

profiles. Global protein expression data are obtained

predominantly as signal intensities on 2D protein gels.

A large amount of biological information is available

over the Internet, but the data are widely distributed and

it is therefore necessary to have efficient mechanisms for

data integration and data retrieval. Grid computing tech-

nologies are becoming established which enable distrib-

uted computational and data resources to be accessed in a

service based environment. These technologies offer the

possibility of analysis of complex distributed

post-genomic resources. To support transparent access,

however, such heterogeneous resources need to be inte-

grated rather than simply accessed in a distributed fash-

ion.

This paper introduces a proteomics data integration

architecture, using a heterogeneous data integration sys-

tem interoperates with Web Services and query process-

ing tools for the virtual and materialised integration of a

number of local and remote proteomics resources. We

also discuss how the architecture can be further used for

supporting data maintenance and analysis activities, i.e.

processing user queries, tracing data linages, and main-

taining data incrementally.

Paper outline: Section II gives an overview of protein

database resources available in the Internet. Section III

presents a web-based heterogeneous data integration ar-

chitecture, including the BAV data integration technolo-

gies and Web services, the transformation pathways for

creating the global schema, and the key issues addressed

in the architecture for applying to biological data integra-

tions. Finally, Section IV gives our conclusions and di-

rections of future work.

2. OVERVIEW OF PROTEOME DATA-

BASE RESOURCES

Proteome databases receive more and more attentions in

proteomics research, which strives to provide a high level

of annotation, such as the description of the function of a

protein, its domains structure, post-translational modifi-

cations variants, etc., a minimal level of redundancy and

high level of integration with other databases [2].

H. Fan / J. Biomedical Science and Engineering 1 (2008) 204-209 205

2.1. Protein Sequence Databases

2.1.1. General sequence databases

EXPr ot (see http://www.cmbi.kun.nl/EXProt/) is aiming

at including only proteins with an experimentally verified

function, which is a non-redundant protein database con-

taining a selection of entries from genome annotation

projects and public databases. Its each entry has a unique

ID number and contains information about the species,

amino acid sequence, functional annotation.

UniProt (see http://www.uniprot.org) is formed by

uniting activities of the Swiss-Prot1, TrEMBL2, and

PIR3 protein databases, which provides a central re-

source on protein sequences and functional annotation

with three database components, each addressing a key

need in protein bioinformatics.

TCDB (see http://www.tcdb.org) is a curated rela-

tional database containing sequence, classification, struc-

tural, functional and evolutionary information about

transport systems from a variety of living organisms.

TCDB is a repository for information compiled from

more than 10; 000 references, encompassing approxi-

mately 3; 000 representative transporters and putative

transporters, classified into over 400 families.

2.1.2 Protein properties

iProLINK (see http://pir.georgetown.edu/iprolink) fa-

cilitates text mining research in the area of litera-

ture-based database curation, named entity recognition,

and protein ontology development. This collection of

annotated data sources can be utilized by computational

and biological researchers to explore literature infor-

mation on proteins and their features.

PFD (see http://www.foldeomics.org/pfd/) has a data-

base structure allows visualization of folding data in a

useful and novel way, with aims of facilitating data min-

ing and bioinformatics approaches, which is a searchable

collection of all annotated structural, methodological,

kinetic and thermodynamic data relating to experimental

protein folding studies.

PINT (see http://www.bioinfodatabase.com/pint/)

contains data of several thermodynamic parameters along

with sequence and structural information, experimental

conditions and literature information. Each entry contains

numerical data for the free energy change, dissociation

constant, association constant, enthalpy change, heat ca-

pacity change and so on of the interacting proteins upon

binding, which are important for understanding the

mechanism of protein-protein interactions.

2.1.3 Protein sequence motifs

PROSITE (see http://www.expasy.org/prosite) is a large

collection of biologically meaningful signatures de-

scribed as patterns or profiles. Each signature is linked to

a documentation that gives useful biological information

on the protein family, domain, or functional site identi-

fied by the signature.

Blocks (see http://blocks.fhcrc.org) are ungapped mul-

tiple alignments corresponding to the most conserved

regions of proteins, which consists of blocks constructed

from documented families of related proteins. A blocks

multiple alignment consists of ungapped conserved re-

gions separated by unaligned regions of variable size.

PRINTS (see http://www.bioinf.man.ac.uk/dbbrowser

/) houses a collection of protein family fingerprints,

which may be used to make familial and tentative func-

tional assignments for uncharacterised sequences. It spe-

cializes in the provisional of hierarchical classifications

of protein superfamilies, allowing fine-grained diagnoses,

and provides the bulk of the hierarchical family annota-

tion in InterPro. PRINTS also underpins the Blocks da-

tabase from Seattle and eMOTIF resource from Stanford.

2.1.4. protein domains and classifications:

iProClass (see http://pir.georgetown.edu/iproclass/) pro-

vides an integrated view of protein information and

serves as a bioinformatics framework for data integration

and associative analysis of proteins. It presents

value-added descriptions of all proteins in UniProtKB

and contains comprehensive, upto- date protein informa-

tion derived from over 90 biologicaldatabases. The

source databases include those of protein sequence, fam-

ily, function, pathway, protein-protein interaction, com-

plex, post-translational modification, protein expression,

structure and structural classification, gene and genome,

gene expression, disease, ontology, literature, and tax-

onomy.

2.2. Protein Structure Databases

PDB (see http://www.rcsb.org/pdb/) is the single world-

wide archive of structural data of biological macromole-

cules. A description of the architecture and functionality

of the systems used to collect, archive, distribute, and

query the data were described previously [3].

SCOP (see http://scop.mrc-lmb.cam.ac.uk/scop) pro-

vides a comprehensive and detailed description of the

evolutionary and structural relationships of the proteins

of known structure. It embodies an evolutionary classifi-

cation produced by human experts. This allows users to

use a theory of protein evolution that encompasses our

knowledge of the great variety, and the full extent, of the

different types of changes that take place during evolu-

tion.

CATH (see http://www.biochem.ucl.ac.uk/bsm/cath

new) currently contains 34; 287 domain structures classi-

fied into 1; 383 superfamilies and 3; 285 sequence fami-

lies. Each structural family is expanded with domain se-

quence relatives recruited from GenBank using a variety

of efficient sequence search protocols and reliable thresh-

olds. New sequence search protocols have been designed,

based on these intermediate sequence libraries, to allow

more regular updating of theclassification.

2.3. Proteomics Resources

PEDRo (see http://pedrodb.man.ac.uk:8080/pedrodb)

provides access to a collection of descriptions of experi-

mental data sets in proteomics [8]. It was one of the

firstdatabases used for storing proteomics experimental

data and has also been used as a format for exchanging

206 H. Fan / J. Biomedical Science and Engineering 1 (2008) 204-209

Figure1. A Proteomics Data Integration Architecture.

proteomics data. The integration of such a comprehen-

sive data model, in addition to its data content, provides

means for capturing a significant proportion of the pro-

teomic information that is captured by other proteomics

repositories.

gpmDB (see http://gpmdb.thegpm.org) is a publicly

available database with over 2; 200; 000 proteins and

almost 470; 000 unique peptide identifications [4]. Al-

though the gpmDB is restricted to minimal information

relating to the protein/peptide identification, it provides

access to a wealth of interesting and useful peptide iden-

tifications from a range of different laboratories and in-

struments.

PepSeeker (see http://nwsr.smith.man.ac.uk/pepseeker)

is a database targeted directly at the identification stage

of the proteomics pipeline. It captures the identification

allied to the peptide sequence data, coupled to the under-

lying ion series and as a result it is a comprehensive re-

source of peptide/proteinidentifications [11]. The reposi-

tory currently holds over 50; 000 proteins and 50; 000

unique peptide identifications.

PRIDE (see http://www.ebi.ac.uk/pride/) is a database

of protein and peptide identifications that have been de-

scribed in the scientific literature. These identifications

will typically be from specific species, tissues and

sub-cellular locations, perhaps under specific disease

conditions and may be annotated with supporting mass

spectra. PRIDE can be searched by experiment accession

number, protein accession number, literature reference

and sample parameters including species, tissue,

sub-cellular location and disease state. Data can be re-

trieved as machine-readable PRIDE or mzData XML, or

as human-readable HTML.

3. DATA INTEGRATION ARCHITECTURE

Figure 1 illustrates a heterogeneous data integration ar-

chitecture in Web environments. In this architecture, re-

mote data sources use web service platforms producing

data exchange and access processes with external users.

The BAV data integration system interoperates with web

services enabling query processing tools for the virtual

and materialised integration of a number of distributed

data resources.

Specially, in the scenario of integrating local databases,

BAV wrappers apply to the local data sources directly, so

that extracting data and data structure from data sources

and producing global user queries. On the other hand, in

the scenario of web data integrations, BAV wrappers

exchange data with web services so that access the re-

mote data sources. Both virtual and materialised inte-

grated views can be created by the data integration sys-

tem.

In this section, we discuss how this architecture can be

proposed used for integrating proteomics data resources.

3.1. Data Integration Technologies

BAV Data Integration. Up to now, most data integra-

tion approaches have been either global-as-view (GAV)

or localas- view (LAV). In GAV, the constructs of a

global schema are described as views over local sche-

mas1. In LAV, the constructs of a local schema are de-

fined as views over a global schema. One disadvantage

of GAV and LAV is that they do not readily support the

evolution of both local and global schemas. In particular,

GAV does not readily support the evolution of local

schemas while LAV does not readily support the evolu-

tion of global schemas. Furthermore, both GAV and

LAV assume one common data model for the data trans-

formation and integration process, typically the relational

data model.

Both-as-view (BAV) is a new data integration ap-

proach based on the use of reversible sequences of primi-

tive schema transformations [10]. From these sequences,

it is possible to derive a definition of a global schema as

a view over the local schemas, and it is also possible to

derive a definition of a local schema as a view over a

global schema. BAV can therefore capture all the seman-

tic information that is present in LAV and GAV deriva-

tion rules. A key advantageof BAV is that it readily sup-

ports the evolution of both localand global schemas, al-

lowing transformation sequences and schemas to be in-

crementally modified as opposed to having to be regen-

erated.

Another advantage is that BAV can offers the capabil-

ity to handle virtual, materialised and indeed hybrid data

integration across multiple data models. This is because

BAV supports a low-level hypergraph-based data model

(HDM) and provides facilities for specifying higher-level

modelling languages in terms of this HDM. For any

modelling language M specified in this way, the approach

provides a set of primitive schema transformations that

can be applied to schema constructs expressed in M. In

particular, for every construct of M there is an add and a

delete primitive transformation which add to/delete from

a schema an instance of that construct. For those con-

structs of M which have textual names, there is also a

rename primitive transformation. BAV schemas can be

incrementally transformed by applying to them a se-

quence of primitive transformations, each adding, delet-

ing or renaming just one schema construct.

H. Fan / J. Biomedical Science and Engineering 1 (2008) 204-209 207

Each add and delete transformation is accompanied by

a query specifying the extent of the added or deleted con-

struct in terms of the rest of the constructs in the schema.

This query is expressed in a functional query language,

IQL, which is a comprehensions-based language and we

refer the reader to the reference [9] for details of its syn-

tax, semantics and implementation. Such languages sub-

sume query languages such as SQL-92 and OQL in ex-

pressiveness [1].

Also available are extend and contract primitive trans-

formations which behave in the same way as add and

delete except that they state that the extent of the

new/removed construct cannot be precisely derived from

the other constructs present in the schema. More specifi-

cally, each extend and contract transformation takes a

pair of queries that specify a lower and an upper bound

on the extent of the construct. The lower bound may be

Void and the upper bound may be Any, which respec-

tively indicate no known information about the lower or

upper bound of the extent of the new construct. The que-

ries supplied with primitive transformations can be used

to translate queries or data along a transformation path-

way

A sequence of primitive transformations from one

schema S1 to another schema S2 is termed a pathway from

S1 to S2. All source, intermediate, and integrated schemas,

and the pathways between them, are stored in a Schemas

& Transformations Repository.

Web Services. Web services provide a standard means

of interoperating between different software applications

(see 4 http://www.w3.org/TR/ws- arch/), running on a

variety of platforms and/or frameworks. A Web service is

a software system designed to support interoperable ma-

chine-to-machine interaction over a network. It has an

interface described in a machine-processable format

(specifically Web Services Description Language,

WSDL). Other systems interact with the Web service in a

manner prescribed by its description using SOAP mes-

sages, typically conveyed using HTTP with an XML

serialization in conjunction with other Web-related stan-

dards.

3.2. Data Integration Using Transformation

Pathways

The aim of the proteomics data integration is to build

technologies providing an environment for integrating

proteomics data, constructing and executing analyses

over such data, and a library of proteomics-aware com-

ponents that can act as building blocks for such analyses.

A web-based architecture is enabling existing proteomics

data resources, creating new resources, producing mid-

dleware technologies for the integration of protein data

resources — including tools for data integration, data

analysis, producing user queries, visualisation applica-

tions and other types of client for biologist end users.

This section gives examples of using transformation

pathways for creating a global schema. Supposing the

table ProteinHit (all peptides matched, expect, score,

threshold,protein, peptidehit), indicated as hhuproteinhitii,

in the global schema, which is composed of data origi-

nally from four different databases, PEDRo, gpmDB,

PepSeeker and Pride. The following transformation

pathways are used to create the hhuproteinhitii schema,

in which id2lsid is an IQL build-in function used to

transform id features in source schemas into lsid feature

in the global schema.

1. Creating ProteinHit from PEDRo database:

2. Creating ProteinHit from gpmDB database:

3. Creating ProteinHit from PepSeeker database:

4. Creating ProteinHit from Pride database:

208 H. Fan / J. Biomedical Science and Engineering 1 (2008) 204-209

3.2. Using Schema Transformation Path-

ways

In the previous section we showed how schema trans-

formation pathways can be used for expressing the proc-

ess generating a global schema. In this section, we dis-

cuss how the transformation pathways can be used for the

following further activities.

1) Schema Evolution: A recurring issue within the data

integration architecture is that the source schemas will

change as the owners of these autonomous data sources

evolve them over time. Changes in global schemas may

also be needed in order to support new requirements of

the client components. An advantage of the BAV ap-

proach over GAV or LAV data integration is that it read-

ily supports the evolution of both source and global

schemas by allowing transformation pathways to be ex-

tended — this means that the entire integration process

does not have to be repeated, and the schemas and path-

ways can instead be ‘repaired’. This process can be han-

dled largely automatically, except in cases where new

information content is being added to schemas where

domain or expert human knowledge is needed regarding

the semantics of new schema constructs.

2) Transforming Data Schemas: In previous work [7], we

show that transformation pathways can be used for ex-

pressing the process generating global schemas in bio-

logical data integration environments. Each transforma-

tion step is accompanied by an add/delete operation with

a query specifying the extent of the added or deleted con-

struct in terms of the restof the constructs in the original

schema.

3) Processing User Queries: An WS wrapper imports

schema information from any data source, via web ser-

vices, into a metadata repository. Thereafter, schema

transformation/ integration functionality can be used to

create one or more virtual global schemas, together with

the transformation pathways between these and the BAV

representations of the data source schemas. Queries

posed on a virtual global schema can be submitted to a

query processor, and this interacts with 5 web services via

WS wrappers to evaluate these queries. After the integra-

tion of the data sources, the user is able to submit to the

query processor a query to be evaluated with respect toa

virtual global schema.

4) Tracing Data Lineage: In the data integration archi-

tecture, proteomics data is integrated from distributed,

autonomous and heterogeneous data sources, in order to

enable analysis and mining of the integrated information.

However, in addition to analyzing the data in the inte-

grated Grid, we sometimes also need to investigate how

certain integrated data was derived from the data sources,

which is the problem of data lineage tracing (DLT).

In [5], we present a DLT approach which is to use the

individual steps of these pathways to compute the lineage

data of the tracing data by traversing the pathways in

reverse order one step at a time. In particular, suppose a

data source LD with schema LS is transformed into a

global database GD with schema GS, and the transforma-

tion pathway LS → GS is ts1,…, tsn. Given tracing data td

belonging to the extent of some schema construct in GD,

we firstly find the transformation step tsi which creates

that construct and obtain tds lineage, dli, from tsi. We

then continue by tracing the lineage of dli from the re-

maining transformation pathway ts1,…, tsi-1. We continue

in this fashion, until we obtain the final lineage data from

the data source LD.

5) Maintaining Integrated Data Incrementally: The

global schema might be materialised by creating an inte-

grated database. A problem relating to materialised inte-

grated data is view maintenance. The materialised inte-

grated data need to be maintained either when the data of

a data source changes, or if there is an evolution of a data

source schema. If the source data is updated, the inte-

grated data has to be refreshed also so as to keep it

up-to-date.

If a materialised construct c is defined by an query q

over other materialised constructs, [6] gives formulae for

incrementally maintaining c if one its ancestor constructs

ca has new data inserted into it (an increment) or data

deleted from it (a decrement). We actually do not use the

whole view definition q generated for c, but instead track

the changes from ca through each step of the pathway. In

particular, at each add or rename step we use the set of

increments and decrements computed so far to compute

the increment and decrement for the schema constructed

being generated by this step of the pathway.

6) Extendable Exploitations: In peer-to-peer (P2P) sys-

tems, a number of autonomous servers, or peers, share

their computing resources, data and services. The loose

and dynamic association between peers has meant that, to

date, P2P systems have been based on the exchange of

files identified by a limited set of attributes. The lack of

information about the data within these files makes it

impossible to support general-purpose mechanisms by

which peers can exchange and translate heterogeneous

data. How the heterogenous data integration architecture

can be used to integrate proteome data in different peers

would be the extendable exploitation of our work.

4. CONCLUSION

In this paper, we give a brief overview of protein data-

base resources available in the Internet, and present an

architecture combining web services and a data integra-

tion software tools over the autonomous data resources

which enables distributed query processing together with

the resolution of semantic heterogeneity over autono-

mous data resources. We also discuss the key issues of

applying the architecture to biological data integrations,

namely schema evolution, transforming data schemas,

processing user queries, tracing data lineage, maintaining

integrated data incrementally, and extendable exploita-

tions.

The final platform will provide researchers with more

information than any of the resources alone, so allowing

them to perform analyses that were previously prohibi-

tively difficult or impossible. This integration process

H. Fan / J. Biomedical Science and Engineering 1 (2008) 204-209 209

both builds on and provides impetus to the development

of data standards in the proteomics and related domains.

Future work incudes designing and implementing

wrappers for extracting data and data structure over web

services, implementing DLT and IVM algorithms over

the architecture, and investigating into extending the sys-

tem into P2P environments.

REFERENCES

[1] P. Buneman et al. (1994) Comprehension syntax. SIGMOD Re-

cord, 23(1):87–96.

[2] A. Bairoch and R. Apweiler. (2000) The SWISS-PROT protein

sequence database and its supplement TrEMBL in 2000. Nucleic

Acids Res, 28:45–48.

[3] H.M. Berman, J. Westbrook, and et al. (2000) The Protein Data

Bank. Nucleic Acids Res, 28:235–242.

[4] R. Craig, J.P. Cortens, and R.C. Beavis. (2004) Open source sys-

tem for analyzing, validating, and storing protein identification

data. Journal of Proteome Research, 3(6).

[5] H. Fan and A. Poulovassilis. (2005) Using schema transformation

pathways for data lineage tracing. In Proc. BNCOD’05, LNCS

3567, pages 133–144.

[6] H Fan. (2005) Investigating a Heterogeneous Data Integration

Approach for Data Warehousing. PhD thesis, Birkbeck College,

University of London.

[7] H. Fan and L. Li. (2007) Study on Metadata Applications for Pro-

teomics Data Integration. In Proc. ICBBE’07, IEEE.

[8] K. Garwood et al. (2004) Pedro: A database for storing, searching

and disseminating experimental proteomics data. BMC Genomics,

[9] E. Jasper, A. (2003) Poulovassilis, and L. Zamboulis. Processing

IQL queries and migrating data in the AutoMed toolkit. Technical

Report 20, Automed Project.

[10] P. McBrien and A. Poulovassilis. (2003) Data integration by

bi-directional schema transformation rules. In Proc. ICDE’03,

pages 227–238.

[11] T. McLaughlin, J. A. Siepen, J. Selley, J. A. Lynch, K. W. Lau, H.

Yin, S. J. Gaskell, and S. J. Hubbard. (2006) Pepseeker: a database

of proteome peptide identifications for investigating fragmentation

patterns. Nucleic Acids Research, 34.

[12] D.N. Perkins, D.J. Pappin, D.M. Creasy, and J.S. Cottrell. (1999)

Probabilitybased protein identification by searching sequence da-

tabases using mass spectrometry data. Electrophoresis, 20(18).

[13] L. Zamboulis, H. Fan et al, (2006) Data Access and Integration in

the ISPIDER Proteomics Grid. In proc. DILS, pages 3–18.