MicrobIdentifier: A Microbial Identification Software Based on Mass-Spectrometry

doi:10.4236/jsea.2009.23028

Paper Menu >>

Journal Menu >>

J. Software Engineering & Applications, 2009, 2: 206-208

doi:10.4236/jsea.2009.23028 Published Online October 2009 (http://www.SciRP.org/journal/jsea)

MicrobIdentifier: A Microbial Identification Softw are

Based on Mass-Spectrometry

Feng LIU, Lu LI, Chi ZHANG, Lingbing WANG, Pei LI

International School of Software, Wuhan University, Wuhan, China.

Email: wolflf@126.com, {lulu.li1989, chzhcn88}@gmail.com

Received May 18th, 2009; revised July 5th, 2009; accepted July 16th, 2009.

ABSTRACT

As the technology of microbial identification by mass cataloging has been widely used, we have developed the microbi-

al identification software, MicrobIdentifier, which integrates and automates different steps in the procedure of rapid

species identification based on mass-spectrometry. This software is written in Java for cross-platform intention.

Keywords: Microbial Identification, Mass-Spectrometry

1. Introduction

With the development of the technology, microbial iden-

tification by mass cataloging has attracted considerable

attention due to its high efficiency and automation. In

order to improve efficiency and automation of this tech-

nology, we’ve developed this microbial identification

software based on the spectral coincidence function pro-

posed in [1]. The software has two major functions: First,

it can be used to search for all the possible primer pairs

among the given genes of different species, and evaluate

these primer candidates by giving each pair a score. This

is proved to be a useful reference during primer design.

Second, it takes advantage of the spectral coincidence

function to compare mass spectrometric observables with

theoretical fragmentation patterns, and further to deter-

mine the genetic affinity between the sample gene and

genes of known species in the database. This will free

researchers from the effort of comparing the fragmenta-

tion patterns manually.

2. Algorithm

The core algorithm our work has been based on is a

spectral coincidence function proposed in [1] as follow:



iji j

ii jj

2MM

CCM,M (MM )(MM )







The dot-product in the coincidence function is defined as



i=1j 1

M, MMMδm-m



 





where M is the mass vector of one sample’s fragmenta-

tion, which has N1 elements with mi standing for the ith

element, while M’ is the mass vector of the other sample,

which has N2 elements with m’j standing for the ith ele-

ment. The discrete delta function  is:

() 0

kotherwis e











Based on the formulas, the inner-product is greater if

the two samples have more fragmentation of the same

mass. The coincidence function normalizes the inner-pro-

duct value to a range between zero and one, and a high

value of the coincidence function indicates more similar-

ity between the two genes in comparison. Therefore, this

function can be used to score the similarity in both the

primer search process and the identification process.

The algorithm in primer search process is as follow:

1) Align all the gene sequences with ClustalW algo-

rithm [3].

2) Find regions where all the sequences have more

than N nucleotides at the same place and in the same or-

der, which are the conserved regions. If the regions are

less than two, then exit.

3) Take two conserved regions and check whether the

number of nucleotides is more than M. Take another pair

of regions if otherwise.

4) Cut the regions between two conserved regions

(conserved regions included) after every “G”, filtering

the fragments which have less than L nucleotides.

5) Calculate the mass of all fragments of each se-

quence, and then form the sequence’s mass vector.

6) Take the mass vectors of one pair of gene sequences

and calculate the score indicating their similarity by us-

ing the coincidence function.

MicrobIdentifier: A Microbial Identification Software Based on Mass-Spectrometry

207

7) Repeat Step 6 until any pair of all the gene se-

quences has been compared. Calculate the average value

of all the scores calculated in Step 6. The average value

is the final score of the primer pair chosen in Step 3.

8) Repeat the steps from 3 to 7 until all the combina-

tions of the conserved regions are considered.

Optimal primer pairs are those conserved regions with

very variable regions in between. A primer pair with a

lower score is better than the ones with higher scores,

since there is less similarity between the primer pairs,

thus the test samples could be identified with much more

ease in the identification process.

The algorithm in identification process is almost the

same as the Steps from 3 to 6 in the primer search proc-

ess with one exception that, in identification process, it is

the comparison of experimental data and the computed

mass vector in the database. A higher score indicates

more genetic affinity, suggesting a higher possibility of

being the same species.

Given inevitable experimental inaccuracy, the discrete

delta function  is further modified to be:

() 0

k tolerance

kotherwise













Thus, tolerable difference between masses is ignored.

3. Software

The software accepts a fasta file as input, then invoke a

new process running clustalw that also takes the .fasta

file. As long as the .fasta file is valid in format, a .aln file,

the result of clustalw’s pairwise alignment, is created and

afterwards captured. Through parsing both the fasta file

and .aln file, a data group is fabricated. In the software, a

data group is a concept of a pool of sequences with user

configuration that is identification-ready. Typically users

need to assign four thresholds: the minimum length of a

sequence fragment after simulated cutting; the minimum

length of a primer; the minimum and maximum length of

the variable region between primer pairs. The same se-

quence pools with different configurations are different

data groups. The software ensures users only work on

one data group at a time given that the concept of data

group supports sufficiently in flexibility and reusability

for users to handle microbial identification merely on one

data group in most situations. During this preprocessing

phase, the software stores user configurations as well as

the data group sequences into the database for the pur-

pose of 1) enabling access to previously processed data

groups in later cases 2) providing thresholds reference

for identification process.

Figure 1. MicrobIdentifier screenshot

MicrobIdentifier: A Microbial Identification Software Based on Mass-Spectrometry

208

The user interface shows the sequences in the pool;

primer selection thresholds and primer pair candidates

are also given out if current data group is loaded from

database, whose primer pair candidates have already

been worked out after proper configuration in previous

use. The more usual case, however, is the user sets up

basic configuration after a new pool is given, parsed

down and shown on UI, to calculate potential primers

pairs. The list of primer pairs is sorted by score in as-

cending order. The configurations are saved into the da-

tabase in associate with the working data group.

To perform microbial identification, the software uses

exported ASCII Spectrometry .txt file from DataExplorer,

whose data is the mass spectrometry result from

MALDI-TOF. Users are free to customize proposed

primer pair candidates to choose a subset, however man-

datory to provide some parameters about the conditions

in their mass-spectrometry experiment, including: in vi-

tro transcription enzyme, either SP6 or T7; mass toler-

ance and minimum intensity threshold; whether the elec-

tric charge is positive of negative during MALDI-TOF

experiment. The software parses the input file, generates

peek list after filtering peak values below the intensity

threshold, taking into account the experimental inaccu-

racy by means of adopting tolerance and finally provides

the identification consequence.

Figure 1 shows the interface of MicrobIdentifier.

4. Acknowledgements

This paper is sponsored by the National Science and Te-

chnology Major Project 2009ZX10004-107 and The Na-

tural Science Founds of Wuhan University F020504.

REFERENCES

[1] G. W. Jackson, R. J. McNichols, G. E. Fox and R. C.

Willson, “Bacterial genotyping by 16S rRNA mass cata-

loging”, BMC Bioinformatics, vol.7, pp. 321–335, June

2006.

[2] Z. D. Zhang, G. W. Jackson, G. E. Fox, and R. C. Willson,

“Microbial identification by mass cataloging,” BMC

Bioinformatics, Vol. 7, pp. 117–135, Match 2006.

[3] J. D. Thompson, D. G. Higgins, and T. J. Gibson,

“CLUSTAL W: Improving the sensitivity of progressive

multiple sequence alignment through sequence weighting,

position-specific gap penalties and weight matrix choice,”

Nucleic Acids Research, Vol. 22, pp. 4673–4680, Sep-

tember 1994.

[4] C. Honisch, Y. Chen, C. Mortimer, C. Arnold, O.

Schmidt, D. van den Boom, C. R. Cantor, H. N. Shah,

and S. E. Gharbia, “Automated comparative sequence

analysis by base-specific cleavage and mass spectrometry

for nucleic acid-based microbial typing,” Proceedings of

the National Academy of Sciences, Vol. 104, pp.

10649–10654, June 2007.

[5] H. Steen and M. Mann, “The abc’s (and xyz’s) of Peptide

Sequencing,” Molecular Cell Biology, Vol. 5, pp. 699–

711, September 2004.