MicrobMatcher: a microbial comparison software based on matrix-assisted laser desorption/ionization with time-of-flight mass spectrometry

doi:10.4236/jbise.2011.45051

Paper Menu >>

Journal Menu >>

J. Biomedical Science and Engineering, 2011, 4, 403-409

doi:10.4236/jbise.2011.45051 Published Online May 2011 (http://www.SciRP.org/journal/jbise/

JBiSE

Published Online May 2011 in SciRes. http://www.scirp.org/journal/JBiSE

MicrobMatcher: a microbial comparison software based on

matrix-assisted laser desorption/ionization with time-of-flight

mass spectrometry

Yizhen Fu, Feng Liu*, Yalu Ye

Depatment of Software Engineering, Wuhan University, Wuhan, China.

Email: wolflf@126.com, fafaisland@gmail.com, shell.yeyalu@gmail.com

Received 10 March 2011; revised 25 March 2011; accepted 25 April 2011.

ABSTRACT

Matrix-assisted Laser Desorption/Ionization with

Time-of-flight Mass Spectrometry (MALDI-TOFMS)

was investigated as a method for the rapid identifica-

tion of species. Current demand in microbial identi-

fication is how to compare unknown strains to the

known one quickly, semi-automatically and accu-

rately. In this paper, we present a software tool that

allows flexibly microbial matching in a user-friendly

way, by letting the users to customize comparison

parameters including: in vitro transcription enzyme,

mass tolerance, minimum fragment length, intensity

threshold and corresponding weights. We provide

three spectral scoring functions to compute the affin-

ity between the species. Therefore, the precision of

microbial comparison increases. To test and verify

this tool, we employed experimental spectral data

based on MALDI-TOFMS and the gene sequences of

E.coli and Salmonella. This software is written in

Java for cross-platform intention.

Keywords: Microbial Comparison; Matrix-Assisted

Laser Desorption/Ionization with Time-of-Flight

Mass Spectrometry

1. INTRODUCTION

MALDI-TOFMS is an analytical technique that meas-

ures the mass-to-charge ratio of charged particles. It is

used for determining masses of particles, for determining

the elemental composition of a sample or molecule, and

for elucidating the chemical structures of molecules,

such as peptides and other chemical compounds. With

the development of this technology, microbial identifica-

tion by mass cataloging has attracted considerable atten-

tion owing to its high efficiency and automation. Mean-

while, there is a current demand that to compare mass

spectrometric observables with theoretical fragmentation

patterns, and further to determine the genetic affinity

between the sample gene and genes of known species in

the database quickly, semi-automatically and accurately.

Within this context, our paper presents a software tool

that allows flexibly microbial matching in a user-friendly

way. When it comes to the matching speed and accuracy,

this software provides three spectral scoring functions to

compute the coincidence between the species. As for the

semi-automation, the tool allows the users to customize

comparison parameters including: transcription enzyme,

mass tolerance, minimum fragment length, intensity

threshold and corresponding weight.

To test and verify this tool, we employed the experi-

mental spectra data based on MALDI-TOFMS and the

gene sequences of E.coli and Salmonella.

The remainder of the document is structured as fol-

lows. We present three algorithms for computing the

coincidence between the sample gene and genes of

known species in Section 2, followed by the description

and the verification of the software separately in Section

3 and Section 4. Subsequently, related work is discussed

in Section 5. Finally, Section 6 concludes.

2. ALGORITHM

2.1. Overall Algorithm

The overall algorithm in comparison process is as fol-

low:

1) Amend the gene sequence of the known reference

species according to the transcription enzyme. To form

the theoretical gene sequence, if the promoter is T7, the

nucleotide sequence “TTCTATAGTGTCACCTAAAT”

will be added to the original one, while If the promoter is

Sp6, reverse and complement(A-T, G-C) the original

gene sequence, and then add the nucleotide sequence

“CCCTATAGTGAGTCGTATTAC” as its subsequence.

2) Cut the theoretical gene sequence after every base

‘G’, omitting the fragments which have less than L nu-

Y. Z. Fu et al. / J. Biomedical Science and Engineering 4 (2011) 403-409

404

cleotides. L is determined by the user.

3) Calculate the mass of all fragments (also referred to

as ‘fingerprint biomarkers’) from the sequence and then

form the sequence’s mass vector. The mass of every

fragment is:

-329.2 305.2

361.2 18.0148 1.0072

MfragmentAG

 

 Da



(1)

A, G, C and T separately stand for the one-to-one total

number of Adenylic acid, Guanylic acid, Cytidylic acid

and Thymidylic acid in each fragment.

4) Take the mass vector of the gene sequence and cal-

culate the score indicating their similarity by using one

of the spectral scoring functions introduced below.

2.2. The First Spectral Scoring Function

The first spectral scoring function [1] in our work is as

follow:

Let N denote the total number of fingerprint bio-

markers in the given theoretical gene sequence. A vector

u of length N is constructed. The elements of u contain

0’s and 1’s. The ith element of u is 0 if the ith fingerprint

peak is not observed in the blinded sample and 1 if the

mass of ith fragment is observed within tolerance in the

blinded sample. The number of 1’s in u (or sum of all

elements of u) indicates the number of fingerprint bio-

markers observed in the blinded sample.

For each blinded sample and each reference species,

likelihood is computed based on the number of frag-

ments observed in the blinded sample. This likelihood is

a value between 0 and 1. If the likelihood is close to 1,

then the reference bacterium is determined to be present.

If the likelihood is close to 0, then the blinded sample

does not contain the significant fingerprint biomarkers,

and the reference is determined to be absent.

2.3. The Second Spectral Scoring Function

Based on the first method, the second spectral scoring

function [2] in our work allows the user to define two

intensities, partitioning the whole experimental peaks

into three parts: the first peaks list, whose intensities are

higher than the larger defined intensity; the second peaks

list, whose intensities are between the two defined inten-

sities; and the third peaks list, whose intensities are

lower than the smaller defined intensity. Furthermore,

users can assign the credibility for the three intervals of

peaks, and give weights for them separately, but the

weighted sum must be one. This method considers the

reliability of the intensities and involves the users’ ex-

perience. The scoring function is as follow:



11223 3DMPWMPW MPWN (1)

where:

MP1 is the number of the matched fragments between

the theoretical fragments and the experimental peaks

whose intensities are higher than the larger defined in-

tensity.

MP2 is the number of the matched fragments between

the theoretical fragments and the experimental peaks

whose intensities are lower than the larger defined inten-

sity and higher than the smaller defined intensity.

MP3 is the number of the matched fragments between

the theoretical fragments and the experimental peaks

whose intensities are lower than the smaller defined in-

tensity.

W1, W2 and W3 are separately the credibility of the

three intervals of peaks corresponding to MP1, MP2 and

MP3.

N is the total number of fingerprint biomarkers in the

given theoretical gene sequence.

A higher score indicates more genetic affinity, indi-

cating a higher possibility of being the same species.

2.4. The Third Spectral Scoring Function

The third spectral scoring function [2,3] in our work is

as follow:

 

2*MM

MMM







  (3)

The scalar product (often referred to as a ‘dot-product’)

of two mass in the function is defined as:









 

 (4)

where M is the mass vector of one sample’s fragmenta-

tion, which has N1 fragments with mi standing for the

mass of the ith fragment, while M' is the mass vector of

the other sample’s fragmentation, which has N2 frag-

ments with standing for the jth fragment. The discrete

delta function δ is:



1; δ

0; otherwise

k





 (5)

Given inevitable experimental inaccuracy, the discrete

delta function δ can be further modified to be:



1;t o l era n c e

0;otherwise

k







 (6)

Based on the formulas, the inner-product is greater if

the two samples have more fragments of the same mass.

The spectral scoring function normalizes the in-

ner-product value to a range between zero and one, and a

high value of the spectral scoring function indicates a

higher possibility of being the same species.

Y. Z. Fu et al. / J. Biomedical Science and Engineering 4 (2011) 403-409

405

3. SOFTWARE tion of the experimental inaccuracy by means of adopt-

ing tolerance, and finally provides the comparison con-

sequence of the selected method. For further research, it

is available for users to save the comparison result as a

txt report file. Figures 2, 3 and 4 separately represent the

user interfaces of the three scoring methods in the soft-

ware.

To perform microbial comparison, the software uses the

exported ASCII Spectrometry .txt file from DataEx-

plorer (Figure 1), whose data is the mass-intensity spec-

trometry result from MALDI-TOFMS, and the theoretic

gene sequence of the known reference species, either .txt

file imported from the local file system or direct text

pasted in the blank box, as inputs. The software offers

three spectral scoring functions mentioned above, and

users can choose one of them to calculate the coinci-

dence between the experimental data and the theoretic

DNA sequence. In all the three methods, users are free to

customize some conditional parameters in their mass-

spectrometry experiment, including: in vitro transcrip-

tion enzyme-either T7 or Sp6, mass tolerance, minimum

fragment length and intensity threshold. In addition, in

Method 2, users can customize the intensity range and

corresponding weight according to their previous ex-

perience of the importance of the peaks among the rela-

tive intensity scope. Subsequently, the software parses

the input file, generates peak lists after filtering peak

values below the intensity threshold, with the considera-

4. VERIFICATION

This paper presents two parts of experiments, the nega-

tive control and the positive one, to verify the accuracy

and the utility of the software.

In the negative control, we divide it into two parts,

and in each part we use five sets of data from five sepa-

rate experiments of one species and the DNA sequence

of another species as input to test the consequence of

inconsistence. For example, we calculate the coincidence

between the theoretic sequence of E.coli and each set of

the experimental data of Salmonella. To ensure justice,

we control the experimental conditions with the same

parameters. Ta b l e 1 shows the results of these negative

control experiments.

Figure 1. Mass-intensity spectrometry of the experimental gene in DataExplorer.

JBiSE

Y. Z. Fu et al. / J. Biomedical Science and Engineering 4 (2011) 403-409

406

Figure 2. The user interface of Method 1.

Figure 3. The user interface of Method 2.

Y. Z. Fu et al. / J. Biomedical Science and Engineering 4 (2011) 403-409

407

Figure 4. The user interface of Method 3.

Table 1. The results of the negative control experiments.

Coincidence

Negative

Control Theoretic

Sequence Experimental

Data Method 1 Method 2 Method 3

E.coli Salmonella (1) 0 0.005 925 925 925 925 9300

E.coli Salmonella (2) 0 0.002 962 962 962 962 9600

E.coli Salmonella (3) 0 0.005 925 925 925 925 9300

E.coli Salmonella (4) 0 0.004 444 444 444 444 44000

Part I

E.coli Salmonella (5) 0 0.005 925 925 925 925 9300

Salmonella E.coli (1) 0.035 714 285 714 285 700.032 142 857 142 857 101.75E-04

Salmonella E.coli (2) 0.035 714 285 714 285 700.032 142 857 142 857 102.42E-04

Salmonella E.coli (3) 0 0 0

Salmonella E.coli (4) 0.035 714 285 714 285 700.032 142 857 142 857 102.92E-04

Part II

Salmonella E.coli (5) 0.035 714 285 714 285 700.032 142 857 142 857 100.001 472 754 050 073 6400

We find that the results of the coincidence are all too

low for the microbe to be classified as the certain species

of the theoretic sequence. In other words, it demonstrates

that the experimental species is probably not the same

kind as the theoretical species, which accords with our

expectation.

Meanwhile, in the positive control, we divide it into

two parts as well, and we use five sets of data from five

separate experiments of one species and the DNA se-

quence of the same species as input. For instance, the

coincidence between the theoretic sequence of E.coli and

its experimental data is calculated. Also, the circum-

stance of each experiment remains the same as to ensure

fairness. Table 2 shows the results of the positive control

JBiSE

Y. Z. Fu et al. / J. Biomedical Science and Engineering 4 (2011) 403-409

408

Table 2. The results of the positive control experiments.

Coincidence

Positive

Control Theoretic

Sequence Experimental

Data Method 1 Method 2 Method 3

E.coli E.coli (1) 0.074 07 0.069 63 2.62E-04

E.coli E.coli (2) 0.074 074 074 074 074 100.066 666 666 666 666 708.20E-04

E.coli E.coli (3) 0.074 074 074 074 074 100.066 666 666 666 666 701.32E-04

E.coli E.coli (4) 0.074 074 074 074 074 100.066 666 666 666 666 708.76E-04

Part I

E.coli E.coli (5) 0.074 074 074 074 074 100.066 666 666 666 666 700.004 419 889 502 762 430

Salmonella Salmonella (1) 0.035 714 285 714 285 700.035 000 000 000 000 0004.28E-05

Salmonella Salmonella (2) 0.071 428 571 428 571 400.064 285 714 285 7143 0.015 463 917 525 773 200

Salmonella Salmonella (3) 0.035 714 285 714 285 700.035 000 000 000 000 0001.87E-04

Salmonella Salmonella (4) 0.035 714 285 714 285 700.035 000 000 000 000 0001.82E-05

Part II

Salmonella Salmonella (5) 0.035 714 285 714 285 700.035 000 000 000 000 0000.010 344 827 586 206 900

experiments.

Given the allowed tolerance during experiment and

the previous experience, we find the results of the coin-

cidence are all within acceptance, which reflects high

probability of the similarity between two species in the

comparison, and which also demonstrates that our soft-

ware is robust and accurate.

5. RELATED WORK

The software in this paper completes the comparison

between the known species in the databases and the un-

known species which has mass-intensity data generated

by MALDI-TOFMS. In the next phrase, we will do sta-

tistical analysis to amount of spectra from one species

and expect to compare affinity among unknown species.

Furthermore, we will try to model for species and search

the possible species range for the unknown species based

on its MALDI-TOFMS data.

6. CONCLUSIONS

In order to allow flexibly microbial matching in a user-

friendly way, we design the software “micromatcher”.

To perform microbial comparison, the software uses

exported ASCII Spectrometry .txt file from DataExplorer,

whose data is the mass-intensity spectrometry result

from MALDI-TOFMS and the theoretical gene sequence

of the known species in the database as inputs. The soft-

ware offers three spectral scoring functions and users

can choose one of them. Then users are free to customize

some comparison parameters, including: in vitro tran-

scription enzyme, mass tolerance, minimum fragment

length, intensity threshold and corresponding weight.

The software parses the input file, generates peek lists

after filtering peak values below the intensity threshold,

taking into account the experimental inaccuracy by

means of adopting tolerance and finally provides the

comparison consequences.

The software computes the genetic affinity between

the sample gene and genes of known species in the da-

tabase quickly, semi-automatically and accurately.

7. ACKNOWLEDGEMENTS

This paper is sponsored by National Science and Technology Major

Project 2009ZX10004-107 and The Natural Science Founds of Wuhan

University F020504.

REFERENCES

[1] Jarman, K.H., Cebula, S.T., Saenz, A.J., Petersen, C.E.,

Valentine, N.B., Kingsley, M.T. and Wahl, K.L. (2000)

An algorithm for automated bacterial identification using

matrix-assisted laser desorption/ionization mass spec-

trometry. Analytical Chemistry, 72, 1217-1223.

doi:10.1021/ac990832j

[2] Jackson, G.W., McNichols, R.J., Fox, G.E. and Willson,

R.C. (2007) Universal bacterial identification by mass

spectrometry of 16S ribosomal RNA cleavage products.

International Journal of Mass Spectrometry, 261, 218-

226. doi:10.1016/j.ijms.2006.09.021

[3] Jackson, G

.W., McNichols, R.J., Fox, G.E. and Willson,

R.C. (2008) Toward universal flavivirus identification by

mass cataloging. Journal of Molecular Diagnostics, 10,

pp. 135-141. doi:10.2353/jmoldx.2008.070130

[4] Zhou, X.H., Gonnet, G., Hallett, M., Münchbach, M.,

Folkers, G. and James, P. (2001) Cell fingerprinting: An

approach to classifying cells according to mass profiles

of digests of protein extracts. Proteomics, 1, 683-690.

doi:10.1002/1615-9861(200104)1:5<683::AID-PROT68

3>3.0.CO;2-3

[5] Ressom, H.W., Varghese, R.S., Drake, S.K., Hortin, G.L.,

Abdel-Hamid, M., Loffredo, C.A. and Goldman, R.

Y. Z. Fu et al. / J. Biomedical Science and Engineering 4 (2011) 403-409 409

(2007) Peak selection from MALDI-TOF mass spectra

using ant colony optimization. Bioinformatics, 23,

619-626. doi:10.1093/bioinformatics/btl678

[6] Jackson, G

.W., McNichols, R.J., Fox, G.E. and Willson,

R.C. (2007) Bacterial genotyping by 16S rRNA mass

cataloging. BMC Bioinformatics, 7, 321.

doi:10.1186/1471-2105-7-321

[7] Zhang, Z.D., Jackson, G.W., Fox, G.E. and Willson, R.C.

(2006) Microbial identification by mass cataloging. BMC

Bioinformatics, 7, 117. doi:10.1186/1471-2105-7-117

[8] Jones, N.C. and Pevzner, P.A. (2004) An introduction to

bioinformatics algorithms, Chapter 6. MIT Press, Cam-

bridge.