Here, we present the VisioBioshapeR package for R [R Core, 2014]. This new library is a comprehensive, multifunctional toolbox designed to automatically analyse biological images. The package extends other common libraries (Momocs, ShapeR) used for biological shape analysis by allowing the user to extract closed contour outlines automatically from reading binary images. Current functionalities of VisioBioshapeR include: random extraction of image coordinates, analysis of the shape of a biological image by the elliptic Fourier descriptor (EFD) method, extraction of an image characteristic vector using multivariate principal c omponent analysis (PCA) and geometrical analysis. The image vector of characteristics can be directly exported to a wide range of statistical packages in R and can be used to perform classification or other types of analysis in order to sort new images into classes. The package could prove useful in studies of any two-dimensional images and is presented with three examples of its application in ecology. The library is useful when multiple images are processed at a time and we wish to automate their analysis for example for recognition of images from patterns.
Automatic identification of images, and biological images in particular (
The principal image processing tasks involve several processes such as grey-level transformation, binarisation, image filtering, image segmentation, visual object tracking, optical flow and image registration. After acquiring image information, statistical techniques must be used for characterisation, classification or description, depending on the target. Image pattern recognition is the technique used to classify an input image into one of a set of predefined classes [
Here, we introduce VisioBioshapeR, an R package designed to analyse biological images automatically with a small number of easy-to-use functions (
The R library presented here is a general purpose library and resolves the problem of analysing numerous specimens at a time to recognise automatically biological images efficiently and within a reasonable time. As with the applications for identification of organisms, despite the existence of many image analysis applications, most free applications (SHERPA, ImageJ, Google’s Image Recognition Software) do not completely resolve the difficulty posed by biological imaging or processing, partly due to the variability and the large number of images needed to implement future automatic identification systems in biological imaging.
So, in this text, we present the mathematical basis for understanding the functions contained in the VisioBioshape library.
Function | Description |
---|---|
coordtiff() | Extraction of X, Y coordinates from a Tiff image. Representation of the image using the first n harmonics. One image process. |
image.to.coords() | Extraction of X, Y coordinates from a Tiff image. Faster algorithm for extracting coordinates from multiple images. |
e.fourier.graph() | Represents a Tiff image using n harmonics. |
automatic.PCAfourier.analysis1() | Perform an automatic multivariate (PCA) of the EFD coefficients and geometric image analysis (euclidean distance and medoids) from the extracted coordinates matrix or a coordinate file. One image process. |
automatic.PCAfourier.analysis2() | Second version of the last function with the parameters adapted to be called from apply() and analyse multiple images at the same time (see the examples for further information). |
The functions e.fourier.graph() and automatic.PCAfourier.analysis1()are supported by the Elliptic Fourier Descriptor (EFD) method. It is a mathematical way to describe closed contours of images using coordinates previously extracted by the function coordtiff() or image.to.coords. It was first proposed as an application of the Fourier transform method [
We implemented the EFD algorithm in the functions e.fourier.graph() and automatic.PCAfourier.analysis1() laid out in [
Let T be the perimeter of the outline, and this perimeter then becomes the period of the signal. One sets
where:
and:
where n is the index of the nth harmonic.
The first harmonic defines the ellipse that best fits the outline. One can then use the parameter of the first harmonic to normalise the data so that they can be invariant to size, rotation, and the starting position of the outline trace. The function then calculates a new set of Fourier coefficients
One obtains the set of normalised coefficients following the equations:
where “scale” is the magnitude of the semi-major axis of the ellipse defined by the first harmonic,
To compute these parameters one can use the formulae supplied in [
with:
and:
and:
Using this method, the function e.fourier.graph() extracts valuable information from the image.
The automatic.PCAfourier.analysis1() and automatic.PCAfourier.analysis2() functions are assigned the task of obtaining a vector containing relevant features which are extracted from each image resulting from the geometric analysis of its form. The problem of feature selection is undoubtedly the most critical part of the framework, as the sensitivity and robustness of the identification algorithm depend closely on responsiveness to the information that has been obtained automatically. In this case, the characteristics that are extracted from the shape of the image are the main components of all the Fourier harmonics of an image using previously the EFD, the geometric centre of the image and the Euclidean distance of the coordinates.
PCA is a mathematical tool for analysing data in a way that allows us to identify patterns in the data, and for expressing the data in such a way as to highlight their similarities and differences. Patterns can be hard to find in data of a high dimension, as is the
case of the data contained and compiled within an image, where graphical representation is not available.
Conceptually, the goal of PCA is to reduce the number of variables of interest to a smaller set of components; in this case, the coefficients related to images. PCA analyses all the variance in the variables and reorganises it into a new set of components equal in number to the number of original variables. The new components are independent and they decrease with respect to the amount of variance in the originals they account for. The first component captures most of the variance; the 2nd, second most; and so on, until all the variance is accounted for. However, only some components will be retained for further study (dimension reduction). Since the first few capture most of the variance, further analysis typically focuses on them.
So, we apply PCA to the data contained in the matrix of EFD coefficients obtained in the elliptic Fourier analysis, either for the EFD coefficients obtained up to a certain number of harmonics or for all the harmonics obtained in the EFD analysis.
We use the function princomp() of R [
Thus for each image, each of the four Fourier coefficients (
The difference between automatic.PCAfourier.analysis1 and automatic.PCAfourier. analysis2 is that the first serves for the analysis of a single image at a time, either from the matrix of coordinates or a coordinate file, on the other hand automatic.PCAfourier. analysis2 has its origins in a list containing matrices of coordinates from multiple images or a single image. automatic.PCAfourier.analysis2 is therefore useful when multiple images are processed.
The automatic.PCAfourier.analysis1() function can also analyse the image geometrically to obtain the centroid(medoid) and the Euclidean distances between the coordinates.
One of the interesting points of the image is the geometric centre; but there are different ways to calculate it. We selected the medoid as a good estimator of the image centroid. Medoids are representative objects of the coordinates X,Y whose average dissimilarity to all the coordinates is minimal. The medoid is not equivalent to a median or a geometric median. A median is only defined on a 1-dimensional coordinate, and it only minimises dissimilarity to other coordinates X,Y for a specific distance metric. A geometric median is defined in any dimension, but is not necessarily a point from within all the coordinates.
To calculate the medoid, we selected in R [
Another important pieces of information obtained for automatic.PCAfourier. analysis1() is the distance between the coordinates X,Y of the image, which can contain information about relevant characteristic such as in the distance between tree rings or comparing some images with others. In order to calculate the distance, we used the Euclidean distance between all pairs of coordinates and extracted the maximum of them.
Many other metrics can be considered for data analysis purposes in the future.
In order to demonstrate the utility of the VisioBioshapeR package, here we present 3 examples of its use, one of which is real, including the automatic identification of 35 Tiff images of 3 genera of diatoms from an inventory of the River Ebro (Spain).
The first example consists of shape analysis of 35 diatoms using three genera: Navicula (n = 9), Cocconeis (n = 18) and Diatoma (n = 8), from the River Ebro inventory. Images can be retrieved from three folders (class 1, class 2 and class 3 respectively) in the examples directory (see “Documentation and availability” section). We then present two more simple examples with a dragonfly from a binarised digital camera image and tree rings that are designed and plotted using Paint (Microsoft Paint).
The VisioBioshapeR library, image samples and *.r script with some examples and analysis can be found at GitHub.
You will need to load VisioBioshapeR and retrieve in R the example data file with the commands library (VisioBioshapeR). To start the analysis, the path to the gallery of example images needs to be set to the folder “Templates” which contains three folders “class 1”, “class 2” and “class 3”, with the binarised diatom images. In the root folder, we can also find the dfly.tif (dragonfly) and tree.tif (ring tree) images.
The tree ring image was designed and plotted using Paint software.
The first example of use refers to the automatic identification of diatoms; a requirement since 1999 [
The samples of diatoms presented in the example of the use of VisioBioshapeR (see
We selected the three genera Navicula, Cocconeis and Diatoma because of their relative abundance in the samples and thus the higher probability of finding the target diatom valves in the samples. Moreover, these species are common in European rivers.
The taxonomic rank in this study was downgraded to genus so we can obtain more image samples from the ADIAC database [
The observation and identification was performed using a Carl Zeiss Jenaval bright field microscope with phase contrast lenses Planachromat Phv 100×/1.3 and normal lenses 100×/1.3. The images were taken with an Olympus DP25 FW digital camera and the associated software Olympus Cell^ B. The images have 2560 × 1920 pixels, where each pixel equals 0.03376 mm2.
In each photograph session we obtained background images to correct irregularities in the light source of the microscope. The background was corrected by subtracting the background image from the normal image and thus obtaining a standardised image with regular background and colour correction.
We used the SHERPA software [
In the following sections we demonstrate how to use the different functions in the three examples (diatoms, dragonfly and tree rings).
To obtain the outline of each image of interest we propose two solutions, one for the analysis of individual images, very versatile and allows the analysis of sub-images within a single image, it is the function Coordtiff() and another is the function image.to.coords(), more efficient computationally and that is destined to rapid removal of the images of a directory, all at once. The last must not have more than a single Tiff image to extract the coordinates correctly. Coordtiff() can interactively display the image once extracted coordinates and image.to.coords() does not.
For the Coordtiff() function we used a modification of the Conte function [
The modified Conte function was named Coordtiff() and is defined as:
>Coordtiff(file, harmonic = F)
The file argument is used to read the images in Tiff format using the function readTiff [
An example of use of the function Coordtiff() is presented in
image.to.coords() is defined as:
>image.to.coords(filename, foldername, folder = F)
The filename argument is used to read a single image in Tiff format. The folder argument is a Boolean value that tells the function if we are reading a set of images in a folder (T) or a single image (F). foldername argument is the relative path to the folder. The algorithm for the outline extraction is from the raster library [
When all the outlines have been captured and stored in the form of X,Y coordinates, the shape coefficients of the image can be extracted using the function e.fourier.graph(). The EFD technique is performed using the efourier function [
age is represented with the reconstructed image using its harmonics (by default: 9 harmonics) by means of the iefourier function. The function e.fourier.graph is a modification of efourier which adds the possibility of normalising the images with regard to size and rotation using the function NEFnormalization and groups the coefficients with regards to size and rotation, then displays the coefficients.
The function is defined as:
>e.fourier.graph(M,n.harmonics = 9, coef = F, normalized = T)
The M argument is used to read the coordinates X and Y from the images using the function Coordtiff The n.harmonic argument is used to perform EFD analysis by means of the function efourier and to represent the reconstruction of the original image graphically using the function iefourier using a first n.harmonic harmonics.
The function returns the four coefficients
The normalised argument is used to normalise (or not) the images with regards to size and rotation, and group the coefficients with regards to size and rotation, then to display the coefficients
One of the purposes of this functions is adjust the number of harmonics required to a suitable reconstruction of the images in the extraction of the vector containing image information using the function automatic.PCAfourier.analysis1().
When all the outlines have been captured in a image, the number of harmonics required is adjusted and normalisation is performed if necessary, the shape coefficients can be extracted and used for PCA analysis using the function automatic.PCAfouri- er.analysis1().
15 harmonics by default are used to give 4 elliptic Fourier coefficients (if the normalised option is used, the first three coefficients are omitted due to standardisation in relation to size, rotation and starting point).
The function is defined as:
>automatic.PCAfourier.analysis1(file, M, id.num, nom.label, tipus.input = 2, normalised = FALSE, num.harmonic = 15)
The file and M arguments are used to read the coordinates X, Y from the images using the function Coordtiff() in the case of using M or directly if we read a file containing the X,Y coordinates of the image.
The id.num argument is used to assign a number to the image analysed. The nom.label argument is used to assign a label (character) to the new image analysed. The argument tipus.input = 2 indicates to the function that we are reading a matrix, M, with the coordinates X and Y, and if we use tipus.input = 1 we are reading a file with the coordinates.
The normalised argument is used to normalise (or not) the images with regards to size and rotation and groups the coefficients with regards to size and rotation, and then displays the coefficients.
After the EFD analysis, in a second step, the function automatic.PCAfourier. analysis1() performs PCA analysis of the four first coefficients
The function returns 21 variables containing information about the geometry of the image using the EFD, PCA and Euclidean analysis of the image:
・ Column 1: numerical ID of the image
・ Column 2: image label
・ Columns 3 - 10: PCA coefficients of all the EFD coefficients
・ columns 11 - 18: PCA coefficients of selected (nharmonics) elliptic Fourier coefficients
・ Columns 20 - 21: X and Y centroids of the image using the medoids
・ Column 21: Euclidean maximum distance between the X,Y coordinates.
Finally, in the example of R script we present a complete example of classification of biological images using 35 Tif images of 3 frequent genera of diatoms (navicula (n = 9), cocconeis (n = 18) and diatoma (n = 8)) from the River Ebro (Spain) that are shown in
The first step (see
In
We used the following discriminant methods to separate the three genera of diatoms using the variables calculated in the EFD-PCA:
Method | navicula | cocconeis | diatoma | all |
---|---|---|---|---|
LDA(12) | 66.7 | 88.9 | 85.7 | 82.9 |
LDA(12CV) | 33.3 | 66.7 | 25 | 48.6 |
LDA(6CV) | 66.7 | 77.7 | 50 | 68.6 |
RRLDA(13) | 55.6 | 55.6 | 87.5 | 62.8 |
FDA(13) | 88.9 | 100 | 100 | 97.1 |
NN(13) | 100 | 100 | 100 | 100 |
NN(3) | 100 | 88.9 | 75 | 88.6 |
SVM(13) | 55.5 | 100 | 87.5 | 85.7 |
・ LDA: Fisher linear discriminant analysis (with CV: Cross validation) using the MASS library,
・ RRLDA: Robust Regularised Linear Discriminant Analysis using the rrlda library,
・ FDA: Flexible Discriminant Analysis using the mda library,
・ NN: Neural Nets using the nnet library,
・ SVM: support vector machine using the kernlab library.
The percentage of good classification varies between 50% and 100%, and specially some methods can be considered useful to classify organism using the procedures by means of the library.
Another two examples are offered in the R script. First was a dragonfly (
The VisioBioshapeR library has demonstrated its usefulness in the three examples presented in this work. All the examples used in this paper were collected in R script to illustrate the utilities.
As we have seen in the above examples (dragonfly, tree rings and diatoms) this VisioBioshapeR library is acceptable for the specific purpose and has a wide variety of uses in automatic identification of biological images.
We thank the research group in diatoms led by Professor Jaume Cambra, Nuria Flor and Andrea Burfeid of the Botanical Department at the UB for their help and support, analysis of samples and providing images of diatoms.
We also thank the Faculty of Biology at the UB for a grant it awarded in 2015 to support this project.
The website for VisioBioshapeR can be found at GITHUB: https://github.com/BielStela/VisioBioShapeR.
It contains the library, an “R” script with the examples and a “zip”’ folder with all the images commented on and used in the examples.
Stela, B. and Monleón-Getino, A. (2016) Facilitating the Automatic Characterisation, Classification and Description of Biological Images with the VisionBioShape Package for R. Open Access Library Journal, 3: e3108. http://dx.doi.org/10.4236/oalib.1103108