The only known predictable aggregation of dwarf minke whales (Balaenoptera acutorostrata subsp.) occurs in the Australian offshore waters of the northern Great Barrier Reef in May-August each year. The identification of individual whales is required for research on the whales’ population characteristics and for monitoring the potential impacts of tourism activities, including commercial swims with the whales. At present, it is not cost-effective for researchers to manually process and analyze the tens of thousands of underwater images collated after each observation/tourist season, and a large data base of historical non-identified imagery exists. This study reports the first proof of concept for recognizing individual dwarf minke whales using the Deep Learning Convolutional Neural Networks (CNN).The “off-the-shelf” Image net-trained VGG16 CNN was used as the feature-encoder of the perpixel sematic segmentation Automatic Minke Whale Recognizer (AMWR). The most frequently photographed whale in a sample of 76 individual whales (MW1020) was identified in 179 images out of the total 1320 images provid-ed. Training and image augmentation procedures were developed to compen-sate for the small number of available images. The trained AMWR achieved 93% prediction accuracy on the testing subset of 36 positive/MW1020 and 228 negative/not-MW1020 images, where each negative image contained at least one of the other 75 whales. Furthermore on the test subset, AMWR achieved 74% precision, 80% recall, and 4% false-positive rate, making the presented approach comparable or better to other state-of-the-art individual animal recognition results.
The dwarf minke whale (Balaenoptera acutorostrata subsp.) is the second smallest baleen whale, born at approximately 2m in length and growing to a maximum measured length of 7.8 m [
The identification of individual whales underpins much of the scientific research on dwarf minke whales and the monitoring of tourism activities. While in the GBR, these whales are highly inquisitive, readily approaching vessels and divers and often maintaining contact for prolonged periods [
Photo-ID is a simple, non-invasive technique widely used to study a range of biological and behavioral characteristics of wild animal populations. Ideal candidates for photo-ID are those with stable color patterns and/or other markings that are unique to each individual, so that individuals can be easily distinguished from each other and their identifiable markings remain the same over time. The automation of the photo-ID process is often highly specific to the required species, e.g. fin contour of great white sharks [
For minke whales, photo-ID has typically involved visual comparison of large numbers of photographsby trained researchers; thus, the process is time-intensive. Much of the imagery used for photo-identification of dwarf minke whales in recent years has come from tourists and crew aboard swim-with whales dive tourism vessels [
Over the last few years the Deep Learning Convolutional Neural Networks (CNNs) revolutionized the field of computer-vision image recognition [
A typical Imagenet-trained CNN is setup to classify as many as 1000 different types of objects. Therefore, it is plausible to expect that such a CNN could distinguish at least 1000 different individual dwarf minke whales if it is trained or re-trained appropriately. This direct approach, however, has a number of limiting factors. First, millions of images are available in the Imagenet for training CNNs, which is presently not feasible for dwarf minke whales, where the number of images available for an individual whale may vary between one and several thousand. Second, typical Imagenet object categories are very different, e.g. differences in images for dogs and people, whereas all minke whales fit essentially the same category for the Imagenet (i.e. near-identical body shape, proportions and general color). Third, the output of a classification CNN is a single probability number for each available class, where category and class are used as equivalent terms in this study. Such probability prediction has limited value to a marine biologist, as it does not explain why/how CNN arrived at its prediction. This is known as the black-box perception and/or criticism of the classification CNNs. The black-box CNN prediction is unavoidable in studies where animals are identified by their “faces”, e.g. for gorillas [
The black-box limitation of the classification CNNs has a natural solution
when the CNNs are configured to perform semantic segmentation of images, where an image is segmented into per-pixel categories [
The underwater imagery dataset used in this study consisted of 1320 digital photographs of dwarf minke whales (Balaenoptera acutorostrata subsp.). All images were sorted according to unique individual animals. In some cases only left or right sides of a whale was identified, without knowing if corresponding images belonged to the same whale or not. Where it was possible to match the left and right sides to the same whale, the related imagery was labeled accordingly and placed together in the same folder. As a result, the dataset identified 76 different whales. The identification process was extremely time consuming even for trained researchers as it required recording and cataloguing the color patterns and scars of 76 different whales, and/or reviewing any new image against at least 76 other whale images thus relying on researchers’ memory to identify matches with any efficiency. The number of available images varied greatly between individuals; the MW1020 individual had the largest number of images (179), and several whales had only one image per individual.
As described in the introduction, this study used a segmentation CNN rather than a classification CNN to recognize an individual minke whale and localize the recognized unique features. Specifically, the most accurate segmentation FCN-8s model from the Fully Convolutional Networks (FCN) [
First, the FCN-8s model is based on the VGG16 CNN model [
Second, this study used the Deep Learning python framework Keras [
Third, at the time of writing, the FCN-8s publication [
In terms of the actual implementation, the FCN-8s model was built by reusing all VGG16 convolutional layers, which were loaded with the Imagenet-trained VGG16 weights available in Keras [
The adopted FCN-8s [
Two image processing protocols were used. First, all available images were standardized by the following imagescaling procedure (ISP640). If a given image had H and W as height and width, respectively, then L = min ( H , W ) is the minimum of H and W, and the image was resized by scale S = 640 / L .This step scaled all images to have shortest sides be 640 pixels long, hence the abbreviation ISP640.
The second or training augmentation protocol (TAP480) was applied to the ISP640 processed images, where each image was:
・ Randomly rotated in the range of [ − 45 , + 45 ] degrees, where the input image was reflected to fill pixels outside the original boundary as required;
・ Randomly resized in the scale range of [ 0.75 , 1.25 ] , or by up to 25% zooming in or out;
・ Randomly shifted in each color channel in the [ − 25.5 , 25.5 ] range, where 25.5 was the 10% of maximum color values 255;
・ Randomly gamma shifted in the [ − 25.5 , 25.5 ] range, where all color channels values were shifted together;
・ Randomly cropped to retain 480 × 480 pixels;
・ Imagenet color mean values were subtracted as commonly done when working with the Imagenet-trained VGG16 model.
The following training workflow was adopted for this study. All available images were sequentially numbered and split into five approximately equal subsets. The first three subsets were used as a single training set, i.e. 60% of all available images. The fourth and the fifth subsets became the validation and testing sets, respectively. More precisely, the ith image was allocated to validation or test if ( i + 1 ) or i were multiple of 5, respectively, where all remaining images were assigned to the training set.
The training of FCN-8s was done in up to 100 cycles. In each cycle, TAP480 was further applied to the already ISP640-processed images. The training images were loaded into memory as a X ( N t , M , M , C ) tensor or a multidimensional matrix, where N t = 200 was the number of images, M = 480 was the TAP480 cropping length, and where C = 3 was due to the three available color channels. The corresponding to the loaded training images were the ground-truth binary per-pixel masks, which were loaded as a one-hot encoded Y ( N t , M , M , K ) tensor, where Y ( i , m , l , k ) = 1 if the ( m , l ) pixel belonged to the kth class in the ith image and zero otherwise. The required number of classes K was K = 1 for the automatic whale locator and a single whale classifier, as described later on in this paper. The validation X v and Y v tensors were constructed in similar fashion.
The per-pixel binary cross-entropy loss function, e.g. p.231 of [
Being a segmentation model, the FCN-8s model required the ground-truth per-pixel binary mask for each of the training and validation images. Therefore, the auxiliary goal of this study was to design the required workflow to be as scalable as possible for future larger training datasets. Creating the ground-truth per-pixel binary masks was clearly the least scalable component of this study, and required a scalable solution. This was solved by training an instance of FCN-8s to be the Minke Whale Locator (MWL).
To train MWL, 100 images were segmented by hand (including 50 of the MW1020 individual) to produce binary per-pixel ground-truth mask Y for each of the 100 images. Then MWL was trained as per preceding Section 2.2 with the following modifications. In addition to TAP480, images were flipped horizontally with 0.5 probability. The available 100 images were split 70 for training, and 30 for validation, where the rest of the not-segmented images were considered to be the testing set. The Keras version of the RMS prop optimizer was used with 10−4 learning rate, and 10−3 learning rate decay after each weights update, where RMS prop “divides the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight” [
Trained MWL was applied to all available images to automatically generate one largest rectangular binary mask per ISP640 pre-processed image. Note that since MWL was fully convolutional, it was rebuilt to accommodate any required image dimensions, where one side was always 640 (due to ISP640) but the other side was varied. The mask generation was done as follows. For each image, the per-pixel prediction heat-map Y p ( i , j ) was converted to binary mask B via,
B ( i , j ) = 1 , Y p ( i , j ) ≥ 0.8 , (1)
where i and j were the row and column pixel location indices, respectively, and where the remaining mask values were set to zero, i.e. B ( i , j ) = 0 , Y p ( i , j ) < 0.8 . The largest connected non-zero area was filled to complete its minimum-enclosing rectangle, and saved as the only non-zero values of the final binary mask.
Similar to the preceding MWL model, an instance of the FCN-8s model was created for a required number of K individual whales to be the Automatic Minke Whale Recognition (AMWR) model. To train AMWR, the automatically created (by MWL) masks for the K whales were reviewed for correctness. Specifically, each MWL-generated rectangular mask was checked to make sure it enclosed correct whale if multiple whales were present in an image. Also, if the mask did not enclose the whole whale, the mask was verified to enclose all whales’ features, which a biologist could use to identify that whale, i.e. fin coloration patterns and distinct scars. Note that in this study, the MWL model was nothing more than a convenience tool to automate ground-truth mask creation. Therefore where available, the manually segmented masks were used instead of the corresponding MWL masks.MWL produced acceptable bounding boxes in more than 90% cases confirming it to be a viable tool for this project.
The AMWR was trained as per preceding Section 2.2 with the following modifications. For the K selected whales the positive ground-truth masks (manually or automatically MWL-segmented) were used. The training masks for the remaining ( 76 − K ) whales were automatically generated as negative or all-zeros masks, i.e. any of the K selected whales were missing in the remaining images. Then the training proceeded as per MWL but with added regularization weight decay set to 10−4.
The largest number (179) of images was available for the individual whale MW1020 so it was used as the benchmark of possible accuracy for the utilized dataset and the AMWR model with K = 1 . As per preceding Sections 2.3 and 2.4, 50 masks were segmented manually, and the rest of available MW1020 images (129) were segmented by MWL and quality-checked visually. The MW1020 training, validation and test sets contained 107, 36, and 36 images, respectively. The rest of other whale images (1141) were automatically labeled as negative, and split 60%-training, 20%-validation, and 20%-test. Because there were many more negative labels than positive, for each training cycle an equal number of images (100) were randomly selected from both negative and positive/MW1020 training images. Similarly, all available 36 MW1020 validation images were used with 36 randomly selected negative validation images, where a new random selection of 36 negative images was done before each training cycle. Also due to the highly unbalanced number of positive and negative examples, AMWR classifier was assessed via precision, recall, fprate (false-positive), in addition to the standard accuracy [
p r e c i s i o n = T P / ( T P + F P ) , r e c a l l = T P / P , f p r a t e = F P / N (2)
a c c u r a c y = ( T P + T N ) / ( P + N ) , (3)
where TP, TN, FP and FN were the numbers of true-positive, true-negative, false-positive and false-negative predictions, respectively, and where P and N were the total numbers of positive (MW1020) and negative (non-MW1020-whale) images.
The main distinct advantage of a per-pixel classifier (rather than per-image) such as the presented AMWR, is the full control over how “conservative” or “liberal” [
Prediction Score | Datasets | ||
---|---|---|---|
Train | Validation | Test | |
Accuracy | 0.984 | 0.924 | 0.935 |
Precision | 0.935 | 0.735 | 0.743 |
Recall | 0.953 | 0.694 | 0.805 |
Fp rate | 0.01 | 0.04 | 0.04 |
On the test subset, AMWR achieved 4% false-positive rate (
Due to the increasing abundance of underwater digital imagery, the manual identification of individual dwarf minke whales from images and videos has become cost-ineffective. It has become excessively time-consuming to manually check if an unsorted image contains a new whale or a known whale, e.g. from the 76 labeled whales of this study’s dataset. Considering that photo-identification of dwarf minke whales represents one of the few methods available to address key knowledge gaps for this species’ biology and life history, the application of automated recognition tools can potentially provide new scientific insights that would otherwise be inaccessible to scientists. The quantity of images for individual whales presented a theoretically challenging problem, where the number of available labeled images was too large for further manual labeling, but not large enough to apply Deep Learning classification CNNs. This study demonstrated how the Deep Learning per-pixel segmentation FCN-8s [
The authors are profoundly grateful for the contributions of passengers, crew and owners of the permitted swim-with-whales tourism vessels in the Great Barrier Reef who have helped to provide many of the minke whale images used in this study. We are also deeply grateful to the many Mike Whale Project Volunteers who have helped to sort our minke images. We are particularly indebted to our research colleagues associated with the Minke Whale Project who have facilitated our photo-identification work including especially Dr Susan Sobtzick (who developed our main MWP Catalogue), Chrystie Watson, Tara Stephens, Liz Forrest, A/Prof Trina Myers, Dr Dianna Hardy, Prof Ian Atkinson and Kent Adams.
Konovalov, D.A., Hillcoat, S., Williams, G., Birtles, R.A., Gardiner, N. and Curnock, M.I. (2018) Individual Minke Whale Recognition Using Deep Learning Convolutional Neural Networks. Journal of Geoscience and Environment Protection, 6, 25-36. https://doi.org/10.4236/gep.2018.65003