This paper introduces MapReduce as a distributed data processing model using open source Hadoop framework for manipulating large volume of data. The huge volume of data in the modern world, particularly multimedia data, creates new requirements for processing and storage. As an open source distributed computational framework, Hadoop allows for processing large amounts of images on an infinite set of computing nodes by providing necessary infrastructures. This paper introduces this framework, current works and its advantages and disadvantages.
The amount of image data has grown considerably in recent years due to the growth of social networking, surveillance cameras, and satellite images. However, this growth is not limited to multimedia data. This huge volume of data in the world has created a new field in data processing which is called Big Data that nowadays positioned among top ten strategic technologies [
Big Data refers to the massive amounts of data collected over time that are difficult to analyze and handle using common database management tools [
The next approach for processing large volumes of data and image was distributed systems with Message Passing Interface (MPI). Along with the idea of parallel data processing in distributed computing nodes and dissemination of data in each node, this approach promised a bright future for new data processing needs. However, the problem that this technique was faced with was parallel coordination and implementation of the required algorithms that completely depended on the system programmer and developer. Therefore, it was not widely embraced due to the lack of experts and professional developers [
Google as one of the leading companies in the field of big data, proposed the MapReduce programming model [
The main problem in the prevalence of this model was the provision of computing cluster for its implementation. It requires energy, cooling systems, physical space, necessary hardware and software for setting it up. These requirements are costly for many small and mid-size companies and enterprises [
This barrier has been resolved now by the popularity of cloud computing that provides consumers with low cost hardware and software based on the resource use. Just rent the number of computing nodes and the required resources when needed, then run your algorithm and get the result.
One of the well-known examples in this field is the generating PDF files from scanned daily archive of the New York Times in 2007. In this case 11 million photos with a volume of about 4 terabytes were converted to PDF only in 24 hours by using 100 nodes of Amazon Cloud Computing. This task would last for many years using common systems and algorithms [
In this paper, we introduce the MapReduce model as the basis of the modern distributed processing, and it’s open-source implementation named Hadoop, the work that has been done in this area, its advantages and disadvantages as a framework for distributed processing, especially in image processing.
In this paper, it is assumed that the readers are familiar with cloud computing. In summary, cloud computing provides online computation and computer processing for users, without being worry about the number of required computers, resources, and other considerations. Users pay the cost based on the amount of resource consumption. Refer to source number [
Hadoop is an open source framework for processing, storage, and analysis of large amounts of distributed and unstructured data [
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo at the time, named it after his son’s toy elephant. It was originally developed to support distribution for the Nutch search engine project [
The Apache Hadoop framework is composed of the following modules [
Hadoop Common contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS)―a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN―a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications. this part also named MapReduce 2.0 (MRv2)
Hadoop MapReduce―a programming model for large scale data processing.
In this system, large data files, such as transaction log files, feed reader of social networks, and other data sources are segmented and then distributed in the network.
Sharing, storing, and retrieving large files on a Hadoop cluster is undertaken by its distributed file system called HDFS [
There are three types of compute nodes in HDFS [
Data node that encompasses each one of Hadoop member computers contains file blocks. There is a name management node in Hadoop system for each data node set. The third type is the secondary node that there is a copy of name management node data on it. Therefore if the node stops working, the data will not be lost.
After data distribution in Hadoop system, analysis and processing would be carried out by the MapReduce part [
After that, this request is sent to each node. These nodes, that we call them task trackers, perform data pro- cessing independently and in parallel by running Map function [
In addition to the above mechanism which is the basis of Hadoop in distributed and parallel data processing, a number of supplemental projects are designed for it. Big data processing may be done easier and more professional with the help of these projects. In this section, we look at these libraries and Hadoop utilities.
To store and retrieve information in Hadoop in a more professional manner, NoSQL databases such as HBase and Cassandra can be used. These types of databases support MapReduce mechanism and they are specifically designed to store and retrieve large amounts of unstructured data [
In addition to Java, users’ requests can be written in an open source language called Pig which is designed specifically for Hadoop and also is relatively simple to learn.
To collect data from different sources and save them in Hadoop, Flume framework was designed. Flume software agents collect data and send them instantly by placing on a web servers, mobiles, etc.
For sequential execution of the requests Oozie Library was considered. It allows users to run multiple requests sequentially and also each of the requests can use the output of the previous requests as an input. Whirr Library is recommended for running Hadoop on cloud computing systems and virtual platforms.
Mahout library has been designed for the purposes of data mining on Hadoop distributed platform. It considers the most common data mining algorithms, such as clustering and regression, as an input and converts them into proper MapReduce requests for Hadoop [
Hadoop which is under the license of Apache with the support of this foundation is accessible to researchers as an open source framework. For using MapReduce models, in addition to Hadoop, we can also use Twister [
Li et al. [
kennedy and the associated team [
In order to determine the subject of images, Yan et al. [
For Content Based Image Retrieval (CBIR), Shi et al. [
Zhao, Li, and Zhou from Peking University conducted a research on the use of MapReduce model for satellite imagery documentation and management and spatial data processing [
Yang and the associated team [
For identification through cornea, Shelly and Raghava designed and implemented a system using Hadoop and Cloud Computing [
Kucakulak and Temizel proposed a Hadoop-based system for pattern recognition and image processing of intercontinental missiles [
Almeer designed and implemented a system for remote sensing image processing systems with the help of Hadoop and cloud computing systems with 112 compute nodes [
The most important advantage of Hadoop is the ability to process and analyze large amounts of unstructured or semi-structured data which have been impossible to process efficiency (cost and time) so far [
The next advantage of Hadoop is its simple expansion and horizontal scalability. Data can easily be analyzed up to Exabyte level and there is no need for companies to work on sample data and a subset of the original data. With the help of Hadoop, the possibility of checking all types of data is provided.
Another advantage is its low set up cost, mainly because it is free and there is no need for expensive and professional hardware. In particular, with the spread of cloud computing and its reasonable prices for case processing of data as well as private clouds, it takes only a few hours to set up a Hadoop system [
On the other hand, Hadoop and its subsets are all in the early stages of development and they are unsteady and immature. This will lead to permanent modification of this framework that imposes costs of continuous training on organizations.
On the other hand, because of novelty of this software model, a few people have the necessary skills for establishing and working on Hadoop-based systems. Lack of expert manpower is the most important challenge of many companies in using this system.
Also the novelty of this technology causes the lack of valid standards and benchmarks for evaluating different algorithms in this area. Bajcsy et al. attempted to assess four different methods of Hadoop-based image proce- ssing on cluster [
Another Hadoop’s problem which has an inherent nature is lack of the ability of real-time data processing. The request tracker must wait for each compute node in the system to finish the work, and then it can deliver the final answer to the user. However, this problem will be solved to some extent by the rapid growth of NoSQL databases’ technologies and its combination with Hadoop. Moreover, frameworks such as Storm [
The huge volume of visual data in recent years and their need for efficient and effective processing stimulate the use of distributed image processing frameworks in image processing area. So that up to the coming years, many algorithms which have been introduced in the field of image processing and pattern recognition should consider the requirements for macro image processing in order to be welcomed by the outside world. This paper gives an overview of distributed processing methods and the programming models. Also some works are studied which have been done in recent years using Hadoop open source framework. Hadoop and its processing model are newly formed and like any other new technologies may have its own issues, such as lack of familiarity of the majority of IT society with it, lack of enough expert forces, and unwanted defects and problems due to its novelty. However, this processing style, that uses MapReduce model and distributed file system, will be among the most useful tools for image processing and pattern recognition in the coming years due to its consistency with cloud computing structures.