Journal of Software Engineering and Applications, 2013, 6, 62-66
doi:10.4236/jsea.2013.63b014 Published Online March 2013 (http://www.scirp.org/journal/jsea)
Copyright © 2013 SciRes. JSEA
Carving Thumbnail/s and Embedded JPEG Files Using
Image Pattern Matching
Nurul Azma Abdullah, Rosziati Ibrahim, Kamaruddin Malik Mohamad
Faculty of Computer Science & Information Technology, UTHM, Malaysia.
Email: {azma, rosziati, malik}@uthm.edu.my
Received 2013
ABSTRACT
Images (typically JPEG) are used as evidence against cyber perpetrators. Typically the file is carved using standard pat-
terns. Many concentrate on carving JPEG files and overlook the important of thumbnail in assisting forensic investiga-
tion. However, a new un ique p attern is u sed to detect thumbnail/s and embedded JPEG file. This p aper is to introduce a
tool call PattrecCarv to recognize thumbnail/s or embedded JPEG files using unique h ex patterns (UHP). A tool called
PattrecCarv is developed to automatically carve thumbnail/s and embedded JPEG files using DFRWS 2006 and
DFRWS 2007 datasets. The tool successfully recovers 11.5% more thumbnails and embedded JPEG files than Pred-
Clus.
Keywords: Component; JPEG Image; Pattern Matching; File Carving; DFRWS 2006/07; Thumbnail
1. Introduction
A method to recover files from the disk collected from
the crime scene which is known as cyber evidences is
called file carving [1-3]. Year by year, with the increas-
ing number of computers and other digital devices usages,
file carving technique also evolve drastically. In carving
a JPEG images, the focus is in fragmentation file carving.
Although there are researchers discussed on fragmenta-
tion such as [2-4], it is not yet completely solved.
This paper is introducing a way to assist in reducing
number of JPEG files to be processed for fragmentation
point detection. A JPEG image can be embedded into
other file types such as doc, ppt, etc. Furthermore,
thumbnails can be embedded into a JPEG image itself to
ease the recovering and organizing of the original image.
A JPEG image can contains none, single or two thumb-
nails in the image itself. A thumbnail which is a reduced
version of an image carried similar feature as the original.
This thumbnail is always mistaken as another JPEG im-
age. Therefore, knowledge of thumbnail’s existence
helps investigator to separate JPEG files with thumb-
nail/s and concen trates to investigate correct point where
the real fragmentation occurs. They can then identify
which JPEG image is fragmented with another JPEG
images. With this way, they can ascertain that those
fragments are belongs to two or more different files, not
file/s within a file. This is important because during the
reassembling process, if a thumbnail is mistakenly iden-
tified as another JPEG files, the original file may corrupt
because of missing fragments. Consequently, this is also
accelerating the reassembling process by allowing inves-
tigators to concentrate on real fragmentation situatio n.
In this paper, a novel algorithm called PattrecCarv is
proposed. PattrecCarv is developed to recover thumb-
nail/s or embedded files from DFRWS 2006 and 2007
datasets using unique hex patterns (UHP). The output can
be used as a pre-processing data to simplify the process
for recovering of JPEG image.
The rest of the paper is organized as follows. Section
II is the related works consists of an overview of JPEG
standard and thumbnail and embedded JPEG files while
Section III brief on PredClus algorithm. Section IV de-
scribes the proposed PattrecCarv algorithm. Section V
describes the experimentations done. Section VI de-
scribes the result and discussion. Finally section VII con-
cludes this paper.
2. Related Works
2.1. An Overview of JPEG Standard
Computer forensics is to recover evidences resides on a
computer, by mean to solve pornography cases [1-3].
This involves image files obtained from the perpetrator
in certain format like Bitmap and JPEG but most com-
mon format is JPEG. JPEG is popular because of its
compressed file that can reduce the size required to allo-
cate an image. Joint Photographic Experts Group (JPEG)
was formed by International Telegraph and Telephone
Carving Thumbnail/s and Embedded JPEG Files Using Image Pattern Matching
Copyright © 2013 SciRes. JSEA
63
Consultative Committee in 1986 inspired by an effort of
International Organization of Standard (ISO) to find
ways to use high resolution graphics and pictures in
computers [4]. JPEG introduced compression standard
for both grayscale and color continuous-tone images. The
details of JPEG compressed data formats can be found in
[5]. There are two types of JPEG that are mostly used
today, JPEG File Interchange Format (JFIF) and JPEG
Exchangeable Image File Format (Exif) [6,7]. JFIF is
popular for internet file while EXIF is the popular image
file format used for digital camera [8].
2.2. Thumbnail and Embedded JPEG Files
Both in JFIF and Exif format allow for embedding
thumbnail/s into a JPEG file. A JPEG image with a com-
plete SOI/EOI can be embedded into an original JPEG
image to ease the recovering and organizing of the origi-
nal image. This file is known as thumbnail. Thumbnails
are reduced size version of images that can be used to
recover and organize the picture [9] while embedded
JPEG files are referred to original JPEG files that are
embedded to other types of files such as PPT, WORDS
and EXCEL. Thumbnails are used to speed up images
search or page load on the Internet and also being used in
image organizing programs. Thumbnails are compatible
on most modern operating systems or desktop environ-
ments such as Microsoft Windows, Mac OS X, KDE and
GNOME [10]. A JPEG image can contain none or a sin-
gle or two thumbnails. Therefore, a JPEG image can
have several SOI/EOI pairs [11]. Mohamad in [12] and
[13] asserted the role of thumbnail to serve as a method
of recognizing the corrupted images because of its small
size that have a better chance for full recovery without
corruption [14]. A thumbnail carried similar features as
the original. Hence, using thumbnail/s, crime investiga-
tors can identify which images or pictures that have po-
tential to be used as evidences against cyber perpetrator.
Guo in [9] proposed thumbnails as a method to recover
JPEG image from fragment data. In brief, thumbnails do
serve multiple roles. Besides contributing in the process
of recovering and organizing JPEG files, thumbnails help
in recognizing corrupted images and also, information
about thumbnail’s location can be used in carving frag-
mentation JPEG images to recover the original files.
Abdullah et al. [15] proposed PredClus as a method to
recognize thumbnail/s and embedded JPEG files. How-
ever, using PredClus which using cluster size to deter-
mine the location of thumbnail/s or embedded file may
miss some thumbnails that resides at the start of cluster.
This situation occurs when a JPEG image with thumb-
nail/s require more than one cluster to store the data.
Sometimes, the start of thumbnail will be at the start of
second cluster. In this situation, the thumbnail/s will be
ignored by PredClus. Hence, an alternative technique to
distinguish thumbnail or embedded JPEG file with the
original is by using pattern matching technique. In carv-
ing JPEG images especially fragmented JPEG files, it
will ease the process of preparing evidence if the carver
can distinguish between original images, thumbnails and
embedded images.
3. Predclus Algorithm
PredClus is developed to automatically determine cluster
size of a dataset. Using this information, JPEG images
that are not located at the starting address of any cluster
are marked as thumbnails or embedded JPEG files. The
algorithm of PredClus is introduced to predict cluster
size used in both DFRWS 2006 and 2007.
First, data from dataset is read. These data are in hex
values. The hex values then matched with the standard
JPEG header. However, in this experiment, additional
markers are also used instead of standard JPEG header,
0xFFD8 alone. The additional markers used are 0xFFE0,
0xFFE1, 0xFFE2, 0xFFC4 and 0xFFDB. When matched,
the offset for each markers matched is retrieved. Using
formula as mentioned earlier, the determinant value is
calculated. If the determinant value = 0, then file found is
counted. This is done for each cluster size which are
512-byte, 1-kb, 2-kb, 4-kb and 8-kb cluster. Please refer
to [15] for detail explanation of PredClus.
After the determinant value for all JPEG files in the
datasets is extracted, files found for each cluster size th en
are summed. The percentage for each cluster size is cal-
culated. Then, a report is prod uced .
4. Pattreccarv Algorithm
This section discusses on the development of the pro-
posed algorithm called PattrecCarv. The algorithm is
adapted from dual-byte-marker algorithm proposed by [1]
to detect JPEG headers (SOI), thumbnails and embedded
JPEG files. DFRWS 2006 and 2007 datasets are used for
testing this algorithm. Nevertheless, the algorithm can
also work with other datasets.
A thumbnail in JFIF format can be recognized using
UHP in Table 1 while a thumbnail in EXIF format as
shown in Table 2. On the other hand, embedded JPEG
files can be recognized using UHP as in Table 3.
Table 1. The first 17 bytes of JFIF thumbnail.
SOIAPP0LengthIdentifier VersionAdditional
Markers
0x4A4649 46 00
0XFFD
8 0XFFE
0 0X00
J FI F NULL
0x0102 0x010048
0048
2 bytes2 bytes1 byte5 bytes 2 bytes5 bytes
Carving Thumbnail/s and Embedded JPEG Files Using Image Pattern Matching
Copyright © 2013 SciRes. JSEA
64
Table 2. The first 6 bytes of non-JFIF thumbnail.
SOI Validated markers
0xFFD8 0xFFDB/0xFFC4 0x0084
2 bytes 2 bytes 2 bytes
Table 3. The first 14 bytes of embedded JPEG files.
SOI APP0 Length Identifier VersionAdditional
Markers
0x4A 4649 46 00
0XFFD8 0XFFE0 0X00
J F I F NULL
0x0102 0x0003
2 bytes 2 bytes 1 byte 5 bytes 2 bytes5 bytes
Basically, the PattrecCarv algorithm works as follows:
STEP 1: Identify start-of-image (SOI or 0xFFD8)
markers (refer to Table 1)
If two-byte structure read is a SOI marker, then
jump to the 9th hex value.
STEP 2: Once SOI is found, locate the embedded UHP
(refer to Table III)
If the 9th hex value is the embedded UHP, then em-
bedded JPEG header is found.
Else, read the next value.
STEP 3: If the read hex values are the UHP for
thumbnail (refer to Table 1 & Table 2), then thumbnail
is found.
Repeat STEP 1, STEP 2 and STEP 3 until end of data.
The algorithm of PattrecCarv is illustrated in Figure 1.
5. Experimentation
This section discusses on the experiment designed for
ThumbedCarv model as illustrated in Figure 2. Both
algorithms, PredClus and PattrecCarv are installed into
this model and the results from both algorithms are
compared. Both algorithms are developed using C++
language in Windows 7 with Intel® Core TM2 Quad
CPU and 2GB of physical memory.
The input of this model is from two datasets, DFRWS
2006 and DFRWS 2007. Comparisons are made for these
algorithms (PredClus and PattrecCarv) based on the
number of successfully JPEG thumbnails and embedded
JPEG files recovered.
The details of PredClus algorithm are discussed in [15].
PattrecCarv consists of function to carve thumbnails and
embedded JPEG files using pattern recognition technique.
To carve thumbnails, three sets of validated markers as
shown in Table 1 and Table 2 are used while validated
markers for carving embedded JPEG files are shown in
Table 3.
Figure 1. Algorithm used in PattrecCarv for carving
thumbnails and embedded JPEG file s.
Datasets PredClus
Report
PattrecCarv
Marke
r
search
Thumbnail/
embedded JPEG
markers found
Figure 2. ThumbedCarv model.
The algorithm starts with reading the dataset. Once
JPEG SOI markers are found, the next APP0 markers are
read. If it is matched, it reads the next 9th hex value. If
next two bytes hex values match the embedded JPEG file
markers, the current offset is recorded. If not, next one
byte hex value is read. Once, a JFIF thumbnails markers
are detected, the offset value of the thumbnail is recorded.
If the APP0 markers are not found after SOI markers, the
algorithm checks the next two bytes hex values. If they
match with validated markers as in Table 1, then thumb-
nail is detected. Finally, the report of thumbnails and
embedded JPEG files is generated.
1. Read data image
2. Initialize hex_values
3. Initialize thumb_markers
4. Initialize embedded_m arkers
5. Find JPEG header
6. If found
7. Jump to 9
th hex_values
8. If hex_values== embed-
ded_markers
9. Read CurrentOffset
10. Save the Curren-
tOffset
11. Else
12. Read next
hex_values
13. If
hex_values==thumb_markers
14. Go to the most
recent SOI
15. Save the Cur-
rentOffset (location of SOI)
16. Endif
17. Endif
18. If not end of data image, repeat step 5
19. Generate report
Carving Thumbnail/s and Embedded JPEG Files Using Image Pattern Matching
Copyright © 2013 SciRes. JSEA
65
6. Result and Discussion
The screenshot for PattrecCarv output can be clearly
examined in Figure 3 and Figure 4. Figure 4 depicts
total number of thumbnails with JFIF headers detected
from DFRWS 2006 dataset is 6 and none for embedded
JPEG files. There are also no thumbnails using UHP 0 x
FFD8 Figure 4 shows total number of thumbnails and
embedded JPEG files detected is 33 which is correspond
with 1 thumbnail with the JFIF header, 18 thumbnails
recognized using UHP of 0xFFD8 and 0xFFDB, 2
thumbnails detected using UHP of 0xFFD8 and 0xFFC4
and 12 embedded JPEG files.
Table 4 shows the co mparisons done on PredClus and
PattrecCarv algorithms. From the table, there is not a
distinct difference of execution time but there is some
interesting findings in term of thumbnail/ embedded file
found, original detect as thumbnail, thumbnail or em-
bedded JPEG files missed and false detection.
Figure 3. Screenshot of PattrecCarv output using DFRWS
2006 dataset.
Figure 4. Screenshot of PattrecCarv output using DFRWS
2007 dataset.
Table 4. Comparison done on PredClus and PattrecCarv.
Experiment 1
(PredClus) Experiment 2
(PattrecCarv)
Dataset Dataset
2006 Dataset
2007 Dataset
2006 Dataset
2007
Thumbnail/ embedded file
found 5 23 5 26
Thumbnail/ embedded file
that is missed 0 3 0 0
Original detected as
thumbnail 0 0 0 4
False detection 0 0 1 3
Time taken (sec) 3.0 27.0 4.0 27.0
Clearly, PattrecCarv detects more thumbnail/embed-
ded file compared to PredClus though it falsely detect 4
files as thumbnails. PredClus is using cluster size infor-
mation to determine the detected file is either thumbnail
or embedded file or original file. That is the reason it did
not make any mistake in determining thumbnail/embed-
ded file. However, using PredClus, some thumbnails can
be missed. This is caused by a big JPEG file that pushes
the header of thumbnail to be stored at the start of file.
Furthermore, PredClus cannot differentiate between
thumbnails and embedded files because it does not know
any information about the file detected; only the size of
cluster is known. Although PattrecCarv has falsely de-
tected 5 files, but it does not miss to carve any thumb-
nails/embedded files and separate between thumbnails
and embedded files. The original file detected as thumb-
nail is caused by fragmented data. An experiment has
been conducted manually to investigate the cause of this
condition. It is found that all original files detected as
thumbnails are fragmented with another JPEG files or
other files.
7. Conclusion
JPEG file can be in a form of original JPEG file, thumb-
nail or embedded in another file. However, the impor-
tance of thumbnail and embedded file should not be
overlooked in forensic investigation. This paper intro-
duces a unique file pattern matching technique, which is
embedded in a tool called PattrecCarv. PredClus assumes
that all images starting from the first byte of a cluster as
an image which may mistakenly detect a thumbnail as an
original file where as PattrecCarv uses uniqu e file pattern
matching technique. Based on experiments done using
DFRWS 2006 and 2007 data sets, PattrecCarv success-
fully carves thumbails and embedded JPEG files more
efficiently as compared to PredClus.
8. Acknowledgment
The authors would like to thank Universiti Tun Hussein
Onn Malaysia (UTHM) for supporting this research.
REFERENCES
[1] S. L. Garfinkel, “Digital forensic research: the next 10
years,” Digital Investigation, 7(1), 2010, S64-S73.
[2] A. Pal and N. Memon, “Aut omated reassembly of the file
fragmented images using greedy algorithms,” IEEE Trans.
Image Processing, 15(2), 2003, 385-393.
[3] M. Karresand and N. Shahmehri, “Reassembly of frag-
mented jpeg images containing restart markers,” in 2008
European conference on computer network defense.
[4] M. I. Cohen, “ Advanced carving techniques,” Digital
Investigation, 4(1-4), 2007, pp. 119-128
[5] The International Telegraph and Telephone Consultative
Carving Thumbnail/s and Embedded JPEG Files Using Image Pattern Matching
Copyright © 2013 SciRes. JSEA
66
Committee (CCITT) 1992 “Information technol-
ogy—digital compression and coding of continuous-tone
still images–requirements and guideline (ITU-T T.81),”
1992. Retrieved Sept. 5, 2012, from World Wide Web
Consortium (W3C):
http://www.w3.org/Graphics/JPEG/itu-t81.pdf
[6] K. M. Mohamad, and M. Mat Deris, “Visualization of
JPEG metadata,” in: Proceeding. of the 2009 first Inter-
national Visual Informatics Conference on Visual Infor-
matics.
[7] E. Hamilton, “JPEG file interchange file format version
1.02.” Retrieved Sept. 5, 2012 from JPEG Committee
Homepage: http://www.jpeg.org/public/jfif.pdf
[8] P. Alvarez, “ Using extended file information (exif) file
headers in digital evidence analysis,” International Jour-
nal of Digital Evidence. 2(3), 2004.
[9] H. Guo and M. Xu, “A method for recovering jpeg files
based on thumbnail,” in: Automation and Systems Engi-
neering (CASE) 2011 International Conference.1-4.
[10] Thumbnail. Retrieved Sept 5, 2012, from Wikipedia:
http://en.wikipedia.org/wiki/Thumbnail.
[11] A. Merola, “Data carving concepts,” 2008. Retrieved
Sept. 5, 2012,
[12] from SANS Institute:
http://www.sans.org/reading_room/whitepapers/forensics/
datacarving-concepts_32969Y.
[13] K. M. Mohamad, A. Patel and M. Mat Deris, “Carving
JPEG images and thumbnails using image pattern match-
ing,” in 2011 IEEE Symposium on Computers and Infor-
matics.
[14] K. M. Mohamad, A.Patel, T. Herawan and M. Mat Deris,
“myKarve: JPEG image and thumbnail carver,” Journal
of Digital Forensic Practice. 3, 2011, 74-97.
[15] K. Cohen, “Digital still camera forensics. Small Scale
Digital Device Forensics,” 1(1), 2007, 1-8.
[16] N.A. Abdullah, R. Ibrahim,and K.M. Mohamad,”Cluster
size determination using JPEG files,” in Proceedings of
the 12th international conference on Computational Sci-
ence and Its Applications.
[17] K.M. Mohamad, T.Herawan and M. Mat Deris,
“Dual-byte-marker algorithm for detecting JFIF
header.,”in Bandyopadyay, S. M., Adi, W., Kim, T. &
Xiao, Y. (eds.) Information Security and Assuranc,. 17-26,
2010,Springer,Heidelber.