
S. GUERON
8
nary tree), where the number of nodes is j + 1 and the
height of the tree is 2. As a special case of a tree mode,
the security properties of this construction follow from
the more general theory on tree hashing (e.g., [5,6] dis-
cuss the security properties of a tree hash in the context
of indifferentiability from an ideal hash function).
Note that the above definition is flexible enough to
cover several useful setups. One example is “interleav-
ing” segments of a given message (which we use here,
for directly taking advantage of SIMD architectures).
Another case is when the data is consumed from j loca-
tions (e.g., j pointers) of a message. This can occur, for
example, in an application that hashes a file system (or a
directory) where j is the number of files (and each file is
a node in the tree).
Hereafter, we assume that the processed messages are
sufficiently long to gain performance from the j-lanes
tree mode (and ignore trivially short message).
2.2. Applying j-Lanes Hashing to Derive a
4-Lanes SHA-256
We use SHA-256 as the underlying hash algorithm, and
generate a “j-lanes SHA-256”. Our motivation is to check
the potential performance advantage of the paralleliza-
tion supported by SIMD architectures (or multithreaded
implementations).
By splitting the message into j independent slices, the
hash computations are reduced to the problem of hashing
multiple independent messages, supplemented by the
fixed-cost Wrapping step. With this, we can apply the
techniques for utilizing SIMD architectures to hash mul-
tiple independent messages (of different lengths). These
techniques, and their resulting performance, are de-
scribed in detail in [2].
SHA-256 operates on 32-bit words. Therefore, on
processors that support the AVX (or SSE) architecture
that has 128-bit registers and the necessary integer in-
structions, a natural choice for j-lanes (SHA-256) hash-
ing is j = 4, with the obvious convention for slicing the
message: consecutive 128-bit chunks of the message are
treated as 4 consecutive 32-bit words, each one belongs
to a different slice. These 4 words can then be viewed as
4 “elements” of a single AVX register (xmm), and the
SHA-256 computations can be parallelized using the
SIMD architecture (see [2] for details). If the byte-length
of the message is divisible by 256, we set the slices to
have equal lengths. Otherwise, (at least) one of slice has
a different length. This situation requires different han-
dling in the last Update (with negligible performance
cost).
j-lanes hashing involves some overhead, and therefore,
the performance gains are expected to be (fully) mani-
fested only for sufficiently long messages.
To illustrate, we note that the performance of SHA-
256 is closely proportional to the number of invocations
of its compression function (“Update” hereafter). Con-
sider a message whose byte-length l is divisible by 256,
and write l = 256x for some integer x. Hashing (with
SHA-256) such a message requires 4x + 1 Updates,
where the last one due to the padding block. On the other
hand, 4-lanes SHA-256 for this message requires 4 (x + 1)
+ 3 Updates, accounting for 1 padding block for each
slice, and 3 Updates for the Wrapping step which re-
quires hashing of a 128 bytes message. Comparing the
linear (i.e., serial) SHA-256 to the 4-lanes SHA-256, we
see that the latter involves 6 additional Updates. How-
ever, from the total of 4x + 7 Updates, 4x can be paral-
lelized, in particular by using the AVX architecture. This
is the reason why the overall performance is expected to
improve.
3. Performance Studies
This section discusses some performance studies for of
j-lanes SHA-256. We first describe the measurement
methodology.
Each measured function was isolated, run 25,000
times (warm-up), followed by 100,000 iterations that
were timed (using the RDTSC instruction) and aver-
aged.
To minimize the effect of background tasks running
on the system, each experiment was repeated five
times, and the minimum result was recorded.
All the runs were carried out on a system where the
Intel® Hyper-Threading Technology, the Intel® Turbo
Boost Technology, and the Enhanced Intel Speed-
step® Technology, were disabled.
The runs were executed on the 3rd Generation Intel®
Core™ i7-3770 processor (previously known as “Ar-
chitecture Code name Ivy Bridge”).
In all cases, the reported performance numbers ac-
count for the full computations (i.e., including the
padding and, when relevant. the final hashing of the j
digests).
In our studies, we used two SHA-256 and three j-
lanes-SHA-256 (j = 4, 8, 16) implementations as fol-
lows:
OpenSSL (1.0.1c) linear: standard hashing using
OpenSSL function.
4-SMS linear: standard hashing using the n-SMS (n =
4) method (see [9,10]; we used here an improved ver-
sion of this implementation).
j-lanes using OpenSSL: using OpenSSL’s (1.0.1c)
SHA-256 function to implement j-lanes-SHA-256.
j-lanes using the n-SMS: using the n-SMS SHA-256
implementation ([9,10]) to implement j-lanes SHA-
256.
Copyright © 2013 SciRes. JIS