S. GUERON

8

nary tree), where the number of nodes is j + 1 and the

height of the tree is 2. As a special case of a tree mode,

the security properties of this construction follow from

the more general theory on tree hashing (e.g., [5,6] dis-

cuss the security properties of a tree hash in the context

of indiﬀerentiability from an ideal hash function).

Note that the above definition is flexible enough to

cover several useful setups. One example is “interleav-

ing” segments of a given message (which we use here,

for directly taking advantage of SIMD architectures).

Another case is when the data is consumed from j loca-

tions (e.g., j pointers) of a message. This can occur, for

example, in an application that hashes a file system (or a

directory) where j is the number of files (and each file is

a node in the tree).

Hereafter, we assume that the processed messages are

sufficiently long to gain performance from the j-lanes

tree mode (and ignore trivially short message).

2.2. Applying j-Lanes Hashing to Derive a

4-Lanes SHA-256

We use SHA-256 as the underlying hash algorithm, and

generate a “j-lanes SHA-256”. Our motivation is to check

the potential performance advantage of the paralleliza-

tion supported by SIMD architectures (or multithreaded

implementations).

By splitting the message into j independent slices, the

hash computations are reduced to the problem of hashing

multiple independent messages, supplemented by the

fixed-cost Wrapping step. With this, we can apply the

techniques for utilizing SIMD architectures to hash mul-

tiple independent messages (of different lengths). These

techniques, and their resulting performance, are de-

scribed in detail in [2].

SHA-256 operates on 32-bit words. Therefore, on

processors that support the AVX (or SSE) architecture

that has 128-bit registers and the necessary integer in-

structions, a natural choice for j-lanes (SHA-256) hash-

ing is j = 4, with the obvious convention for slicing the

message: consecutive 128-bit chunks of the message are

treated as 4 consecutive 32-bit words, each one belongs

to a different slice. These 4 words can then be viewed as

4 “elements” of a single AVX register (xmm), and the

SHA-256 computations can be parallelized using the

SIMD architecture (see [2] for details). If the byte-length

of the message is divisible by 256, we set the slices to

have equal lengths. Otherwise, (at least) one of slice has

a different length. This situation requires different han-

dling in the last Update (with negligible performance

cost).

j-lanes hashing involves some overhead, and therefore,

the performance gains are expected to be (fully) mani-

fested only for sufficiently long messages.

To illustrate, we note that the performance of SHA-

256 is closely proportional to the number of invocations

of its compression function (“Update” hereafter). Con-

sider a message whose byte-length l is divisible by 256,

and write l = 256x for some integer x. Hashing (with

SHA-256) such a message requires 4x + 1 Updates,

where the last one due to the padding block. On the other

hand, 4-lanes SHA-256 for this message requires 4 (x + 1)

+ 3 Updates, accounting for 1 padding block for each

slice, and 3 Updates for the Wrapping step which re-

quires hashing of a 128 bytes message. Comparing the

linear (i.e., serial) SHA-256 to the 4-lanes SHA-256, we

see that the latter involves 6 additional Updates. How-

ever, from the total of 4x + 7 Updates, 4x can be paral-

lelized, in particular by using the AVX architecture. This

is the reason why the overall performance is expected to

improve.

3. Performance Studies

This section discusses some performance studies for of

j-lanes SHA-256. We first describe the measurement

methodology.

Each measured function was isolated, run 25,000

times (warm-up), followed by 100,000 iterations that

were timed (using the RDTSC instruction) and aver-

aged.

To minimize the effect of background tasks running

on the system, each experiment was repeated five

times, and the minimum result was recorded.

All the runs were carried out on a system where the

Intel® Hyper-Threading Technology, the Intel® Turbo

Boost Technology, and the Enhanced Intel Speed-

step® Technology, were disabled.

The runs were executed on the 3rd Generation Intel®

Core™ i7-3770 processor (previously known as “Ar-

chitecture Code name Ivy Bridge”).

In all cases, the reported performance numbers ac-

count for the full computations (i.e., including the

padding and, when relevant. the final hashing of the j

digests).

In our studies, we used two SHA-256 and three j-

lanes-SHA-256 (j = 4, 8, 16) implementations as fol-

lows:

OpenSSL (1.0.1c) linear: standard hashing using

OpenSSL function.

4-SMS linear: standard hashing using the n-SMS (n =

4) method (see [9,10]; we used here an improved ver-

sion of this implementation).

j-lanes using OpenSSL: using OpenSSL’s (1.0.1c)

SHA-256 function to implement j-lanes-SHA-256.

j-lanes using the n-SMS: using the n-SMS SHA-256

implementation ([9,10]) to implement j-lanes SHA-

256.

Copyright © 2013 SciRes. JIS