A <i>j</i>-Lanes Tree Hashing Mode and <i>j</i>-Lanes SHA-256

doi:10.4236/jis.2013.41002

Paper Menu >>

Journal Menu >>

Journal of Information Security, 2013, 4, 7-11

http://dx.doi.org/10.4236/jis.2013.41002 Published Online January 2013 (http://www.scirp.org/journal/jis)

A j-Lanes Tree Hashing Mode and j-Lanes SHA-256

Shay Gueron1,2

1Department of Mathematics, University of Haifa, Haifa, Israel

2Intel Corporation, Israel Development Center, Haifa, Israel

Email: shay@math.haifa.ac.il

Received August 19, 2012; revised September 20, 2012; accepted October 12, 2012

ABSTRACT

j-lanes hashing is a tree mode that splits an input message to j slices, computes j independent digests of each slice, and

outputs the hash value of their concatenation. We demonstrate the performance advantage of j-lanes hashing on SIMD

architectures, by coding a 4-lanes-SHA-256 implementation and measuring its performance on the latest 3rd Generation

Intel® Core™. For messages whose lengths range from 2 KB to 132 KB, we show that the 4-lanes SHA-256 is between

1.5 to 1.97 times faster than the fastest publicly available implementation that we are aware of, and between ~2 to ~2.5

times faster than the OpenSSL 1.0.1c implementation. For long messages, there is no significant performance difference

between different choices of j. We show that the 4-lanes SHA-256 is faster than the two SHA3 finalists (BLAKE and

Keccak) that have a published tree mode implementation. Finally, we explain why j-lanes hashing will be faster on the

coming AVX2 architecture that facilitates using 256 bits registers. These results suggest that standardizing a tree mode

for hash functions (SHA-256 in particular) could be useful for performance hungry applications.

Keywords: Tree Mode Hashing; SHA-256; SHA3 Competition; SIMD Architecture; Advanced Vector Extensions

Architectures; AVX; AVX2

1. Introduction

The performance of hash functions plays an important

role in various situations (e.g., for SSL/TLS connections

that use HMAC for authenticated encryption). In par-

ticular, the performance of SHA-256 on high end proc-

essors is a performance baseline for the SHA3 competi-

tion [1].

Recently, [2] published a “Simultaneous Hashing” (S-

HASH) method, for using SIMD architectures to speed

up the computations of SHA-256 (and other hashes) over

multiple messages. In this paper, we apply this technique

to accelerate SHA-256 for a single message, using a tree

mode that we call j-lanes hashing. We show that the re-

sulting “j-lanes SHA-256” is significantly faster than the

standard (“linear” hereafter) SHA-256. This demon-

strates the performance benefits that applications could

gain if a tree mode for hash functions (in particular,

SHA-256 and SHA-512) is standardized. It is interesting

to compare our results to the two SHA3 finalists that

already have a j-lanes tree mode (j = 2) implementation

(see [3]): BLAKE and Keccak. We offer this comparison

in Section 3.

2. j-Lanes Hashing and the Special Case of

4-Lanes SHA-256

Tree hashing is a well known concept for efficient com-

putations of hash functions, and is an efficient way for

updating a hash value when only a portion of the mes-

sage is changed. Some relevant references are [4-8]. We

focus here on a specific tree construction, which is de-

fined in the following section.

2.1. j-Lanes Hashing

Definition 1 (message j-Slicing): given a message m, its

associated j-Sliced message is obtained by applying a

permutation to the bits of m, followed by slicing the re-

sulting message into consecutive disjoint slices. We de-

note this by m = permutation (m1ǁm2ǁm3ǁ…ǁmj) under

some agreed convention on how each slice is defined,

and what the permutation is (for simplicity we assume

that m has at least j bits, to avoid degenerate “null”

slices).

Definition 2 (j-lanes-hash): Let h = h (MESSAGE) be

a hash function. Its associated j-lanes-hash, is a hash

scheme that operates as follows:

1) j-Slicing the message to m = permutation (m1ǁm2ǁ

m3ǁ…ǁmj).

2) Computing t1 = h (m1), t2 = h (m2), …, tj = h (mj).

3) Computing t* = h (t1ǁt2ǁ…ǁtj).

4) Returning the digest t*.

(hereafter we call Step 3 the “Wrapping” step).

j-lanes-hash is a special form of a tree mode (not a bi-

S. GUERON

nary tree), where the number of nodes is j + 1 and the

height of the tree is 2. As a special case of a tree mode,

the security properties of this construction follow from

the more general theory on tree hashing (e.g., [5,6] dis-

cuss the security properties of a tree hash in the context

of indiﬀerentiability from an ideal hash function).

Note that the above definition is flexible enough to

cover several useful setups. One example is “interleav-

ing” segments of a given message (which we use here,

for directly taking advantage of SIMD architectures).

Another case is when the data is consumed from j loca-

tions (e.g., j pointers) of a message. This can occur, for

example, in an application that hashes a file system (or a

directory) where j is the number of files (and each file is

a node in the tree).

Hereafter, we assume that the processed messages are

sufficiently long to gain performance from the j-lanes

tree mode (and ignore trivially short message).

2.2. Applying j-Lanes Hashing to Derive a

4-Lanes SHA-256

We use SHA-256 as the underlying hash algorithm, and

generate a “j-lanes SHA-256”. Our motivation is to check

the potential performance advantage of the paralleliza-

tion supported by SIMD architectures (or multithreaded

implementations).

By splitting the message into j independent slices, the

hash computations are reduced to the problem of hashing

multiple independent messages, supplemented by the

fixed-cost Wrapping step. With this, we can apply the

techniques for utilizing SIMD architectures to hash mul-

tiple independent messages (of different lengths). These

techniques, and their resulting performance, are de-

scribed in detail in [2].

SHA-256 operates on 32-bit words. Therefore, on

processors that support the AVX (or SSE) architecture

that has 128-bit registers and the necessary integer in-

structions, a natural choice for j-lanes (SHA-256) hash-

ing is j = 4, with the obvious convention for slicing the

message: consecutive 128-bit chunks of the message are

treated as 4 consecutive 32-bit words, each one belongs

to a different slice. These 4 words can then be viewed as

4 “elements” of a single AVX register (xmm), and the

SHA-256 computations can be parallelized using the

SIMD architecture (see [2] for details). If the byte-length

of the message is divisible by 256, we set the slices to

have equal lengths. Otherwise, (at least) one of slice has

a different length. This situation requires different han-

dling in the last Update (with negligible performance

cost).

j-lanes hashing involves some overhead, and therefore,

the performance gains are expected to be (fully) mani-

fested only for sufficiently long messages.

To illustrate, we note that the performance of SHA-

256 is closely proportional to the number of invocations

of its compression function (“Update” hereafter). Con-

sider a message whose byte-length l is divisible by 256,

and write l = 256x for some integer x. Hashing (with

SHA-256) such a message requires 4x + 1 Updates,

where the last one due to the padding block. On the other

hand, 4-lanes SHA-256 for this message requires 4 (x + 1)

+ 3 Updates, accounting for 1 padding block for each

slice, and 3 Updates for the Wrapping step which re-

quires hashing of a 128 bytes message. Comparing the

linear (i.e., serial) SHA-256 to the 4-lanes SHA-256, we

see that the latter involves 6 additional Updates. How-

ever, from the total of 4x + 7 Updates, 4x can be paral-

lelized, in particular by using the AVX architecture. This

is the reason why the overall performance is expected to

improve.

3. Performance Studies

This section discusses some performance studies for of

j-lanes SHA-256. We first describe the measurement

methodology.

 Each measured function was isolated, run 25,000

times (warm-up), followed by 100,000 iterations that

were timed (using the RDTSC instruction) and aver-

aged.

 To minimize the effect of background tasks running

on the system, each experiment was repeated five

times, and the minimum result was recorded.

 All the runs were carried out on a system where the

Intel® Hyper-Threading Technology, the Intel® Turbo

Boost Technology, and the Enhanced Intel Speed-

step® Technology, were disabled.

 The runs were executed on the 3rd Generation Intel®

Core™ i7-3770 processor (previously known as “Ar-

chitecture Code name Ivy Bridge”).

 In all cases, the reported performance numbers ac-

count for the full computations (i.e., including the

padding and, when relevant. the final hashing of the j

digests).

 In our studies, we used two SHA-256 and three j-

lanes-SHA-256 (j = 4, 8, 16) implementations as fol-

lows:

 OpenSSL (1.0.1c) linear: standard hashing using

OpenSSL function.

 4-SMS linear: standard hashing using the n-SMS (n =

4) method (see [9,10]; we used here an improved ver-

sion of this implementation).

 j-lanes using OpenSSL: using OpenSSL’s (1.0.1c)

SHA-256 function to implement j-lanes-SHA-256.

 j-lanes using the n-SMS: using the n-SMS SHA-256

implementation ([9,10]) to implement j-lanes SHA-

256.

S. GUERON 9

 AVX j-lanes hashing (j-lanes hashing for short): an

optimized implementation of j-lanes SHA 256, using

the S-HASH implementation of [2], and the AVX ar-

chitecture.

The results are illustrated in Figures 1-3.

Figure 1 compares the different implementations for

an 8 KB message and j = 4. Without parallelizing the

hashing of the slices (as in j-lanes using OpenSSL and

j-lanes using the n-SMS), the 4-lanes SHA-256 is slower

than the linear implementation. This is due to the over-

heads of the j-lanes method. For example, OpenSSL

(1.0.1c) uses 129 Updates and performs at 12.87 Cy-

cles/Byte, while the 4-lanes SHA-256 implementation

that simply calls the OpenSSL functions, uses 135 Up-

dates, and performs at 13.57 Cycles/Bye. On the other

hand, the optimized (using AVX) 4-lanes SHA-256 im-

plementation is 2.45 times faster than OpenSSL.

Figure 2 illustrates the effect of the choice of j (= 4, 8,

16). Obviously, increasing j involves additional overhead

to the j-lanes hashing. For example, 16-lanes SHA-256

for an 8 KB message involves 153 Updates, and is there-

fore slower than the 4-lanes SHA-256 that uses only 135

Updates (see top panel). However, both 8-lanes and 16-

lanes SHA-256 are still significantly faster than the best

performing linear implementation. For long messages

(see bottom panel), the relative impact of the overheads

decreases, and we obtain roughly the same performance

for j = 4, 8, 16.

Figure 3 shows the performance advantage of the

4-lanes SHA-256 for messages of lengths varying from 2

KB to 128 KB: 4-lanes SHA-256 is between 1.55 to 2

times faster than the best serial implementation (and 1.97

- 2.53 times faster than OpenSSL 1.0.1c).

Figure 4 shows the performance advantage of the 4-

lanes SHA-256 for a long message of 1 MB, measured on

12.87

10.18

5.465.78 6.44

OpenSSL (linear)n-SMS (linear)j-lanes hashing

Cycles per byte

8,192 bytes message

j=4

j=8

j-16

12.87

10.18

13.57

10.81

OpenSSL

( linear)

n-SMS

(linear)

4-lane

using n-S

Cycles per byte

8,192 bytes message

5.46

4-lanes

hashing

4-lanes

using OpenSSL

Figure 1. Performance of different implementations of 4-

lanes SHA-256, compared to linear SHA-256, for a 8192-

byte message. Measurements taken on the 3rd Generation

Intel® Core™ Processor.

j=1 6

(a)

12.78

10.09

5.055.07 5.11

OpenSSL (linear)n-SMS (linear)4-lanes hashing

Cycles per byte

131,072 bytes message

j=4

j=8

j-16

j=1 6

(b)

Figure 2. Performance of j-lanes-SHA-256 for j = 4, 8, 16,

compared to linear SHA-256. The message length is 8192

bytes (top panel) and 131,072 bytes (bottom panel). Mea-

surements taken on the 3rd Generation Intel® Core™ Proc-

essor.

Figure 3. Performance of 4-lanes SHA-256, compared to

linear SHA-256, for different message lengths. Measure-

ments taken on the 3rd Generation Intel® Core™ Processor.

the two processors: the 2nd Generation and the 3rd Gene-

ration Intel® Core™ Processors. We can see that the 4-

lanes SHA-256 is 2.27x faster than the best linear im-

plementation of SHA-256, when measured on the 2nd

Generation Intel processor, and 2x faster when measured

S. GUERON

17.41

11.41

5.01

12.76

10.1

5.03

OpenSSL (linear)n-SMS (linear)4-lanes hashi

Cycles per byte

1MB message

12.87

10.18

5.46

7.55

5.93 5.82

11.03

7.04

n g

2nd

Generation

Intel

Processor

3rd

Generation

Intel

Processor

Cycles per byte

8192 bytes message

Figure 4. Performance of 4-lanes SHA-256, compared to

linear SHA-256, for a 1 MB message. Measurements taken

on the 2nd Generation and on the 3rd Generation Intel®

Core™ Processor.

on the 3rd Generation Intel processor.

A Comment on the SHA3 Finalists

We expect that the SHA3 finalists [1] could also gain

from using the j-lanes-hash, at least to some extent, and

the performance gains will further increase when the

AVX2 architecture becomes available. However, at this

point, it is hard to tell if these algorithms would outper-

form the j-lanes-SHA-256 and/or j-lanes-SHA-512, and

by what margin.

Since the two finalists BLAKE and Keccak already

have a tree mode implementation (j = 2 for BLAKE and

Keccak; see [3]), we show the performance comparisons

of SHA-256, BLAKE, and Keccak in linear and in j-

lanes mode in Figure 5. Although we tried a 4-lanes

BLAKE512, we could not get it to perform better than

the underlying hash function. This is due to the fact that

the optimized BLAKE512 implementation already ex-

ploits the parallelism offered by the AVX architecture.

As expected, the j-lanes (tree mode) implementation

improves the performance of all three algorithms. The

results show that the j-lanes SHA-256 implementation is

the fastest one of these three.

Recalling that SHA-256 (and SHA-512) is the per-

formance baseline for SHA3, we conclude (from the

currently available information) that considering the j-

lanes mode still does not offer a performance advantage

for SHA3 over SHA-256. This is consistent with the

findings of [6]: migration to a new SHA3 standard could

not be motivated by performance advantages on the high

end platforms1.

Figure 5. Performance of SHA-256, BLAKE256, and Kec-

cak, in “linear” mode and in tree mode (for a 8192-byte

message). BLAKE512 is brought here only for comparison

(4-lanes BLAKE512 is not shown because we could not gain

good performance from this implementation). Measure-

ments taken on the 3rd Generation Intel® Core™ Processor.

4. Conclusions

We demonstrated the performance gains of j-lanes hash-

ing, using SHA-256 as the underlying hash algorithm.

On the 3rd Generation Intel® Core™ Processor (with

AVX architecture) selecting j = 4, gives speedup factors

between 1.55x to 2x, compared to the best available im-

plementation (up to 2.53x when comparing to OpenSSL

1.0.1c). We focused on j = 4, as the natural choice for the

current AVX (and SSE) architectures. Interestingly, al-

though j = 4 yields the best results (the Wrapping over-

head is the smallest among the tested cases), we note that

the performance with all the studies choices j = 4, 8, 16

is roughly the same for long messages.

We also comment that with the near future AVX2 ar-

chitecture [11], a natural choice would be j = 8 for

SHA-1, SHA-256 and j = 4 for SHA-512, and the j-lanes

implementations will be significantly faster. Therefore, if

a j-lanes hashing mode is adopted, and the ecosystem

would prefer to support only a single value of j (to reduce

the interoperability complexities), it seems that selecting

j = 8 would be a good choice.

In general, the j-lanes-hash can be useful in other sce-

narios, and with different values of j. One example men-

tioned about is hashing a file system, where j is the

number of files (and each file is a node in the tree). Such

computations can be accelerated not only by using SIMD

architectures, but also by using the processing power of

multi-cores systems.

We conclude that the j-lanes-hash could alleviate com-

putational bottlenecks, and recommend that this mode (or

a general tree mode) should be standardized. To this end,

we comment that standardization of a j-lanes (or any tree)

mode should also properly define different initialization

1Since the paper was submitted, NIST has announced Keccak as the

winner of the SHA3 competition. The performance analysis (and com-

arison) that is provided here is still useful for assessing the implica-

tions of this selection (from the performance viewpoint). vectors (depending also on the value of j) in order to dis-

S. GUERON

tinguish the resulting digests from outputs of the linear

SHA-256 (analogously to the how a digest truncation

(e.g., SHA-224) is defined).

5. Acknowledgements

I thank Jean-Philippe Aumasson, Bart Preneel, and Jesse

Walker for helpful discussions.

REFERENCES

[1] National Institute of Standards and Technology, “Cryp-

tographic Hash Algorithm Competition,” 2005.

http://csrc.nist.gov/groups/ST/hash/sha-3/index.html

[2] S. Gueron and V. Krasnov, “Simultaneous Hashing of Mul-

tiple Messages”, Journal of Information Security, Vol. 3,

2012, pp. 319-325. doi:10.4236/jis.2012.34039

[3] Virtual Applications and Implementations Research Lab，

“SUPERCOP,” 2012.

http://bench.cr.yp.to/supercop.html

[4] G. Bertoni, J. Daemen, M. Peeters and G. Van Assche,

“Keccak Sponge Function Family Main Document,” 2009.

http://cuda-keccak.googlecode.com/svn/trunk/docs/Kecca

k-main-2.1.pdf

[5] G. Bertoni, J. Daemen, M. Peeters and G. Van Assche,

“Sufficient Conditions for Sound Tree and Sequential

Hashing Modes,” 2009. http://eprint.iacr.org/2009/210

[6] Y. Dodis, L. Reyzin, R. L. Rivest and E. Shen, “Indiffe-

rentiability of Permutation-Based Compression Functions

and Tree-Based Modes of Operation, with Applications to

MD6,” Proceedings of FSE 2009, Lecture Notes in Com-

puter Science, Leuven, 22-25 February 2009, pp. 104-

121.

[7] R. C. Merkle, “A Certified Digital Signature,” Advances

in Cryptology—Proceedings of CRYPTO’89, Lecture

Notes in Computer Science, Santa Barbara, 20-24 August

1989, pp. 218-238.

[8] P. Sarkar and P. J. Schellenberg, “A Parallelizable Design

Principle for Cryptographic Hash Functions,” 2002.

http://eprint.iacr.org/2002/031

[9] S. Gueron and V. Krasnov, “Parallelizing Message Sche-

dules to Accelerate the Computations of Hash Functions,”

Journal of Cryptographic Engineering, Vol. 4, No. 4,

2012, pp. 241-253.

[10] S. Gueron and V. Krasnov, “[PATCH] Efficient Imple-

mentations of SHA256 and SHA512, Using the Simul-

taneous Message Scheduling method,” 2012.

http://rt.openssl.org/Ticket/Display.html?id=2784&user=

guest&pass=guest

[11] M. Buxton, “Haswell New Instruction Descriptions Now

Available!” 2011.

http://software.intel.com/en-us/blogs/2011/06/13/haswell-

new-instruction-descriptions-now-available/