Parallel Implementation of the Gauss-Seidel Algorithm on <i>k</i>-Ary <i>n</i>-Cube Machine

doi:10.4236/am.2013.41A028

Paper Menu >>

Journal Menu >>

Applied Mathematics, 2013, 4, 177-182

http://dx.doi.org/10.4236/am.2013.41A028 Published Online January 2013 (http://www.scirp.org/journal/am)

Parallel Implementation of the Gauss-Seidel Algorithm on

k-Ary n-Cube Machine

Mohammad H. Al-Towaiq

Department of Mathematics and Statistics, Jordan University of Science and Technology, Irbid, Jordan

Email: towaiq@just.edu.jo

Received August 5, 2012; revised September 7, 2012; accepted September 14, 2012

ABSTRACT

In this paper, we present parallel implementation of the Gauss-Seidel (GS) iterative algorithm for the solution of linear

systems of equations on a k-ary n-cube parallel machine using Open MPI as a parallel programming environment. The

proposed algorithm is of O(N3/kn) computational complexity and uses O(nN) communication time to decompose a ma-

trix of order N on the a k-ary n-cube provided N ≥ kn−1. The incurred communication time is better than the best known

results for hypercube, O(N log n!), and the mesh, O(N n!), each with approximately n! nodes. The numerical results

show that speedup improves as number of processors increased and the proposed algorithm has approximately 80%

parallel efficiency.

Keywords: Cluster Computing; Parallel Computing; Linear Systems of Equations; Direct Methods; Iterative Methods

1. Introduction

In linear algebra the solution of the system of linear

equations



(1)

where A is an n by n dense matrix, b is a known n-vector

and x is n-vector to be determined, is probably the most

important class of problems. It is needed in many areas

of science and engineering which require great computa-

tional speed and huge memory.

There are two classes of methods for solving linear

systems of equations, direct and iterative methods. A

direct method is a fixed number of operations are carried

out once, at the end of which the solution is produced.

Gauss elimination and related strategies on a linear sys-

tem is an example of such methods. Direct methods are

the primary for solving linear systems. Those methods

are often too expensive in either computation time or

computer memory requirements, or possible both. As an

alternative, such linear systems are usually solved with

iterative methods. A method is called iterative when it

consists of a basic series of operations which are carried

out over and over again, until the answer is produced, or

some exceptional error occurs, or the limit on the number

of steps is exceeded [1].

There are two different aspects of the iterative solution

of linear systems of Equation (1). The first one is the

particular acceleration technique for a sequence of itera-

tion vectors, that is the technique used to construct a new

approximation for the solution x, with information from

previous approximation. The second aspect is the trans-

formation of a given system to one that can be more effi-

ciently solved by a particular iteration method (precondi-

tioning). For the early parallel computer, in existence in

the eighties, it was observed that the single iteration steps

of most iteration methods offered too little opportunity

for effective parallelism, in comparison with, for instance,

direct method for dense matrices. In particular, the few

inner products that required per iteration for many itera-

tion methods were identified as obstacles because of

communication. This had led to attempts to combine it-

eration steps, or to combine the message passing for dif-

ferent inner products.

Recently, many parallel implementation and the com-

puting architecture have become increasingly parallel

attempt to overcome this limitation [2-9]. Some imple-

mentations have been developed for regular problems

such as Laplace equation [10,11], circuit simulation prob-

lems [12], the power load-flow problems were an alter-

nating sequential/parallel multiprocessors is used [13],

and for many applications of inter-dependent constraints

or as a relaxation step in multi-grid methods [2]. Also,

[14] presented several parallelization strategies for the

dense Gauss-Seidel method. These strategies are com-

pared and evaluated through performance measurements

on a large range of hardware architectures. They found

that these new architectures do not offer the same trade-

off in term of computation power versus communication

and synchronization overheads, as traditional high-per-

formance platforms.

M. H. Al-TOWAIQ

178

In 1999, Feng Wang and Jinchao Xu [15], present a

specific technique of solving convection-dominated prob-

lems. Their algorithm uses crosswind thin blocks in a

block Gauss-Seidel method. Their method is based on a

special partitioning technique for a block iterative method

for solving the linear system derived from a monotone

discretization scheme for convection-diffusion problems.

They conclude that crosswind grouping is essential for

the rapid convergence of the method.

In 2005, Jens Grabel et al. [16], present two simple

techniques for improving the performance of the parallel

Gauss-Seidel method for 3D Poisson equation by opti-

mizing cache usage as well as reducing the number of

communication steps.

In 2006, O. Nobuhiko et al. [17], present a novel par-

allel algorithm for block Gauss-Seidel method. The algo-

rithm is devised by focusing on Reitzinger’s coarsening

scheme for the linear systems derived from the finite

element discritizations with first order tetrahedral ele-

ments.

Most of the results show that the time consuming on

communication between processors limit the parallel

computation speed. Motivated by this fact and to conquer

this problem we use the k-ary n-cube machine in order to

change the interconnection network topology of parallel/

computing, and develop a cluster-based Gauss-Seidel

algorithm, which is suitable for the parallel computing. A

generic approach for the method will be developed and

implemented. Also execution time prediction models will

also be presented and verified.

The rest of the paper is structured as follows: in Sec-

tion 2 we introduce and analyze the Gauss-Seidel se-

quential algorithm. We describe the proposed parallel al-

gorithms in Section 3. We evaluate the performance of

the parallel algorithm in Section 4. We conclude the pa-

per in Section 5.

2. Gauss-Seidel Sequential Algorithm

To evaluate the performance of parallel Gauss-Seidel

(GS) algorithm one must first look at the sequential per-

formance. GS is an improvement of the Jacobi algorithm.

That is, GS corrects the ith component of the vector



in the order i = 1, 2, ···, n. However, the approximation

solution is updated immediately after the new component

is determined. The newly computed component i





, i

= 1, 2, ···, n can be changed within a working vector

which redefined at each relaxation step, and this results

in the following iterative formula

 

iiij jij jii

jji

bax axa









 









(2)

In matrix notation, Equation (2) becomes



 

DLxUx bk





 (3)

where L, D, and U are the lower, diagonal, and up-

per-triangular parts of matrix A respectively.

Figure 1 present GS sequential algorithm.

Performance Analysis

In this section we evaluate the behavior of the sequential

GS algorithm. It is well known that the algorithm will

always converges if the matrix A is strictly or irreducible

diagonally dominant. So, we used different problems

with different matrix sizes n = 32, 64, 128, 256, ···, 16368.

Figure 2 shows the relationship between different

sizes of matrix A and the real execution time (in seconds)

needed for the solution of the problem. The figure indi-

cates that the convergence of GS algorithm is dependent

of the problem size and the number of iterations in-

creases as the problem size becomes large (it is of order

N2,





ON , for k iterations we need opera-

tions). So, the algorithm is not scalable.





3. Parallel Implementation of the Proposed

Algorithm

In this section we propose a parallel algorithm for Gauss-

Seidel using k-ary n-cub machine which well-suited for

cluster computing. First, we overview the parallel com-

puting, present the data distribution methodology, then

we present the proposed parallel algorithm.

3.1. Overview

The computing tasks from the scientific and engineering

fields are more and more complex which require huge

memory and great computational speed. One way of in-

creasing the computational speed is by designing a par-

allel computer systems to match this need. Where, the

Sequential_GS

Input A, b, x(0), tolerance

for k = 0 to k_max do the following:

for i = 1,…, n

sum = 0

for j = 1, 2,…., i − 1

sum = sum +



ij j

ax 

end j

for j = i +

1,…, n

sum = sum +



ij j

end j



ii ii

bsuma



end i

 

1kk

xtoleranc

 e, then output the solution, stop

end k

end Sequential_GS

Figure 1. Sequential Gauss-Seidel algor i thm.

M. H. Al-TOWAIQ 179

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

110.0

120.0

32641282565121024 2048 4092818416368

Execution Real

Time in Seconds

Problem Size

Figure 2. The relationship between the problem size and the real

time needed for GS solver.

single problem split to small pieces and each processor

operates on one or more pieces and the whole problem

can be computed more quickly than a single processor.

Different processors communicate with each other in

most cases, so data message passing are not avoidable.

These data and message passing is the most important

factor that limits the speed of parallel computers speed.

Recent development in microprocessor technology has

been focused on simultaneous multi-threading [18], in

the algorithm, a basic operation is recursively applied to

a sequence of array data structures. Gauss-Seidel has the

same type of internal data dependencies, and it is tradi-

tionally parallelized using multicolor orderings of the

grid points. Unfortunately, the algorithms arising from

these orderings are difficult to parallelize if cache-block-

ing techniques are added to increase data reuse. As an

alternative, we propose to use a much more synchroniza-

tion-intensive data flow parallelization technique for the

standard, natural ordering of the unknowns. This type of

scheme enables us to exploit recently cached data in a

better way than the standard multicolor technique.

3.2. Mapping the Matrix Elements onto the

Processors

Solving a linear system Ax = b requires to distribute the

n-by-n matrix A over a set of processors.

The effective mapping of matrix elements to proces-

sors is the key factor to efficient implementation of the

algorithm in parallel. In this paper, we employ the gen-

eral distribution functions suggested by Al-Towaiq [19]

because it embrace a wide range of matrix distribution

functions, namely column/row blocked, column/row cy-

clic, column/row block cyclic, and column and row ( or

2D) block cyclic layouts [20-22]. The main issues of

choosing a mapping are the load balancing and commu-

nication time. It was confirmed in previous studies [23]

that cyclic layout offers reasonable performance and load

balancing, but not the best. The better layout is 2-D block

cyclic. The 2-D partitioning induces less amount of com-

munication than the 1-D partitioning [17].

3.3. Parallel Implementation of Gauss-Seidel

Algorithm

In this section, we implement and analyze the perform-

ance of the proposed algorithm via simulation, on the

k-ary n-cube





Q machine.

Definition [24]: The k-ary n-cube has N = kn

nodes each of the form 12nn

xPxx



, where 0  xi 

k, for all 0  i < n. Two nodes 120nn

xx x



 and

12 0

Yyyy





 in are connected if and only if

there exists i, 0  i < n, such that xi = yi ± 1 (mod k) and xj

= yj, for i  j. It is shown in [25] that has degree 2n

and diameter

2nk







.

Figure 3 illustrates an example of network group-

ing.

This is apparent in the algorithm which illustrate that

each processor requires information from each previous

node after its computational loop is complete. That is, the

first node complete its computation before the second

node can begin, and similarly for each successive node

(called pipe lining). Our technique does not have this

problem since the algorithm requires message passing

between neighboring nodes.

The proposed algorithm uses an array of size q by N,

where q is the number of groups and N is the problem

size. These groups chosen to reduce the communication

required by the computation operations. Each group par-

tition its nodes into interior nodes and boundary nodes.

The interior nodes can operate on GS iteration loop

without communication. Only boundary nodes requiring

massage passing for the updates values





at each

step(s) of the GS iteration as follows:

00 01 02

10 11 12

20 21 22

0001 02

1011 12

2021 22

00 0102

10 1112

20 2122

00 01 02

10 11 12

20 21 22

0001 02

1011 12

2021 22

00 0102

10 1112

20 2122

00 01 02

10 11 12

2021 22

00 01 02

10 11 12

20 21 22

0001 02

1011 12

2021 22

00 0102

10 1112

20 2122

Figure 3. The OTIS- network.

M. H. Al-TOWAIQ

180

For i = 2 to q do the following

- Group i sends



x back to group i − 1,

receives



from group i + 1, then

receives from group i − 1



x



- Call GS subroutine

- Send



x to group i + 1

Continue

The execution time consists of two parts; the computa-

tion time and the communication time. We utalize the

following standared communication formula to estimate

the communication time:



Communication time







where λ is the message latency which equal to 32 × 10−6

sec, α is the message transmision which equale to 3 ×

10−3 sec and β is the number of the transition values be-

tween the boundary nodes of the groups. Using the

communication paradigm machine, significant improve-

ments in the performance of the algorithm were observed

compared to more traditional communication paradigms

that use the standard send and receive functions in con-

junction with packing data into communications buffers.

The GS process requires









computation com-

plexity.

4. Experimental Results

In this section, we evaluate experimentally the perform-

ance of the proposed algorithm on a cluster of sixteen

Linux workstations; each of which has a single Pentium

IV with 128 MB of memory and 20 GB disk space.

These hosts are connected together using 16-port Myrinet

switch providing full-duplex 2 + 2 Gbps data rate links.

The workstations use Mandrake Linux 7.2 operating sys-

tem running a kernel version 2.2. 17 - 12 mk. In the im-

plementation process we used cyclic distribution func-

tions. In the experiments we used the matrix orders of 32,

64, ···, 16,368.

Figure 4 present the real execution time (in seconds)

of the parallel GS algorithm for different problem sizes

run with 4, 8, and 16 processors. All the execution times

are normalized relative to the sequential GS. The exper-

mental results show that the proposed parallel algorithm

perform better for large problems.

Figure 5 present the speedup curves on each proces-

sors. The curves indicate that the speedup improves as

number of processors increases.

Figure 6 shows the efficiency curves for different

problem sizes. The curves indicate that our proposed

algorithm is perfectly parallel it gives approximately

80% efficiency.

Figure 4. Cluster Gauss-Seidel algorithm real execution

time.

Figure 5. The real speed up ratio for 4, 8, and 16 processors

of G-S algorithm.

Figure 6. The efficiency curves.

5. Conclusions

In this paper, we have designed and analyzed the com-

plexity of a parallel implementation of the Gauss-Seidel

algorithm for solving a system of linear equations on a

k-ary n-cube parallel machine. The proposed parallel algo-

rithm is of





ON k computational complexity and

uses





OnN communication time to decompose a ma-

trix of order N on the a k-ary n-cube provided 1n



.

The incurred communication time is better than the best

M. H. Al-TOWAIQ 181

known results for hypercube, , and the mesh,

, each with approximately n! nodes. The parallel

algorithm takes advantage of the attractive topological

properties of the k-ary n-cube in order to reduce the in-

ter-node communication time involved during the vari-

ous tasks of the parallel computation.



log !ON n





!ONn

The numerical results show the following interesting

observations:

1) Reduced time gain in the parallel algorithm.

2) Good processor load balancing.

3) Almost zero communication time done on the inte-

rior nodes, but more communication done at the bound-

ary nodes.

4) Speedup improves using the 16 processors over the

8 and 4 processors.

5) The parallel algorithm has approximately 80% par-

allel efficiency.

REFERENCES

[1] L. Adams and D. Xie, “New Parallel SOR Method by

Domain Partitioning,” SIAM Journal on Scientific Com-

puting, Vol. 20, No. 22, 1999, pp. 2261-2281.

[2] M. F. Adams. “A Distributed Memory Unstructured

Gauss-Seidel Algorithm for Multigrid Smoothers,” Pro-

ceedings of 2001 ACM/IEEE Conference on Supercom-

puting, Donver, 10-16 November 2001, p. 4.

[3] C. J. Hu, J. L. Zang, J. Wang, J. J. Li and L. Ding, “A

New Parallel Gauss-Seidel Method by Iterative Space

Alternate Tiling,” 16th International Conference on Par-

allel Architecture and Compilation Techniques, Brasov,

15-19 September 2007, p. 410.

[4] M. Murugan, S. Sridhar and Sunil Arvindam “A Parallel

Implementation of the Gauss-Seidel Method on the

Flosolver,” Technical Report, National Aeronautical La-

baratory, Bangalor, 24 July 2006.

[5] L. Olszewski. “A Timing Comparison of the Conjugate

Gradient and Gauss-Siedel Parallel Algorithms in a One-

Dimensional Flow Equation Using PVM,” Proceedings of

the 33rd Annual on Southeast Regional Conference,

Clemson, March 1995, pp. 205-212.

[6] U. Thongkrajay and T. Kulworawanichpong. “Conver-

gence Improvement of Gauss-Seidel Power Flow Solu-

tion Using Load Transfer Technique,” Proceedings of

Modelling, Identification, and Control, Innsbruck, 11-13

February 2008,

[7] D. Wallin, H. Lof, E. Hagersten and S. Holmgren, “Mul-

tigrid and Gauss-Seidel Smoothers Revisited: Paralleliza-

tion on Chip Multiprocessors,” Proceedings of ICS06

Conference, Cairns, 28-30 June 2006.

[8] T. Kim and C.-O. Lee. “A Parallel Gauss-Seidel Method

Using NR Data Flow Ordering,” Journal of Applied

Mathematics and Computation, Vol. 99, No. 2-3, 1999,

pp. 209-220. doi:10.1016/S0096-3003(98)00008-3

[9] M. Adams, M. Brezina, J. Hu and R. Tuminara, “Parallel

Multigrid Smoothing: Polynomial versus Gauass-Seidel,”

Journal of Computational Physics, Vol. 188, No. 2, 2003,

pp. 593-610.

[10] G. Fox, M. Johnson, G. Lyzanga, S. Otto, J. Salmon and

D. Walker, “Solving Problems on Concurrent Proces-

sors,” Printice Hall, Upper Saddle River, 1988.

[11] G. Golub and J. M. Ortega, “Scintific Computing with an

Introduction to Parallel Computing,” Academic Press,

Boston, 1993.

[12] R. A. Saleh, K. A. Gallivan, M. Chang, I. N. Hajj, D.

Smart and T. N. Trich, “Parallel Circuit Simulation on

Supercomputers,” Proceedings of the IEEE, Vol. 77, No.

12, 1989, pp. 1915-1930. doi:10.1109/5.48832

[13] Y. Wallch, “Calculations and Programs for Power System

Networks,” Printice Hall, Upper Saddle River, 1986.

[14] H. Courtecuisse and J. Allard, “Parallel Dense Gauss-

Seidel Algorithm on Many-Core Processors,” High Per-

formance Computation Conference (HPCC), Seoul, 25-27

June 2009, pp. 139-147.

[15] F. Wang and J. Xu, “A Crosswind Block Iterative Method

for Convection-Dominated Problems,” SIAM Journal on

Scientific Computing, Vol. 21, No. 2, 1999, pp. 620-645.

doi:10.1137/S106482759631192X

[16] J. Grabel, B. Land and P. Uebertholz, “Performance Op-

timization for the Parallel Gauss-Seidel Smoother,” Pro-

ceedings in Applied Mathematics and Mechanics, Vol. 5,

No. 1, 2005, pp. 831-832.

[17] O. Nobuhiko, M. Takeshi, I. Takeshi and S. Masaaki, “A

Parallel Block Gauss-Seidel Smoother for Algebraic Mul-

tigrid Method in Edge-Element Analysis,” IEE Japan

Papers of Technical Meeting on Static Apparatus, Vol. 6,

No. 58-61, 63-75, 2006, pp. 55-60.

[18] P. Kongetira, K. Aingaran and K. Olukotun. “Niagra: A

32-Way Multithreaded SPARC Processor,” IEEE Micro,

Vol. 25, No. 2, 2005, pp. 21-29.

doi:10.1109/MM.2005.35

[19] M. Al-Towaiq, “Clustered Gauss-Huard Algorithm for

the Solution of Ax = b,” Journal of Applied Mathematics

and Computation, Vol. 184, No. 2, 2007, pp. 485-595.

[20] James, “Demmel Home Page, Applications of Parallel

Computers,” 1999. www.cs.brekeley.edu/~demmel/

[21] W. Lichtenstein and S. L. Johnsson, “Block Cyclic Dense

Linear Algebra,”SIAM Journal on Scientific Computing,

Vol. 14, No. 6, 1993, pp. 1257-1286.

doi:10.1137/0914075

[22] M. Al-Towaiq, F. Masoud, A. B. Mnaour and K. Day,

“An Implementation of a Parallel Iterative Algorithm for

the Solution of Large Banded Systems on a Cluster of

Workstations,” International Journal of Modeling and

Simulation, Vol. 28, No. 4, 2008, pp. 378-386.

[23] M. Al-Towaiq and H. Al-Aamri, “A Parallel Implementa-

tion of GESPP on a Cluster of Silicon Graphics Worksta-

tions,” Proceedings of the 9th IEEE International Con-

ference on Parallel and Distributed Systems, Hsinchu,

17-20 December 2002, pp. 226-230.

[24] K. Day, “Optical Transpose k-Ary n-Cube Networks,”

Journal of Systems Architecture, Vol. 50, No. 11, 2004,

pp. 697-705. doi:10.1016/j.sysarc.2004.05.002

M. H. Al-TOWAIQ

182

[25] K. Day and A. E. Al-Ayyoub, “Fault Diameter of k-Ary

n-Cube Networks,” IEEE Transactions on Parallel and

Distributed Systems, Vol. 8, No. 9, 1997, pp. 903-907.

doi:10.1109/71.615436