Applied Mathematics, 2013, 4, 177-182
http://dx.doi.org/10.4236/am.2013.41A028 Published Online January 2013 (http://www.scirp.org/journal/am)
Parallel Implementation of the Gauss-Seidel Algorithm on
k-Ary n-Cube Machine
Mohammad H. Al-Towaiq
Department of Mathematics and Statistics, Jordan University of Science and Technology, Irbid, Jordan
Email: towaiq@just.edu.jo
Received August 5, 2012; revised September 7, 2012; accepted September 14, 2012
ABSTRACT
In this paper, we present parallel implementation of the Gauss-Seidel (GS) iterative algorithm for the solution of linear
systems of equations on a k-ary n-cube parallel machine using Open MPI as a parallel programming environment. The
proposed algorithm is of O(N3/kn) computational complexity and uses O(nN) communication time to decompose a ma-
trix of order N on the a k-ary n-cube provided N kn1. The incurred communication time is better than the best known
results for hypercube, O(N log n!), and the mesh, O(N n!), each with approximately n! nodes. The numerical results
show that speedup improves as number of processors increased and the proposed algorithm has approximately 80%
parallel efficiency.
Keywords: Cluster Computing; Parallel Computing; Linear Systems of Equations; Direct Methods; Iterative Methods
1. Introduction
In linear algebra the solution of the system of linear
equations
A
xb
(1)
where A is an n by n dense matrix, b is a known n-vector
and x is n-vector to be determined, is probably the most
important class of problems. It is needed in many areas
of science and engineering which require great computa-
tional speed and huge memory.
There are two classes of methods for solving linear
systems of equations, direct and iterative methods. A
direct method is a fixed number of operations are carried
out once, at the end of which the solution is produced.
Gauss elimination and related strategies on a linear sys-
tem is an example of such methods. Direct methods are
the primary for solving linear systems. Those methods
are often too expensive in either computation time or
computer memory requirements, or possible both. As an
alternative, such linear systems are usually solved with
iterative methods. A method is called iterative when it
consists of a basic series of operations which are carried
out over and over again, until the answer is produced, or
some exceptional error occurs, or the limit on the number
of steps is exceeded [1].
There are two different aspects of the iterative solution
of linear systems of Equation (1). The first one is the
particular acceleration technique for a sequence of itera-
tion vectors, that is the technique used to construct a new
approximation for the solution x, with information from
previous approximation. The second aspect is the trans-
formation of a given system to one that can be more effi-
ciently solved by a particular iteration method (precondi-
tioning). For the early parallel computer, in existence in
the eighties, it was observed that the single iteration steps
of most iteration methods offered too little opportunity
for effective parallelism, in comparison with, for instance,
direct method for dense matrices. In particular, the few
inner products that required per iteration for many itera-
tion methods were identified as obstacles because of
communication. This had led to attempts to combine it-
eration steps, or to combine the message passing for dif-
ferent inner products.
Recently, many parallel implementation and the com-
puting architecture have become increasingly parallel
attempt to overcome this limitation [2-9]. Some imple-
mentations have been developed for regular problems
such as Laplace equation [10,11], circuit simulation prob-
lems [12], the power load-flow problems were an alter-
nating sequential/parallel multiprocessors is used [13],
and for many applications of inter-dependent constraints
or as a relaxation step in multi-grid methods [2]. Also,
[14] presented several parallelization strategies for the
dense Gauss-Seidel method. These strategies are com-
pared and evaluated through performance measurements
on a large range of hardware architectures. They found
that these new architectures do not offer the same trade-
off in term of computation power versus communication
and synchronization overheads, as traditional high-per-
formance platforms.
C
opyright © 2013 SciRes. AM
M. H. Al-TOWAIQ
178
In 1999, Feng Wang and Jinchao Xu [15], present a
specific technique of solving convection-dominated prob-
lems. Their algorithm uses crosswind thin blocks in a
block Gauss-Seidel method. Their method is based on a
special partitioning technique for a block iterative method
for solving the linear system derived from a monotone
discretization scheme for convection-diffusion problems.
They conclude that crosswind grouping is essential for
the rapid convergence of the method.
In 2005, Jens Grabel et al. [16], present two simple
techniques for improving the performance of the parallel
Gauss-Seidel method for 3D Poisson equation by opti-
mizing cache usage as well as reducing the number of
communication steps.
In 2006, O. Nobuhiko et al. [17], present a novel par-
allel algorithm for block Gauss-Seidel method. The algo-
rithm is devised by focusing on Reitzinger’s coarsening
scheme for the linear systems derived from the finite
element discritizations with first order tetrahedral ele-
ments.
Most of the results show that the time consuming on
communication between processors limit the parallel
computation speed. Motivated by this fact and to conquer
this problem we use the k-ary n-cube machine in order to
change the interconnection network topology of parallel/
computing, and develop a cluster-based Gauss-Seidel
algorithm, which is suitable for the parallel computing. A
generic approach for the method will be developed and
implemented. Also execution time prediction models will
also be presented and verified.
The rest of the paper is structured as follows: in Sec-
tion 2 we introduce and analyze the Gauss-Seidel se-
quential algorithm. We describe the proposed parallel al-
gorithms in Section 3. We evaluate the performance of
the parallel algorithm in Section 4. We conclude the pa-
per in Section 5.
2. Gauss-Seidel Sequential Algorithm
To evaluate the performance of parallel Gauss-Seidel
(GS) algorithm one must first look at the sequential per-
formance. GS is an improvement of the Jacobi algorithm.
That is, GS corrects the ith component of the vector

k
i
x
,
in the order i = 1, 2, ···, n. However, the approximation
solution is updated immediately after the new component
is determined. The newly computed component i

1k
x
, i
= 1, 2, ···, n can be changed within a working vector
which redefined at each relaxation step, and this results
in the following iterative formula
 
1
11
11
in
kk
iiij jij jii
jji
k
x
bax axa



 


0
(2)
In matrix notation, Equation (2) becomes

 
1,
kk
DLxUx bk
 (3)
where L, D, and U are the lower, diagonal, and up-
per-triangular parts of matrix A respectively.
Figure 1 present GS sequential algorithm.
Performance Analysis
In this section we evaluate the behavior of the sequential
GS algorithm. It is well known that the algorithm will
always converges if the matrix A is strictly or irreducible
diagonally dominant. So, we used different problems
with different matrix sizes n = 32, 64, 128, 256, ···, 16368.
Figure 2 shows the relationship between different
sizes of matrix A and the real execution time (in seconds)
needed for the solution of the problem. The figure indi-
cates that the convergence of GS algorithm is dependent
of the problem size and the number of iterations in-
creases as the problem size becomes large (it is of order
N2,
2
ON , for k iterations we need opera-
tions). So, the algorithm is not scalable.
2
kN
3. Parallel Implementation of the Proposed
Algorithm
In this section we propose a parallel algorithm for Gauss-
Seidel using k-ary n-cub machine which well-suited for
cluster computing. First, we overview the parallel com-
puting, present the data distribution methodology, then
we present the proposed parallel algorithm.
3.1. Overview
The computing tasks from the scientific and engineering
fields are more and more complex which require huge
memory and great computational speed. One way of in-
creasing the computational speed is by designing a par-
allel computer systems to match this need. Where, the
Sequential_GS
Input A, b, x(0), tolerance
for k = 0 to k_max do the following:
for i = 1,, n
sum = 0
for j = 1, 2,., i 1
sum = sum +

1k
ij j
ax
end j
for j = i +
1,, n
sum = sum +

k
ij j
ax
end j


1k
ii ii
x
bsuma

end i
if
 
1kk
xtoleranc
 e, then output the solution, stop
end k
end Sequential_GS
Figure 1. Sequential Gauss-Seidel algor i thm.
Copyright © 2013 SciRes. AM
M. H. Al-TOWAIQ 179
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
110.0
120.0
32641282565121024 2048 4092818416368
Execution Real
Time in Seconds
Problem Size
Figure 2. The relationship between the problem size and the real
time needed for GS solver.
single problem split to small pieces and each processor
operates on one or more pieces and the whole problem
can be computed more quickly than a single processor.
Different processors communicate with each other in
most cases, so data message passing are not avoidable.
These data and message passing is the most important
factor that limits the speed of parallel computers speed.
Recent development in microprocessor technology has
been focused on simultaneous multi-threading [18], in
the algorithm, a basic operation is recursively applied to
a sequence of array data structures. Gauss-Seidel has the
same type of internal data dependencies, and it is tradi-
tionally parallelized using multicolor orderings of the
grid points. Unfortunately, the algorithms arising from
these orderings are difficult to parallelize if cache-block-
ing techniques are added to increase data reuse. As an
alternative, we propose to use a much more synchroniza-
tion-intensive data flow parallelization technique for the
standard, natural ordering of the unknowns. This type of
scheme enables us to exploit recently cached data in a
better way than the standard multicolor technique.
3.2. Mapping the Matrix Elements onto the
Processors
Solving a linear system Ax = b requires to distribute the
n-by-n matrix A over a set of processors.
The effective mapping of matrix elements to proces-
sors is the key factor to efficient implementation of the
algorithm in parallel. In this paper, we employ the gen-
eral distribution functions suggested by Al-Towaiq [19]
because it embrace a wide range of matrix distribution
functions, namely column/row blocked, column/row cy-
clic, column/row block cyclic, and column and row ( or
2D) block cyclic layouts [20-22]. The main issues of
choosing a mapping are the load balancing and commu-
nication time. It was confirmed in previous studies [23]
that cyclic layout offers reasonable performance and load
balancing, but not the best. The better layout is 2-D block
cyclic. The 2-D partitioning induces less amount of com-
munication than the 1-D partitioning [17].
3.3. Parallel Implementation of Gauss-Seidel
Algorithm
In this section, we implement and analyze the perform-
ance of the proposed algorithm via simulation, on the
k-ary n-cube
k
n
Q machine.
Definition [24]: The k-ary n-cube has N = kn
nodes each of the form 12nn
k
n
Q
0
xPxx
, where 0 xi
k, for all 0 i < n. Two nodes 120nn
X
xx x

and
12 0
nn
Yyyy

in are connected if and only if
there exists i, 0 i < n, such that xi = yi ± 1 (mod k) and xj
= yj, for i j. It is shown in [25] that has degree 2n
and diameter
k
n
Q
k
n
Q
2nk
.
Figure 3 illustrates an example of network group-
ing.
3
2
Q
This is apparent in the algorithm which illustrate that
each processor requires information from each previous
node after its computational loop is complete. That is, the
first node complete its computation before the second
node can begin, and similarly for each successive node
(called pipe lining). Our technique does not have this
problem since the algorithm requires message passing
between neighboring nodes.
The proposed algorithm uses an array of size q by N,
where q is the number of groups and N is the problem
size. These groups chosen to reduce the communication
required by the computation operations. Each group par-
tition its nodes into interior nodes and boundary nodes.
The interior nodes can operate on GS iteration loop
without communication. Only boundary nodes requiring
massage passing for the updates values

1s
x
at each
step(s) of the GS iteration as follows:
00 01 02
10 11 12
20 21 22
0001 02
1011 12
2021 22
00 0102
10 1112
20 2122
00 01 02
10 11 12
20 21 22
0001 02
1011 12
2021 22
00 0102
10 1112
20 2122
00 01 02
10 11 12
2021 22
00 01 02
10 11 12
20 21 22
0001 02
1011 12
2021 22
00 0102
10 1112
20 2122
Figure 3. The OTIS- network.
3
2
Q
Copyright © 2013 SciRes. AM
M. H. Al-TOWAIQ
180
For i = 2 to q do the following
- Group i sends

1s
x back to group i 1,
receives

s
x
from group i + 1, then
receives from group i 1
1s
x
i
- Call GS subroutine
- Send

1s
x to group i + 1
Continue
The execution time consists of two parts; the computa-
tion time and the communication time. We utalize the
following standared communication formula to estimate
the communication time:

Communication time


where λ is the message latency which equal to 32 × 106
sec, α is the message transmision which equale to 3 ×
103 sec and β is the number of the transition values be-
tween the boundary nodes of the groups. Using the
communication paradigm machine, significant improve-
ments in the performance of the algorithm were observed
compared to more traditional communication paradigms
that use the standard send and receive functions in con-
junction with packing data into communications buffers.
k
n
Q
The GS process requires
3
n
N
Ok


computation com-
plexity.
4. Experimental Results
In this section, we evaluate experimentally the perform-
ance of the proposed algorithm on a cluster of sixteen
Linux workstations; each of which has a single Pentium
IV with 128 MB of memory and 20 GB disk space.
These hosts are connected together using 16-port Myrinet
switch providing full-duplex 2 + 2 Gbps data rate links.
The workstations use Mandrake Linux 7.2 operating sys-
tem running a kernel version 2.2. 17 - 12 mk. In the im-
plementation process we used cyclic distribution func-
tions. In the experiments we used the matrix orders of 32,
64, ···, 16,368.
Figure 4 present the real execution time (in seconds)
of the parallel GS algorithm for different problem sizes
run with 4, 8, and 16 processors. All the execution times
are normalized relative to the sequential GS. The exper-
mental results show that the proposed parallel algorithm
perform better for large problems.
Figure 5 present the speedup curves on each proces-
sors. The curves indicate that the speedup improves as
number of processors increases.
Figure 6 shows the efficiency curves for different
problem sizes. The curves indicate that our proposed
algorithm is perfectly parallel it gives approximately
80% efficiency.
Figure 4. Cluster Gauss-Seidel algorithm real execution
time.
Figure 5. The real speed up ratio for 4, 8, and 16 processors
of G-S algorithm.
Figure 6. The efficiency curves.
5. Conclusions
In this paper, we have designed and analyzed the com-
plexity of a parallel implementation of the Gauss-Seidel
algorithm for solving a system of linear equations on a
k-ary n-cube parallel machine. The proposed parallel algo-
rithm is of
3n
ON k computational complexity and
uses
OnN communication time to decompose a ma-
trix of order N on the a k-ary n-cube provided 1n
Nk
.
The incurred communication time is better than the best
Copyright © 2013 SciRes. AM
M. H. Al-TOWAIQ 181
known results for hypercube, , and the mesh,
, each with approximately n! nodes. The parallel
algorithm takes advantage of the attractive topological
properties of the k-ary n-cube in order to reduce the in-
ter-node communication time involved during the vari-
ous tasks of the parallel computation.
log !ON n

!ONn
The numerical results show the following interesting
observations:
1) Reduced time gain in the parallel algorithm.
2) Good processor load balancing.
3) Almost zero communication time done on the inte-
rior nodes, but more communication done at the bound-
ary nodes.
4) Speedup improves using the 16 processors over the
8 and 4 processors.
5) The parallel algorithm has approximately 80% par-
allel efficiency.
REFERENCES
[1] L. Adams and D. Xie, “New Parallel SOR Method by
Domain Partitioning,” SIAM Journal on Scientific Com-
puting, Vol. 20, No. 22, 1999, pp. 2261-2281.
[2] M. F. Adams. “A Distributed Memory Unstructured
Gauss-Seidel Algorithm for Multigrid Smoothers,” Pro-
ceedings of 2001 ACM/IEEE Conference on Supercom-
puting, Donver, 10-16 November 2001, p. 4.
[3] C. J. Hu, J. L. Zang, J. Wang, J. J. Li and L. Ding, “A
New Parallel Gauss-Seidel Method by Iterative Space
Alternate Tiling,” 16th International Conference on Par-
allel Architecture and Compilation Techniques, Brasov,
15-19 September 2007, p. 410.
[4] M. Murugan, S. Sridhar and Sunil Arvindam “A Parallel
Implementation of the Gauss-Seidel Method on the
Flosolver,” Technical Report, National Aeronautical La-
baratory, Bangalor, 24 July 2006.
[5] L. Olszewski. “A Timing Comparison of the Conjugate
Gradient and Gauss-Siedel Parallel Algorithms in a One-
Dimensional Flow Equation Using PVM,” Proceedings of
the 33rd Annual on Southeast Regional Conference,
Clemson, March 1995, pp. 205-212.
[6] U. Thongkrajay and T. Kulworawanichpong. “Conver-
gence Improvement of Gauss-Seidel Power Flow Solu-
tion Using Load Transfer Technique,” Proceedings of
Modelling, Identification, and Control, Innsbruck, 11-13
February 2008,
[7] D. Wallin, H. Lof, E. Hagersten and S. Holmgren, “Mul-
tigrid and Gauss-Seidel Smoothers Revisited: Paralleliza-
tion on Chip Multiprocessors,” Proceedings of ICS06
Conference, Cairns, 28-30 June 2006.
[8] T. Kim and C.-O. Lee. “A Parallel Gauss-Seidel Method
Using NR Data Flow Ordering,” Journal of Applied
Mathematics and Computation, Vol. 99, No. 2-3, 1999,
pp. 209-220. doi:10.1016/S0096-3003(98)00008-3
[9] M. Adams, M. Brezina, J. Hu and R. Tuminara, “Parallel
Multigrid Smoothing: Polynomial versus Gauass-Seidel,”
Journal of Computational Physics, Vol. 188, No. 2, 2003,
pp. 593-610.
[10] G. Fox, M. Johnson, G. Lyzanga, S. Otto, J. Salmon and
D. Walker, “Solving Problems on Concurrent Proces-
sors,” Printice Hall, Upper Saddle River, 1988.
[11] G. Golub and J. M. Ortega, “Scintific Computing with an
Introduction to Parallel Computing,” Academic Press,
Boston, 1993.
[12] R. A. Saleh, K. A. Gallivan, M. Chang, I. N. Hajj, D.
Smart and T. N. Trich, “Parallel Circuit Simulation on
Supercomputers,” Proceedings of the IEEE, Vol. 77, No.
12, 1989, pp. 1915-1930. doi:10.1109/5.48832
[13] Y. Wallch, “Calculations and Programs for Power System
Networks,” Printice Hall, Upper Saddle River, 1986.
[14] H. Courtecuisse and J. Allard, “Parallel Dense Gauss-
Seidel Algorithm on Many-Core Processors,” High Per-
formance Computation Conference (HPCC), Seoul, 25-27
June 2009, pp. 139-147.
[15] F. Wang and J. Xu, “A Crosswind Block Iterative Method
for Convection-Dominated Problems,” SIAM Journal on
Scientific Computing, Vol. 21, No. 2, 1999, pp. 620-645.
doi:10.1137/S106482759631192X
[16] J. Grabel, B. Land and P. Uebertholz, “Performance Op-
timization for the Parallel Gauss-Seidel Smoother,” Pro-
ceedings in Applied Mathematics and Mechanics, Vol. 5,
No. 1, 2005, pp. 831-832.
[17] O. Nobuhiko, M. Takeshi, I. Takeshi and S. Masaaki, “A
Parallel Block Gauss-Seidel Smoother for Algebraic Mul-
tigrid Method in Edge-Element Analysis,” IEE Japan
Papers of Technical Meeting on Static Apparatus, Vol. 6,
No. 58-61, 63-75, 2006, pp. 55-60.
[18] P. Kongetira, K. Aingaran and K. Olukotun. “Niagra: A
32-Way Multithreaded SPARC Processor,” IEEE Micro,
Vol. 25, No. 2, 2005, pp. 21-29.
doi:10.1109/MM.2005.35
[19] M. Al-Towaiq, “Clustered Gauss-Huard Algorithm for
the Solution of Ax = b,” Journal of Applied Mathematics
and Computation, Vol. 184, No. 2, 2007, pp. 485-595.
[20] James, “Demmel Home Page, Applications of Parallel
Computers,” 1999. www.cs.brekeley.edu/~demmel/
[21] W. Lichtenstein and S. L. Johnsson, “Block Cyclic Dense
Linear Algebra,”SIAM Journal on Scientific Computing,
Vol. 14, No. 6, 1993, pp. 1257-1286.
doi:10.1137/0914075
[22] M. Al-Towaiq, F. Masoud, A. B. Mnaour and K. Day,
“An Implementation of a Parallel Iterative Algorithm for
the Solution of Large Banded Systems on a Cluster of
Workstations,” International Journal of Modeling and
Simulation, Vol. 28, No. 4, 2008, pp. 378-386.
[23] M. Al-Towaiq and H. Al-Aamri, “A Parallel Implementa-
tion of GESPP on a Cluster of Silicon Graphics Worksta-
tions,” Proceedings of the 9th IEEE International Con-
ference on Parallel and Distributed Systems, Hsinchu,
17-20 December 2002, pp. 226-230.
[24] K. Day, “Optical Transpose k-Ary n-Cube Networks,”
Journal of Systems Architecture, Vol. 50, No. 11, 2004,
pp. 697-705. doi:10.1016/j.sysarc.2004.05.002
Copyright © 2013 SciRes. AM
M. H. Al-TOWAIQ
Copyright © 2013 SciRes. AM
182
[25] K. Day and A. E. Al-Ayyoub, “Fault Diameter of k-Ary
n-Cube Networks,” IEEE Transactions on Parallel and
Distributed Systems, Vol. 8, No. 9, 1997, pp. 903-907.
doi:10.1109/71.615436