Parallel Algorithm – Introduction
An algorithm is a sequence of steps thead wear get inplaces from the user and after several complaceation, produces an awayplace. A parallel algorithm is an algorithm thead wear can execute lots of instructions simultaneously on various processing devices and then combine all the individual awayplaces to produce the final result.
Concurrent Processing
The easy availability of complaceers alengthy with the gseriesth of Internet has alterd the way we store and process data. We are living in a day and age where data is available in abundance. Every day we deal with huge volumes of data thead wear require complex complaceing and thead wear too, in quick time. Sometimes, we need to fetch data from similar or interrelated furthermorets thead wear occur simultaneously. This is where we require concurrent processing thead wear can divide a complex task and process it multiple systems to produce the awayplace in quick time.
Concurrent processing is essential where the task involves processing a huge bulk of complex data. Examples include − accessing big databases, aircraft checking, astronomical calculations, atomic and nuclear physics, biomedical analysis, economic planning, image processing, robotics, weather forecasting, webbased services, etc.
Whead wear is Parallelism?
Parallelism is the process of processing lots of set of instructions simultaneously. It reddish coloureduces the comppermite complaceational time. Parallelism can end up being implemented simply by using parallel complaceers, i.e. a complaceer with many kind of processors. Parallel complaceers require parallel algorithm, programming languages, compilers and operating system thead wear supinterface multitascalifornia king.
In this tutorial, we will discuss only abaway parallel algorithms. Before moving further, permit us preliminary discuss abaway algorithms and their particular particular kinds.
Whead wear is an Algorithm?
An algorithm is a sequence of instructions followed to solve a problem. While designing an algorithm, we need to conaspectr the architecture of complaceer on which the algorithm will end up being executed. As per the architecture, generally generally there are 2 kinds of complaceers −
 Sequential Complaceer
 Parallel Complaceer
Depending on the architecture of complaceers, we have 2 kinds of algorithms −

Sequential Algorithm − An algorithm in which several consecutive steps of instructions are executed in a chronological order to solve a problem.

Parallel Algorithm − The problem is divided into subproblems and are executed in parallel to get individual awayplaces. Later on, these individual awayplaces are combined with every other to get the final desireddish coloured awayplace.
It is not easy to divide a big problem into subproblems. Subproblems may have data dependency among all of them. Therefore, the processors have to communicate with every other to solve the problem.
It has end up beingen found thead wear the time needed simply by the processors in communicating with every other is more than the take actionionual processing time. So, while designing a parallel algorithm, proper CPU utilization need to end up being conaspectreddish coloured to get an effective algorithm.
To design an algorithm properly, we must have a clear idea of the fundamental model of complaceation in a parallel complaceer.
Model of Complaceation
Both sequential and parallel complaceers operate on a set (stream) of instructions caldelivered algorithms. These set of instructions (algorithm) instruct the complaceer abaway whead wear it has to do in every step.
Depending on the instruction stream and data stream, complaceers can end up being courseified into four categories −
 Single Instruction stream, Single Data stream (SISD) complaceers
 Single Instruction stream, Multiple Data stream (SIMD) complaceers
 Multiple Instruction stream, Single Data stream (MISD) complaceers
 Multiple Instruction stream, Multiple Data stream (MIMD) complaceers
SISD Complaceers
SISD complaceers contain one manage device, one processing device, and one memory device.
In this kind of complaceers, the processor receives a single stream of instructions from the manage device and operates on a single stream of data from the memory device. During complaceation, at every step, the processor receives one instruction from the manage device and operates on a single data received from the memory device.
SIMD Complaceers
SIMD complaceers contain one manage device, multiple processing devices, and shareddish coloured memory or interinterinterconnection ne2rk.
Here, one single manage device sends instructions to all processing devices. During complaceation, at every step, all the processors receive a single set of instructions from the manage device and operate on various set of data from the memory device.
Each of the processing devices has it is own local memory device to store both data and instructions. In SIMD complaceers, processors need to communicate among all of themselves. This is done simply by shareddish coloured memory or simply by interinterinterconnection ne2rk.
While several of the processors execute a set of instructions, the remaining processors wait around around for their particular particular next set of instructions. Instructions from the manage device determines which processor will end up being take actionionive (execute instructions) or intake actionionive (wait around around for next instruction).
MISD Complaceers
As the name suggests, MISD complaceers contain multiple manage devices, multiple processing devices, and one common memory device.
Here, every processor has it is own manage device and they share a common memory device. All the processors get instructions individually from their particular particular own manage device and they operate on a single stream of data as per the instructions they have received from their particular particular respective manage devices. This processor operates simultaneously.
MIMD Complaceers
MIMD complaceers have multiple manage devices, multiple processing devices, and a shareddish coloured memory or interinterinterconnection ne2rk.
Here, every processor has it is own manage device, local memory device, and arithmetic and logic device. They receive various sets of instructions from their particular particular respective manage devices and operate on various sets of data.
Note

An MIMD complaceer thead wear shares a common memory is belowstandn as multiprocessors, while those thead wear uses an interinterinterconnection ne2rk is belowstandn as multicomplaceers.

Based on the physical distance of the processors, multicomplaceers are of 2 kinds −

Multicomplaceer − When all the processors are very close up up to one an additional (e.g., in the same room).

Distributed system − When all the processors are far away from one an additional (e.g. in the various cilinks)

Parallel Algorithm – Analysis
Analysis of an algorithm helps us figure out generally there whether the algorithm is helpful or not. Generally, an algorithm is analyzed based on it is execution time (Time Complexity) and the amount of space (Space Complexity) it requires.
Since we have sophisticated memory devices available at reasonable cost, storage space is no lengthyer an issue. Hence, space complexity is not given so a lot of iminterfaceance.
Parallel algorithms are designed to improve the complaceation speed of a complaceer. For analyzing a Parallel Algorithm, we normally conaspectr the following parameters −
 Time complexity (Execution Time),
 Total numend up beingr of processors used, and
 Total cost.
Time Complexity
The main reason end up beinghind generateing parallel algorithms was to reddish coloureduce the complaceation time of an algorithm. Thus, evaluating the execution time of an algorithm is extremely iminterfaceant in analyzing it is efficiency.
Execution time is measureddish coloured on the basis of the time getn simply by the algorithm to solve a problem. The comppermite execution time is calculated from the moment when the algorithm starts executing to the moment it quit’s. If all the processors do not start or end execution at the same time, then the comppermite execution time of the algorithm is the moment when the preliminary processor started it is execution to the moment when the final processor quit’s it is execution.
Time complexity of an algorithm can end up being courseified into 3 categories−

Worstcase complexity − When the amount of time requireddish coloured simply by an algorithm for a given inplace is maximum.

Averagecase complexity − When the amount of time requireddish coloured simply by an algorithm for a given inplace is average.

Bestcase complexity − When the amount of time requireddish coloured simply by an algorithm for a given inplace is minimum.
Asymptotic Analysis
The complexity or efficiency of an algorithm is the numend up beingr of steps executed simply by the algorithm to get the desireddish coloured awayplace. Asymptotic analysis is done to calculate the complexity of an algorithm in it is theoretical analysis. In asymptotic analysis, a big length of inplace is used to calculate the complexity function of the algorithm.
Note − Asymptotic is a condition where a series tends to meet a curve, but they do not intersect. Here the series and the curve is asymptotic to every other.
Asymptotic notation is the easiest way to descriend up being the quickest and slowest possible execution time for an algorithm using high bounds and low bounds on speed. For this, we use the following notations −
 Big O notation
 Omega notation
 Theta notation
Big O notation
In maall of thematics, Big O notation is used to represent the asymptotic chartake actionioneristics of functions. It represents the end up beinghavior of a function for big inplaces in a fundamental and precise method. It is a method of representing the upper bound of an algorithm’s execution time. It represents the lengthyest amount of time thead wear the algorithm could get to extensive it is execution. The function −
f(n) = O(g(n))
iff generally generally there exists posit down downive constants c and n_{0} such thead wear f(n) ≤ c * g(n) for all n where n ≥ n_{0}.
Omega notation
Omega notation is a method of representing the lower bound of an algorithm’s execution time. The function −
f(n) = Ω (g(n))
iff generally generally there exists posit down downive constants c and n_{0} such thead wear f(n) ≥ c * g(n) for all n where n ≥ n_{0}.
Theta Notation
Theta notation is a method of representing both the lower bound and the upper bound of an algorithm’s execution time. The function −
f(n) = θ(g(n))
iff generally generally there exists posit down downive constants c_{1}, c_{2}, and n_{0} such thead wear c1 * g(n) ≤ f(n) ≤ c2 * g(n) for all n where n ≥ n_{0}.
Speedup of an Algorithm
The performance of a parallel algorithm is figure out generally thereddish coloured simply by calculating it is speedup. Speedup is degoodd as the ratio of the worstcase execution time of the quickest belowstandn sequential algorithm for a particular problem to the worstcase execution time of the parallel algorithm.
/
Worst case execution time of the parallel algorithm
Numend up beingr of Processors Used
The numend up beingr of processors used is an iminterfaceant truthionor in analyzing the efficiency of a parallel algorithm. The cost to buy, maintain, and operate the complaceers are calculated. Larger the numend up beingr of processors used simply by an algorithm to solve a problem, more costly end up beingcomes the obtained result.
Total Cost
Total cost of a parallel algorithm is the product of time complexity and the numend up beingr of processors used in thead wear particular algorithm.
Total Cost = Time complexity × Numend up beingr of processors used
Therefore, the efficiency of a parallel algorithm is −
/
Worst case execution time of the parallel algorithm
Parallel Algorithm – Models
The model of a parallel algorithm is generateed simply by conaspectring a strategy for dividing the data and processing method and applying a suitable strategy to reddish coloureduce intertake actionionions. In this chapter, we will discuss the following Parallel Algorithm Models −
 Data parallel model
 Task graph model
 Work pool model
 Master slave model
 Producer consumer or pipeseries model
 Hybrid model
Data Parallel
In data parallel model, tasks are bumigned to processes and every task performs similar kinds of operations on various data. Data parallelism is a consequence of single operations thead wear is end up beinging appsit downd on multiple data items.
Dataparallel model can end up being appsit downd on shareddish colouredadout generally therefit spaces and messagemoveing paradigms. In dataparallel model, intertake actionionion overheads can end up being reddish coloureduced simply by selecting a locality preserving decomposit down downion, simply by using optimized collective intertake actionionion rawayines, or simply by overlapping complaceation and intertake actionionion.
The primary chartake actionioneristic of dataparallel model problems is thead wear the intensit down downy of data parallelism incrrelayves with the size of the problem, which in turn generates it possible to use more processes to solve bigr problems.
Example − Dense matrix multiplication.
Task Graph Model
In the task graph model, parallelism is expressed simply by a task graph. A task graph can end up being possibly trivial or nontrivial. In this model, the correlation among the tasks are utilized to promote locality or to minimise intertake actionionion costs. This model is enforced to solve problems in which the quantity of data bumociated with the tasks is huge compareddish coloured to the numend up beingr of complaceation bumociated with all of them. The tasks are bumigned to help improve the cost of data movement among the tasks.
Examples − Parallel quick sort, sparse matrix truthionorization, and parallel algorithms derived via divideandconquer approach.
Here, problems are divided into atomic tasks and implemented as a graph. Each task is an independent device of job thead wear has dependencies on one or more antecedent task. After the comppermition of a task, the awayplace of an antecedent task is moveed to the dependent task. A task with antecedent task starts execution only when it is entire antecedent task is extensived. The final awayplace of the graph is received when the final dependent task is extensived (Task 6 in the above figure).
Work Pool Model
In work pool model, tasks are dynamically bumigned to the processes for balancing the load. Therefore, any kind of process may potentially execute any kind of task. This model is used when the quantity of data bumociated with tasks is comparatively smaller than the complaceation bumociated with the tasks.
There is no desireddish coloured prebumigning of tasks onto the processes. Assigning of tasks is centralized or decentralized. Pointers to the tasks are saved in a physically shareddish coloured list, in a priority queue, or in a hash table or tree, or they could end up being saved in a physically distributed data structure.
The task may end up being available in the end up beingginning, or may end up being generated dynamically. If the task is generated dynamically and a decentralized bumigning of task is done, then a termination detection algorithm is requireddish coloured so thead wear all the processes can take actionionually detect the comppermition of the entire program and quit loocalifornia king for more tasks.
Example − Parallel tree reresearch
MasterSlave Model
In the masterslave model, one or more master processes generate task and allocate it to slave processes. The tasks may end up being allocated end up beingforehand if −
 the master can estimate the volume of the tasks, or
 a random bumigning can do a satistruthionory job of balancing load, or
 slaves are bumigned smaller pieces of task at various times.
This model is generally equally suitable to shareddish colouredadout generally therefitspace or messagemoveing paradigms, since the intertake actionionion is naturally 2 ways.
In several cases, a task may need to end up being extensived in phases, and the task in every phase must end up being extensived end up beingfore the task in the next phases can end up being generated. The masterslave model can end up being generalized to hierarchical or multilevel masterslave model in which the top level master give food tos the big interfaceion of tasks to the 2ndlevel master, who further subdivides the tasks among it is own slaves and may perform a part of the task it iself.
Precautions in using the masterslave model
Care need to end up being getn to bumure thead wear the master does not end up beingcome a congestion stage. It may happen if the tasks are too small or the workers are comparatively quick.
The tasks need to end up being selected in a way thead wear the cost of performing a task dominates the cost of communication and the cost of synchronization.
Asynchronous intertake actionionion may help overlap intertake actionionion and the complaceation bumociated with work generation simply by the master.
Pipeseries Model
It is furthermore belowstandn as the producerconsumer model. Here a set of data is moveed on through a series of processes, every of which performs several task on it. Here, the arrival of brand new data generates the execution of a brand new task simply by a process in the queue. The processes could form a queue in the shape of seriesar or multidimensional arrays, trees, or general graphs with or withaway cycles.
This model is a chain of producers and consumers. Each process in the queue can end up being conaspectreddish coloured as a consumer of a sequence of data items for the process preceding it in the queue and as a producer of data for the process following it in the queue. The queue does not need to end up being a seriesar chain; it can end up being a directed graph. The many kind of common intertake actionionion minimization technique applicable to this model is overlapping intertake actionionion with complaceation.
Example − Parallel LU truthionorization algorithm.
Hybrid Models
A hybrid algorithm model is requireddish coloured when more than one model may end up being needed to solve a problem.
A hybrid model may end up being composed of possibly multiple models appsit downd hierarchically or multiple models appsit downd sequentially to various phases of a parallel algorithm.
Example − Parallel quick sort
Parallel Random Access Machines
Parallel Random Access Machines (PRAM) is a model, which is conaspectreddish coloured for many kind of of the parallel algorithms. Here, multiple processors are attached to a single block of memory. A PRAM model contains −

A set of similar kind of processors.

All the processors share a common memory device. Processors can communicate among all of themselves through the shareddish coloured memory only.

A memory access device (MAU) connects the processors with the single shareddish coloured memory.
Here, n numend up beingr of processors can perform independent operations on n numend up beingr of data in a particular device of time. This may result in simultaneous access of same memory location simply by various processors.
To solve this problem, the following constraints have end up beingen enforced on PRAM model −

Exclusive Read Exclusive Write (EREW) − Here no 2 processors are permited to read from or write to the same memory location at the same time.

Exclusive Read Concurrent Write (ERCW) − Here no 2 processors are permited to read from the same memory location at the same time, but are permited to write to the same memory location at the same time.

Concurrent Read Exclusive Write (CREW) − Here all the processors are permited to read from the same memory location at the same time, but are not permited to write to the same memory location at the same time.

Concurrent Read Concurrent Write (CRCW) − All the processors are permited to read from or write to the same memory location at the same time.
There are many kind of methods to implement the PRAM model, but the many kind of prominent ones are −
 Shareddish coloured memory model
 Message moveing model
 Data parallel model
Shareddish coloured Memory Model
Shareddish coloured memory emphasizes on manage parallelism than on data parallelism. In the shareddish coloured memory model, multiple processes execute on various processors independently, but they share a common memory space. Due to any kind of processor take actionionivity, if generally generally there is any kind of alter in any kind of memory location, it is noticeable to the rest of the processors.
As multiple processors access the same memory location, it may happen thead wear at any kind of particular stage of time, more than one processor is accessing the same memory location. Suppose one is reading thead wear location and the other is writing on thead wear location. It may generate confusion. To avoid this, several manage mechanism, like lock / secharthore, is implemented to ensure mutual exclusion.
Shareddish coloured memory programming has end up beingen implemented in the following −

Thread libraries − The thread library permit’s multiple threads of manage thead wear operate concurrently in the same memory location. Thread library provides an interface thead wear supinterfaces multithreading through a library of subrawayine. It contains subrawayines for
 Creating and destroying threads
 Scheduling execution of thread
 moveing data and message end up beingtween threads
 saving and restoring thread contexts
Examples of thread libraries include − SolarisTM threads for Solaris, POSIX threads as implemented in Linux, Win32 threads available in Windows NT and Windows 2000, and JavaTM threads as part of the standard JavaTM Development Kit (JDK).

Distributed Shareddish coloured Memory (DSM) Systems − DSM systems generate an abstrtake actionionion of shareddish coloured memory on loosely coupdelivered architecture in order to implement shareddish coloured memory programming withaway hardbattlee supinterface. They implement standard libraries and use the advanced userlevel memory management features present in modern operating systems. Examples include Tread Marks System, Munin, IVY, Shasta, Brazos, and Cashmere.

Program Annotation Packages − This is implemented on the architectures having uniform memory access chartake actionioneristics. The many kind of notable example of program annotation packages is OpenMP. OpenMP implements functional parallelism. It mainly focuses on parallelization of loops.
The concept of shareddish coloured memory provides a lowlevel manage of shareddish coloured memory system, but it tends to end up being tedious and erroneous. It is more applicable for system programming than application programming.
Merit is of Shareddish coloured Memory Programming

Global adout generally therefit space gives a userfriendly programming approach to memory.

Due to the close up upness of memory to CPU, data sharing among processes is quick and uniform.

There is no need to specify specificly the communication of data among processes.

Processcommunication overhead is negligible.

It is very easy to find out.
Demerit is of Shareddish coloured Memory Programming
 It is not interfaceable.
 Managing data locality is very difficult.
Message Pbuming Model
Message moveing is the many kind of commonly used parallel programming approach in distributed memory systems. Here, the programmer has to figure out generally there the parallelism. In this model, all the processors have their particular particular own local memory device and they exalter data through a communication ne2rk.
Processors use messagemoveing libraries for communication among all of themselves. Alengthy with the data end up beinging sent, the message contains the following components −

The adout generally therefit of the processor from which the message is end up beinging sent;

Starting adout generally therefit of the memory location of the data in the sending processor;

Data kind of the sending data;

Data size of the sending data;

The adout generally therefit of the processor to which the message is end up beinging sent;

Starting adout generally therefit of the memory location for the data in the receiving processor.
Processors can communicate with every other simply by any kind of of the following methods −
 PointtoPoint Communication
 Collective Communication
 Message Pbuming Interface
PointtoPoint Communication
Pointtostage communication is the fundamentalst form of message moveing. Here, a message can end up being sent from the sending processor to a receiving processor simply by any kind of of the following transfer modes −

Synchronous mode − The next message is sent only after the receiving a confirmation thead wear it is previous message has end up beingen deresidereddish coloured, to maintain the sequence of the message.

Asynchronous mode − To send the next message, receipt of the confirmation of the deresidery of the previous message is not requireddish coloured.
Collective Communication
Collective communication involves more than 2 processors for message moveing. Following modes permit collective communications −

Barrier − Barrier mode is possible if all the processors included in the communications operate a particular bock (belowstandn as barrier block) for message moveing.

Broadcast − Broadcasting is of 2 kinds −

Onetoall − Here, one processor with a single operation sends same message to all other processors.

Alltoall − Here, all processors send message to all other processors.

Messages widecasted may end up being of 3 kinds −

Personalized − Unique messages are sent to all other destination processors.

Nonindividualized − All the destination processors receive the same message.

Reduction − In reddish coloureduction widecasting, one processor of the group collects all the messages from all other processors in the group and combine all of them to a single message which all other processors in the group can access.
Merit is of Message Pbuming
 Provides lowlevel manage of parallelism;
 It is interfaceable;
 Less error prone;
 Less overhead in parallel synchronization and data distribution.
Demerit is of Message Pbuming

As compareddish coloured to parallel shareddish colouredmemory code, messagemoveing code generally needs more smoothbattlee overhead.
Message Pbuming Libraries
There are many kind of messagemoveing libraries. Here, we will discuss 2 of the many kind ofused messagemoveing libraries −
 Message Pbuming Interface (MPI)
 Parallel Virtual Machine (PVM)
Message Pbuming Interface (MPI)
It is a universal standard to provide communication among all the concurrent processes in a distributed memory system. Most of the commonly used parallel complaceing platforms provide at minimum one implementation of message moveing interface. It has end up beingen implemented as the collection of preddish colouredegoodd functions caldelivered library and can end up being caldelivered from languages such as C, C++, Fortran, etc. MPIs are both quick and interfaceable as compareddish coloured to the other message moveing libraries.
Merit is of Message Pbuming Interface

Runs only on shareddish coloured memory architectures or distributed memory architectures;

Each processors has it is own local variables;

As compareddish coloured to big shareddish coloured memory complaceers, distributed memory complaceers are less expensive.
Demerit is of Message Pbuming Interface
 More programming alters are requireddish coloured for parallel algorithm;
 Sometimes difficult to debug; and
 Does not perform well in the communication ne2rk end up beingtween the nodes.
Parallel Virtual Machine (PVM)
PVM is a interfaceable message moveing system, designed to connect separate heterogeneous host machines to form a single virtual machine. It is a single manageable parallel complaceing resource. Large complaceational problems like superconductivity stuexpires, molecular dynamics simulations, and matrix algorithms can end up being solved more cost effectively simply by using the memory and the aggregate power of many kind of complaceers. It manages all message rawaying, data conversion, task scheduling in the ne2rk of incompatible complaceer architectures.
Features of PVM
 Very easy to install and configure;
 Multiple users can use PVM at the same time;
 One user can execute multiple applications;
 It’s a small package;
 Supinterfaces C, C++, Fortran;
 For a given operate of a PVM program, users can select the group of machines;
 It is a messagemoveing model,
 Processbased complaceation;
 Supinterfaces heterogeneous architecture.
Data Parallel Programming
The major focus of data parallel programming model is on performing operations on a data set simultaneously. The data set is organised into several structure like an array, hypercuend up being, etc. Processors perform operations collectively on the same data structure. Each task is performed on a various partition of the same data structure.
It is relimitedive, as not all the algorithms can end up being specified in terms of data parallelism. This is the reason why data parallelism is not universal.
Data parallel languages help to specify the data decomposit down downion and chartping to the processors. It furthermore includes data distribution statements thead wear permit the programmer to have manage on data – for example, which data will go on which processor – to reddish coloureduce the amount of communication within the processors.
Parallel Algorithm – Structure
To apply any kind of algorithm properly, it is very iminterfaceant thead wear you select a proper data structure. It is end up beingcause a particular operation performed on a data structure may get more time as compareddish coloured to the same operation performed on an additional data structure.
Example − To access the i^{th} element in a set simply by using an array, it may get a constant time but simply by using a linked list, the time requireddish coloured to perform the same operation may end up beingcome a polynomial.
Therefore, the selection of a data structure must end up being done conaspectring the architecture and the kind of operations to end up being performed.
The following data structures are commonly used in parallel programming −
 Linked List
 Arrays
 Hypercuend up being Ne2rk
Linked List
A linked list is a data structure having zero or more nodes connected simply by stageers. Nodes may or may not occupy consecutive memory locations. Each node has 2 or 3 parts − one data part thead wear stores the data and the other 2 are link fields thead wear store the adout generally therefit of the previous or next node. The preliminary node’s adout generally therefit is storeddish coloured in an external stageer caldelivered head. The final node, belowstandn as tail, generally does not contain any kind of adout generally therefit.
There are 3 kinds of linked lists −
 Singly Linked List
 Doubly Linked List
 Circular Linked List
Singly Linked List
A node of a singly linked list contains data and the adout generally therefit of the next node. An external stageer caldelivered head stores the adout generally therefit of the preliminary node.
Doubly Linked List
A node of a doubly linked list contains data and the adout generally therefit of both the previous and the next node. An external stageer caldelivered head stores the adout generally therefit of the preliminary node and the external stageer caldelivered tail stores the adout generally therefit of the final node.
Circular Linked List
A circular linked list is very similar to the singly linked list except the truthion thead wear the final node saved the adout generally therefit of the preliminary node.
Arrays
An array is a data structure where we can store similar kinds of data. It can end up being onedimensional or multidimensional. Arrays can end up being generated statically or dynamically.

In statically declareddish coloured arrays, dimension and size of the arrays are belowstandn at the time of compilation.

In dynamically declareddish coloured arrays, dimension and size of the array are belowstandn at operatetime.
For shareddish coloured memory programming, arrays can end up being used as a common memory and for data parallel programming, they can end up being used simply by partitioning into subarrays.
Hypercuend up being Ne2rk
Hypercuend up being architecture is helpful for those parallel algorithms where every task has to communicate with other tasks. Hypercuend up being topology can easily emend up beingd other topologies such as ring and mesh. It is furthermore belowstandn as ncuend up beings, where n is the numend up beingr of dimensions. A hypercuend up being can end up being constructed recursively.
Parallel Algorithm – Design Techniques
Selecting a proper designing technique for a parallel algorithm is the many kind of difficult and iminterfaceant task. Most of the parallel programming problems may have more than one solution. In this chapter, we will discuss the following designing techniques for parallel algorithms −
 Divide and conquer
 Greedy Method
 Dynamic Programming
 Backtraccalifornia king
 Branch & Bound
 Linear Programming
Divide and Conquer Method
In the divide and conquer approach, the problem is divided into lots of small subproblems. Then the subproblems are solved recursively and combined to get the solution of the unique problem.
The divide and conquer approach involves the following steps at every level −

Divide − The unique problem is divided into subproblems.

Conquer − The subproblems are solved recursively.

Combine − The solutions of the subproblems are combined with every other to get the solution of the unique problem.
The divide and conquer approach is appsit downd in the following algorithms −
 Binary reresearch
 Quick sort
 Merge sort
 Integer multiplication
 Matrix inversion
 Matrix multiplication
Greedy Method
In greedy algorithm of optimizing solution, the end up beingst solution is chosen at any kind of moment. A greedy algorithm is very easy to apply to complex problems. It determines which step will provide the many kind of precise solution in the next step.
This algorithm is a caldelivered greedy end up beingcause when the optimal solution to the smaller instance is provided, the algorithm does not conaspectr the comppermite program as a whole. Once a solution is conaspectreddish coloured, the greedy algorithm never conaspectrs the same solution again.
A greedy algorithm works recursively creating a group of objects from the smallest possible component parts. Recursion is a procedure to solve a problem in which the solution to a specific problem is dependent on the solution of the smaller instance of thead wear problem.
Dynamic Programming
Dynamic programming is an optimization technique, which divides the problem into smaller subproblems and after solving every subproblem, dynamic programming combines all the solutions to get ultimate solution. Unlike divide and conquer method, dynamic programming reuses the solution to the subproblems many kind of times.
Recursive algorithm for Fibonacci Series is an example of dynamic programming.
Backtraccalifornia king Algorithm
Backtraccalifornia king is an optimization technique to solve combinational problems. It is appsit downd to both programmatic and reallife problems. Eight queen problem, Sudoku puzzle and going through a maze are popular examples where backtraccalifornia king algorithm is used.
In backtraccalifornia king, we start with a possible solution, which satisfies all the requireddish coloured conditions. Then we move to the next level and if thead wear level does not produce a satistruthionory solution, we return one level back and start with a brand new option.
Branch and Bound
A branch and bound algorithm is an optimization technique to get an optimal solution to the problem. It looks for the end up beingst solution for a given problem in the entire space of the solution. The bounds in the function to end up being optimized are merged with the value of the lacheck end up beingst solution. It permit’s the algorithm to find parts of the solution space extensively.
The purpose of a branch and bound reresearch is to maintain the lowestcost rout generally theree to a target. Once a solution is found, it can maintain improving the solution. Branch and bound reresearch is implemented in depthbounded reresearch and depth–preliminary reresearch.
Linear Programming
Linear programming descriend up beings a wide course of optimization job where both the optimization criterion and the constraints are seriesar functions. It is a technique to get the end up beingst awaycome like maximum profit, shorcheck rout generally theree, or lowest cost.
In this programming, we have a set of variables and we have to bumign absolute values to all of them to satisfy a set of seriesar equations and to maximize or minimise a given seriesar goal function.
Parallel Algorithm – Matrix Multiplication
A matrix is a set of numerical and nonnumerical data arranged in a fixed numend up beingr of seriess and column. Matrix multiplication is an iminterfaceant multiplication design in parallel complaceation. Here, we will discuss the implementation of matrix multiplication on various communication ne2rks like mesh and hypercuend up being. Mesh and hypercuend up being have higher ne2rk connectivity, so they permit quicker algorithm than other ne2rks like ring ne2rk.
Mesh Ne2rk
A topology where a set of nodes form a pdimensional grid is caldelivered a mesh topology. Here, all the edges are parallel to the grid axis and all the adjacent nodes can communicate among all of themselves.
Total numend up beingr of nodes = (numend up beingr of nodes in series) × (numend up beingr of nodes in column)
A mesh ne2rk can end up being evaluated using the following truthionors −
 Diameter
 Bisection width
Diameter − In a mesh ne2rk, the lengthyest distance end up beingtween 2 nodes is it is diameter. A pdimensional mesh ne2rk having kP nodes has a diameter of p(k–1).
Bisection width − Bisection width is the minimum numend up beingr of edges needed to end up being removed from a ne2rk to divide the mesh ne2rk into 2 halves.
Matrix Multiplication Using Mesh Ne2rk
We have conaspectreddish coloured a 2D mesh ne2rk SIMD model having wraparound interinterconnections. We will design an algorithm to multiply 2 n × n arrays using n^{2} processors in a particular amount of time.
Matrices A and B have elements a_{ij} and b_{ij} respectively. Processing element PE_{ij} represents a_{ij} and b_{ij}. Arrange the matrices A and B in such a way thead wear every processor has a pair of elements to multiply. The elements of matrix A will move in left direction and the elements of matrix B will move in upbattdelivered direction. These alters in the posit down downion of the elements in matrix A and B present every processing element, PE, a brand new pair of values to multiply.
Steps in Algorithm
 Stagger 2 matrices.
 Calculate all products, a_{ik} × b_{kj}
 Calculate sums when step 2 is extensive.
Algorithm
Procedure MatrixMulti Begin for k = 1 to n1 for all Pij; where i and j ranges from 1 to n ifi is greater than k then rotate a in left direction end if if j is greater than k then rotate b in the upbattdelivered direction end if for all Pij ; where i and j sit down's end up beingtween 1 and n complacee the product of a and b and store it in c for k= 1 to n1 step 1 for all Pi;j where i and j ranges from 1 to n rotate a in left direction rotate b in the upbattdelivered direction c=c+aXb End
Hypercuend up being Ne2rk
A hypercuend up being is an ndimensional construct where edges are perpendicular among all of themselves and are of same length. An ndimensional hypercuend up being is furthermore belowstandn as an ncuend up being or an ndimensional cuend up being.
Features of Hypercuend up being with 2^{k} node
 Diameter = k
 Bisection width = 2^{k–1}
 Numend up beingr of edges = k
Matrix Multiplication using Hypercuend up being Ne2rk
General specification of Hypercuend up being ne2rks −

Let N = 2^{m} end up being the comppermite numend up beingr of processors. Let the processors end up being P_{0,} P_{1}…..P_{N1}.

Let i and i^{b} end up being 2 integers, 0 < i,i^{b} < N1 and it is binary representation differ only in posit down downion b, 0 < b < k–1.

Let us conaspectr 2 n × n matrices, matrix A and matrix B.

Step 1 − The elements of matrix A and matrix B are bumigned to the n^{3} processors such thead wear the processor in posit down downion i, j, k will have a_{ji} and b_{ik}.

Step 2 − All the processor in posit down downion (i,j,k) complacees the product
C(i,j,k) = A(i,j,k) × B(i,j,k)

Step 3 − The sum C(0,j,k) = ΣC(i,j,k) for 0 ≤ i ≤ n1, where 0 ≤ j, k < n–1.
Block Matrix
Block Matrix or partitioned matrix is a matrix where every element it iself represents an individual matrix. These individual sections are belowstandn as a block or submatrix.
Example
In Figure (a), X is a block matrix where A, B, C, D are matrix all of themselves. Figure (f) shows the comppermite matrix.
Block Matrix Multiplication
When 2 block matrices are square matrices, then they are multipsit downd simply the way we perform fundamental matrix multiplication. For example,
Parallel Algorithm – Sorting
Sorting is a process of arranging elements in a group in a particular order, i.e., ascending order, descending order, alphaend up beingtic order, etc. Here we will discuss the following −
 Enumeration Sort
 OddEven Transposit down downion Sort
 Parallel Merge Sort
 Hyper Quick Sort
Sorting a list of elements is a very common operation. A sequential sorting algorithm may not end up being effective sufficient when we have to sort a huge volume of data. Therefore, parallel algorithms are used in sorting.
Enumeration Sort
Enumeration sort is a method of arranging all the elements in a list simply by finding the final posit down downion of every element in a sorted list. It is done simply by comparing every element with all other elements and finding the numend up beingr of elements having smaller value.
Therefore, for any kind of 2 elements, a_{i} and a_{j} any kind of one of the following cases must end up being true −
 a_{i} < a_{j}
 a_{i} > a_{j}
 a_{i} = a_{j}
Algorithm
procedure ENUM_SORTING (n) end up beinggin for every process P_{1,j} do C[j] := 0; for every process P_{i, j} do if (A[i] < A[j]) or A[i] = A[j] and i < j) then C[j] := 1; else C[j] := 0; for every process P1, j do A[C[j]] := A[j]; end ENUM_SORTING
OddEven Transposit down downion Sort
OddEven Transposit down downion Sort is based on the Bubble Sort technique. It compares 2 adjacent numend up beingrs and switches all of them, if the preliminary numend up beingr is greater than the 2nd numend up beingr to get an ascending order list. The opposit down downe case appsit down’s for a descending order series. OddEven transposit down downion sort operates in 2 phases − odd phase and furthermore phase. In both the phases, processes exalter numend up beingrs with their particular particular adjacent numend up beingr in the right.
Algorithm
procedure ODDEVEN_PAR (n) end up beinggin id := process's laend up beingl for i := 1 to n do end up beinggin if i is odd and id is odd then compareexalter_min(id + 1); else compareexalter_max(id  1); if i is furthermore and id is furthermore then compareexalter_min(id + 1); else compareexalter_max(id  1); end for end ODDEVEN_PAR
Parallel Merge Sort
Merge sort preliminary divides the unsorted list into smallest possible sublists, compares it with the adjacent list, and merges it in a sorted order. It implements parallelism very nicely simply by following the divide and conquer algorithm.
Algorithm
procedureparallelmergesort(id, n, data, brand newdata) end up beinggin data = sequentialmergesort(data) for dim = 1 to n data = parallelmerge(id, dim, data) endfor brand newdata = data end
Hyper Quick Sort
Hyper quick sort is an implementation of quick sort on hypercuend up being. It’s steps are as follows −
 Divide the unsorted list among every node.
 Sort every node locally.
 From node 0, widecast the median value.
 Split every list locally, then exalter the halves acombination the highest dimension.
 Repeat steps 3 and 4 in parallel until the dimension reveryes 0.
Algorithm
procedure HYPERQUICKSORT (B, n) end up beinggin id := process’s laend up beingl; for i := 1 to d do end up beinggin x := pivot; partition B into B1 and B2 such thead wear B1 ≤ x < B2; if ith bit is 0 then end up beinggin send B2 to the process alengthy the ith communication link; C := subsequence received alengthy the ith communication link; B := B1 U C; endif else send B1 to the process alengthy the ith communication link; C := subsequence received alengthy the ith communication link; B := B2 U C; end else end for sort B using sequential quicksort; end HYPERQUICKSORT
Parallel Search Algorithm
Searching is one of the fundamental operations in complaceer technology. It is used in all applications where we need to find if an element is in the given list or not. In this chapter, we will discuss the following reresearch algorithms −
 Divide and Conquer
 DepthFirst Search
 BreadthFirst Search
 BestFirst Search
Divide and Conquer
In divide and conquer approach, the problem is divided into lots of small subproblems. Then the subproblems are solved recursively and combined to get the solution of the unique problem.
The divide and conquer approach involves the following steps at every level −

Divide − The unique problem is divided into subproblems.

Conquer − The subproblems are solved recursively.

Combine − The solutions of the subproblems are combined to get the solution of the unique problem.
Binary reresearch is an example of divide and conquer algorithm.
Pseudocode
Binaryreresearch(a, b, low, high) if low < high then return NOT FOUND else mid ← (low+high) / 2 if b = key(mid) then return key(mid) else if b < key(mid) then return BinarySearch(a, b, low, mid−1) else return BinarySearch(a, b, mid+1, high)
DepthFirst Search
DepthFirst Search (or DFS) is an algorithm for reresearching a tree or an undirected graph data structure. Here, the concept is to start from the starting node belowstandn as the underlying and traverse as far as possible in the same branch. If we get a node with no successor node, we return and continue with the vertex, which is yet to end up being visit down downed.
Steps of DepthFirst Search

Conaspectr a node (underlying) thead wear is not visit down downed previously and mark it visit down downed.

Visit down down the preliminary adjacent successor node and mark it visit down downed.

If all the successors nodes of the conaspectreddish coloured node are already visit down downed or it doesn’t have any kind of more successor node, return to it is mother or father node.
Pseudocode
Let v end up being the vertex where the reresearch starts in Graph G.
DFS(G,v) Stack S := {}; for every vertex u, set visit down downed[u] := false; push S, v; while (S is not empty) do u := pop S; if (not visit down downed[u]) then visit down downed[u] := true; for every unvisit down downed neighbour w of u push S, w; end if end while END DFS()
BreadthFirst Search
BreadthFirst Search (or BFS) is an algorithm for reresearching a tree or an undirected graph data structure. Here, we start with a node and then visit down down all the adjacent nodes in the same level and then move to the adjacent successor node in the next level. This is furthermore belowstandn as levelsimply bylevel reresearch.
Steps of BreadthFirst Search
 Start with the underlying node, mark it visit down downed.
 As the underlying node has no node in the same level, go to the next level.
 Visit down down all adjacent nodes and mark all of them visit down downed.
 Go to the next level and visit down down all the unvisit down downed adjacent nodes.
 Continue this process until all the nodes are visit down downed.
Pseudocode
Let v end up being the vertex where the reresearch starts in Graph G.
BFS(G,v) Queue Q := {}; for every vertex u, set visit down downed[u] := false; insert Q, v; while (Q is not empty) do u := depermite Q; if (not visit down downed[u]) then visit down downed[u] := true; for every unvisit down downed neighbour w of u insert Q, w; end if end while END BFS()
BestFirst Search
BestFirst Search is an algorithm thead wear traverses a graph to revery a target in the shorcheck possible rout generally theree. Unlike BFS and DFS, BestFirst Search follows an evaluation function to figure out generally there which node is the many kind of appropriate to traverse next.
Steps of BestFirst Search
 Start with the underlying node, mark it visit down downed.
 Find the next appropriate node and mark it visit down downed.
 Go to the next level and find the appropriate node and mark it visit down downed.
 Continue this process until the target is reveryed.
Pseudocode
BFS( m ) Insert( m.StartNode ) Until PriorityQueue is empty c ← PriorityQueue.DepermiteMin If c is the goal Exit Else Forevery neighbour n of c If n "Unvisit down downed" Mark n "Visit down downed" Insert( n ) Mark c "Examined" End procedure
Graph Algorithm
A graph is an abstrtake actionion notation used to represent the interinterconnection end up beingtween pairs of objects. A graph consists of −

Vertices − Interconnected objects in a graph are caldelivered vertices. Vertices are furthermore belowstandn as nodes.

Edges − Edges are the links thead wear connect the vertices.
There are 2 kinds of graphs −

Directed graph − In a directed graph, edges have direction, i.e., edges go from one vertex to an additional.

Undirected graph − In an undirected graph, edges have no direction.
Graph Coloring
Graph colouring is a method to bumign colours to the vertices of a graph so thead wear no 2 adjacent vertices have the same colour. Some graph colouring problems are −

Vertex colouring − A way of colouring the vertices of a graph so thead wear no 2 adjacent vertices share the same colour.

Edge Coloring − It is the method of bumigning a colour to every edge so thead wear no 2 adjacent edges have the same colour.

Face colouring − It bumigns a colour to every face or area of a planar graph so thead wear no 2 faces thead wear share a common boundary have the same colour.
Chromatic Numend up beingr
Chromatic numend up beingr is the minimum numend up beingr of colours requireddish coloured to colour a graph. For example, the chromatic numend up beingr of the following graph is 3.
The concept of graph colouring is appsit downd in preparing timetables, mobile radio stations frequency bumignment, SudUKu, register allocation, and colouring of charts.
Steps for graph colouring

Set the preliminary value of every processor in the ndimensional array to 1.

Now to bumign a particular colour to a vertex, figure out generally there whether thead wear colour is already bumigned to the adjacent vertices or not.

If a processor detects same colour in the adjacent vertices, it sets it is value in the array to 0.

After macalifornia king n^{2} comparisons, if any kind of element of the array is 1, then it is a valid colouring.
Pseudocode for graph colouring
end up beinggin generate the processors P(i_{0},i_{1},...i_{n1}) where 0_i_{v} < m, 0 _ v < n status[i0,..i_{n1}] = 1 for j varies from 0 to n1 do end up beinggin for k varies from 0 to n1 do end up beinggin if a_{j,k}=1 and i_{j}=i_{k}then status[i_{0},..i_{n1}] =0 end end ok = ΣStatus if ok > 0, then display valid colouring exists else display invalid colouring end
Minimal Spanning Tree
A spanning tree in whose sum of weight (or length) of all it is edges is less than all other possible spanning tree of graph G is belowstandn as a minimal spanning tree or minimum cost spanning tree. The following figure shows a weighted connected graph.
Some possible spanning trees of the above graph are shown end up beinglow −
Among all the above spanning trees, figure (d) is the minimum spanning tree. The concept of minimum cost spanning tree is appsit downd in journeylingling salesman problem, designing electronic circuit is, Designing effective ne2rks, and designing effective rawaying algorithms.
To implement the minimum costspanning tree, the following 2 methods are used −
 Prim’s Algorithm
 Kruskal’s Algorithm
Prim's Algorithm
Prim’s algorithm is a greedy algorithm, which helps us find the minimum spanning tree for a weighted undirected graph. It selects a vertex preliminary and finds an edge with the lowest weight incident on thead wear vertex.
Steps of Prim’s Algorithm

Select any kind of vertex, say v_{1} of Graph G.

Select an edge, say e_{1} of G such thead wear e_{1} = v_{1} v_{2} and v_{1} ≠ v_{2} and e_{1} has minimum weight among the edges incident on v_{1} in graph G.

Now, following step 2, select the minimum weighted edge incident on v_{2}.

Continue this till n–1 edges have end up beingen chosen. Here n is the numend up beingr of vertices.
The minimum spanning tree is −
Kruskal's Algorithm
Kruskal’s algorithm is a greedy algorithm, which helps us find the minimum spanning tree for a connected weighted graph, adding increasing cost arcs at every step. It is a minimumspanningtree algorithm thead wear finds an edge of the minimum possible weight thead wear connects any kind of 2 trees in the forest.
Steps of Kruskal’s Algorithm

Select an edge of minimum weight; say e_{1} of Graph G and e_{1} is not a loop.

Select the next minimum weighted edge connected to e_{1}.

Continue this till n–1 edges have end up beingen chosen. Here n is the numend up beingr of vertices.
The minimum spanning tree of the above graph is −
Shorcheck Path Algorithm
Shorcheck Path algorithm is a method of finding the minimum cost rout generally theree from the source node(S) to the destination node (D). Here, we will discuss Moore’s algorithm, furthermore belowstandn as Breadth First Search Algorithm.
Moore’s algorithm

Laend up beingl the source vertex, S and laend up beingl it i and set i=0.

Find all unlaend up beingdelivered vertices adjacent to the vertex laend up beingdelivered i. If no vertices are connected to the vertex, S, then vertex, D, is not connected to S. If generally generally there are vertices connected to S, laend up beingl all of them i+1.

If D is laend up beingdelivered, then go to step 4, else go to step 2 to incrrelayve i=i+1.

Stop after the length of the shorcheck rout generally theree is found.