Parallel Arrays C++ Example

Posted By admin On 26/11/21

Example: parallel image processing This example implements a halo exchange algorithm to speed up an image processing program. You will need to view images on screen, so connect to BlueCrystal with ssh -X or ssh -XY. Go to the example files: cd examples/9laplace. If using the Intel compiler use Makefile.ifort as. Make -f Makefile.ifort. The parallel directive can be used in coarse-grain parallel programs. In the following example, each thread in the parallel region decides what part of the global array x to work on, based on the thread number. The #pragma omp parallel for statement will do the loop parallelization which we can initialize the matrix more efficiently. We also need a square matrix with zero values to store the answer. OpenMP, short for “Open Multi-Processing”, is an API that supports multi-platform shared memory multiprocessing programming in C, C, and Fortran - on most platforms, processor architectures and operating systems. OpenMP consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. In this post, we will be exploring OpenMP for C. In this tutorial, you will learn to work with arrays. You will learn to declare, initialize and access array elements of an array with the help of examples. An array is a variable that can store multiple values.

Haskell'98 supports just one array constructor type, namely Array, which gives you immutableboxed arrays. 'Immutable' means that these arrays, like any other purefunctional data structure, have contents fixed at construction time.You can't modify them, only query. There are 'modification' operations,but they just return new arrays and don't modify the original one. Thismakes it possible to use Arrays in pure functional code along with lists.'Boxed' means that array elements are just ordinary Haskell (lazy)values, which are evaluated on demand, and can even contain bottom(undefined) values. You can learn how to use these arrays at and I'd recommend that you readthis before proceeding to the rest of this page

Nowadays the main Haskell compilers, GHC and Hugs, ship withthe same set of Hierarchical Libraries,and these libraries contain a new implementation of arrays which isbackward compatible with the Haskell'98 one, but which has far more features.Suffice it to say that these libraries support 9 types of arrayconstructors: Array, UArray, IOArray, IOUArray, STArray, STUArray,DiffArray, DiffUArray and StorableArray. Each provides just one of two interfaces, and one of these you already know.

  • 12GHC-specific topics

Quick reference

IO monad
ST monad
Sort parallel arrays

Immutable arrays (module Data.Array.IArray)

The first interface provided by the new array library, is definedby the typeclass IArray (which stands for 'immutable array' and definedin the module Data.Array.IArray)and defines the same operations that were defined for Array inHaskell'98. The big difference is that it is now a typeclass and there are 4array type constructors, each of which implements this interface: Array,UArray, DiffArray, and DiffUArray. We will later describe the differencesbetween them and the cases when these other types are preferable to use insteadof the good old Array. Also note that to use Array type constructortogether with other new array types, you need to importData.Array.IArray module instead of Data.Array

Mutable IO arrays (module Data.Array.IO)

The second interface is defined by the type class MArray (which stands for'mutable array' and is defined in the module Data.Array.MArray)and contains operations to update array elements in-place. Mutablearrays are very similar to IORefs, only they contain multiple values. Typeconstructors for mutable arrays are IOArray and IOUArray andoperations which create, update and query these arrays all belong to theIO monad:

This program creates an array of 10 elements with all values initially set to 37. Then it reads the first element of the array. After that, theprogram modifies the first element of the array and then reads itagain. The type declaration in the second line is necessary because our littleprogram doesn't provide enough context to allow the compiler to determine the concrete type of `arr`. Unlike examples, real programs rarely need such declarations.

Mutable arrays in ST monad (module Data.Array.ST)

In the same way that IORef has its more general cousin STRef, IOArray has a moregeneral version STArray (and similarly, IOUArray corresponds to STUArray). Thesearray types allow one to work with mutable arrays in the ST monad:

Believe it or not, now you know all that is needed to use anyarray type. Unless you are interested in speed issues, just use Array,IOArray and STArray where appropriate. The following topics are almostexclusively about selecting the proper array type to make programs runfaster.

DiffArray (module Data.Array.Diff)

Note, as of Jan 2012, DiffArray is not yet ready for production use; it's practical (wall clock) performance does not live up to its theoretical advantages.

As we already stated, the update operation on immutable arrays (IArray)just creates a new copy of the array, which is very inefficient, but it is apure operation which can be used in pure functions. On the other hand,updates on mutable arrays (MArray) are efficient but can be done onlyin monadic code. In theory, DiffArray combines the best of both worlds - itsupports the IArray interface and therefore can be used in a purelyfunctional way, but internally it uses the efficient update of MArrays.

(In practice, however, DiffArrays are 10-100x slower than MArrays, due to the overhead of maintaining an immmutable interface. See bug report here: [1])

How does this trick work? DiffArray has a pure external interface, butinternally it is represented as a reference to an IOArray.

When the '//' operator is applied to a diff array, its contentsare physically updated in place. The old array silently changesits representation without changing the visible behavior: it stores a link to the new current array along with the difference to be applied to get the old contents.

So if a diff array is used in a single-threaded style, that is, after '//' application the old version is no longer used, a!i takes O(1) time and a//d takes O(length d). Accessing elements of older versions gradually becomes slower.

Updating an array which is not current makes a physical copy. The resulting array is unlinked from the old family. So you can obtain a version which is guaranteed to be current and thus has fast element access by a//[].

The library provides two 'differential' array constructors - DiffArray,made internally from IOArray, and DiffUArray, based on IOUArray. If you really need to, you can construct new 'differential' array types from any'MArray' types living in the 'IO' monad. Since GHC-6.12, DiffArray has been splitted off into separated package due to its 'unusably slow'. See Hackage documentation for further details.

Usage of DiffArray doesn't differ from that of Array, the only difference is memory consumption and speed:

You can use 'seq' to force evaluation of array elements prior to updating an array:

Unboxed arrays

In most implementations of lazy evaluation, values are represented at runtime as pointers to either their value, or code for computing their value. This extra level of indirection, together with any extra tags needed by the runtime, is known as a box. The default 'boxed' arrays consist of many of these boxes, each of which may compute its value separately. This allows for many neat tricks, like recursively defining an array's elements in terms of one another, or only computing the specific elements of the array which are ever needed. However, for large arrays, it costs a lot in terms of overhead, and if the entire array is always needed, it can be a waste.

Unboxed arrays are more like arrays in C - they contain just the plainvalues without this extra level of indirection, so that, for example,an array of 1024 values of type Int32 will use only 4 kb of memory. Moreover, indexing of such arrays can be significantly faster.

Of course, unboxed arrays have their own disadvantages. First, unboxedarrays can be made only of plain values having a fixed size - Int, Word,Char, Bool, Ptr, Double, etc. (see the full list in the Data.Array.Unboxed module).You can even implement unboxed arrays yourself for othersimple types, including enumerations. But Integer, String and anyother types defined with variable size cannot be elements of unboxed arrays.Second, without that extra level of indirection, all of the elements in an unboxed array must be evaluated when the array is evaluated, so you lose the benefits of lazy evaluation. Indexing the array to read just one element will construct the entire array. This is not much of a loss if you will eventually need the whole array, but it does prevent recursively defining the array elements in terms of each other, and may be too expensive if you only ever need specific values. Nevertheless, unboxed arrays are a very useful optimizationinstrument, and I recommend using them as much as possible.

All main array types in the library have unboxed counterparts:

So, basically replacing boxed arrays in your program with unboxed onesis very simple - just add 'U' to the type signatures, and you are done! Of course, if you change Array to UArray, you also need to add 'Data.Array.Unboxed'to your imports list.

StorableArray (module Data.Array.Storable)

A storable array is an IO-mutable array which stores itscontents in a contiguous memory block living in the Cheap. Elements are stored according to the class 'Storable'.You can obtain the pointer to the array contents to manipulateelements from languages like C.

It is similar to 'IOUArray' (in particular, it implements the sameMArray interface) but slower. The advantage is that it's compatiblewith C through the foreign function interface. The memory addresses ofstorable arrays are fixed, so you can pass them to C routines.

The pointer to the array contents is obtained by 'withStorableArray'.The idea is similar to 'ForeignPtr' (used internally here).The pointer should be used only during execution of the 'IO' actionreturned by the function passed as argument to 'withStorableArray'.

If you want to use this pointer afterwards, ensure that you call'touchStorableArray' AFTER the last use of the pointer,so that the array will be not freed too early.

Additional comments: GHC 6.6 made access to 'StorableArray' as fast as to any other unboxed arrays. The only difference between 'StorableArray' and 'UArray' is that UArray lies in relocatable part of GHC heap while 'StorableArray' lies in non-relocatable part and therefore keep the fixed address, what allow to pass this address to the C routines and save it in the C data structures.

GHC 6.6 also adds an 'unsafeForeignPtrToStorableArray' operation that allowsthe use of any Ptr as the address of a 'StorableArray' and in particular works witharrays returned by C routines. Here is an example of using this operation:

This example allocates memory for 10 Ints (which emulates an array returned by some C function),then converts the returned 'Ptr Int' to 'ForeignPtr Int' and 'ForeignPtr Int' to'StorableArray Int Int'. It then writes and reads the first element of the array. At the end, thememory used by the array is deallocated by 'free', which again emulates deallocationby C routines. We can also enable the automatic freeing of the allocated block by replacing'newForeignPtr_ ptr' with 'newForeignPtr finalizerFree ptr'. In this case memory will be automatically freed after the last array usage, as for any other Haskell objects.

The Haskell Array Preprocessor (STPP)

Using mutable (IO and ST) arrays in Haskell is not very handy.But there is one tool which adds syntactic sugar to make the use of sucharrays very close to that of imperative languages. It is written byHal Daume III and you can get it at

Using this tool, you can index array elements in arbitrarily complexexpressions with the notation 'arr[ i ]' and the preprocessor willautomatically convert these forms to the appropriate calls to'readArray' and 'writeArray'. Multi-dimensional arrays are alsosupported, with indexing in the form 'arr[ i ][ j ]'. See furtherdescriptions at

Repa package

Another option for arrays in Haskell which is worth consideration are REgular PArallel arrays (Repa). Repa is a Haskell library for high performance, regular, multi-dimensional parallel arrays. It allows to easily get an advantage from multi-core CPU's. Repa also provides list-like operations on arrays such as map, fold and zipWith, moreover repa arrays are instances of Num, which comes in hand for many applications.

Repa employs a different syntax for arrays, which is also used in an experimental accelerate package. Data.Array.Accelerate is aimed to gain the performance from using GPGPU (via CUDA).

Repa possesses a number of other interesting features, such as exporting/importing arrays from ascii or bmp files. For further information consult repa tutorial.

ArrayRef library

The ArrayRef library reimplements array libraries with the following extensions:

  • dynamic (resizable) arrays
  • polymorphic unboxed arrays

It also adds syntactic sugarwhich simplifies arrays usage. Although notas elegant as STPP, it is implemented entirelyinside the Haskell language without requiring any preprocessors.

Unsafe indexing, freezing/thawing, running over array elements

There are operations that convert between mutable and immutablearrays of the same type, namely 'freeze' (mutable->immutable) and'thaw' (immutable->mutable). They make a new copy of the array. If you aresure that a mutable array will not be modified or that an immutable array willnot be used after the conversion, you can use unsafeFreeze/unsafeThaw.These operations convert array the in-place if the input and resultingarrays have the the same memory representation (i.e. the same type andboxing). Please note that the 'unsafe*' operations modify memory - theyset/clear a flag in the array header which specifies array mutability.So these operations can't be used together with multi-threaded accessto arrays (using threads or some form of coroutines).

There are also operations that convert unboxed arrays to anotherelement type, namely castIOUArray and castSTUArray. These operationsrely on the actual type representation in memory and therefore there are noguarantees on their results. In particular, these operations canbe used to convert any unboxable value to a sequence of bytes andvice versa. For example, they are used in the AltBinary library to serializefloating-point values. Please note that these operations don'trecompute array bounds to reflect any changes in element size. Youneed to do that yourself using the 'sizeOf' operation.

While arrays can have any type of index, the internal representation only accepts Ints for indexing. The array libraries first use the Ix class to translate the polymorphic index into an Int. An internal indexing function is then called on this Int index. The internal functions are: unsafeAt, unsafeRead and unsafeWrite, found in the Data.Array.Base module.You can use these operations yourself in order to speed up your program by avoiding bounds checking. These functions are marked 'unsafe' for good a reason -- they allow the programmer to access and overwrite arbitrary addresses in memory. These operations are especially usefulif you need to walk through entire array:

'unsafe*' operations in such loops are really safe because 'i' loopsonly through positions of existing array elements.

GHC-specific topics

Parallel arrays (module GHC.PArr)

As we already mentioned, array library supports two array varieties -lazy boxed arrays and strict unboxed ones. A parallel array implementssomething intermediate: it's a strict boxed immutable array. Thiskeeps the flexibility of using any data type as an array element while makingboth creation of and access to such arrays much faster. Array creation isimplemented as one imperative loop that fills all the array elements,while accesses to array elements don't need to check the 'box'. It should beobvious that parallel arrays are not efficient in cases where thecalculation of array elements is relatively complex and most elementswill not be used. One more drawback of practical usage is thatparallel arrays don't support the IArray interface, which means that youcan't write generic algorithms which work both with Array and the parallelarray constructor.

Like many GHC extensions, this is described in a paper: An Approach to Fast Arrays in Haskell, by Manuel M. T. Chakravarty and Gabriele Keller.

You can also look at the sources of GHC.PArr module, which contains a lot of comments.

The special syntax for parallel arrays is enabled by 'ghc -fparr' or 'ghci -fparr' which is undocumented in the GHC 6.4.1 user manual.

Welcome to the machine: Array#, MutableArray#, ByteArray#, MutableByteArray#, pinned and moveable byte arrays

The GHC heap contains two kinds of objects. Some are just byte sequences,while the others are pointers to other objects (so-called 'boxes'). Thissegregation allows the system to find chains of references when performinggarbage collection and to update these pointers when memory used by the heapis compacted and objects are moved to new places. The internal (raw) GHCtype Array# represents a sequence of object pointers (boxes). There is alow-level operation in the ST monad which allocates an array of specified size in the heap.Its type is something like (Int -> ST s Array#). The Array# type is usedinside the Array type which represents boxed immutable arrays.

There is a different type for mutable boxed arrays(IOArray/STArray), namely MutableArray#. A separate type for mutablearrays is required because of the 2-stage garbage collection mechanism.The internal representations of Array# and MutableArray# are the sameapart from some flags in header, and this make possible to perform in-placeconvsion between MutableArray# and Array# (this is thatunsafeFreeze and unsafeThaw operations do).

Unboxed arrays are represented by the ByteArray# type. This is just a plainmemory area in the Haskell heap, like a C array. There are two primitive operationsthat create a ByteArray# of specified size. One allocates memory in thenormal heap and so this byte array can be moved whengarbage collection occurs. This prevents the conversion of a ByteArray#to a plain memory pointer that can be used in C procedures (althoughit's still possible to pass a current ByteArray# pointer to an 'unsafeforeign' procedure if the latter doesn't try to store this pointer somewhere).The second primitive allocates a ByteArray# of a specified size in the'pinned' heap area, which contains objects with a fixed location. Such a bytearray will never be moved by garbage collection, so its address can be used as a plainPtr and shared with the C world. The first way to create ByteArray# is usedinside the implementation of all UArray types, while the second way is used inStorableArray (although StorableArray can also point to dataallocated by C malloc). Pinned ByteArray# also used in ByteString.

There is also a MutableByteArray# type which is very similar to ByteArray#, but GHC's primitives support only monadic read/writeoperations for MutableByteArray#, and only pure reads for ByteArray#,as well as the unsafeFreeze/unsafeThaw operations which change appropriatefields in headers of this arrays. This differentiation doesn't make muchsense except for additional safety checks.

So, pinned MutableByteArray# or C malloced memory is used insideStorableArray, pinned ByteArray# or C malloced memory - insideByteString, unpinned MutableByteArray# - inside IOUArray andSTUArray, and unpinned ByteArray# is used inside UArray.

The API's of boxed and unboxed arrays API are almost identical:

Based on these primitive operations, the array library implementsindexing with any type and with any lower bound, bounds checking andall other high-level operations. Operations that createimmutable arrays just create them as mutable arrays in the ST monad, makeall required updates on this array, and then use unsafeFreeze beforereturning the array from runST. Operations on IO arrays are implementedvia operations on ST arrays using the stToIO operation.

Mutable arrays and GC

GHC implements 2-stage GC which is very fast. Minor GC occurs aftereach 256 kb allocated and scans only this area (plus recent stackframes) when searching for 'live' data. This solution uses the factthat normal Haskell data are immutable and therefore any datastructures created before the previous minor GC can't point todata structures created after it, since due to immutability, datacan contain only 'backward' references.

But this simplicity breaks down when we add to the language mutableboxed references (IORef/STRef) and arrays (IOArray/STArray).On each GC, including minor ones, each element in amutable data structure has to be be scanned because it may have been updatedsince the last GC and to make it point to data allocated since then.

For programs that contain a lot of data in mutable boxedarrays/references, GC times may easily outweigh the useful computation time.Ironically, one such program is GHC itself.The solution for such programs is to add to a command line option like '+RTS -A10m', which increases the size of minor GC chunks from 256 kb to 10 mb, making minor GC 40 times less frequent. You can see effect of thischange by using '+RTS -sstderr' option: '%GC time' should significantly decrease.

There is a way to include this option in your executable so that it willbe used automatically on each execution - you should just add to yourthe following line to your project C source file:

Of course, you can increase or decrease this value according to your needs.

Increasing '-A' value doesn't comes for free. Aside from the obviousincrease in memory usage, execution times (of useful code) will alsogrow. The default '-A' value is tuned to be close to modern CPU cache sizes, so that most memory references fall inside the cache.When 10 mb of memory are allocated before doing GC, this data localityno longer holds. So increasing '-A' can either increase or decreaseprogram speed. You should try various settings between64 kb and 16 mb while the running program with 'typical' parameters andtry to select the best setting for your specific program and CPU combination.

There is also another way to avoid increasing GC times: use eitherunboxed or immutable arrays. Also note that immutable arrays are builtas mutable ones and then 'frozen', so during the construction time GCwill also scan their contents.

Hopefully, GHC 6.6 has fixed the problem - it remembers whichreferences/arrays were updated since last GC and scans only them. Youcan suffer from the old problems only if you use verylarge arrays.

Further information:

Notes for contributors to this page

if you have any questions, pleaseask at the IRC/mailing list. If you have any answers, please submit themdirectly to this page. please don't sign your contributions, so thatanyone will feel free to further improve this page. but if you arecompiler/Array libraries author - please sign your text to let us knowthat it is the Last Word of Truth :-)

Retrieved from ''

OpenMP is one of the most popular solutions to parallel computation in C/C++. OpenMP is a mature API and has been around two decades, the first OpenMP API spec came out for Fortran(Yes, FORTRAN). OpenMP provides a high level of abstraction and allows compiler directives to be embedded in the source code.

Ease of use and flexibility are the amongst the main advantages of OpenMP. In OpenMP, you do not see how each and every thread is created, initialized, managed and terminated. You will not see a function declaration for the code each thread executes. You will not see how the threads are synchronized or how reduction will be performed to procure the final result. You will not see exactly how the data is divided between the threads or how the threads are scheduled. This, however, does not mean that you have no control. OpenMP has a wide array of compiler directives that allows you to decide each and every aspect of parallelization; how you want to split the data, static scheduling or dynamic scheduling, locks, nested locks, subroutines to set multiple levels of parallelism etc.

Another important advantage of OpenMP is that, it is very easy to convert a serial implementation into a parallel one. In many cases, serial code can be made to run in parallel without having to change the source code at all. This makes OpenMP a great option whilst converting a pre-written serial program into a parallel one. Further, it is still possible to run the program in serial, all the programmer has to do is to remove the OpenMP directives.

Understanding OpenMP

First, let’s see what OpenMP is:

OpenMP, short for “Open Multi-Processing”, is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran - on most platforms, processor architectures and operating systems.

OpenMP consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. So basically when we use OpenMP, we use directives to tell the compiler details of how our code shuld be run in parallel. Programmers do not have to write (or cannot write) implicit parallelization code, they just have to inform the compiler to do so. It is imperative to note that the compiler does not check if the given code is parallelizable or if there is any racing, it is the responsibility of the programmer to do the required checks for parallelism.

OpenMP is designed for multi-processor/core, shared memory machines and can only be run in shared memory computers. OpenMP programs accomplish parallelism exclusively through the use of threads. There’s a master thread that forks a number of slave threads that do the actual computation in parallel. The master plays the role of a manager. All the threads exist within a single process.

By default, each thread executes the parallelized section of code independently. Work-sharing constructs can be used to divide a task among the threads so that each thread executes its allocated part of the code. Therefore, both task parallelism and data parallelism can be achieved using OpenMP.

Though, not the most efficient method, OpenMP provides one of the easiest parallelization solutions for programs written in C and C++.

Linear Search

For our first example, let’s look at linear search.

Linear search or sequential search is a method for finding a target value within a list. It sequentially checks each element of the list for the target value until a match is found or until all the elements have been searched.

Linear search is one of the simplest algorithms to implement and has the worst case complexity of O(n), ie. the algorithm has to scan through the entire list to find the element - this happens when the required element isn’t in the list or is present right at the end.

By parallelizing the implementation, we make the multiple threads split the data amongst themselves and then search for the element independently on their part of the list.

Here’s the serial implementation:

Parallelizing Linear Search through OpenMP

In order to use OpenMP’s directives, we will have to include the header file: 'omp.h'. Whilst compilation, we’ll have to include the flag -fopenmp. All the directives start with #pragma omp ... .

In the above serial implementation, there is a window to parallelize the for loop. To parallelize the for loop, the openMP directive is: #pragma omp parallel for. This directive tells the compiler to parallelize the for loop below. As I’ve said before, the complier makes no checks to see if the loop is parallelizable, it is the responsiblity of the programmer to make sure that the loop can be parallelized.

Whilst parallelizing the loop, it is not possible to return from within the if statement if the element is found. This is due to the fact that returning from the if will result in an invalid branch from OpenMP structured block. Hence we will have change the implementation a bit.

The above snippet will keep on scanning the the input till the end regardless of a match, it does not have any invalid branches from OpenMP block. Also, we can be sure that there is won’t be racing since we are not modifying any variable decalred outside. Now, let’s parallelize this:

It is as simple as this, all that had to be done was adding the comipler directive and it gets taken care of, completely. The implementation didn’t have to be changed much. We didn’t have to worry about the actual implementation, scheduling, data split and other details. There’s a high level of abstraction. Also, the code will run in serial after the OpenMP directives have been removed, albeit with the modification.

It is noteworthy to mention that with the parallel implementation, each and every element will be checked regardless of a match, though, parallely. This is due to the fact that no thread can directly return after finding the element. So, our parallel implementation will be slower than the serial implementation if the element to be found is present in the range [0, (n/p)-1] where n is the length of the array and p is the number of parallel threads/sub-processes.

Further, if there are more than one instances of the required element present in the array, there is no guarantee that the parallel linear search will return the first match. The order of threads running and termination is non-deterministic. There is no way of which which thread will return first or last. To preserve the order of the matched results, another attribute(index) has to be added to the results.

You can find the complete code of Parallel Linear Search here

Still have questions? Find me on Codementor

Selection Sort

Now, let’s look at our second example - Selection Sort.

Selection sort is an in-place comparison sorting algorithm. Selection sort is noted for its simplicity, and it has performance advantages over more complicated algorithms in certain situations, particularly where auxiliary memory is limited.

In selection sort, the list is divided into two parts, the sorted part at the left end and the unsorted part at the right end. Initially, the sorted part is empty and the unsorted part is the entire list.

The smallest/largest element is selected from the unsorted array and swapped with the leftmost element, and that element becomes a part of the sorted array. This process continues moving unsorted array boundary by one element to the right.

Selection Sort has the time complexity of O(n2), making it unsuitable for large lists.

By parallelizing the implementation, we make the multiple threads split the data amongst themselves and then search for the largest element independently on their part of the list. Each thread locally stores it own smallest element. Then,

Here’s the serial implementation:

Parallelizing Selection Sort through OpenMP

First, let’s look at potential parallelization windows. The outer loop is not parallelizable owing to the fact that there are frequent changes made to the array and that every ith iteration needs the (i-1)th to be completed.

In selection sort, the parallelizable region is the inner loop, where we can spawn multiple threads to look for the maximum element in the unsorted array division. This could be done by making sure each thread has it’s own local copy of the local maximum. Then we can reduce each local maximum into one final maximum.

Reduction can be performed in OpenMP through the directive:

where op defines the operation that needs to be applied whilst performing reduction on variable va.

However, in the implementation, we are not looking for the maximum element, instead we are looking for the index of the maximum element. For this we need to declare a new custom reduction. The ability to describe our own custom reduction is a testament to the flexibility that OpenMP provides.

Reduction can be declared by using:

The declared reduction clause receives a struct. So, our custom maximum index reduction will look something like this:

Now, let’s work on parallelizing the inner loop through OpenMP. We’ll need to store both the maximum value as well as its index.


Sort Parallel Arrays

Now that we’ve parallelized our serial implementation, let’s see if the program produces the required output. For that, we can have a simple verify function that checks if the array is sorted.

After running the new sort implementation with the verify function for 100000 elements:

So, the parallel implementation is equivalent to the serial implementation and produces the required output.

You can find the complete code of Parallel Selection sort here.


Mergesort is one of the most popular sorting techniques. It is the typical example for demonstrating the divide-and-conquer paradigm.

Merge sort (also commonly spelled mergesort) is an efficient, general-purpose, comparison-based sorting algorithm.

Java Parallel Arrays

Mergesort has the worst case serial growth as O(nlogn).

Sorting an array: A[p .. r] using mergesort involves three steps.

1) Divide Step

If a given array A has zero or one element, simply return; it is already sorted. Otherwise, split A[p .. r] into two subarrays A[p .. q] and A[q + 1 .. r], each containing about half of the elements of A[p .. r]. That is, q is the halfway point of A[p .. r].

2) Conquer Step

Parallel Arrays Python

Conquer by recursively sorting the two subarrays A[p .. q] and A[q + 1 .. r].

3) Combine Step

Combine the elements back in A[p .. r] by merging the two sorted subarrays A[p .. q] and A[q + 1 .. r] into a sorted sequence. To accomplish this step, we will define a procedure MERGE (A, p, q, r).

We can parallelize the “conquer” step where the array is recursively sorted amongst the left and right subarrays. We can ‘parallely’ sort the left and the right subarrays.

Arrays In C++ Examples

Here’s the serial implementation:

Parallelizing Merge Sort through OpenMP

As stated before, the parallelizable region is the “conquer” part. We need to make sure that the left and the right sub-arrays are sorted simuntaneously. We need to implement both left and right sections in parallel.

This can be done in OpenMP using directive:

And each section that has to be parallelized should be enclosed with the directive:

Now, let’s work on parallelizing the both sections through OpenMP

The above will parallleize both left and right recursion.


Now that we’ve parallelized our serial mergesort implementation, let’s see if the program produces the required output. For that, we can use the verify function that we used for our selection sort example.

Great, so the parallel implementation works. You can find the parallel implementation here

Cannot Index Parallel Arrays

That’s it for now, if you have any comments please leave them below.

Parallel Arrays C++ Examples

Please enable JavaScript to view the comments powered by Disqus.