Skip to content

Commit 3146c23

Browse files
committed
Documents C++ last pointer trick
Adds section on the best practice to handle C++ first/last pointers. Fixes #26 Signed-off-by: Jeff Larkin <[email protected]>
1 parent 9134e05 commit 3146c23

File tree

1 file changed

+79
-0
lines changed

1 file changed

+79
-0
lines changed

05-Data-Locality.markdown

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -481,6 +481,85 @@ reduce the cost of the second kernel launch.
481481
parallelism, there is no similar technique for `kernels`, but the `parallel`
482482
approach above can be easily placed between `kernels` regions.*
483483

484+
Best Practice: Dealing with C++ End/Last Pointers
485+
-------------------------------------------------
486+
It is a common practice for list/array-like objects in C++ to store two
487+
pointers, one to the first element and one that points to a memory address
488+
immediately following the last element. This pattern can be found in the
489+
C++ Standard Template Library, for instance. Below is an example:
490+
491+
```cpp
492+
const size_t N = 1024;
493+
float *first = (float*)malloc(N * sizeof(float));
494+
float *last = first + N; // Beyond the allocated memory
495+
```
496+
497+
This pattern works fine on machines with a single address space, however can
498+
be problematic on machines with discrete host and device memory spaces.
499+
The default behavior for scalar pointer variables used in `parallel` or
500+
`serial` regions is to treat the pointer implicitly as-if it had appeared
501+
in a `firstprivate` clause. Consider the following code, however, which
502+
would break on a discrete memory machine.
503+
504+
```cpp
505+
#pragma acc parallel loop copy(first[0:1024])
506+
for (int i = 0; i < 1024 ; i++)
507+
{
508+
// first is a device address
509+
// last is a host address
510+
if ( first != last )
511+
{
512+
first[i] = (float)i;
513+
}
514+
}
515+
```
516+
517+
Because `first` has been copied to the device, within the `parallel` region
518+
the device address will be used, but since `last` is implicitly firstprivate
519+
it will contain the host address. The `last` pointer does not actually point
520+
to any data and must remain relative to `first` in device memory, so
521+
attempting to copy it doesn't make sense. There is an unintuitive solution
522+
to this problem, however.
523+
524+
```cpp
525+
#include <cstdio>
526+
#include <cstdlib>
527+
int main(int argc, char **argv)
528+
{
529+
float *first = (float*)malloc(1024 * sizeof(float));
530+
float *last = first + 1024;
531+
#pragma acc parallel loop copy(first[0:1024]) copy(last[-1:0])
532+
for (int i = 0; i < 1024 ; i++)
533+
{
534+
if ( first != last )
535+
{
536+
first[i] = (float)i;
537+
}
538+
}
539+
printf("[%d] %f : [%d] %f\n", 0, first[0], 1023, first[1023]);
540+
free(first);
541+
return 0;
542+
}
543+
```
544+
545+
In this example we are copying the element immediately before `last`
546+
and copying zero elements. This may be surprising, since C++ does not
547+
have arbitrary array bounds and copying zero elements seems nonsensical.
548+
What happens, however, is that a present table entry is created for
549+
`last` and, because we're saying to copy the element before `last`,
550+
we're copying data that is already present on the device (which is defined
551+
as not copying any additional data). With this change, both `first`
552+
and `last` will use device addresses and will be relative to the same
553+
base address.
554+
555+
***Note:*** As-written this code assumes that the data clauses are
556+
evaluated from left to right, which is not strictly required. If processed
557+
from right to left the overlap between the two regions would likely
558+
result in a partially present error, since some of the memory already
559+
exists in the present table. At the time of writing OpenACC 3.4 is the
560+
current version and defining more tightly the order of operations to
561+
prevent this issue is a deferred topic.
562+
484563
Case Study - Optimize Data Locality
485564
-----------------------------------
486565
By the end of the last chapter we had moved the main computational loops of

0 commit comments

Comments
 (0)