@@ -481,6 +481,85 @@ reduce the cost of the second kernel launch.
481481parallelism, there is no similar technique for ` kernels ` , but the ` parallel `
482482approach above can be easily placed between ` kernels ` regions.*
483483
484+ Best Practice: Dealing with C++ End/Last Pointers
485+ -------------------------------------------------
486+ It is a common practice for list/array-like objects in C++ to store two
487+ pointers, one to the first element and one that points to a memory address
488+ immediately following the last element. This pattern can be found in the
489+ C++ Standard Template Library, for instance. Below is an example:
490+
491+ ``` cpp
492+ const size_t N = 1024 ;
493+ float *first = (float *)malloc(N * sizeof (float ));
494+ float *last = first + N; // Beyond the allocated memory
495+ ```
496+
497+ This pattern works fine on machines with a single address space, however can
498+ be problematic on machines with discrete host and device memory spaces.
499+ The default behavior for scalar pointer variables used in ` parallel ` or
500+ ` serial ` regions is to treat the pointer implicitly as-if it had appeared
501+ in a ` firstprivate ` clause. Consider the following code, however, which
502+ would break on a discrete memory machine.
503+
504+ ``` cpp
505+ #pragma acc parallel loop copy(first[0:1024])
506+ for (int i = 0 ; i < 1024 ; i++)
507+ {
508+ // first is a device address
509+ // last is a host address
510+ if ( first != last )
511+ {
512+ first[i] = (float)i;
513+ }
514+ }
515+ ```
516+
517+ Because `first` has been copied to the device, within the `parallel` region
518+ the device address will be used, but since `last` is implicitly firstprivate
519+ it will contain the host address. The `last` pointer does not actually point
520+ to any data and must remain relative to `first` in device memory, so
521+ attempting to copy it doesn't make sense. There is an unintuitive solution
522+ to this problem, however.
523+
524+ ```cpp
525+ #include <cstdio>
526+ #include <cstdlib>
527+ int main(int argc, char **argv)
528+ {
529+ float *first = (float*)malloc(1024 * sizeof(float));
530+ float *last = first + 1024;
531+ #pragma acc parallel loop copy(first[0:1024]) copy(last[-1:0])
532+ for (int i = 0; i < 1024 ; i++)
533+ {
534+ if ( first != last )
535+ {
536+ first[i] = (float)i;
537+ }
538+ }
539+ printf("[%d] %f : [%d] %f\n", 0, first[0], 1023, first[1023]);
540+ free(first);
541+ return 0;
542+ }
543+ ```
544+
545+ In this example we are copying the element immediately before ` last `
546+ and copying zero elements. This may be surprising, since C++ does not
547+ have arbitrary array bounds and copying zero elements seems nonsensical.
548+ What happens, however, is that a present table entry is created for
549+ ` last ` and, because we're saying to copy the element before ` last ` ,
550+ we're copying data that is already present on the device (which is defined
551+ as not copying any additional data). With this change, both ` first `
552+ and ` last ` will use device addresses and will be relative to the same
553+ base address.
554+
555+ *** Note:*** As-written this code assumes that the data clauses are
556+ evaluated from left to right, which is not strictly required. If processed
557+ from right to left the overlap between the two regions would likely
558+ result in a partially present error, since some of the memory already
559+ exists in the present table. At the time of writing OpenACC 3.4 is the
560+ current version and defining more tightly the order of operations to
561+ prevent this issue is a deferred topic.
562+
484563Case Study - Optimize Data Locality
485564-----------------------------------
486565By the end of the last chapter we had moved the main computational loops of
0 commit comments