Performance impact of programmer-inserted Data Prefetches for irregular access patterns with a case study of FMM VList algorithm
MetadataShow full item record
Data Prefetching is a well-known technique to speed up applications wherein hardware prefetchers or compilers speculatively prefetch data into caches closer to the processor to ensure it’s readily available when the processor demands it. Since incorrect speculation leads to prefetching useless data which, in turn, results in wasting memory bandwidth and polluting caches, prefetch mechanisms are usually conservative and prefetch on spotting fairly regular access patterns only. This gives the programmer with a knowledge of application, an opportunity to insert fine-grain software prefetches in the code to clinically prefetch the data that is certain to be demanded but whose access pattern is not too obvious for hardware prefetchers or compiler to detect. In this study, the author demonstrates the performance improvement obtained by such programmer-inserted prefetches with the case study of an FMM (Fast Multipole Method) VList application kernel run with several different configurations. The VList computation requires computing the Hadamard product of matrices. However, the way each node of the octree is stored in the memory, leads to indirect accessing of elements where memory accesses themselves are not sequential but the pointers pointing to those memory locations are still stored sequentially. Since compilers do not insert prefetches for indirect accesses, and to hardware, the access pattern appears random, programmer-inserted prefetching is the only solution for such a case. The author demonstrates the performance gain obtained by employing different prefetching choices in terms of what all structures in the code to prefetch and which level of cache to prefetch those to and also presents an analysis of the impact of different configuration parameters on performance gain. The author shows that there are several prefetching combinations which always bring performance gain without ever hurting the performance, and also identifies prefetching to L1 cache and prefetching all data structures in question, as the best prefetching recommendation for this application kernel. It is shown that this one combination gets the highest performance gain for most run configurations and an average performance gain of 10.14% across all run configurations.