Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach.

RSS Source
Authors
Peng Zhang, Jianbin Fang, Tao Tang, Canqun Yang, Zheng Wang

Many-core accelerators, as represented by the XeonPhi coprocessors andGPGPUs, allow software to exploit spatial and temporal sharing of computingresources to improve the overall system performance. To unlock this performancepotential requires software to effectively partition the hardware resource tomaximize the overlap between hostdevice communication and acceleratorcomputation, and to match the granularity of task parallelism to the resourcepartition. However, determining the right resource partition and taskparallelism on a per program, per dataset basis is challenging. This is becausethe number of possible solutions is huge, and the benefit of choosing the rightsolution may be large, but mistakes can seriously hurt the performance. In thispaper, we present an automatic approach to determine the hardware resourcepartition and the task granularity for any given application, targeting theIntel XeonPhi architecture. Instead of hand-crafting the heuristic for whichthe process will have to repeat for each hardware generation, we employ machinelearning techniques to automatically learn it. We achieve this by firstlearning a predictive model offline using training programs; we then use thelearned model to predict the resource partition and task granularity for anyunseen programs at runtime. We apply our approach to 23 representative parallelapplications and evaluate it on a CPU-XeonPhi mixed heterogenous many-coreplatform. Our approach achieves, on average, a 1.6x (upto 5.6x) speedup, whichtranslates to 94.5% of the performance delivered by a theoretically perfectpredictor.

Stay in the loop.

Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.