Development of Deep Learning frameworks for Exascale
October 7, 2021 – Corey Adams of Argonne National Laboratory is leading efforts to deploy advanced deep learning frameworks on Aurora, the next exascale system scheduled for delivery next year to the Argonne Leadership Computing Facility (ALCF) , a US Department of Energy (DOE) Office of Science User Facility.
Adams, a computer scientist in the Data Science group of the ALCF, has a joint position with the physics division of Argonne. His research sits at the intersection of deep learning, AI, and fundamental physics, encompassing contributions to Aurora’s one-time engineering efforts (NRE) that target Aurora Early Science program projects. (ESP), including connectomic brain mapping work and the virtual drug CANDLE. cancer response prediction and treatment application, in addition to applications for astrophysics, neutrino physics, quantum lattice chromodynamics, Argonne Advanced Photon Source (APS) and Large Hadron Collider.
Collaborate with Intel to deliver Aurora
As the arrival of Aurora approaches, the Data Science Group is working to ensure that the AI applications defined for deployment on the system will perform at full performance from day one, that they will work and evolve. correctly in relatively bug-free implementations. To this end, Corey and his colleagues have selected a number of Argonne workloads that represent innovative approaches to AI for science that will benefit from the Aurora architecture.
In doing so, in order to develop the capabilities of the applications from a scientific perspective, they relied on computer vision benchmarks established by Intel – for the deep learning projects for which Adams serves as the ALCF point of contact – when developing various deep learning and AI frameworks.
Performance tracking is twofold: Intel reports metrics for selected applications, while the Argonne team uses GitLab CI / CD on the Joint Systems Assessment Laboratory (JLSE) test benches to track application performance and stability, performing tests on a weekly basis.
Scale and scale with different deep learning frameworks
Deep learning frameworks can be extended or extended.
The first process, scaling, is to optimize an application for the fastest possible performance on a single graphics processing unit (GPU). Scaling, on the other hand, distributes an application across multiple GPUs. The ALCF predicts that Aurora, like other upcoming exascale systems, will derive most of its power from GPUs.
High-level frameworks in Python, such as TensorFlow and PyTorch, are based on Intel oneDNN Deep Neural Network (DNN) Framework for compute-intensive GPU processes such as convolution operations, whose complex requirements frustrate attempts at out-of-the-box performance. This requires extensive iterations of development and testing before an effective kernel or source code can be produced.
Once optimal performance is achieved on a single GPU, the Intel oneCCL collective communications library helps provide optimum performance on multiple GPUs by distributing optimized communication patterns to allocate parallel pattern training among arbitrary number of nodes. OneCCL and the synchronicity it encourages thereby enable tasks such as the uniform collection of gradients from a training iteration.
The oneDNN framework provides fast concentrated performance in a single GPU, in other words, while oneCCL provides fast distributive GPU performance across multiple GPUs.
For more detailed benchmarks, Adams and his team are working with Intel to track the performance of oneDNN and oneCCL independently of each other and independent of additional GPU operations.
Source: Nils Heinonen, ALCF