Addressing the challenge of data processing at HL-LHC
The ATLAS Experiment Implements Heterogeneous Particle Reconstruction with Intel® oneAPI Tools
Using Intel® oneAPI tools, the ATLAS Experiment at the Large Hadron Collider is on track to achieve many-fold performance improvements using multi-architecture CPU+GPU systems in processing future data from the detector.
The ATLAS Experiment
The ATLAS Experiment is a general purpose particle physics experiment at the Large Hadron Collider (LHC). Its goal is to understand the nature of the smallest building blocks of matter in our universe by studying collisions of protons and heavy nuclei at the high-energy frontier. The high-energy collisions also recreate conditions that would have existed mere moments after the Big Bang.
A simplified view of the components of the ATLAS Detector. (Image: ATLAS Collaboration/CERN)
Track Reconstruction in Particle Physics
Charged particle reconstruction is one of the most computationally challenging steps in analysing the data recorded by the ATLAS Detector. It is the process of identifying groups of measurements in different parts of the detector that came from energy deposits left by a single particle, then calculating the physical properties of the particles based on the paths they took through the detector, interacting with the detector’s material, while traversing a non-homogeneous magnetic field. Tracking is a complex combinatorial task, described in more detail in A Common Tracking Software (Acts) project, assigning in some cases hundreds of thousands of measurements to thousands of particles. The challenge will increase even further in the coming years, with the High-Luminosity LHC era providing large increases in the proton-proton collision rate, allowing the total dataset to be increased by an order of magnitude.
The High-Luminosity LHC Challenge
The current algorithms used to analyse ATLAS’s data do not scale well to the data expected to arrive in the future. Without significant improvements in the experiment’s data analysis software, execution times would become unacceptably high.
This is demonstrated by the plot shown on the right. It shows the time taken by the ATLAS Experiment’s existing algorithms for reconstructing events with 20-90 proton-proton collisions per proton-bunch crossing. The LHC currently provides events with ~60 simultaneous proton-proton collisions to ATLAS. These require tens of seconds to reconstruct on a single CPU thread with current reconstruction algorithms. During the HL-LHC era, up to 200 proton-proton collisions are expected per LHC event. With the algorithmic approach used so far, it would require more than 10 times more processing power than currently in use to process each of those events.
In order to achieve the ATLAS experiment’s physics programme in a sustainable way, significant changes and improvements to its data processing are under study.
After decades of nuclear- and particle physics experiments writing their track reconstruction software individually, with minimal cooperation, the ACTS project now aims to provide a general toolkit that experiments could use as the basis of their own reconstruction software. The project’s main development, for the time being, is focusing on implementing tools for use on CPUs.
The ACTS Parallelization R&D
To foster an independent development environment to try new ideas in, multiple R&D projects were started in 2020 with the aim of implementing the same functionality provided by ACTS, running on accelerators / GPUs. Currently developed in multiple separate software repositories, the goal of this effort is to migrate its results back into the main ACTS project. This would make GPU-accelerated charged-particle reconstruction accessible to current, and future nuclear- and particle-physics experiments around the world.
The following development projects were set up as part of the ACTS Parallelization R&D:
acts-project/vecmem: Provides infrastructure for allocating and managing memory using standard library containers and equivalents in both host and device code.
acts-project/algebra-plugins: Provides an abstraction for performing the linear algebra operations on small vectors and matrices that are required during track reconstruction. Allows a seamless switch between different linear algebra backends such as Eigen, SMatrix, and hand-written implementations.
acts-project/covfie: Provides a general way of storing and accessing a “vector field” in host and device code. Used for the storage of magnetic fields in the track reconstruction software.
acts-project/detray: Provides a compile-time polymorphic detector geometry description. This code is responsible for much of the logic needed for propagating and fitting particle tracks through/in a detector.
acts-project/traccc: Implements the high level algorithms for performing track reconstruction using CPUs and GPUs. This is the “primary project” of the R&D effort, which brings together all other projects into a single build of experimental libraries and executables.
During code development some useful features of oneAPI’s multiarchitecture programming via SYCL were discovered.
Expressing asynchronous code execution using SYCL is done in a very natural way, as the API very much encourages expressing all operations as interdependent tasks. In many cases, extra effort was spent on achieving the same level of asynchronicity using CUDA code.
The oneAPI compiler optimises accelerated code blocks very efficiently. This led in a number of cases to binaries compiled for the NVIDIA backend that run even faster than the corresponding binaries produced from native CUDA code.
A further general observation of the ACTS Parallelization R&D work has been that porting algorithms implemented in one GPU language to another one is fundamentally a much easier task than porting algorithms optimised for running in a single thread on a CPU, to running efficiently on a GPU. Some algorithms, first implemented using CUDA, could be fairly easily and quickly modified to work with SYCL to create portable code during the code development.
The Status of ACTS Track Reconstruction on GPUs
Reconstructing the tracks of charged particles in a detector like ATLAS happens in multiple steps. After establishing the 3D positions in the detector where (charged) particles interacted with detector elements, tracks are reconstructed by first generating “seeds” of viable 3D position triplets, extending those seeds with a Combinatorial Kalman Filter approach, and finally performing a combined fit of the entire track. Here you can find a more detailed description of this process in ACTS, which is also shown at right.
Implementation progress within the ACTS Parallelization R&D Project. (Image: ATLAS Collaboration/CERN)
The R&D project is nearly feature complete at the time of writing, as shown in the above table. Already at this point the project has proven the feasibility of implementing track reconstruction with significant code sharing between a classical CPU implementation and one designed specifically for efficient GPU acceleration. We used the following tools from the Intel® oneAPI Base Toolkit during the development:
The Intel®oneAPI DPC++/C++ Compiler is used for building all C++ source files of the project, with appropriate flags for building some of the source files as SYCL sources.
The oneapi-gdb debugger was used many times during the development to understand our code, and to validate the implementation of our algorithms.
The Intel®VTune™ Profiler continues being very effective for understanding performance bottlenecks in both the host/CPU and device/GPU part of our codebase. We are using it extensively during code development to understand which part of the code to focus on with our optimizations.
Finally the oneAPI Threading Building Blocks (oneTBB) task-based multithreading library was used to implement host-side multithreading in our applications. oneTBB will eventually allow us to conveniently integrate the Acts GPU code into ATLAS’s full offline software, which is also based on oneTBB.
Performance Results
Based on early performance results we believe GPU-based track reconstruction will be a viable path for ATLAS in the High-Luminosity LHC era. In tests with an early version of the Acts GPU R&D code running the already existing algorithm chain up to estimating the parameters of track seeds, Intel’s data centre GPUs offer performance competitive with the offerings from NVIDIA, as shown below. With the performance of SYCL code compiled for an NVIDIA backend being very close to native CUDA code executing the same algorithm. It is also worth noting that single data centre GPUs provide a significantly higher performance with the traccc code than would be possible using even multiple traditional CPUs.
Schedule for the High-Luminosity LHC, which begins in 2029 with Run 4 of the LHC. (Image: CERN)
Once the traccc project becomes feature complete, after a review of the lessons learned from the R&D process, the code will be migrated back into the main ACTS codebase.
With the start of the High-Luminosity LHC era currently planned for 2029 (see the currently planned schedule above), ATLAS is planning to make a final decision about the hardware/software that it will use for its track reconstruction as part of its data taking in 2025–6. The oneAPI-aided implementation is on good track to prove its viability – providing code portability with performance and flexible hardware choice, and to possibly be selected as the experiment’s solution for handling the huge data load coming from colliding particle beams with a higher intensity than was ever possible before.
The work described here was performed as a part of a collaboration between the ACTS Parallelization R&D team (mostly composed of members of the ATLAS Heterogeneous Computing & Accelerator Forum), led by Attila Krasznahorkay, and the Intel oneAPI team.
Track Parameter Estimation Performance Information
Testing date:
Results are based on testing by the research team working on this study at Intel as of August 22, 2023. Configuration data:
Intel Data Center GPU Max Series: 1-node, 2x Intel Xeon Platinum 8480+, 56 cores, HT On, Turbo On, NUMA 2, Total Memory 1024GB (16x64GB DDR5 4800 MT/s [4800 MT/s]), BIOS SE5C7411.86B.9525.D26.2305160804, , 1x Ethernet Controller X710 for 10GBASE-T, 1x 960 GB
Micron 7450 MTFDKBG960TFR, Ubuntu 22.04.2 LTS, 5.15.47+prerelease23.6.22, microcode 0x2b0001b0, 4x Intel Data Center GPU Max 1550, agama driver: agama-ci-devel-682.16, AMC Firmware Version: 6.6.0.0
Compilers/Tools used: Intel® oneAPI DPC++/C++ Compiler 2023.2.1
Compiler flags used: “-O2 -fsycl -fsycl-targets=intel_gpu_pvc -Xsycl-target-backend '-options -ze-intel-enable-auto-large-GRF-mode'”
Tested by The Atlas Experiment at CERN on 22/08/23.
NVIDIA A100: 1-node, 2x Intel Xeon Platinum 8480+, 56 cores, HT On, Turbo On, NUMA 2, Total Memory 512GB (16x32GB DDR5 4800 MT/s [4800 MT/s]), BIOS SE5C7411.86B.9525.D26.2305160804, 1x Ethernet Controller X710 for 10GBASE-T, 1x 1 TB Intel SSD PE2KX010T8, Ubuntu 22.04 LTS, 5.15.0-79-generic, microcode 0x2b0004b1, NVIDIA A100 80GB PCIe GPU, Driver Version: 535.54.03, CUDA Version: 12.0
Compilers/Tools used: clang version 17.0.0 (https://github.com/intel/llvm aa5722c9b25b79c70756c77cbe8393ad524f6e5e)
Compiler flags used: “-fsycl -fsycl-targets=nvidia_gpu_sm_80”
Tested by The Atlas Experiment at CERN on 22/08/23.
When physicists make statements about whether or not a given process has been observed in the LHC data, they must back up their claim with strong statistical evidence. This is often expressed in terms of standard deviations or the p-value.
A special Virtual Visit offered during a temporary LHC shutdown in the summer of 2023. We'll answer questions about the shutdown and the large variety of recent results presented at the summer conferences. Oh yes. And you'll get to see our beautiful detector.