Lattice QCD On A Novel Vector Architecture: Benjamin Huth, Nils Meyer, Tilo Wettig
Lattice QCD On A Novel Vector Architecture: Benjamin Huth, Nils Meyer, Tilo Wettig
The SX-Aurora TSUBASA PCIe accelerator card is the newest model of NEC’s SX architecture
family. Its multi-core vector processor features a vector length of 16 kbits and interfaces with up to
48 GB of HBM2 memory in the current models, available since 2018. The compute performance
is up to 2.45 TFlop/s peak in double precision, and the memory throughput is up to 1.2 TB/s
peak. New models with improved performance characteristics are announced for the near future.
In this contribution we discuss key aspects of the SX-Aurora and describe how we enabled the
architecture in the Grid Lattice QCD framework.
∗ Speaker.
c Copyright owned by the author(s) under the terms of the Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). https://pos.sissa.it/
Lattice QCD on a novel vector architecture Tilo Wettig
1. Introduction
Grid [1] is a modern Lattice QCD framework targeting parallel architectures. Architecture-
specific code is confined to a few header files. The CPU implementations use compiler built-
in functions (a.k.a. intrinsics) and assembly. There is also a generic, architecture-independent
implementation based on C/C++ that relies on auto-vectorization.
Mainline Grid is limited to a vector register size of at most 512 bits. Here, we consider a
new architecture with 16-kbit vector registers. We describe how we modified Grid to enable larger
vector lengths and present initial performance benchmarks.
2.1 Overview
Figure 1: NEC SX-Aurora TSUBASA PCIe ac- Figure 2: Liquid-cooled NEC A412-8 server
celerator card (type 10). Picture published with presented at SC 19, featuring 8 SX-Aurora
permission from NEC,
c by NEC. cards of novel type 10E attached to a single-
socket AMD Rome host CPU and fitting in 2U.
The SX-Aurora TSUBASA, also called vector engine (VE), is the newest member of NEC’s
SX series [2]. In contrast to former vector supercomputer architectures, the SX-Aurora is designed
as an accelerator card, see Fig. 1. At present it is available with PCIe Gen3 x16 interconnect
(VE type 10). The accelerator hosts a vector processor with 8 cores. The card ships in 3 models,
which we list in Table 1. For instance, the type 10A flagship model clocks at 1.6 GHz and delivers
2.45 TFlop/s DP peak. The High Bandwidth Memory (HBM2) capacity is 48 GB with a throughput
of 1.2 TB/s peak. Improved accelerator models with higher main memory throughput (type 10E,
type 20) and 10 cores (type 20) have been announced [3, 4].
Multiple SX-Aurora platforms are available, including workstation, rack-mounted server and
supercomputer [2]. Up to 64 vector engines interconnected by InfiniBand fit into one A500 rack,
delivering 157 TFlop/s DP peak. In Fig. 2 we show the novel A412-8 server presented at SC 19.
1
Lattice QCD on a novel vector architecture Tilo Wettig
PCIe
Network on chip
DMA
2 MB LLC 2 MB LLC 2 MB LLC 2 MB LLC 2 MB LLC 2 MB LLC 2 MB LLC 2 MB LLC
4 / 8 GB 4 / 8 GB 4 / 8 GB 4 / 8 GB 4 / 8 GB 4 / 8 GB
HBM2 HBM2 HBM2 HBM2 HBM2 HBM2
The high-level architecture of the SX-Aurora type 10 is shown in Fig. 3 [5, 6]. The chip
contains 8 identical single-thread out-of-order cores. Each core comprises a scalar processing unit
(SPU) with 32 kB L1 cache and 256 kB L2 cache as well as a vector processing unit (VPU).
The VPU processes (optionally masked) 16-kbit vector registers (corresponding to 256 real DP
numbers) in 8 chunks of 2 kbits each.
There are 8 blocks of 2 MB last-level cache (LLC) connected by a 2d network on chip. The
VPUs directly access this (coherent) LLC. Two groups of 4 LLCs are connected to one memory
controller each. Every controller addresses 3 stacks of HBM2. A ring bus interconnects the LLCs
and allows for direct memory access (DMA) and PCIe traffic.
2
Lattice QCD on a novel vector architecture Tilo Wettig
To explain the origin of this restriction, it is helpful to understand how the lattice sites are mapped
to Gridarrays. We first introduce some notation.
1. The lattice is decomposed into sublattices containing n lattice sites each. The number ni
(i = 0, . . . , d −1) equals the number of sublattice sites in dimension i. Note that Grid performs
the decomposition in such a way that adjacent sites of the full lattice are mapped to different
sublattices. In fact, the sites of a given sublattice are as far away from one another in the full
lattice as possible (see also Fig. 4).
2. A given sublattice is mapped onto m Gridarrays. One Gridarray contains one of the m degrees
of freedom of all n sublattice sites (see Fig. 4 for m = 1 with real numbers).
We first consider a single Grid process (without MPI communication). Given the decomposition
just described, shifts can be done by a few simple methods (see Fig. 4, cases A and B):
• If no site of a sublattice is part of the lattice boundary in the shift direction, a simple copy of
all involved Gridarrays is sufficient.
• Otherwise, in addition to the copy, the elements within a Gridarray must be rearranged.
Mainline Grid can handle two cases:
– Case A: A SIMD layout with all entries but one equal to 1. Then the Grid function
rotate performs the rearrangement.
– Case B: Any other SIMD layout than case A, with the restriction that no entry is larger
than 2. Then the Grid function permute rearranges the sites in the right way.
Other SIMD layouts are not supported by mainline Grid, and thus the maximum SIMD layout for
a 4-dimensional lattice is {2, 2, 2, 2}. For a lattice with SP real numbers this corresponds to a maxi-
mum vector register size of 512 bits, which explains the restriction mentioned above. However, the
SX-Aurora with its 16-kbit vector registers requires at least {2, 4, 4, 4} (for a DP complex lattice).
Therefore, an extension of the shift and stencil algorithms is necessary.
The new transformation should have the functionality of rotate and permute, but for
an arbitrary SIMD layout. Furthermore, this operation should be vectorizeable and have similar
performance. The new function split_rotate (shown in the following for the double-precision
case) fulfills these aims:
1 Forexample, if the Gridarray size is 512 bits, we have n = 4 for double complex, n = 8 for double real or single
complex, and n = 16 for single real.
3
Lattice QCD on a novel vector architecture Tilo Wettig
Circular Shift (Case A) Circular Shift (Case B) Circular Shift (new implementation)
0 4 1 5 2 6 3 7 0 4 1 5 5 0 4 1 0 8 1 9 2 10 3 11 11 0 8 1 9 2 10 3
8 12 9 13 10 14 11 15 8 12 9 13 13 8 12 9 16 24 17 25 18 26 19 27 27 16 24 17 25 18 26 19
2 6 3 7 7 2 6 3 4 12 5 13 6 14 7 15 15 4 12 5 13 6 14 7
7 0 4 1 5 2 6 3
10 14 11 15 15 10 14 11 20 28 21 29 22 30 23 31 31 20 28 21 29 22 30 23
15 8 12 9 13 10 14 11
Copy: Copy:
Copy:
0 1 2 3 0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 1 2 3 0 1 2 3
Copy & Permute: Copy & Split-Rotate:
Copy & Rotate:
4 5 6 7 5 4 7 6 5 4 7 6 8 9 10 11 12 13 14 15 11 8 9 10 15 12 13 14 11 8 9 10 15 12 13 14
4 5 6 7 7 4 5 6 7 4 5 6
Copy:
Copy: Copy:
8 9 10 11 8 9 10 11
8 9 10 11 8 9 10 11 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23
Copy & Permute:
Copy & Rotate: Copy & Split-Rotate:
12 13 14 15 13 12 15 14 13 12 15 14
12 13 14 15 15 12 13 14 15 12 13 14 24 25 26 27 28 29 30 31 27 24 25 26 31 28 29 30 27 24 25 26 31 28 29 30
Figure 4: Examples for shifts in a 2d lattice with m = 1. Case A (rotate) has SIMD layout
{4,1}, case B (permute) {2,2}. In both cases n = 4. The new implementation split_rotate
(right-most figure, n = 8) is able to handle a SIMD layout of, e.g., {4,2}. The colored boxes denote
positions in memory, where all boxes with the same color are in contiguous memory. At the bottom
we display the transformations of the underlying Gridarrays.
The split parameter s specifies into how many subarrays the Gridarray is split. Then these subarrays
are rotated by r. In the case of complex numbers, the input r must be multiplied by 2. For s = 1, we
obtain mainline Grid’s rotate function. The effect of split_rotate on a 2d lattice is shown
in Fig. 4. Examples for shifts on a 3d sublattice by split_rotate are shown in Fig. 5.
We have replaced rotate and permute by split_rotate in shift and stencil operations
and thereby enabled Grid to handle Gridarray sizes of 128 × 2k bits (with k ∈ N). Our implemen-
tation of split_rotate is done in generic C/C++ code, which is architecture-independent and
thus applicable beyond the SX-Aurora.
The algorithm described above works for a single Grid process. When using Grid on multiple
nodes (with MPI communication), some Gridarrays have to be broken up and partially transferred
to the neighboring node in order to perform a lattice shift. In mainline Grid, this is also restricted to
the same SIMD layouts as described above. We have enabled the required functionality for larger
Gridarray sizes of 128 × 2k bits. However, the implementation still needs to be optimized.
4
Lattice QCD on a novel vector architecture Tilo Wettig
z
x
1 2 3 4
y
51 62 62 51 5 6 7 8
2 3 4 1
6 7 8 5
73 4 84 3 9 101112 101112 9 r = 1
13141516 14151613 s = 16
1 2 3 4 ... 3132 ... 61626364 2 3 4 1 ... 2833 ... 62 6364 61
1 2 3 4 5 6 7 8 2 1 4 3 6 5 8 7
15 26 73 84 17181920 5 6 7 8
37 8 51 2 21222324
25262728 r = 16
9 101112 r=4
13141516 s=4
29303132 s = 1 1 2 3 4
5 6 7 8 1 2 3 4 3 4 1 2 7 8 5 6
17-32 33-48 49-64 1-16 5-8 9-12 13-16 1-4 ... 53-56 57-60 61-64 49-52
Figure 5: Shifts on a 3d sublattice with m = 1. Left: A small sublattice with side length = 2
and SIMD layout {2, 2, 2} is shifted by the permute function. Right: A larger sublattice with
side length = 4 and SIMD layout {4, 4, 4} is shifted using the split_rotate function. The
corresponding parameters s and r are also shown.
We have shown how to modify Grid to deal with larger vector register sizes than the current
512-bit limit and presented performance benchmarks on the SX-Aurora. Work on MPI support is
in progress. Once this support is available and further performance optimizations are implemented,
the SX-Aurora will be an interesting option for Lattice QCD.
5
Lattice QCD on a novel vector architecture Tilo Wettig
% peak performance
200 80
clang/LLVM VE (intrinsics) Custom, clang/LLVM VE (intr.), NEC SX-Aurora
ncc (auto-vec.) Grid, clang/LLVM VE (intr.), NEC SX-Aurora
150 60 Grid, ncc (auto-vec.), NEC SX-Aurora
Grid, clang/LLVM (intr.), Intel KNL 7210
100 40
50 20
0 0
16 64 256 1024 4096 16384 0.25 1 4 16 64
Gridarray size in bytes Data size in MB
(a) Single-thread custom SU(3) matrix-matrix mul- (b) Single-thread SU(3) matrix-matrix multiplica-
tiplication without OpenMP, scaling up the size of tion without OpenMP, increasing the lattice size: in-
the Gridarray: intrinsics vs auto-vectorization. trinsics vs auto-vectorization, custom vs Grid.
Performance in GFlop/s
600 120
Custom, clang/LLVM VE (intr.) split_rotate, generic kernel
500 Custom, ncc (auto-vec.) 100 permute, generic kernel
Grid, clang/LLVM VE (intr.) permute, unrolled loops
400 Grid, ncc (auto-vec.) 80 (clang/LLVM auto-vec.)
300 60
200 40
100 20
0 0
1 2 3 4 5 6 7 8 84 82x162 164 162x322 324
Number of cores Lattice volume
(c) Multi-thread SU(3) matrix-matrix multiplica- (d) Application of the Wilson Dirac operator on
tion using OpenMP: intrinsics vs auto-vectorization, KNL with OpenMP: mainline Grid implementation
custom vs Grid. (in this case, permute only) vs split_rotate.
Acknowledgment
This work was supported by DFG in the framework of SFB/TRR 55 (project QPACE 4). We
thank Erlangen University, Germany, and KEK, Japan, for access to the SX-Aurora and for support.
References
[1] P. Boyle, A. Yamaguchi, G. Cossu and A. Portelli, PoS (LATTICE 2015) 023 [1512.03487]
[2] https://www.nec.com/en/global/solutions/hpc/sx
[3] S. Momose, 2nd Aurora Deep Dive Workshop at RWTH Aachen University (2019)
[4] https://fuse.wikichip.org/news/3073/nec-refreshes-sx-aurora-vector-engine-outlines-roadmap
[5] Y. Yamada and S. Momose, Hot Chips (2018)
[6] https://en.wikichip.org/wiki/nec/microarchitectures/sx-aurora
[7] K. Komatsu et al., Proceedings of SC 18 (2018) 54
[8] B. Huth, “Grid fork SX-Aurora”, https://github.com/benjaminhuth/Grid