Operating Systems Design and Implementation, Vol. 13.ĭianNao Family: Energy-Efficient HardwareĬompiler Support for Optimizing Memory Bank-Level TVM: An Automated End-to-End Optimizing Compilerįor Deep Learning. Significantly reduce memory references for state-of-the-art networks on Inferentia, a homegrown AWS machine-learning inference chip. Experimental results show that we are able to To conclude, this paper proposes a systematic approach to globally optimize the memory-access patterns of DL workloads on accelerators. Typically, for a high-dimensional tensor, we map its outer dimensions to different banks and use its inner dimensions to address different elements in the same bank to support sequential data access. If a tensor t / has conflicting mapping requirements during the propagation, i.e., the data layout changes between consecutive operators in the network, we will introduce a tensor t ′ and a memcopy between t / and t ′ to represent data movement between memory banks. We perform a fixed-point iteration to propagate the mappings to cover all operators in the neural network and make sure that the output of an operator maps to the memory banks required by the next operator. To achieve this goal, we first derive bank mappings for the operators with bank-mapping restrictions, e.g., conv2D, matmul, pooling, etc., then propagate these mappings across the network based on the data dependencies between operators. Our goal is to minimize inter-bank data movement between multiple operators (represented by multiple loop nests in our compiler). There is some global optimization work for DL models (Jia et al., 2019 LiuĮt al., 2019), but no one seems to have attacked global optimization of memory-access patterns for DL accelerators. Current solutions, e.g., the XLA compiler for Google’s TPU (XLA Team, 2017), handle memory-access optimization within an operator, but ignore opportunities to reduce the number of memory accesses across multiple operators. End-to-end performance will be limited if memory references of a neural network are not well organized. On the other hand, the accelerators depend on complex software-managed scratchpads. These units are able to process multiply-accumulate operations in a highly efficient manner. Modern accelerators mostly focus on compute-bound operators such as convolution (CONV) and general matrix multiplication (GEMM) via specially designed compute units like systolic arrays. Graph, where nodes are operators and directed edges denote the dependences between nodes. A typical DL model can be represented as a
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |