文献阅读(38)

  • 题目:Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
  • 时间:2019
  • 会议:The International Symposium on Microarchitecture (MICRO)
  • 研究机构:英伟达

1 缩写 & 引用

  • MCM: multi-chip-module
  • NoC: network on chip片上网络
  • NoP:network on package
  • GRS: ground-referenced signaling
  • GALS: global asynchronous locally synchronous
    Eyeriss: A Spatial Architecture for Energy-efficient Dataflow for Convolutional Neural Networks 2016 ISCA
    Timeloop: A Systematic Approach to DNN Accelerator Evaluation 2019 Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS)
    Zeppelin: An SoC for Multichip Architectures 2018 ISSCC

2 abstract & introduction & background

多芯片模组(multi-chip-module)是一种新型封装方式,一个MCM可以包含很多小型的chiplet,因为小型chiplet的设计成本比较低,而很多chiplet并行度高,可以实现很好的性能
这里Simba有36个chiplet,一个chiplet可以达到4TOPS的峰值性能,通过tiling优化来提高数据局部性,实现深度学习inference

3个tail-latency-aware non-uniform tiling优化:

  1. non-uniform work partitioning to balance compute latency with communication latency
  2. communication-aware data placement to minimize interchiplet traffic
  3. cross-layer流水线

多芯片模组的问题是package-level wire不能提供和on-chip wire同样的通信密度,即intra-chiplet带宽远大于inter-chiplet带宽,需要考虑这个non-uniform的带宽、延时和能耗

3 simba 架构和系统

文献阅读(38)

3.1 Simba架构

一共分三个层次,package,chiplet和PE
一个simba package有6x6的simba chiplet,一个chiplet有PE阵列,global PE,NoP路由器,一个控制器
每个PE都有:distributed权重buffer,输入buffer,parallel vector MAC单元,accumulation buffer和post-processing单元
PE之间还可以cross-PE reduction

3.2 Simba Silicon prototype

每个simba chiplet都有一个RISC-V处理器,可以配置和管理PE及global PE,通过AXI协议控制地址映射的寄存器

通信模式是NoC和NoP

3.3 Simba baseline tiling

4 simba characterization

芯片通过PCIE和x86通信,通过软件端计数,来确定chiplet开始执行时间