文献阅读(38)
文章目录
- 题目:Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
- 时间:2019
- 会议:The International Symposium on Microarchitecture (MICRO)
- 研究机构:英伟达
1 缩写 & 引用
- MCM: multi-chip-module
- NoC: network on chip片上网络
- NoP:network on package
- GRS: ground-referenced signaling
- GALS: global asynchronous locally synchronous
Eyeriss: A Spatial Architecture for Energy-efficient Dataflow for Convolutional Neural Networks 2016 ISCA
Timeloop: A Systematic Approach to DNN Accelerator Evaluation 2019 Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS)
Zeppelin: An SoC for Multichip Architectures 2018 ISSCC
2 abstract & introduction & background
多芯片模组(multi-chip-module)是一种新型封装方式,一个MCM可以包含很多小型的chiplet,因为小型chiplet的设计成本比较低,而很多chiplet并行度高,可以实现很好的性能
这里Simba有36个chiplet,一个chiplet可以达到4TOPS的峰值性能,通过tiling优化来提高数据局部性,实现深度学习inference
3个tail-latency-aware non-uniform tiling优化:
- non-uniform work partitioning to balance compute latency with communication latency
- communication-aware data placement to minimize interchiplet traffic
- cross-layer流水线
多芯片模组的问题是package-level wire不能提供和on-chip wire同样的通信密度,即intra-chiplet带宽远大于inter-chiplet带宽,需要考虑这个non-uniform的带宽、延时和能耗
3 simba 架构和系统
3.1 Simba架构
一共分三个层次,package,chiplet和PE
一个simba package有6x6的simba chiplet,一个chiplet有PE阵列,global PE,NoP路由器,一个控制器
每个PE都有:distributed权重buffer,输入buffer,parallel vector MAC单元,accumulation buffer和post-processing单元
PE之间还可以cross-PE reduction
3.2 Simba Silicon prototype
每个simba chiplet都有一个RISC-V处理器,可以配置和管理PE及global PE,通过AXI协议控制地址映射的寄存器
通信模式是NoC和NoP
3.3 Simba baseline tiling
4 simba characterization
芯片通过PCIE和x86通信,通过软件端计数,来确定chiplet开始执行时间