阵列处理器分布式存储的簇内全访问结构设计

1.西安科技大学 集成电路设计实验室,陕西 西安 710054; 2.西安邮电大学 电子工程学院,陕西 西安 710121; 3.西安邮电大学 计算机学院,陕西 西安 710121

阵列处理器; 分布式存储; 访问延迟; 并行访问

Design of intra-cluster full-switch architecture for distributed storage
JIANG Lin1,LIU Peng2,SHAN Rui2,LIU Yang3

(1.Integrated Circuit Design Laboratory,Xi'an University of Science and Technology,Xi'an 710054,China; 2.School of Electronic Engineering,Xi'an University of Posts and Telecommunications,Xi'an 710121,China; 3.School of Computer,Xi'an University of Posts and Telecommunications,Xi'an 710121,China)

array processor; distributed storage; access delay; parallel access

DOI: 10.13800/j.cnki.xakjdxxb.2018.0420

备注

采用分布式存储结构来解决阵列处理器片内访问延迟等“存储墙”问题已经成为研究主流。针对阵列处理器中分布式存储簇内互连问题,设计了一种电路结构简单、使用效率高和延迟低的簇内全访问电路结构,实现了簇内16个处理单元对存储单元的并行访问。实验结果表明,在无冲突情况下,最高频率达223 MHz,访问峰值带宽可达7.42 GB/S.测试结果表明,相比于行列交叉互连结构,全访问结构具有更小的访问延迟。通过对256×256和512×512边缘检测canny算法在该结构上进行并行化实现和性能比较发现,相比于CPU+GPU结构的处理时间,加速比分别提升了2.84倍和2.91倍。

Using distributed storage structure to solve access delay has become a mainstream in chip of array processor.Aimed at the interconnect problem in clusters of distributed storage,Intra-cluster Full-Switch architecture is designed which has simple circuit structure,high efficiency and low delay.The structure achieves parallel access to memory cells by 16 processing elements within a cluster.The experimental results show that in the case of no conflict,the highest frequency is 223 MHz with access to the peak bandwidth of 7.42 GB/S.Compared to the Line Row two-stage Switch structure,Full-Switch architecture has smaller average access latency.Finally,the 256×256 and 512×512 canny edge detection algorithm is mapped and compared.The acceleration ratio is increased by 2.84 and 2.91 times,respectivelycompared to with GPU+CPU architecture.