[1]蒋 林,刘 鹏,山 蕊,等.阵列处理器分布式存储的簇内全访问结构设计[J].西安科技大学学报,2018,(04):656-662.[doi:10.13800/j.cnki.xakjdxxb.2018.0420 ]
 JIANG Lin,LIU Peng,SHAN Rui,et al.Design of intra-cluster full-switch architecture for distributed storage[J].Journal of Xi'an University of Science and Technology,2018,(04):656-662.[doi:10.13800/j.cnki.xakjdxxb.2018.0420 ]
点击复制

阵列处理器分布式存储的簇内全访问结构设计(/HTML)
分享到:

西安科技大学学报[ISSN:1672-9315/CN:61-1434/N]

卷:
期数:
2018年04期
页码:
656-662
栏目:
出版日期:
2018-07-15

文章信息/Info

Title:
Design of intra-cluster full-switch architecture for distributed storage
文章编号:
1672-9315(2018)04-0656-07
作者:
蒋 林1刘 鹏2山 蕊2刘 阳3
1.西安科技大学 集成电路设计实验室,陕西 西安 710054; 2.西安邮电大学 电子工程学院,陕西 西安 710121; 3.西安邮电大学 计算机学院,陕西 西安 710121
Author(s):
JIANG Lin1LIU Peng2SHAN Rui2LIU Yang3
(1.Integrated Circuit Design Laboratory,Xi'an University of Science and Technology,Xi'an 710054,China; 2.School of Electronic Engineering,Xi'an University of Posts and Telecommunications,Xi'an 710121,China; 3.School of Computer,Xi'an University of Posts and Telecommunications,Xi'an 710121,China)
关键词:
阵列处理器 分布式存储 访问延迟 并行访问
Keywords:
array processor distributed storage access delay parallel access
分类号:
TP 302
DOI:
10.13800/j.cnki.xakjdxxb.2018.0420
文献标志码:
A
摘要:
采用分布式存储结构来解决阵列处理器片内访问延迟等“存储墙”问题已经成为研究主流。针对阵列处理器中分布式存储簇内互连问题,设计了一种电路结构简单、使用效率高和延迟低的簇内全访问电路结构,实现了簇内16个处理单元对存储单元的并行访问。实验结果表明,在无冲突情况下,最高频率达223 MHz,访问峰值带宽可达7.42 GB/S.测试结果表明,相比于行列交叉互连结构,全访问结构具有更小的访问延迟。通过对256×256和512×512边缘检测canny算法在该结构上进行并行化实现和性能比较发现,相比于CPU+GPU结构的处理时间,加速比分别提升了2.84倍和2.91倍。
Abstract:
Using distributed storage structure to solve access delay has become a mainstream in chip of array processor.Aimed at the interconnect problem in clusters of distributed storage,Intra-cluster Full-Switch architecture is designed which has simple circuit structure,high efficiency and low delay.The structure achieves parallel access to memory cells by 16 processing elements within a cluster.The experimental results show that in the case of no conflict,the highest frequency is 223 MHz with access to the peak bandwidth of 7.42 GB/S.Compared to the Line Row two-stage Switch structure,Full-Switch architecture has smaller average access latency.Finally,the 256×256 and 512×512 canny edge detection algorithm is mapped and compared.The acceleration ratio is increased by 2.84 and 2.91 times,respectivelycompared to with GPU+CPU architecture.

参考文献/References:

[1] Du Z,Liu S,Fasthuber R,et al.An accelerator for high efficient vision processing[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2017,36(2):227-240. [2] Wei S J,Liu L B.Key techniques of reconfigurable computing processor[J].Scientia Sinica,2012,42(12):1559-1576. [3] Wang X L,Qu D M,et al.Design and implementation of homogeneous multi-core system[C]//2017 IEEE 12th International Conference on ASIC(ASICON),2017:788-791. [4] Castrillon J,Thiele L,Sheng W,et al.Multi/many-core programming:where are we standing?[C]//Design,Automation and Test in Europe Conference and Exhibition.EDA Consortium,2015:1708-1717. [5] Wulf W A,Mckee S A.Hitting the memory wall:implications of the obvious[J].Acm Sigarch Computer Architecture News,2013,23(1):20-24. [6]Alvarez L,Vilanova L,Moreto M,et al.Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures[J].Acm Sigarch Computer Architecture News,2015,43(3):720-732. [7] Cong J,Huang H,Ma C,et al.A fully pipelined and dynamically composable architecture of CGRA[C]//IEEE,International Symposium on Field-Programmable Custom Computing Machines.IEEE,2014:9-16. [8] Ssouza J D,Carro L,Rutzig M B,et al.A reconfigurable heterogeneous multicore with a homogeneous ISA[C]//Design,Automation and Test in Europe Conference and Exhibition IEEE,2016:1598-1603. [9]Alvarez L,Vilanova L,Moreto M,et al.Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures[J].Acm Sigarch Computer Architecture News,2015,43(3):720-732. [10]Martins A,Penny W,Weber M,et al.Cache memory energy efficiency exploration for the HEVC motion estimation[C]//Computing Systems Engineering(SBESC),2017 Ⅶ Brazilian Symposium on IEEE,2017:31-38. [11]Boroumand A,Ghose S,Lucia B,et al.LazyPIM:An efficient cache coherence mechanism for processing in memory[J].IEEE Computer Architecture Letters,2017,16(1):46-50. [12]Komuravelli R,Adve S V,Chou C T.Revisiting the complexity of hardware cache coherence and some implications[J].Acm Transactions on Architecture and Code Optimization,2014,11(4):1-22. [13]Agarwal N,Nellans D,Ebrahimi E,et al.Selective GPU caches to eliminate CPU-GPU HW cache coherence[C]//IEEE International Symposium on High PERFORMANCE Computer Architecture.IEEE, 2016:494-506. [14]Park J,Lee J,Kim S.A way-filtering-based dynamic logical-associative cache architecture for low-energy consumption[J].IEEE Transactions on Very Large Scale Integration Systems,2017(99):1-13. [15]Imani M,Rahimi A,Kim Y,et al.A low-power hybrid magnetic cache architecture exploiting narrow-width values[C]//Non-Volatile Memory Systems and Applications Symposium.IEEE,2016:1-6. [16]Park J,Lee J,Kim S.A way-filtering-based dynamic logical-associative cache architecture for low-energy consumption[J].IEEE Transactions on Very Large Scale Integration Systems,2017(99):1-13. [17]山 蕊,沈绪榜,蒋 林,等.面向阵列处理器的分布式共享存储结构设计[J].北京邮电大学学报,2017,40(4):9-15. SHAN Rui,SHEN Xu-bang,JIANG Lin,et al.Design of distributed shared memory structure for array processor[J].Journal of Beijing University of Posts & Telecommunications,2017,40(4):9-15. [18]Howard J,Dighe S,Hoskote Y,et al.A 48-Core IA-32 message-passing processor with DVFS in 45 nm CMOS[C]//Solid-State Circuits Conference Digest of Technical Papers,2010:108-109. [19]Shan Y,Tsai S Y,Zhang Y.Distributed shared persistent memory[C]//Symposium on Cloud Computing.ACM,2017:323-337. [20]Chen X.Command-Triggered Microcode Execution for Distributed Shared Memory Based Multi-Core Network-on-Chips[J].Journal of Software,2015,10(2):142-161. [21]Xin L,Chen W,Tao M,et al.Real-time algorithm for SIFT based on distributed shared memory architecture with homogeneous multi-core DSP[C]//International Conference on Intelligent Control and Information Processing.IEEE,2011:839-843. [22]Chen X,Lu Z,Jantsch A,et al.Run-Time Partitioning of Hybrid Distributed Shared Memory on Multi-core Network-on-Chips[C]//International Symposium on Parallel Architectures,Algorithms and Programming.IEEE Computer Society,2010:39-46. [23]郭佳乐,蒋 林,山 蕊,等.可重构视频阵列处理器簇内存储结构设计与实现[J].微电子学与计算机,2017,34(9):116-120. GUO Jia-le,JIANG Lin,SHAN Rui,et al.Design of cluster memory structure for reconfigurable cideo array processor[J].Microelectronics & Computer,2017,34(9):116-120. [24]Shan R,Li T,Jing,L,et al.Design and Implementation of a data-driven dynamical reconflgurable cell array[J].Shanghai Jiaotong University,2017,22(4):493-503. [25]唐 斌,龙 文.基于GPU+CPU的CANNY算子快速实现[J].液晶与显示,2016,31(7):714-720. TANG Bin,LONG Wen. Fast canny algorithm based on GPU+CPU[J].Chinese Journal of Liquid Crystals and Displays,2016,31(7):714-720.

备注/Memo

备注/Memo:
收稿日期:2017-11-20 责任编辑:高 佳
基金项目:国家自然科学基金(61772417,61272120, 61634004,61602377); 陕西省自然科学基金(2015JM6326); 陕西省科技统筹创新工程(2016KTZDGY02-04-02); 陕西省教育厅自然科学研究(17JK0689); 陕西省重点研发计划(2017GY-060)
通信作者:蒋 林(1970-),男,陕西杨凌人,教授,硕士生导师,E-mail:jianglin@xust.edu.cn
更新日期/Last Update: 2018-08-29