طراحی و پیاده سازی عمل کانولوشن در سیستم های شبکه های عصبی

STUDENT

DEGREE

YEAR

many image recognition algorithms has increased significantly. The presence of huge number of computations and data in these networks requires using high-performance accelerators in their hardware implementations. As a result, many efficient accelerators have been proposed for hardware implementation. In the conventional approach of designing the accelerators, the CNN layers proceed iteratively layer by layer. In this approach, due to the large number of intermediate data, the accelerator must use off-chip memory to store data between the layers. In this work, by exploiting the dataflow mechanism across the convolutional layers, some parts of input data are stored in the internal memory, and by using an appropriate calculations approach, adjacent CNN layers are computed in a pipeline structure without a need to store intermediate data. In this approach, only the output data of the last layer is needed to be stored in an off-chip memory. To evaluate the performance of the proposed accelerator which is named MLCP architecture, 3 adjacent convolution layers were processed concurrently in a pipeline structure. Results are compared with those of the SLCP architecture, in which calculations were performed layer by layer. Both SLCP and MLCP architectures are designed at RTL level by using Verilog HDL, and implemented on the FPGA Zynq-7000 family chip. The results of MLCP architecture show a 73% on-chip storage reduction in the case of storing intermediate data on the on-chip memory, and a 6.6 times lower off-chip memory access rate in the case of storing intermediate data on an off-chip memory. Also, by applying optimization techniques and using parallel computation, the throughput of the MLCP architecture has been 2.7 times higher than that of the SLCP architecture. This approach is also used to implement two first convolution layers of VGG-16 model network. Along with achieving 232 GOPS performance, the number of BRAMs and the number of external memory accesses are reduced compared to those of traditional implementations. This has increased the energy efficiency of this implementation compared to other works. Key Words: Convolutional Neural Network (CNN), Multi-Layer Processing, Pipeline Processing, Off-chip Memory Access, Hardware Implementation, FPGA.

در سال های اخیر شبکه های عصبی کانولوشن (CNN) به دلیل دقت بالا در تشخیص تصاویر مورد توجه پژوهشگران قرار گرفته و در بسیاری از الگوریتم های تشخیص تصویر مبتنی بر یادگیری ماشین مورد استفاده قرار گرفته اند. به دلیل حجم بالای محاسبات و داده در این شبکه ها، به شتاب دهنده هایی با عملکرد بالا جهت پیاده سازی سخت افزاری آن ها نیاز است. درنتیجه تحقیقات وسیعی برای تسریع شتاب دهی این شبکه ها انجام و شتاب دهنده های بسیاری برای پیاده سازی سخت افزاری ارائه شده است. در این شتاب دهنده ها محاسبات لایه ها به صورت معمول و لایه به لایه انجام می شود که در آن به دلیل حجم بالای داده های میانی و عدم امکان ذخیره آن ها در حافظه داخلی، از حافظه خارجی برای ذخیره این داده ها استفاده شده است. در کار حاضر، با تمرکز به چگونگی روند انتقال داده ها در بین لایه های کانولوشن و ذخیره ناحیه مشخصی از داده های ورودی در حافظه داخلی و استفاده از رویکردی خاص در نحوه و ترتیب انجام محاسبات، چندین لایه مجاور بدون نیاز به داده های میانی به صورت خط لوله پردازش می شوند. بدین ترتیب در انتها تنها داده های خروجی لایه آخر در حافظه خارجی ذخیره می شوند. برای ارزیابی طرح پیشنهادی (معماری MLCP)، محاسبات 3 لایه کانولوشن مجاور به صورت همزمان و خط لوله انجام گرفته و نتایج آن با معماری SLCP که در آن محاسبات به صورت معمول و لایه به لایه انجام می شود مقایسه شده است. دو معماری با استفاده از زبان Verilog در سطح RTL توصیف و برروی تراشه خانواده FPGA Zynq-7000 پیاده سازی شده است. پیاده سازی معماری MLCP کاهش 73 درصدی حافظه داخلی در صورت ذخیره داده های میانی در این حافظه و کاهش ?/? برابری دسترسی به حافظه خارجی در صورت ذخیره داده های میانی در حافظه خارجی را نتیجه می دهد. همچنین با بکارگیری تکنیک های بهینه سازی و موازی سازی محاسبات، توان عملیاتی معماری MLCP 2/7 برابر نسبت به معماری SLCP بهبود یافته است. همچنین از این رویکرد برای پیاده سازی دو لایه کانولوشن ابتدایی مدل شبکه VGG-16 استفاده شده است. این طرح با توان عملیاتی 232 GOPS تعداد بلوک های حافظه و همچنین تعداد دسترسی به حافظه خارجی را نسبت به پیاده سازی های معمول کاهش داده و همین موضوع باعث افزایش بهره انرژی این پیاده سازی نسبت به کارهای دیگر شده است. واژه های کلیدی:?1شبکه های عصبی کانولوشن (CNN) ?2پردازش همزمان چند لایه، ?3پردازش خط لوله، ??دسترسی به حافظه خارجی ??پیاده سازی سخت افزاری .FPGA??