TY - GEN
T1 - New and improved word-based unified and scalable architecture for radix 2 montgomery modular multiplication algorithm
AU - Ibrahim, Atef
AU - Gebali, Fayez
AU - Elsimary, Hamed
PY - 2013
Y1 - 2013
N2 - This paper presents a new and improved word-based processor array architecture for unified and scalable radix2 Montgomery modular multiplication algorithm. In this architecture, the multiplicand and the modulus words are allocated to each processing element rather than pipelined between the processing elements as in the previous architecture extracted by Ç. Koç, and also the multiplier bits are fed serially to the first processing element of the processor array every odd clock cycle. Moreover, this architecture was modified to reduce the critical path delay and area by replacing the two levels of carry save adder (CSA) logic by modified 4-to-2 CSA that use only one level of dual field adder logic (DFA) taking advantage of processing two operand words by the same processing element (PE) of the processor array. An ASIC Implementation of the proposed architecture shows that it can perform 1024-bit modular multiplication (for word size w = 32) in about 17.07 μs. Also, the results show that it has smaller Area x Time values compared to all existing designs by ratios ranging from 11.6 % to 47.8 % which makes it suitable for implementations where both area and performance are of concern. Moreover, it has higher throughput (1.8 - 39.5 %) than most of the published unified and scalable architectures except the architecture extracted by Harris. It has slightly higher throughput (4.5 %) than the proposed one.
AB - This paper presents a new and improved word-based processor array architecture for unified and scalable radix2 Montgomery modular multiplication algorithm. In this architecture, the multiplicand and the modulus words are allocated to each processing element rather than pipelined between the processing elements as in the previous architecture extracted by Ç. Koç, and also the multiplier bits are fed serially to the first processing element of the processor array every odd clock cycle. Moreover, this architecture was modified to reduce the critical path delay and area by replacing the two levels of carry save adder (CSA) logic by modified 4-to-2 CSA that use only one level of dual field adder logic (DFA) taking advantage of processing two operand words by the same processing element (PE) of the processor array. An ASIC Implementation of the proposed architecture shows that it can perform 1024-bit modular multiplication (for word size w = 32) in about 17.07 μs. Also, the results show that it has smaller Area x Time values compared to all existing designs by ratios ranging from 11.6 % to 47.8 % which makes it suitable for implementations where both area and performance are of concern. Moreover, it has higher throughput (1.8 - 39.5 %) than most of the published unified and scalable architectures except the architecture extracted by Harris. It has slightly higher throughput (4.5 %) than the proposed one.
UR - https://www.scopus.com/pages/publications/84889017576
U2 - 10.1109/PACRIM.2013.6625466
DO - 10.1109/PACRIM.2013.6625466
M3 - Conference contribution
AN - SCOPUS:84889017576
SN - 9781479915019
T3 - IEEE Pacific RIM Conference on Communications, Computers, and Signal Processing - Proceedings
SP - 153
EP - 158
BT - 2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, PACRIM 2013
T2 - 14th IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing, PACRIM 2013
Y2 - 27 August 2013 through 29 August 2013
ER -