Electronics, Vol. 14, Pages 2021: Efficient GPU Parallel Implementation and Optimization of ARIA for Counter and Exhaustive Key-Search Modes
Electronics doi: 10.3390/electronics14102021
Authors:
Siwoo Eum
Minho Song
Sangwon Kim
Hwajeong Seo
This paper proposes an optimized shared memory access technique to enhance parallel processing performance and reduce memory accesses for the ARIA block cipher in GPU environments. To overcome the limited size of GPU shared memory, we merged ARIA’s four separate S-box tables into a single unified 32-bit table, effectively reducing the total memory usage from 4 KB to 1 KB. This allowed the consolidated table to be replicated 32 times within the limited shared memory, efficiently resolving the memory-bank conflict issues frequently encountered during parallel execution. Additionally, we utilized CUDA’s built-in function __byte_perm() to efficiently reconstruct the desired outputs from the reduced unified table, without imposing additional computational overhead. In exhaustive key-search scenarios, we implemented an on-the-fly key-expansion method, significantly reducing the memory usage per thread and enhancing parallel processing efficiency. In the RTX 3060 environment, profiling was performed to accurately analyze shared memory efficiency and the performance degradation caused by bank conflicts, yielding detailed profiling results. The results of experiments conducted on the RTX 3060 Mobile and RTX 4080 GPUs demonstrated significant performance improvements over conventional methods. Notably, the RTX 4080 GPU achieved a maximum throughput of 1532.42 Gbps in ARIA-CTR mode, clearly validating the effectiveness and practical applicability of the proposed optimization techniques. On the RTX 3060, the performance of 128-bit ARIA-CTR was improved by 2.34× compared to previous state-of-the-art implementations. Furthermore, for exhaustive key searches on the 128-bit ARIA block cipher, a throughput of 1365.84 Gbps was achieved on the RTX 4080 GPU.
Source link
Siwoo Eum www.mdpi.com