Skip to content

provide a NEON version of arm/sgemm#5800

Open
notaz wants to merge 4 commits intoOpenMathLib:developfrom
notaz:armv7_sgemm
Open

provide a NEON version of arm/sgemm#5800
notaz wants to merge 4 commits intoOpenMathLib:developfrom
notaz:armv7_sgemm

Conversation

@notaz
Copy link
Copy Markdown
Contributor

@notaz notaz commented May 5, 2026

Surprisingly OpenBLAS lacks NEON optimized kernels for armv7, even though it auto-enables NEON during build by default (passes -mfpu=neon to the compiler, meaning the compiler will use NEON instructions wherever it can).

The speedup on Cortex-A76 is significant, before:

 M= 200, N= 200, K= 200 :     9262.97 MFlops   0.001727 sec

after:

 M= 200, N= 200, K= 200 :    30223.64 MFlops   0.000529 sec

notaz added 4 commits May 5, 2026 22:36
Non-local labels interfere with profiling. Same thing was done for arm64 in
commit a0128aa.
According to ARM AAPCS (Procedure Call Standard) 5.1.2.1, only registers
s16-s31 must be preserved across subroutine calls; registers s0-s15
do not need to be preserved.
benchmark/sgemm.goto before:
 M= 200, N= 200, K= 200 :     9262.97 MFlops   0.001727 sec
after:
 M= 200, N= 200, K= 200 :    30223.64 MFlops   0.000529 sec

Conveniently the registers are already allocated suitably for vector
operation, so the conversion from vfpv3 was rather straightforward.

Prefetching was left out because it doesn't help Cortex-A76,
only hurts it slightly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant