Use loop unrolling
Actual work inside the loop takes only 3 cycles, 3 other cycles are used for SUBI and BNEZ which are overhead to the computation.
A simple solution is to increase the number of instructions relative to the branch.
How? replicate the loop body multiple times and adjust the loop termination code.
This may increase also the overlap possible among iterations.
Constraint: Use different registers for different iterations (this results in increased register requirement.)
Assume R1 is a multiple of 32 which means number iterations is a multiple of 4.