The problem comes from the fact, that the bus arbiter is eating time for doing the arbitration for EACH CYCLE.
In order to get faster speed, you HAVE TO use the DMA controller and transfer large data blocks at a time. This type of transfer is NOT possible with software, you need DMA.
regards
Wolfgang