Running this on various PC-98 models confirms that unchecked blitting
(i.e., what you would intuitively consider to be the best method) is in
fact faster than checking either byte of a 16-pixel-wide sprite
beforehand, and has been throughout the PC-98's lifespan. For optimal
performance on the 286 and 386, we might want to use MOVS instead of
MOV, but even that difference is way too small to truly matter.
Also, nice to see turns out that our blitter outperforms a naive pure C
implementation by 2-4×, depending on the model. And master.lib is not
*that* much faster…
The gaiji in `Research/blitperf.bmp` were taken from the Unifont
version 15.0.01 glyphs for:
• U+2022 BULLET •
• U+23F1 STOPWATCH ⏱
• U+1F40C SNAIL 🐌
Part of P0233, funded by [Anonymous].
The fact that every sprite format comes with its own blitter is one of
the major sources of bloat in PC-98 Touhou, and of TH01 in particular.
So how about writing a single decently optimized blitter, and calling
into that from the entire game?
Especially because generating distinct blitting functions for every
width is a much better use of all that memory: It eliminates horizontal
loops, and ensures that we use the optimal MOV variant for each sprite
size. Removing any checks for empty bytes (which will turn out to never
have been a good idea for any PC-98 model ever) and unrolling the main
blitting loop using Duff's Device already gets us something that,
depending on the PC-98 model, is easily 2-4× faster than the typical
naive C implementation you'd find in TH01. With master.lib being not
that faster…
Making more use of C++ templates would have been fancy, but horizontal
sprite clipping can change the blit width depending on runtime values.
So, we're back to X macro code generation after all.
Part of P0233, funded by [Anonymous].