These break the build on TASM32 version 5.0… which means that
th03_main.asm didn't assemble on that version ever since 3072208.
Thanks to mu021 for telling me about the resulting crash!
The exact reason why pellets can be carried over from Sariel's first
form to her second… and why you probably shouldn't carelessly use that
redundant count of alive pellets to skip loops in the Anniversary
Edition, because that count will be incorrect after a reset.
Thanks to mu021 for reporting this issue!
spaztron64 suggested that the GRCG might not always write all 4 planes
if one of the tile registers is 0. By changing the color and comparing
the results of this benchmark, we can prove that real hardware has no
such optimization.
Completes P0233, funded by [Anonymous].
Running this on various PC-98 models confirms that unchecked blitting
(i.e., what you would intuitively consider to be the best method) is in
fact faster than checking either byte of a 16-pixel-wide sprite
beforehand, and has been throughout the PC-98's lifespan. For optimal
performance on the 286 and 386, we might want to use MOVS instead of
MOV, but even that difference is way too small to truly matter.
Also, nice to see turns out that our blitter outperforms a naive pure C
implementation by 2-4×, depending on the model. And master.lib is not
*that* much faster…
The gaiji in `Research/blitperf.bmp` were taken from the Unifont
version 15.0.01 glyphs for:
• U+2022 BULLET •
• U+23F1 STOPWATCH ⏱
• U+1F40C SNAIL 🐌
Part of P0233, funded by [Anonymous].
The fact that every sprite format comes with its own blitter is one of
the major sources of bloat in PC-98 Touhou, and of TH01 in particular.
So how about writing a single decently optimized blitter, and calling
into that from the entire game?
Especially because generating distinct blitting functions for every
width is a much better use of all that memory: It eliminates horizontal
loops, and ensures that we use the optimal MOV variant for each sprite
size. Removing any checks for empty bytes (which will turn out to never
have been a good idea for any PC-98 model ever) and unrolling the main
blitting loop using Duff's Device already gets us something that,
depending on the PC-98 model, is easily 2-4× faster than the typical
naive C implementation you'd find in TH01. With master.lib being not
that faster…
Making more use of C++ templates would have been fancy, but horizontal
sprite clipping can change the blit width depending on runtime values.
So, we're back to X macro code generation after all.
Part of P0233, funded by [Anonymous].
Yup, unaligned! The prefilling case is quite broken on T98-Next, but
given that this emulator hasn't seen any development since 2010 and
every other emulator gets it right, we can reasonably assume that to be
a bug in that emulator.
Completes P0232, funded by [Anonymous].
Choosing C++ RAII wrappers because there's at least one case where ZUN
misplaced a manual grcg_off(). This implementation combines safety with
the optimal instructions for both dynamic and static use cases.
Part of P0232, funded by [Anonymous].
Starting with the simple refresh rate-oblivious code from TH01, until
we've figured out what the rest of the master.lib code is doing and
have valid reasons to include it. Also extending the second counter to
32-bit because we *might* be measuring some processes that could take
longer than 19:35 minutes…
Part of P0232, funded by [Anonymous].
Moving the code from TH01 to a new platform layer, and deciding against
the `pc98_` prefix, which is sort of implied by the directory of the
header file it came from. Namespaces would be ideal, but Turbo C++ 4.0J
sadly doesn't support them.
Part of P0232, funded by [Anonymous].
We'd like to use this optimization in the platform layer as well.
Turning it into an inline function via __emit__() also allows us to
turn a bunch of other macros into proper inline functions.
Part of P0232, funded by [Anonymous].
Over on the `debloated` branch, we're going to use them in our own
platform-specific code, which obviously is not decompiled from
anything.
Part of P0232, funded by [Anonymous].
And dissolve the "vars" units, which would be too annoying to merge on
the `debloated` branch otherwise. This interpretation of these globals
further highlights the type differences between REIIDEN.EXE and
FUUIN.EXE.
Placing them in directly in `resident.hpp`; on the `debloated` branch,
we could neatly define each of these fields in a matching .cpp file,
but that file would need to be compiled twice due to the aforementioned
differences between binaries. Better to keep those in `op_01.cpp`,
`main_01.cpp`, or `fuuin_01.cpp`, respectively.
Part of P0229, funded by Ember2528.
Allows us to use them as switch cases in the `debloated` branch, in
exchange for turning some inline functions to macros.
Part of P0229, funded by Ember2528.
This variable makes more sense if we name it after its one actual and
consistent usage. All others make more sense when interpreted as a bug
or bloat.
Part of P0229, funded by Ember2528.
Same rationale here – this naming scheme clarifies how these variables
are just redundant copies out of the resident structure.
Part of P0229, funded by Ember2528.
The `credit_` prefix may seem redundant within the REIIDEN.CFG
structure, but consistency seems more important here. Makes it much
easier to follow how these fields are copied around.
Part of P0229, funded by Ember2528.
In all places visited during the next 6 pushes: The resident structure
and copies of its values, the packfile implementation, boss entities,
rendering font ROM glyphs to VRAM, and overall inconsistent code
between the three binaries.
Part of P0229, funded by Ember2528.
Neatly dissolves two of the three game-specific preprocessor branches
in `th01/hardware/grppsafx.h`. Moving at least one function into a
corresponding .cpp file will also simplify the corresponding debloating
commit on the respective branch.
Part of P0229, funded by Ember2528.