ReC98/CONTRIBUTING.md

15 KiB
Raw Blame History

Welcome!

If we've seen you doing any kind of reverse-engineering or modding work on the PC-98 Touhou games before, you might have already been invited as a collaborator. In that case, feel free to create separate branches for your work directly in this repository this will immediately inform anyone who watches this repo or subscribed to a webhook. If you prefer, you can still use your own fork though.

What can I do on these separate branches?

Anything reverse-engineering and decompilation of original ZUN code (which then could be merged back into master after review) or your own custom mods, no matter how large or small.

For starters, simply naming functions or global variables to reflect their actual intent will already be helpful. Any name is better than sub_<something>, and can always be fixed or improved later.

Contribution guidelines

Rule #1

master must never introduce code changes that change the decompressed program image, or the unordered set of relocations, of any original game binary, as compared using mzdiff. The only allowed exceptions are:

  1. different encodings of identical x86 instructions within code segments
  2. padding with 00 bytes at the end of the file.

These cases should gradually be removed as development goes along, though.

Taste issues

  • Use tabs for indentation.

  • Spaces for alignment are allowed, especially if they end up giving the code a nice visual structure, e.g. with multiple calls to the same function with varying pixel coordinates.

  • Don't indent extern "C" blocks that span the entire file.

  • Always use { brackets }, even around single-statement conditional branches and single-instruction inline assembly.

  • Add spaces around binary operators. for(i = 0; i < 12; i++)

  • Variables should be signed in the absence of any ASM instruction (conditional jump, arithmetic, etc.) or further context (e.g. parameters with a common source) that defines their signedness. If a variable is used in both signed and unsigned contexts, declare it as the more common one.

Compatibility

  • Despite the games' native encoding being Shift-JIS, most files should be encoded in UTF-8, as it's simply more comfortable to work with in modern tools. The only (necessary) exceptions are

    • the big .ASM dump files in the root directory,
    • and any files in the per-game shiftjis/ subdirectory. All hardcoded Shift-JIS strings should be put there. With files full of Shift-JIS text, it's also easier to see when an editor didn't recognize the encoding, which keeps the annoyance from accidentally destroyed files to a minimum.
  • Use _asm as the keyword for decently sane or temporary inline assembly. This variation has the biggest compiler support, which will ease potential future ports to other x86 systems:

    Compiler support asm _asm __asm
    Microsoft QuickC 2.51
    Turbo C++ 4.0
    Borland C++ 5.5.1
    Open Watcom 2.0
    Visual Studio 2022
    Clang 13 (default)
    Clang 13 (with -fms-extensions)
    • Conversely, use asm as the keyword for the particularly dumb small pieces of inline assembly that refer to or depend on register pseudovariables from surrounding code, and are just needed to ensure correct code generation. These should break on other compilers.

      Example:

      _CX = loop_count;
      loop_label: {
          // …
      
          // `asm`, with no underscore, because the x86 LOOP instruction
          // branches depending on the value in CX, which was set using a
          // pseudovariable access above.
          asm { loop	loop_label; }
      }
      

Build system

  • Whenever you edit the Tupfile, run tup generate Tupfile.bat to update the dumb batch fallback script, for systems that can't run Tup.

Code organization

  • Try to avoid repeating numeric constants after all, easy moddability should be one of the goals of this project. For local arrays, use sizeof() if the size can be expressed in terms of another array or type. Otherwise, #define a macro if there is a clear intent behind a number. (Counterexample: Small, insignificant amounts of pixels in e.g. entity movement code.)

  • Try rewriting padding instructions in ASM land into TASM directives:

    • db 0 / NOPeven / align 2
    • db ?evendata

    This makes mzdiffs a bit shorter in common cases where a single byte was erroneously added somewhere, by providing a chance for the code to catch up to its original byte positions.

  • Documenting function comments exclusively go into C/C++ header files, right above the corresponding function prototype, not into ASM slices.

  • If an ASM translation unit requires the .MODEL directive and uses 32-bit 80386 instructions via .386, make sure to specify the USE16 model modifier, as in

    .model use16 large
    

    Otherwise, some TASM versions might create 32-bit segments if .386 is specified before .MODEL, causing all sorts of issues and messing up segment alignments. (TASM32 version 5.3 is known to do this, for example.) Specifying USE16 is a lot more understandable than switching back and forth between CPUs, or relying on the order of the .MODEL and .386 directives to imply the default 16-bit behavior.

  • Newly named symbols in ASM land (functions, global variables, structs, and "sequence of numeric equate" enums) should immediately be reflected in C/C++ land, with the correct types and calling conventions. Typically, these definitions would go into header files, but they can stay in .c/.cpp files if they aren't part of a public interface, i.e., not used by unrelated functions.

  • Compress calls to known functions in ASM land to use TASM's one-line, interfaced call syntax, whenever all parameters are passed via consecutive PUSH instructions:

    • pascal:

      push param1
      push param2
      call foo
      call foo pascal, param1, param2
    • __cdecl, single call, single parameter:

      push param1
      call foo
      pop  cx
      call foo stdcall, param1
      pop  cx
    • __cdecl, single call, multiple parameters:

      push param2
      push param1
      call foo
      add  sp, 4
      call foo c, param1, param2
    • __cdecl, single call, 32-bit parameters (Note that you have to use large whenever a parameter happens to be 32-bit, even if the disassembly didn't need it):

      push  012345678h
      pushd param1
      call  foo
      add   sp, 8
      call foo c, large param1, large 012345678h
    • __cdecl, multiple calls with a single add sp instruction for their combined parameter size at the end:

      push  param2
      push  param1
      call  foo
      […]
      push  param2
      pushd param1
      call  bar
      add   sp, 0Ah
      call foo stdcall, param1, param2
      […]
      call bar stdcall, large param1, param2
      add  sp, 10
  • In ASM functions with ZUN's silly MOV BX, SP stack frame, use the arg_bx and ret_bx macros from th03/arg_bx.inc to declare parameters and return with the correct amount of bytes released from the stack. The parameter names only get a single @ as their prefix in this case:

    foo proc near
    arg_2 = byte ptr 2
    arg_0 = word ptr 4

    mov bx, sp
    mov al, ss:[bx+arg_2]
    mov bx, ss:[bx+arg_0]
    ret 2
    foo endp
    foo proc near
    arg_bx near, @arg_2:byte, @arg_0:word



    mov al, @arg_0
    mov bx, @arg_2
    ret_bx
    foo endp
  • Try moving repeated sections of code into a separate inline function before grabbing the #define hammer. Turbo C++ will generally inline everything declared as inline that doesn't contain do, for, while, goto, switch, break, continue, or case.

  • These inlining rules also apply to C++ class methods, so feel free to declare classes if you keep thinking "overloaded operators would be nice here" or "this code would read really nicely if this functionality was encapsulated in a method". (Sometimes, you will have little choice, in fact!) Despite Turbo C++'s notoriously outdated C++ implementation, there are quite a lot of possibilites for abstractions that inline perfectly. Subpixels, as seen in 9d121c7, are the prime example here. Don't overdo it, though use classes where they meaningfully enhance the original procedural code, not to replace it with an overly nested, "enterprise-y" class hierarchy.

  • Use #pragma option -zC and #pragma option -zP to rename code segments and their groups, not #pragma codeseg. Might look uglier, but has the advantage of not generating an empty segment with the default name and the default padding. This is particularly relevant if the -WX option is used to enforce word-aligned code segments: That empty default segment would otherwise also (unnecessarily) enforce word alignment for the segment that ends up following the empty default one.

    • These options can only be used "at the beginning" of a translation unit before the first non-preprocessor and non-comment C language token. Any other #pragma option settings should also be put there.

Decompilation

  • Don't try to decompile self-modifying code. Yes, it may be possible by calculating addresses relative to the start of the function, but as soon as someone starts modding or porting that function, things will crash at runtime. Inline ASM in C/C++ source files is fine, that will trip up future port developers at compile time. Self-modifying code can only do the same if it's kept in separate ASM files.

  • Don't use TCC's -a command-line option to force a particular code or data alignment. Instead, directly spell out the alignment by adding padding members to structures, and additional global variables. It's simply not worth requiring every structure to work around it. For functions with switch tables that originally were word-aligned, put a single #pragma option -a2 after all header inclusions.

Portability

  • Use __seg * wherever it doesn't make the code all too ugly. Type conversions into far pointers automatically set the offset to 0, so MK_FP is not necessary in such a case:

    void resident_set(resident __seg *seg)
    {
      // Redundant, and requires the MK_FP() macro to be declared
      resident_t far *resident = MK_FP(seg, 0);
    
      // Does the same, without requiring a macro
      resident_t far *resident = seg;
    }
    
  • All original .EXE binaries use the large memory model, meaning that both function and data pointers are far by default. Therefore, pointers and references should only explicitly be declared far if

    1. they are actually constructed from a segment and an offset via the two methods above, or
    2. the code performs segment/offset arithmetic on them.

Naming conventions

  • ASM file extensions: .asm if they emit code, .inc if they don't
  • Macros defining the number of instances of an entity: <ENTITY>_COUNT
  • Macros defining the number of distinct sprites in an animation: *_CELS
  • Frame variables counting from a frame count to 0: *_time
  • Frame variables and other counters starting from 0: *_frames
  • Generic 0-based IDs: *_id
  • Generic 1-based IDs, with 0 indicating some sort of absence: *_num
  • Functionally identical reimplementations or micro-optimizations of master.lib functions: z_<master.lib function name>

Identifiers from ZUN's original code

On some occasions, ZUN leaked pieces of the actual PC-98 Touhou source code during interviews. From these, we can derive ZUN's original names for certain variables, functions, or macros. To indicate one of those, put a /* ZUN symbol [reference] */ comment next to the declaration and definition of the identifier in question. Preserving any aspect from leaked ZUN code just for the sake of it is not mandatory though, and in fact tends to make the resulting code harder to understand. If you can come up with a better (or less wrong) name, go for it.

Currently, we know about the following [references]:

  • [Strings]: The symbol name is mentioned in error or debug messages. Can be easily verified by grepping over the ReC98 source tree.
  • [MAGNet2010]: Interview with ZUN for the NHK BS2 TV program MAG・ネット (MAG.Net), originally broadcast 2010-05-02. At 09m36s, ZUN's monitor briefly displays a piece of TH04's MAIN.EXE, handling demo recording and the setup of the game's EMS area.