ReC98/th01/grps2xsc.c

void scale_2x(unsigned long *dst32, int src16)
{
	unsigned long mask = 1;
	unsigned long srcex = 0;
	unsigned long dst_local;
	int i;

	srcex = src16;
	dst_local = 0;
	*dst32 = 0;
	for(i = 0; i < 16; i++) {
		dst_local |= _lrotl(srcex & mask, ((i * 2) + 0) - i);
		dst_local |= _lrotl(srcex & mask, ((i * 2) + 1) - i);
		mask = _lrotl(mask, 1);
	}
	mask = 0x00FF00FF;	*dst32 |= _lrotl(dst_local & mask, 8);
	mask = 0xFF00FF00;	*dst32 |= _lrotr(dst_local & mask, 8);
}

void graph_slow_2xscale_rect_1_to_0(int x0, int y0, int x1, int y1, int w1, int h1)
{
	int row_p1 = (x1 / 8) + (y1 * ROW_SIZE);
	int row_p0 = (x0 / 8) + (y0 * ROW_SIZE);
	int col16;
	int row;
	vram_planar_16_pixels_t px16;
	int px16_nonzero;

	for(row = 0; row < h1; row++) {
		int p0 = row_p0;
		int p1 = row_p1;
		for(col16 = 0; col16 < w1 / 16; col16++) {
			int scale_p;
			graph_accesspage(1);
			px16.B = *(int*)(VRAM_PLANE_B + p1);
			px16.R = *(int*)(VRAM_PLANE_R + p1);
			px16.G = *(int*)(VRAM_PLANE_G + p1);
			px16.E = *(int*)(VRAM_PLANE_E + p1);
			px16_nonzero = px16.B | px16.R | px16.G | px16.E;
			for(scale_p = 0; scale_p < ROW_SIZE * 2; scale_p += ROW_SIZE) {
				unsigned long dst32;
				unsigned long px32_nonzero;

				graph_accesspage(0);
				scale_2x(&px32_nonzero, px16_nonzero);
				grcg_setcolor_rmw(0);
				*(long*)(VRAM_PLANE_B + p0 + scale_p) = px32_nonzero;
				grcg_off();

				scale_2x(&dst32, px16.B);  
				*(long*)(VRAM_PLANE_B + p0 + scale_p) |= dst32;

				scale_2x(&dst32, px16.R);
				*(long*)(VRAM_PLANE_R + p0 + scale_p) |= dst32;

				scale_2x(&dst32, px16.G);
				*(long*)(VRAM_PLANE_G + p0 + scale_p) |= dst32;

				scale_2x(&dst32, px16.E);
				*(long*)(VRAM_PLANE_E + p0 + scale_p) |= dst32;
			}
			p1 += 2;
			p0 += 4;
		}
		row_p0 += ROW_SIZE * 2;
		row_p1 += ROW_SIZE;
	}
}
[C decompilation] [th01/fuuin] Slow 2x VRAM region scaling This function raises one of those essential questions about the eventual ports we'd like to do. I'll explain everything more thoroughly here, since people who might complain about the ports not being faithful enough need to understand this. ---- The original plan was aim for "100% frame-perfect" ports and advertise them as such. However, the PC-98 is not a console with fixed specs. As the name implies, it's a computer architecture, and a plethora of different, more and more powerful PC-98 models were released during its lifespan. Even if we only consider the subset of products that fulfills the minimum requirements to run the PC-98 Touhou games, that's still a sizable number of systems. Therefore, the only true definition of a frame can be "everything that is drawn between two Vsync wait calls". Such a frame may contain certain expensive function calls, and certain systems may run these functions slower than the developer expected, thus effectively leading to more frames than the developer explicitly specified. This is one of those functions. Here, we have a scaling function that appears to be written deliberately to run very slow, which ends up creating the rolling effect you see in the route selection and the high score and continue screens of TH01. However, that doesn't change the fact that the function is still CPU-bound, and neither waits for Vsync nor is iteratively called by something that does. The faster your CPU, the faster the rolling effect gets… until ultimately, it's faster than one frame and therefore vanishes altogether. Mind you, this is true on both emulators and real hardware. The final PC-98 model, the Ra43, had a CPU clocked at 433 Mhz, and it may have even been instant there. If you use more optimized algorithm, it also runs faster on the same CPU (I tried this, and it worked beautifully)… you get the idea. Still, it may very well be that this algorithm was not a deliberate choice and simply resulted from a lack of experience, especially since this was ZUN's first game. That leaves us with two approaches to porting functions like these: 1) Look at the recommended system requirements ZUN specified, configure the PC-98 emulator accordingly, measure how much of the work is done in each frame, then rewrite the function to be bound to that specific frame rate… 2) …or just continue using a CPU-bound algorithm, which will pretty much complete instantly on any modern system. I'd argue that 2) is actually the more "faithful" approach. It will run faster than the typical clock speeds people emulate the games at, and maybe draw a bit of criticism because of that, but it seems a lot more rational than the approximation provided by 1). Not to mention that it's undeniably easier to implement, and hey, a faster game feels a lot better than a slower one, right? … Oh well, maybe we'll still encounter some kind of CPU-bound animation that is so essential to the experience that we do want to lock it to a certain frame rate… 2015-03-09 16:58:30 +00:00			`void scale_2x(unsigned long *dst32, int src16)`
			`{`
			`unsigned long mask = 1;`
			`unsigned long srcex = 0;`
			`unsigned long dst_local;`
			`int i;`

			`srcex = src16;`
			`dst_local = 0;`
			`*dst32 = 0;`
			`for(i = 0; i < 16; i++) {`
			`dst_local \|= _lrotl(srcex & mask, ((i * 2) + 0) - i);`
			`dst_local \|= _lrotl(srcex & mask, ((i * 2) + 1) - i);`
			`mask = _lrotl(mask, 1);`
			`}`
			`mask = 0x00FF00FF; *dst32 \|= _lrotl(dst_local & mask, 8);`
			`mask = 0xFF00FF00; *dst32 \|= _lrotr(dst_local & mask, 8);`
			`}`

Rename the _copy_region_ functions to _copy_rect_ TH01 copies a lot of different shapes from plane 1 to 0, so "region" feels awfully unspecific. 2015-03-10 12:59:12 +00:00			`void graph_slow_2xscale_rect_1_to_0(int x0, int y0, int x1, int y1, int w1, int h1)`
[C decompilation] [th01/fuuin] Slow 2x VRAM region scaling This function raises one of those essential questions about the eventual ports we'd like to do. I'll explain everything more thoroughly here, since people who might complain about the ports not being faithful enough need to understand this. ---- The original plan was aim for "100% frame-perfect" ports and advertise them as such. However, the PC-98 is not a console with fixed specs. As the name implies, it's a computer architecture, and a plethora of different, more and more powerful PC-98 models were released during its lifespan. Even if we only consider the subset of products that fulfills the minimum requirements to run the PC-98 Touhou games, that's still a sizable number of systems. Therefore, the only true definition of a frame can be "everything that is drawn between two Vsync wait calls". Such a frame may contain certain expensive function calls, and certain systems may run these functions slower than the developer expected, thus effectively leading to more frames than the developer explicitly specified. This is one of those functions. Here, we have a scaling function that appears to be written deliberately to run very slow, which ends up creating the rolling effect you see in the route selection and the high score and continue screens of TH01. However, that doesn't change the fact that the function is still CPU-bound, and neither waits for Vsync nor is iteratively called by something that does. The faster your CPU, the faster the rolling effect gets… until ultimately, it's faster than one frame and therefore vanishes altogether. Mind you, this is true on both emulators and real hardware. The final PC-98 model, the Ra43, had a CPU clocked at 433 Mhz, and it may have even been instant there. If you use more optimized algorithm, it also runs faster on the same CPU (I tried this, and it worked beautifully)… you get the idea. Still, it may very well be that this algorithm was not a deliberate choice and simply resulted from a lack of experience, especially since this was ZUN's first game. That leaves us with two approaches to porting functions like these: 1) Look at the recommended system requirements ZUN specified, configure the PC-98 emulator accordingly, measure how much of the work is done in each frame, then rewrite the function to be bound to that specific frame rate… 2) …or just continue using a CPU-bound algorithm, which will pretty much complete instantly on any modern system. I'd argue that 2) is actually the more "faithful" approach. It will run faster than the typical clock speeds people emulate the games at, and maybe draw a bit of criticism because of that, but it seems a lot more rational than the approximation provided by 1). Not to mention that it's undeniably easier to implement, and hey, a faster game feels a lot better than a slower one, right? … Oh well, maybe we'll still encounter some kind of CPU-bound animation that is so essential to the experience that we do want to lock it to a certain frame rate… 2015-03-09 16:58:30 +00:00			`{`
			`int row_p1 = (x1 / 8) + (y1 * ROW_SIZE);`
			`int row_p0 = (x0 / 8) + (y0 * ROW_SIZE);`
			`int col16;`
			`int row;`
			`vram_planar_16_pixels_t px16;`
			`int px16_nonzero;`

			`for(row = 0; row < h1; row++) {`
			`int p0 = row_p0;`
			`int p1 = row_p1;`
			`for(col16 = 0; col16 < w1 / 16; col16++) {`
			`int scale_p;`
			`graph_accesspage(1);`
			`px16.B = (int)(VRAM_PLANE_B + p1);`
			`px16.R = (int)(VRAM_PLANE_R + p1);`
			`px16.G = (int)(VRAM_PLANE_G + p1);`
			`px16.E = (int)(VRAM_PLANE_E + p1);`
			`px16_nonzero = px16.B \| px16.R \| px16.G \| px16.E;`
			`for(scale_p = 0; scale_p < ROW_SIZE * 2; scale_p += ROW_SIZE) {`
			`unsigned long dst32;`
			`unsigned long px32_nonzero;`

			`graph_accesspage(0);`
			`scale_2x(&px32_nonzero, px16_nonzero);`
			`grcg_setcolor_rmw(0);`
			`(long)(VRAM_PLANE_B + p0 + scale_p) = px32_nonzero;`
			`grcg_off();`

			`scale_2x(&dst32, px16.B);`
			`(long)(VRAM_PLANE_B + p0 + scale_p) \|= dst32;`

			`scale_2x(&dst32, px16.R);`
			`(long)(VRAM_PLANE_R + p0 + scale_p) \|= dst32;`

			`scale_2x(&dst32, px16.G);`
			`(long)(VRAM_PLANE_G + p0 + scale_p) \|= dst32;`

			`scale_2x(&dst32, px16.E);`
			`(long)(VRAM_PLANE_E + p0 + scale_p) \|= dst32;`
			`}`
			`p1 += 2;`
			`p0 += 4;`
			`}`
			`row_p0 += ROW_SIZE * 2;`
			`row_p1 += ROW_SIZE;`
			`}`
			`}`