Scale2x algorithm for smoothing

lennyhome · Post by **lennyhome** » Sat Jun 04, 2022 1:26 pm

For reference:
https://www.scale2x.it/algorithm

In blit.c just after the start of smoothzoomblit_32_to_32bit:

if (smooth == 1 && zoom == 2) {

	uint32_t *srcbuffer2 = (uint32_t *)srcbuffer;
	uint32_t *destbuffer2 = (uint32_t *)destbuffer;

	int x, y;
	
	for (y = 1; y < size.h - 1; y++) {
		for (x = 1; x < size.w - 1; x++) {
		
			uint32_t B = srcbuffer2[(y - 1) * size.w + (x + 0)];
			uint32_t D = srcbuffer2[(y + 0) * size.w + (x - 1)];
			uint32_t E = srcbuffer2[(y + 0) * size.w + (x + 0)];
			uint32_t F = srcbuffer2[(y + 0) * size.w + (x + 1)];
			uint32_t H = srcbuffer2[(y + 1) * size.w + (x + 0)];
		
			uint32_t E0, E1, E2, E3;

			if (B != H && D != F) {
				E0 = D == B ? D : E;
				E1 = B == F ? F : E;
				E2 = D == H ? D : E;
				E3 = H == F ? F : E;
			} else {
				E0 = E;
				E1 = E;
				E2 = E;
				E3 = E;
			}

			destbuffer2[(y * zoom + 0) * pitch + x * zoom + 0] = E0;
			destbuffer2[(y * zoom + 0) * pitch + x * zoom + 1] = E1;
			destbuffer2[(y * zoom + 1) * pitch + x * zoom + 0] = E2;
			destbuffer2[(y * zoom + 1) * pitch + x * zoom + 1] = E3;
		}
	}
	
	return;
}

I did a quick test with Vikings:

A comparison:

Current smoothing algorithm on the left, Scale3x on the right. It fixes the lone pixel in the "C" and "F" letters.

Post by **TMC** » Sun Jun 05, 2022 10:41 am

Thanks! The existing smoothing algorithm was by Reaxor Jones (so let's call it Reax2x/Reax3x), and I have no idea where he got it or whether it was original. It's got a lot of flaws and I've been wanting to replace it for a long time. Flipping back and forth between Reax2x and Scale2x, the new is so much better. Actually the benefit is mainly for text and line art; it's not hugely better or worse for most sprites. It also handles 50% dithers a bit better. Happily it's also only a little slower (2.7ms vs 2.5ms for 2x scale of a view of Mannheim in Vikings at 640x400x8, compiled for x86, on this FX 6150 ~~clocked down to 3GHz no turbo~~).

However, this code doesn't handle a 1 pixel border around the screen edge.

Considering how simple and fast they are, Scale2x and Scale3x compare favourably to many other algorithms, e.g. see this comparison on Wikipedia. It does produce some weird stray pixels though when you have a pattern like

Code: Select all

AB
BA

Although looking at actual reults this seems far less of a problem than I expected. I found a post about how to fix it, unfortunately not very simple, but maybe it can be made not too inefficient on CPU . (That thread is about shaders, has dead links, and the author quickly goes way beyond merely tweaking Scale2x.)

Actually, I was about to look for a 3x scaler to use as a preprocessor before rotation, to reduce artifacts (similar to Rotsprite). That doesn't need to be super fast so I'm hoping for something better than Scale3x which is suitable (e.g. no antialiasing).

lennyhome · Post by **lennyhome** » Sun Jun 05, 2022 12:38 pm

The image border is computed reusing the value of the nearest pixel on the border when the value of a pixel over the border is required.

So as a speed trick it's possible to allocate a slightly larger source buffer, or just do some clamping. It's not a big deal on modern computers.

EDIT: blit.c fully patched to use Scale2x:
https://sourceforge.net/projects/snes9l ... z/download
Megaman sprite game looks really good with Scale3x and the new controls allow the player to follow the path with better precision.

This what I wanted to see. The original smoothing filter made that particular font hard to read.

Post by **TMC** » Tue Jun 07, 2022 5:06 am

OK, this is now a devlog entry...

Wow, umm, never mind about those timings I gave earlier. Because we're going to have to switch to separate builds for Win 2000 and earlier anyway it's time to start requiring CPUs with SSE2 in normal gfx_sdl2 builds. Well with SSE2 enabled, using GCC 10.3, with Scale2x is faster than not smoothing at all! 30% faster for 8bit, 10% faster for 32bit. Stunning. Looks like my attempted optimisation of scaled blitting is actually a pessimisation with SSE2.

I found a number of ways to fix the stay pixels problem in Scale2x which are much simpler than that one I linked to. Scale3x has the identical problem (but I haven't tackled that yet), and in Scale4x (Scale2x applied twice) it gets very noticeable, creating bow-tie shapes. 4x scaling is likely to be common, so I wanted to fix that.

I call my modified algorithm OHR2x:

2x original, Reax2x (prev method), Scale2x, OHR2x

As above, blown up 2x

4x original, Reax4x (prev method), Scale4x, OHR4x

The fastest bowtie fix I found (for 8-bit: only 35% slower with SSE2, 25% slower without) is

Code: Select all

else if (smooth == 1 && zoom == 2) {
	uint8_t *restrict srcbuffer2 = (uint8_t *)srcbuffer;
	uint8_t *restrict destbuffer2 = (uint8_t *)destbuffer;

	for (int y = 1; y < size.h - 1; y++) {
		for (int x = 1; x < size.w - 1; x++) {
			uint8_t A = srcbuffer2[(y - 1) * size.w + (x - 1)];
			uint8_t B = srcbuffer2[(y - 1) * size.w + (x + 0)];
			uint8_t C = srcbuffer2[(y - 1) * size.w + (x + 1)];
			uint8_t D = srcbuffer2[(y + 0) * size.w + (x - 1)];
			uint8_t E = srcbuffer2[(y + 0) * size.w + (x + 0)];
			uint8_t F = srcbuffer2[(y + 0) * size.w + (x + 1)];
			uint8_t G = srcbuffer2[(y + 1) * size.w + (x - 1)];
			uint8_t H = srcbuffer2[(y + 1) * size.w + (x + 0)];
			uint8_t I = srcbuffer2[(y + 1) * size.w + (x + 1)];
			uint8_t E0, E1, E2, E3;

			if (B != H && D != F) {
				E0 = D == B && E != A ? D : E;
				E1 = B == F && E != C ? F : E;
				E2 = D == H && E != G ? D : E;
				E3 = H == F && E != I ? F : E;
			} else {
				E0 = E;
				E1 = E;
				E2 = E;
				E3 = E;
			}
			destbuffer2[(y * zoom + 0) * pitch + x * zoom + 0] = E0;
			destbuffer2[(y * zoom + 0) * pitch + x * zoom + 1] = E1;
			destbuffer2[(y * zoom + 1) * pitch + x * zoom + 0] = E2;
			destbuffer2[(y * zoom + 1) * pitch + x * zoom + 1] = E3;
		}
	}
}

however that links up fewer diagonally-adjacent pixels into lines. So I'll go with this version instead (8-bit: 70% slower with SSE2, 20% slower without):

Code: Select all

else if (smooth == 1 && zoom == 2) {
	uint8_t *restrict outbuf = (uint8_t *)calloc(size.w, 2);
	uint8_t *restrict srcbuffer2 = (uint8_t *)srcbuffer;
	uint8_t *restrict destbuffer2 = (uint8_t *)destbuffer;

	for (int y = 1; y < size.h - 1; y++) {
		for (int x = 1; x < size.w - 1; x++) {
			uint8_t B = srcbuffer2[(y - 1) * size.w + (x + 0)];
			uint8_t D = srcbuffer2[(y + 0) * size.w + (x - 1)];
			uint8_t E = srcbuffer2[(y + 0) * size.w + (x + 0)];
			uint8_t F = srcbuffer2[(y + 0) * size.w + (x + 1)];
			uint8_t H = srcbuffer2[(y + 1) * size.w + (x + 0)];

			uint8_t E0, E1, E2, E3;

			uint8_t B2 = outbuf[x * zoom + 0];
			uint8_t B3 = outbuf[x * zoom + 1];
			bool swapB2 = B2 != B;
			bool swapB3 = B3 != B;

			if (B != H && D != F) {
				E0 = D == B && !swapB2 ? D : E;
				if (D == B) B2 = B;

				E1 = B == F && !swapB3 ? F : E;
				if (B == F) B3 = B;

				E2 = D == H ? D : E;
				E3 = H == F ? F : E;
			} else {
				E0 = E;
				E1 = E;
				E2 = E;
				E3 = E;
			}

			destbuffer2[(y * zoom + 0) * pitch + x * zoom + 0] = E0;
			destbuffer2[(y * zoom + 0) * pitch + x * zoom + 1] = E1;
			destbuffer2[(y * zoom - 1) * pitch + x * zoom + 0] = B2;
			destbuffer2[(y * zoom - 1) * pitch + x * zoom + 1] = B3;
			outbuf[x * zoom + 0] = E2;
			outbuf[x * zoom + 1] = E3;
		}
	}
	free(outbuf);
}

(The temp buffer can be avoided but then it isn't vectorisable.)

An odd thing about Scale2x/3x is that they actually handle 45° diagonals worse than other angles. I invented some Scale2x variants (e.g. "Diag2x" below) that partially improved that but they all caused text to look very angular, so I won't use them, but here's what they look like. I probably should have been concentrating on Scale3x, because its problem with 45° diagonal lines is far worse and all defects are far more noticeable at greater scale.

2x original, OHR2x, Diag2x

As above, blown up 2x

4x original, OHR4x, Diag4x

Another problem with Scale2x is that if a lone pixel is between two diagonal bands of colour it will be squeezed down to just 2 pixels in the scaled version (from the normal 4), and in Scale4x it gets squeezed to 6 pixels from 16. I tried improving that by disallowing squeezing two opposite corners, but it led to too many other problems.

I've been looking into other algorithms (for the non-realtime scaling) but so far all of the ones that I've looked at that give better results than Scalex2/3/4 have on the order of 20x more code. Scalex2/3 seem to occur a prominent position on the complex/quality and probably speed/quality Pareto frontiers.

I only just saw the edit you made to your post with the fully patched file. Thanks, that's excellent!
Could you please make multiple posts rather than such major edits though, I've often missed your edits. Double posting isn't as bad as made out to be.

lennyhome · Post by **lennyhome** » Tue Jun 07, 2022 6:27 am

Comparison between Scale3x on the left and Scale3xSFX on the right:

lennyhome · Post by **lennyhome** » Tue Jun 07, 2022 8:16 am

using GCC 10.3, with Scale2x is faster than not smoothing at all!

See, CPUs are such amazing things. You may accidentally hit a fast path that was secretly put in there at the request of a Microsoft engineer who wanted to give some product some unfair advantage. Because you know, empires don't build themselves.

About the quality/speed tradeoff. ScalexSFX is very good. In my implementation I've skipped the thresholding and blending steps but all the modes (2x, 3x, SFX or not) support thresholding and blending in principle.

For quality over speed I would suggest:
https://github.com/grom358/hqx
It's LGPL-licensed but you could still build a DLL and have the engine load it at runtime. However, thresholding and blending are mandatory and heavy compared to anything else. I know for a fact that it doesn't run at 60 fps on a Raspberry Pi.

In my version of blit.c here:
https://sourceforge.net/projects/snes9l ... z/download
you can easily switch between Scalex and ScalexSFX if you go to line 443 and type in:

Code: Select all

#define SCALE2X(_a) SCALE2XORIG(_a) 
#define SCALE3X(_a) SCALE3XORIG(_a)

or:

Code: Select all

#define SCALE2X(_a) SCALE2XSFX(_a) 
#define SCALE3X(_a) SCALE3XSFX(_a)

I hope you like the re-organization into macros. With that you should be able to experiment with the rules more easily.

An odd thing about Scale2x/3x is that they actually handle 45° diagonals worse than other angles.

Scale3x seems fine with diagonals. I don't know exactly but every algorithm has some quirk.

Post by **TMC** » Tue Jun 07, 2022 4:47 pm

Thanks, most excellent! The deduplication with macros is also very welcome.
For years I've been wanting to add a settings menu where you can toggle smoothing on/off (etc) rather than just an obscure commandline flag. Now I have a lot more reason to. People have also asked for a setting to make their games run like that by defaut.

Thanks also for implementing that Scale2/3xSFX variant, which I didn't bother to do, so that I can compare to my "OHR2/3x" variant. As I expected, I found that they are almost identical, differing very rarely. But my version is 8x faster for me (for 8-bit).

SIMD instructions are fast, but I place the real credit on the GCC developers. Older GCC versions can't do that level of auto-vectorisation.

However, I had assumed that GCC was fully optimising away all the redundancy between the expressions for accessing srcbuffer & destbuffer. Turns out that those row0/row1/row2 pointers you added gave another speed boost, adding up to ~50% faster after I added similar pointers for reading.

Clamping x and y to within image bounds is slow, but it's not necessary to process the first/last row & column, so I changed it to copy them from the neighbouring rows/columns.

Scale3x's problem with diagonal lines only affects isolated width-1 lines: they are lumpy, as seen in the letter "y" in your screenshots, and really blatant in e.g. this.

Yes, HQ3x is very impressive, but I'm undecided. I'm not sure whether I should avoid scaling algorithms that perform blending, or just convert back to 8-bit using the existing 32->8 bit routine I have (which is very fast, used for alpha blending).

lennyhome · Post by **lennyhome** » Wed Jun 08, 2022 12:30 am

For speed, it's a generally a bad idea to write two or three scanlines at once because there's the potential for a cache miss per pixel. If the pitch of the destination is sufficiently large it'll happen and it's more expensive than anything else.

Locality is the single most important factor on modern systems. Math is always fast because CPUs execute those operations out of order and they can draw from a large pool of hardware adders and multipliers.

It used to be the opposite.

EDIT:
The SFX version should be fixed now.

Post by **TMC** » Wed Jun 08, 2022 3:10 pm

Yes, there is a possibility of exhausting L1 cache. (I calculate the worst case, 32-bit 3x scaling has at most a risk of 0.75 cache misses per pixel: Three 64-byte cache lines are read per 64/4 = 16 input pixels, and three cache lines written per 64/4/3 = 5.333 input pixels. In other words, 48 bytes drawn into cache per input pixel.) Strides and ratios between read/write speeds and cache structure make it really complicated to figure out how soon you start clobbering cache lines for input pixels (however, 16kB L1d$ / 48 bytes = 342 pixels), but since all reads and writes are linear wouldn't they be automatically prefetched from L2 anyway?

Testing 32bit 3x scaling I saw no only minor change in time per pixel as I changed the scan line length, about 10% difference between fastest and slowest. I think this is because L1 cache miss latency is actually quite small relative to the CPU cycles per input pixel (about 48).
With 8bit 2x scaling the results are more interesting. It processes 16 input pixels/64 output pixels per inner loop iteration. 16-byte alignment and the extra pixels which don't divide into 16 processed with non-SIMD code seem to make a much bigger difference than cache effects though. For me, it runs in ~2.3 cycles per input pixel when the input width is 320, ~2.75 for width 321, ~3.3 for width 330 (didn't see higher than that), ~2.5 for width 400, ~2.4 for width 1200, etc.

Oh, I see you also did some other code cleanup, which I've taken.
I've been playing around with Scale3x (having ignored it before) and I see there really are a lot of problems with it, such as isolated pixels squeezed to just 3 output pixels (instead of 9). So I don't think anyone should bother with Scale3xSFX, that's a lot of complexity to make it just slightly better.
I tried making some modifications and ended up rewriting it. I got some nice results while still looking at only the neighbouring 8 pixels; I have Scale3x totally beat for text, and improved in other ways, but still a couple things not right. However the number of rules/patterns has grown greatly and I don't know whether it can be simplified. Maybe two passes would be more efficient.

Scale3x:

Mine:

Or with sharper corners (would be possible to also smooth out just the holes in 'o', 'P', etc), I think this looks so good that I'm going to add it (and the version above) as a feature for upscaling fonts!

Note also that Scale3x has lots of bowtie artifacts in the box border.

lennyhome · Post by **lennyhome** » Wed Jun 08, 2022 3:40 pm

I would look into LQX (HQX without thresholding and blending) for quality maybe. I've noticed that the implementations are machine written and probably more correct.

When attempting to implement thresholding and blending I've noticed the the 8 to 8 version of the blit function is not given a palette. It should be given a palette and an updated color table in case you ever want to use HQX.

Whenever possible you should iterate over the pixel at the destination rather than the pixels at the source. That's what scanline renderers such as Cairo do. In fact I think it's possible to re-write Scalex to work like that and it would be interesting to see if that improves anything.

I don't know whether it can be simplified.

I think the HQX/BRZ guys have some rather large Python scripts with grammars and whatnot that they use to write down the actual code. It seems that the ScalexSFX guy attempted to do it by hand and he was 90% successful, but I made mistakes while copying the rules and I also tried to swap some letters at random and I couldn't see any difference sometimes. That tells me there's redundancy.

You may love it, you may hate it. I'm ok with it.

Slime Salad

Scale2x algorithm for smoothing

Scale2x algorithm for smoothing

Re: Scale2x algorithm for smoothing

Re: Scale2x algorithm for smoothing

Re: Scale2x algorithm for smoothing

Re: Scale2x algorithm for smoothing

Re: Scale2x algorithm for smoothing

Re: Scale2x algorithm for smoothing

Re: Scale2x algorithm for smoothing

Re: Scale2x algorithm for smoothing

Re: Scale2x algorithm for smoothing