OK, this is now a devlog entry...
Wow, umm, never mind about those timings I gave earlier. Because we're going to have to switch to separate builds for Win 2000 and earlier anyway it's time to start requiring CPUs with SSE2 in normal gfx_sdl2 builds. Well with SSE2 enabled, using GCC 10.3,
with Scale2x is faster than not smoothing at all! 30% faster for 8bit, 10% faster for 32bit. Stunning. Looks like my attempted optimisation of scaled blitting is actually a pessimisation with SSE2.
I found a number of ways to fix the stay pixels problem in Scale2x which are much simpler than that one I linked to. Scale3x has the identical problem (but I haven't tackled that yet), and in Scale4x (Scale2x applied twice) it gets very noticeable, creating bow-tie shapes. 4x scaling is likely to be common, so I wanted to fix that.
I call my modified algorithm OHR2x:
2x original, Reax2x (prev method), Scale2x, OHR2x
As above, blown up 2x
4x original, Reax4x (prev method), Scale4x, OHR4x
The fastest bowtie fix I found (for 8-bit: only 35% slower with SSE2, 25% slower without) is
Code: Select all
else if (smooth == 1 && zoom == 2) {
uint8_t *restrict srcbuffer2 = (uint8_t *)srcbuffer;
uint8_t *restrict destbuffer2 = (uint8_t *)destbuffer;
for (int y = 1; y < size.h - 1; y++) {
for (int x = 1; x < size.w - 1; x++) {
uint8_t A = srcbuffer2[(y - 1) * size.w + (x - 1)];
uint8_t B = srcbuffer2[(y - 1) * size.w + (x + 0)];
uint8_t C = srcbuffer2[(y - 1) * size.w + (x + 1)];
uint8_t D = srcbuffer2[(y + 0) * size.w + (x - 1)];
uint8_t E = srcbuffer2[(y + 0) * size.w + (x + 0)];
uint8_t F = srcbuffer2[(y + 0) * size.w + (x + 1)];
uint8_t G = srcbuffer2[(y + 1) * size.w + (x - 1)];
uint8_t H = srcbuffer2[(y + 1) * size.w + (x + 0)];
uint8_t I = srcbuffer2[(y + 1) * size.w + (x + 1)];
uint8_t E0, E1, E2, E3;
if (B != H && D != F) {
E0 = D == B && E != A ? D : E;
E1 = B == F && E != C ? F : E;
E2 = D == H && E != G ? D : E;
E3 = H == F && E != I ? F : E;
} else {
E0 = E;
E1 = E;
E2 = E;
E3 = E;
}
destbuffer2[(y * zoom + 0) * pitch + x * zoom + 0] = E0;
destbuffer2[(y * zoom + 0) * pitch + x * zoom + 1] = E1;
destbuffer2[(y * zoom + 1) * pitch + x * zoom + 0] = E2;
destbuffer2[(y * zoom + 1) * pitch + x * zoom + 1] = E3;
}
}
}
however that links up fewer diagonally-adjacent pixels into lines. So I'll go with this version instead (8-bit: 70% slower with SSE2, 20% slower without):
Code: Select all
else if (smooth == 1 && zoom == 2) {
uint8_t *restrict outbuf = (uint8_t *)calloc(size.w, 2);
uint8_t *restrict srcbuffer2 = (uint8_t *)srcbuffer;
uint8_t *restrict destbuffer2 = (uint8_t *)destbuffer;
for (int y = 1; y < size.h - 1; y++) {
for (int x = 1; x < size.w - 1; x++) {
uint8_t B = srcbuffer2[(y - 1) * size.w + (x + 0)];
uint8_t D = srcbuffer2[(y + 0) * size.w + (x - 1)];
uint8_t E = srcbuffer2[(y + 0) * size.w + (x + 0)];
uint8_t F = srcbuffer2[(y + 0) * size.w + (x + 1)];
uint8_t H = srcbuffer2[(y + 1) * size.w + (x + 0)];
uint8_t E0, E1, E2, E3;
uint8_t B2 = outbuf[x * zoom + 0];
uint8_t B3 = outbuf[x * zoom + 1];
bool swapB2 = B2 != B;
bool swapB3 = B3 != B;
if (B != H && D != F) {
E0 = D == B && !swapB2 ? D : E;
if (D == B) B2 = B;
E1 = B == F && !swapB3 ? F : E;
if (B == F) B3 = B;
E2 = D == H ? D : E;
E3 = H == F ? F : E;
} else {
E0 = E;
E1 = E;
E2 = E;
E3 = E;
}
destbuffer2[(y * zoom + 0) * pitch + x * zoom + 0] = E0;
destbuffer2[(y * zoom + 0) * pitch + x * zoom + 1] = E1;
destbuffer2[(y * zoom - 1) * pitch + x * zoom + 0] = B2;
destbuffer2[(y * zoom - 1) * pitch + x * zoom + 1] = B3;
outbuf[x * zoom + 0] = E2;
outbuf[x * zoom + 1] = E3;
}
}
free(outbuf);
}
(The temp buffer can be avoided but then it isn't vectorisable.)
An odd thing about Scale2x/3x is that they actually handle 45° diagonals worse than other angles. I invented some Scale2x variants (e.g. "Diag2x" below) that partially improved that but they all caused text to look very angular, so I won't use them, but here's what they look like. I probably should have been concentrating on Scale3x, because its problem with 45° diagonal lines is far worse and all defects are far more noticeable at greater scale.
2x original, OHR2x, Diag2x
As above, blown up 2x
4x original, OHR4x, Diag4x
Another problem with Scale2x is that if a lone pixel is between two diagonal bands of colour it will be squeezed down to just 2 pixels in the scaled version (from the normal 4), and in Scale4x it gets squeezed to 6 pixels from 16. I tried improving that by disallowing squeezing two opposite corners, but it led to too many other problems.
I've been looking into other algorithms (for the non-realtime scaling) but so far all of the ones that I've looked at that give better results than Scalex2/3/4 have on the order of 20x more code. Scalex2/3 seem to occur a prominent position on the complex/quality and probably speed/quality Pareto frontiers.
I only just saw the edit you made to your post with the fully patched file. Thanks, that's excellent!
Could you please make multiple posts rather than such major edits though, I've often missed your edits. Double posting isn't as bad as made out to be.