Indent tool for HSpeak

Post by **TMC** » Mon Mar 30, 2020 9:11 am

Breaking compatibility wasn't a problem; those are pretty small changes. The file reorganisation mixed in with so many other changes took some disentangling. I've now merged most of your changes and pushed to svn, so I can continue to merge selected changes from you. But we will diverge further. I preferred to keep flow nodes as type "flow" instead of "function". And I'm intending to rewrite hs_tld.py to handle subscripts, begin and end, and handle top-level constructs with the grammar.

I have an idea for how to handle commas and newlines, and () vs begin/end, although handling both those things at once adds a lot of extra complication because of the necessary commas around begin/end so the. It won't be 100% the same as HSpeak, but stricter (which is a good thing), and probably close enough for 95+% of games.

But maybe I'm wasting time figuring out convoluted ways to replicate HSpeak's strange lexing/comma handling if I'll end up writing a custom lexer for PLY (or modifying PLY's lex) or switching to lark, which has a "contextual lexer":

The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. Itâ€™s surprisingly effective at resolving common terminal collisions, and allows to parse languages that LALR(1) was previously incapable of parsing.

Which sounds very convenient: in some places newlines could be treated like commas, and in others they could be ignored.

I was disappointed and very skeptical about how PLY (and yacc) handles syntax error reporting, using p_error and 'error' symbols. What I really want, and which PLY doesn't provide, is a description of what it was expecting to see at the point of the error, like HSpeak provides in many of its error messages.

But on actually trying out writing 'error' rules, it's not as bad as I thought.
However it seems it's necessary to add a rule containing 'error' at the exact location of the error in order to explain exactly what's wrong, e.g. an extra comma inside an "if()" rather than just printing an error like "Expected condition after IF". But I think I can use a combination of p_error to point out the token where the error occurred with error rules to describe the general context. A lot of rules might be needed, but for comparison HSpeak has roughly 180 different warning and error messages.
Also, it's possible to examine the parser's symbol stack which might be helpful.

Interestingly, lark takes a completely different approach to error reporting based on pattern matching. But error recovery isn't even mentioned in lark's documentation (which is sparse compared to PLY's)
https://github.com/lark-parser/lark/blo ... ng_lalr.py

lennyhome · Post by **lennyhome** » Mon Mar 30, 2020 12:45 pm

About error reporting. The manual isn't especially clear on how to do it. I wasn't sure how to manually set the line number, but here's how I did it.

I've added:

Code: Select all

def t_eof&#40;t&#41;&#58;
    t.lexer.lineno = 1

to the lexer. And in the parser:

Code: Select all

def p_error&#40;p&#41;&#58;
    if p&#58;
        AST_state.error = "Syntax error at '%s'" % &#40;p.value&#41;
        AST_state.lineno = p.lineno
    else&#58;
        AST_state.error = "continue"

That way after a yacc.parse() call, lineno indicates a line offset into the buffer and it gives me a chance to add an offset before printing the message.

Writing your own lexer may be the way to go for compatibility. In the C world modern parser generators like Lemon tend to come without a lexer. A benefit is that you can avoid regular expressions entirely.

I don't think PLY has good support for autocompletion. You can get an idea of what the parser was expecting based on which state it was left in and by looking at parser.out

In a general sense and as far as I know, automatic comma insertion is an unsolved problem. Every language does it differently. You may want to watch all videos you can find by Rob Pike and Brendan Eich. And then realize C has been doing it right all this time... by not doing it at all.

----

While I was working on C-style operators I thought of changing the meaning of "<<" from "less than" to "shift left". But then I realized that if I did that it would have made it impossible to port existing scripts. Instead I'm going to leave most two letter operators undefined so that way you can choose to re-add them or change them in the scripts.

----

Finally after many attemps I was able to handle comments correctly. I've added:

Code: Select all

t_ignore_COMMENT = r'\#.*'

to the lexer and I've changed the comma insertion from ",\n" to "\n,". That way comments are terminated correctly. I also have an "automatic comma insertion suppressor" symbol now: "\" at the end of a line like in Python.

Post by **TMC** » Tue Mar 31, 2020 3:38 am

Oh, I got the line numbers to sync up a different way. And I already had a t_ignore_COMMENT rule identical to that, and had moved the line-end commas to fix it. Although I see I forgot to update hsi.py in the same way... actually, it turns out that just removing ',' from the line concatenation in hsi.py fixes most problems with newlines and commas, except then it doesn't allow statements on multiple lines inside e.g. a then() without commas. But this gives me some new ideas...

By inspecting the internal state of the parser (and a lot of reading PLY internals) I managed to write a function to describe what the parser expects to see. Unfortunately due to the nature of the algorithm used for LALR(1), in many locations you can't get an answer to that question because some reduces have already happened. But it should be possible to add specific errors to cover those cases as they're found.

Examples:

Code: Select all

HSpeak> -4 // 23

Line 1    -4 // 23
              ^
Syntax error at '/'&#58; Expected to see expression

HSpeak> $*1 + "asd

Line 1    $*1 + "asd
           ^
Syntax error at '*'&#58; Expected to see string ref beginning with one of&#58; NUMBER &#40; NAME

Line 1    $*1 + "asd
                ^^^^
String missing closing "

Line 1    $*1 + "asd
                ^
Strings can't be used as expressions; they can only appear as part of $...="..." or $...+"...".

HSpeak> if&#40;1,2&#41; not&#40;&#41;

Line 1    if&#40;1,2&#41; not&#40;&#41;
              ^
Syntax error at ','&#58; Expected to see one of&#58; &#41;  an operator like +

Line 1    if&#40;1,2&#41; not&#40;&#41;
            ~~~~~
Condition should be a &#40;single&#41; expression.

Line 1    if&#40;1,2&#41; not&#40;&#41;
          ~~~~~~~~~
if&#40;&#41; should be followed by then&#40;&#41; block

You could use <<< and >>> for shifting left/right.
I'm going to add a negate math function to the VM and could add << and >> too (we have sometimes needed them in plotscr.hsd), though I don't know what syntax to use.

lennyhome · Post by **lennyhome** » Tue Mar 31, 2020 1:05 pm

That errors/suggestions thing looks really cool. I had some doubt it could be done to any effect because after all lex/yacc are tools from the '70s and I'm told people used to read manuals back then.

I've found out that multiple letters operators put a noticeable burden on PLY's lexer just by being there, probably because they're matched via regular expressions. In my version I would like to give access to all operators functionality via function calls, but leave the actual non-essential or conflicting operators undefined.

Looking at my reference, the missing functions in the VM are:

bitwise not - emulated as (-1 ^^ a) or (-1 - a)
sign negate - emulated as (a * -1)
left shift - emulated as (a * 2 ^ b)
arithmetic right shift - emulated as (a / 2 ^ b)

I can't think of anything else at the moment, but those functions alone would certainly be useful. I haven't tried this yet but maybe I could make a little "bitop.hsd" library modeled after the Lua BitOp extension.

If you're working on better error reporting you should consider adding line and column informations to AST nodes becuase that's needed later for when kind_and_id() wants to report a linkage error.

I'm not especially worried about that because scripts tend to be short and when you know a symbol is unresolved, it's easy to track it down, but that error phase could be much better.

----

Something like:

Code: Select all

script, bit&#58;neg, a, begin
	return&#40;a * -1&#41;
end
script, bit&#58;not, a, begin
	return&#40;xor&#40;-1, a&#41;&#41;
end
script, bit&#58;lshift, a, b, begin
	return&#40;a * 2 ^ b&#41;
end
script, bit&#58;arshift, a, b, begin
	return&#40;a / 2 ^ b&#41;
end

I also got rid of the "^^" operator because from the manual I can't understand what it does.

Post by **TMC** » Tue Mar 31, 2020 3:20 pm

I rewrote "include" handling, which allowed getting rid of name_concat, which alllowed getting rid of automatic commas at newlines, which allowed some significant changes to the grammar. I removed some uses of 'empty' (there's just one left), added statement, statement_list, nonempty_statement_list, merged NUMBER, HEX and BINARY, changed what's void vs an expression, added $+ and $=, and other stuff. I'm working on allowing if() else() next. After that, the only compile errors in baconthulhu will be due to 'begin' and 'end'.

I found one case where a possible terminal is missing from the "Expected to see..." message, rather than the whole lot being unknown. It's because a NAME (which can be followed by '(') got reduced to expression (which can't) before the error was detected. I think the only way to do something about it would be to modify yacc to record the previous state somewhere even after it's deleted from the statestack. Of course, there could be more than one reduce in-between.

Here's something I learnt about how to do error handling with yacc/PLY.
I had a rule "expression : STRING" to clearly tell that string literals can't be used freely and raise a SyntaxError. But then when I changed the string production to "expression : '$' expression '=' STRING" I got 24 reduce/reduce conflicts all in one state like:

Code: Select all

state 127

    &#40;80&#41; expression -> $ expression + STRING .
    &#40;84&#41; expression -> STRING .

  ! reduce/reduce conflict for * resolved using rule 80 &#40;expression -> $ expression + STRING .&#41;

Then I realised that I should make the rule "expression : error STRING" instead. This works, without any conflicts. When a string is used in place of an expression, first p_error is called (printing an extra syntax error), then (because p_error didn't do recovery) the string token is saved to the lookahead stack and the 'error' token is generated (as lookahead), which gets pushed onto the stack and reduced with the "expression : error STRING" rule.

Code: Select all

Line 1    if&#40;"asd"&#41;then&#40;&#41;
             ^
Syntax error at 'asd'&#58; Expected to see expression
Strings can't be used as expressions; they can only appear as part of $...="..." or $...+"...".

I wonder what the overhead is of calling .match() on a compiled regex? I was hoping it would be low. It's not clear to me why hand-writing a pure-python lexer that avoid regexs would be significantly faster than PLY's. I want to do a timing test to find out how much of the runtime is spent parsing vs lexing. I know the PLY docs say lexing is slow.

Yes, definitely want to add line/column information to nodes. I need it anyway not just for error reporting but also to add debug information to .hsz files (which is a not-quite-complete git branch).

The ^^ operator is logical-xor. I haven't come across any other language that has such an operator. a ^^ b is equivalent to bool(a) ^ bool(b) in Python. I grepped my scripts and found that I've used it twice, ever. Not likely that anyone else has used it.

lennyhome · Post by **lennyhome** » Tue Mar 31, 2020 5:06 pm

I was able to compile "baconthulhu.hss" a few days ago with my version of the compiler by editing it by hand. While I was doing it I remember there was one straight missing parenthesis. I have no idea how it got there. Meanwhile I've completed the game and I couldn't find anything wrong with it.

And if you take a diff between my version of "plotscr.hsd" and the official one, they're really quite similar despite all my guessing, the features I've dropped and all the corners I've cut. And I still haven't read most of the manual for the language. I feel like a living chinese room.

As an alternative I was considering re-purposing our indent tool to make it like 2to3 for Python because I thought there may be some value in using that approach.

Right now I'm bummed because they say this quarantine situation may go on till summer.

Post by **TMC** » Wed Apr 01, 2020 12:26 pm

Ah, I was wondering what glitches in Baconthulhu you meant, because I didn't see any when playing either.

You mean a tool to convert from existing syntax to new syntax? Actually that's a nice idea, then we can make syntax changes far more freely (though people will still have to re-learn) or have an alternative syntax.

Maybe I should have prioritised working on a builtin REPL instead of getting the grammar right.
The next OHRRPGCE release candidate is due in a few days so I better go work on that instead.

I'm really pleased with how well having unary - and binary - at the same time as -- works. -- is the biggest annoyance in HS for me. I definitely want to port this behaviour over to HSpeak.
There are a small number of games that use - in identifiers. Scanning the game lists a few years ago found 1273 games with scripts, about 12 of which had - in a script name. Unknown how many had - in a variable name (I have the script source for 552 of those, but haven't scanned them). It was actually more common for script names to begin with a digit; which is no longer allowed by HSpeak, so compatibility breaks have happened in the past.
Although it would be possible to support - in identifier and also use it for negation and subtraction by using a kludge: if you declare an indentifier containing - then it tells the lexer to treat that string as atomic. It could be implemented with negligible lexer slowdown, since t_NAME already looks up each name in 'reserved' anyway. It's ugly, but I'm sure it'll work well.

Because Python is interpreted, it could actually be possible to support define operator by regenerating the parser. However that's difficult and a waste of effort; there are only a couple uses of defineoperator in the wild, due to this: http://rpg.hamsterrepublic.com/ohrrpgce ... _Party_HSI
So the precedences that are actually used for custom operators is known, and that allows a simple implementation of defineoperator. Add OPERATOR_25 and OPERATOR_30 tokens, a couple 'expression' rules for them, and generate these tokens from t_NAME when a matching operator name is seen.

Thought you were being optimistic when you said the restrictions could be reduced soon. Now NZ has been in (a proactive) lockdown for a week too, but I'm on my dad's lifestyle block, and being stuck here is no problem at all - I would leave rarely anyway. Where in Italy are you?

lennyhome · Post by **lennyhome** » Wed Apr 01, 2020 2:49 pm

I've just finished porting Void Pyramid. Took me maybe an hour to modify the script, same as Baconthulhu, not a big deal. But that confirmed my theory (not really my theory) that people always try to write C when they write for other languages. And this is no exception. Just look at the way I write Python code.

I'm in the middle of Italy. I'm not in pain or anything, it's just that after a while, even if you didn't use to go out much, the idea that you can't, it becomes bothersome.

I went out to buy cigars, which is allowed but heavily frowned upon and the guy at the counter who was wearing an industrial welder mask, a tissue mask underneath and gloves yelled at me and told me I couldn't enter the store without wearing a mask myself.

So in alternative he told me to pull up my winter jacket over my face and I had to talk to him through it. I was like Kenny from South Park. I'm telling you people are getting crazier by the day and I honestly fear people more than this virus.

----

While testing Void Pyramid I've noticed that the shop didn't work correctly. I've tracked it down to this:

Code: Select all

decrement &#40;mon, itemcost&#41;
increment &#40;moneyspent, itemcost&#41;

It should be:

Code: Select all

decrement &#40;@mon, itemcost&#41;
increment &#40;@moneyspent, itemcost&#41;

Which is the correct behavior in my opinion but maybe there is special support for those functions in HSpeak? Just something to be aware of.

----

I've also noticed that sometimes variable names that contain certain symbols that are not explicitly allowed are compiled and linked correctly. As it is now, the lexer skips whatever it doesn't understand, so the unexpected symbol just vanishes from the variable name. Most of the times, it just works.

----

I've removed the ":=" operator and replaced both assignment and default value with "=". It worked out nicely. Of course it's still possible to alias ":=" to "=" in binop_table if you want to, but now I don't need the distinction anymore.

The two rules for assignment and default value were very similar and I realized I could just interpret the assignment node as easily as the previous default value node when needed.

",begin" keywords were never checked or needed but now I removed any special handling so a script declaration becomes:

Code: Select all

script, bit&#58;not, a
	return&#40;xor&#40;a, -1&#41;&#41;
end

If a ",begin" is there at the end of the declaration, it becomes a local variable called "begin" and then it's just never used, but probably won't break anything.

Post by **TMC** » Sat Apr 04, 2020 2:02 am

Crazy. Hardly seen anyone wear masks here, I don't think the government has even encouraged it.

I had a look at the Void Pyramid scripts. Interesting that Willy frequently puts another statement or 'end' on the same line as a closing ')'. I've only ever seen people do that because they were sloppy. But I don't see what you mean "people always try to write C when they write for other languages". Considering that most people (probably even most programmers) don't know C

Ah, I think the best place to handle that increment/decrement problem by turning those into references in post. Cleaner that a hack in generation. increment(x,y) is completely identical to x:=y in HSpeak.

Yes, extra unused arguments are quite harmless.

I changed the lexer to allow arbitrary unicode 'word' characters in identifiers. HSpeak is even more lax and allows arbitrary unicode, which is good because it means that combining characters can be used without needing to find canonical/normalised forms, which is a pain. HSpeak basically uses a blacklist, and everything else is allowed. But I haven't actually seen anyone use unicode in a script.

Removing the 'begin' is a lot like my suggestion to make 'begin' optional but continue to close the block with 'end'.

I mentioned macros in the other thread. This is something I've wanted for a long time for two purposes: mostly for aliases, but also inlined functions. Definitely should be at the AST level, not the token level. But it felt like too much trouble to implement in HSpeak.

lennyhome · Post by **lennyhome** » Sat Apr 04, 2020 4:07 am

To adapt Void Pyramid was less effort than Baconthulu.

I was considering adding a limited interpreter, essentailly the calc.py example from PLY to the compiler because I'm going to need a sin table and it would be nice if I could relocate it easily in the constants space, like:

Code: Select all

define constant
10, sin_base + 0
20, sin_base + 1
30, sin_base + 2
...

I would like to be able to write expressions for constants that get solved at compile time essentially. That was part of the reason I wanted to get the operators in order.

If I ever decide to do it, I'm going to have an interpreter running in parallel with the AST generator and I'm going to periodically ask it: "is this expression solvable now?" If it is, emit the result instead of the operation.

PLY has an example for a cpp-style pre-processor. Maybe it could be adapted and used for macros, includes and expressions?

I don't know. I'm burned out for today.

Post by **TMC** » Sun Apr 05, 2020 4:09 am

sin and cos sure get used a lot, but I don't want to add them before we support floating point. I wrote a script which takes angle and multiplier args consults a lookup table for people who need it.
Although... come to think of it, I could add sin/cos/tan as two-argument math functions, where the second argument is a multiplier, and the return type is equal to the type of the multiplier (currently always int). In future the 2nd arg can become optional. Ditto for other mathematical functions.

lennyhome wrote: I'm going to have an interpreter running in parallel with the AST generator and I'm going to periodically ask it: "is this expression solvable now?" If it is, emit the result instead of the operation.

But expressions don't become solvable by adding more to the right; the partial parse output is an AST and that can be evaluated immediately. Or it can be done in post, like HSpeak does it. HSpeak doesn't support evaluating constant expressions outside of scripts, but it's theoretically possible. Might be tricky, but should be pretty easy to implement in this python compiler.

PLY has an example for a cpp-style pre-processor. Maybe it could be adapted and used for macros, includes and expressions?

Definitely not, CPP is an abomination. It's the worst part of C. (For example, FreeBASIC has a preprocessor which at a glance appears identical, but it's way better, e.g. you can #undef function/var/type declarations, not just #defines. But any modern language should aim far higher, see eg D.) It's also the best part, since you can do so much with it; C without CPP would be a very crippled language. I want an AST-based preprocessor, not a token-based one.

lennyhome · Post by **lennyhome** » Sun Apr 05, 2020 10:33 am

About the whole physics thing, I wanted to explore it a little bit for curiosity, but I also think it's an abuse for this engine. I don't even think physics-based games are that fun. For the most part I'm fascinated by the machinery itself.

Altough I suspect there is a reasonable way to do cylinder/cylinder collision checks which would be useful to lead the player into narrow doors if it can move freely like in Pixel Walker.

I keep going back and forth in my head with this crazy idea that I could do constant propagation optimization and at the same time allow an expression any place a constant is expected just by adding a small amount of code, but then I think: "why bother?".

Would it even be necessary or useful? I'm going to have to visit a tibetan monastery and then roam the lands to find an answer to that.