Compile Lua without automatic conversion between strings and numbers

Compile Lua without automatic conversion between strings and numbers - c

Lua is generally a strongly-typed language, providing almost no implicit conversion between data types.
However, numbers and strings do get automatically coerced in a few cases:
Lua provides automatic conversion between string and number values at run time. Any arithmetic operation applied to a string tries to convert this string to a number, following the rules of the Lua lexer. (The string may have leading and trailing spaces and a sign.) Conversely, whenever a number is used where a string is expected, the number is converted to a string, in a reasonable format
Thus:
local x,y,z = "3","8","11"
print(x+y,z) --> 11 11
print(x+y==z) --> false
print(x>z) --> true
I do not want this. How can I recompile the Lua interpreter to remove all automatic conversion?
I would prefer to have:
print(x+y) --> error: attempt to perform arithmetic on a string value
print(x>1) --> error: attempt to compare number with string
print(x..1) --> error: attempt to concatenate a number value

The illustrious LHF has commented above that this is not possible out of the box, and requires editing the innards of Lua, starting with http://www.lua.org/source/5.2/lvm.c.html#luaV_tonumber
Marking this as the answer in order to close this question. If anyone later chooses to provide an answer with in-depth details on what needs to be done, I will gladly switch the acceptance mark to that answer.

Related

Tcl String function that does "shimmering" ruined my customized tcl type defined in c

I have defined a customized tcl type using tcl library in c/c++. I basically make the Tcl_Obj.internalRep.otherValuePtr point to my own data structure. The problem happens by calling [string length myVar] or other similar string functions that does so called shimmering behaviour which replace my internalRep with it's own string structure. So that after the string series tcl function, myVar cannot convert back! because it's a complicate data structure cannot be converted back from the Tcl_Obj.bytes representation plus the type is no longer my customized type. How can I avoid that.

The string length command converts the internal representation of the values it is given to the special string type, which records information to allow many string operations to be performed rapidly. Apart from most of the string command's various subcommands, the regexp and regsub commands are the main ones that do this (for their string-to-match-the-RE-against argument). If you have a precious internal representation of your own and do not wish to lose it, you should avoid those commands; there are some operations that avoid the trouble. (Tcl mostly assumes that internal representations are not fragility, and therefore that they can be regenerated on demand. Beware when using fragility!)
The key operations that are mostly safe (as in they generate the bytes/length rep through calling the updateStringProc if needed, but don't clear the internal rep) are:
substitution into a string; the substituted value won't have the internal rep, but it will still be in the original object.
comparison with the eq and ne expression operators. This is particularly relevant for checks to see if the value is the empty string.
Be aware that there are many other operations that spoil the internal representation in other ways, but most don't catch people out so much.
[EDIT — far too long for a comment]: There are a number of relatively well-known extensions that work this way (e.g., TCOM and Tcl/Java both do this). The only thing you can really do is “be careful” as the values really are fragile. For example, put them in an array and then pass the indexes into the array around instead, as those need not be fragile. Or keep things as elements in a list (probably in a global variable) and pass around the list indices; those are just plain old numbers.
The traditional, robust approach is to put a map (e.g., a Tcl_HashTable or std::map) in your C or C++ code and have the indices into that be short strings with not too much meaning (I like to use the name of the type of value followed by either a sequence number or a serialisation of the pointer, such as you might get with the %p conversion in sprintf(); the printed pointer reveals more of the implementation details, is a little more helpful if you're debugging, and generally doesn't actually make that much difference in practice). You then have the removal of things from the map be an explicit deletion operation, and it is also easy to provide operations like listing all the known current values. This is safe, but prone to “leaking” (though it's not formally a memory leak if you provide the listing operation). It can be accelerated by caching the lookup in a Tcl_Obj*'s internal representation (a cheap way to handle deletion is to use a sequence number that you increment when you delete something, and only bypass the map lookup if the sequence number that you cache in the intrep is equal to the main sequence number) but it's not usually a big deal; only go to that sort of thing if you've measured a bottleneck in the lookups.
But I'd probably just live with fragility in my own code, and would just take care to ensure that I never bust the assumptions. The problem is really that you're being incautious about how you use the values; the Tcl code should just pass them around and nothing else really. Also, I've experimented a fair bit with wrapping such things up inside a TclOO object; it's far too heavyweight (by the design of TclOO) for values that you're making a lot of, but if you've only got a few of them and you're wanting to treat them as objects with methods, this can work very well indeed (and gives many more options for automatic cleanup).

Parsing an iCalendar file in C

I am looking to parse iCalendar files using C. I have an existing structure setup and reading in all ready and want to parse line by line with components.
For example I would need to parse something like the following:
UID:uid1#example.com
DTSTAMP:19970714T170000Z
ORGANIZER;CN=John Doe;SENT-BY="mailto:smith#example.com":mailto:john.doe#example.com
CATEGORIES:Project Report, XYZ, Weekly Meeting
DTSTART:19970714T170000Z
DTEND:19970715T035959Z
SUMMARY:Bastille Day Party
Here are some of the rules:
The first word on each line is the property name
The property name will be followed by a colon (:) or a semicolon (;)
If it is a colon then the property value will be directly to the right of the content to the end of the line
A further layer of complexity is added here as a comma separated list of values are allowed that would then be stored in an array. So the CATEGORIES one for example would have 3 elements in an array for the values
If after the property name a semi colon is there, then there are optional parameters that follow
The optional parameter format is ParamName=ParamValue. Again a comma separated list is supported here.
There can be more than one optional parameter as seen on the ORGANIZER line. There would just be another semicolon followed by the next parameter and value.
And to throw in yet another wrench, quotations are allowed in the values. If something is in quotes for the value it would need to be treated as part of the value instead of being part of the syntax. So a semicolon in a quotation would not mean that there is another parameter it would be part of the value.
I was going about this using strchr() and strtok() and have got some basic elements from that, however it is getting very messy and unorganized and does not seem to be the right way to do this.
How can I implement such a complex parser with the standard C libraries (or the POSIX regex library)? (not looking for whole solution, just starting point)

This answer is supposing that you want to roll your own parser using Standard C. In practice it is usually better to use an existing parser because they have already thought of and handled all the weird things that can come up.
My high level approach would be:
Read a line
Pass pointer to start of this line to a function parse_line:
Use strcspn on the pointer to identify the location of the first : or ; (aborting if no marker found)
Save the text so far as the property name
While the parsing pointer points to ;:
Call a function extract_name_value_pair passing address of your parsing pointer.
That function will extract and save the name and value, and update the pointer to point to the ; or : following the entry. Of course this function must handle quote marks in the value and the fact that their might be ; or : in the value
(At this point the parsing pointer is always on :)
Pass the rest of the string to a function parse_csv which will look for comma-separated values (again, being aware of quote marks) and store the results it finds in the right place.
The functions parse_csv and extract_name_value_pair should in fact be developed and tested first. Make a test suite and check that they work properly. Then write your overall parser function which calls those functions as needed.
Also, write all the memory allocation code as separate functions. Think of what data structure you want to store your parsed result in. Then code up that data structure, and test it, entirely independently of the parsing code. Only then, write the parsing code and call functions to insert the resulting data in the data structure.
You really don't want to have memory management code mixed up with parsing code. That makes it exponentially harder to debug.
When making a function that accepts a string (e.g. all three named functions above, plus any other helpers you decide you need) you have a few options as to their interface:
Accept pointer to null-terminated string
Accept pointer to start and one-past-the-end
Accept pointer to start, and integer length
Each way has its pros and cons: it's annoying to write null terminators everywhere and then unwrite them later if need be; but it's also annoying when you want to use strcspn or other string functions but you received a length-counted piece of string.
Also, when the function needs to let the caller know how much text it consumed in parsing, you have two options:
Accept pointer to character, Return the number of characters consumed; calling function will add the two together to know what happened
Accept pointer to pointer to character, and update the pointer to character. Return value could then be used for an error code.
There's no one right answer, with experience you will get better at deciding which option leads to the cleanest code.

Writing a C function to parse a Hex string and convert it to decimal?

I am just learning C and in my class this is a part of our first program.
The full description of the function I am trying to implement:
if that's tl;dr, a key point is that I am not allowed to use functions from other libraries (so something like srtol is ruled out).
int parseHexString(char *hexString, int *integerRead); The first
parameter is a null terminated C string, that represents a hexadecimal
integer. This function parses this string, accumulating the integer
value it represents. This integer value is placed at the location
pointed to by the second parameter, integerRead. If a bad hexadecimal
character, thus an invalid hex value, is encountered, this function
stops looking at further characters within the string and returns -1.
If a good hex value is parsed, it returns 0.
The correct way to implement this function looks at the first
character within the string first and does not use a stack to
accomplish the parsing. Your first assembly language program will need
to implement what this function accomplishes, so you will save
yourself time by implementing this function the correct way.
For this function, do not call any functions from any libraries; as an
exception, for debugging purposes only, you may use printf(). It will
help our grading if you remove your debugging code before you submit
your assignment.
I am NOT just looking for a full implementation of this function, just some tips or hints to get me started.
I feel as though there is some intuitive way of doing this, but right now I am blanking. I'm concerned with how I am supposed to start at the first character of the string and then go forward from there to convert it to decimal.

How about:
Get the length of the string
Walk the string from left to right
For each character:
Check if it's a valid hex character
Add its decimal value multiplied by 16^x, where x is the number of characters left on the right

MalformedInputException when trying to read entire file

I have a 132 kb file (you can't really say it's big) and I'm trying to read it from the Scala REPL, but I can't read past 2048 char because it gives me a java.nio.charset.MalformedInputException exception
These are the steps I take:
val it = scala.io.Source.fromFile("docs/categorizer/usig_calles.json") // this is ok
it.take(2048).mkString // this is ok too
it.take(1).mkString // BANG!
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
Any idea what could be wrong?
--
Apparently the problem was that the file was not UTF encoded
I saved it as UTF and everything works, I just issue mkString on the iterator and it retrieves the whole contents of the file
The strange thing is that the error only aroused passing the first 2048 chars...

Cannot be certain without the file, but the documentation on the exception indicates it is thrown "when an input byte sequence is not legal for given charset, or an input character sequence is not a legal sixteen-bit Unicode sequence." (MalformedInputException javadoc)
I suspect that at 2049 is the first character encountered that is not valid with whatever the default JVM character encoding is in you environment. Consider explicitly stating the character encoding of the file using one of the overloads to fromFile.
If the application will be cross platform, you should know that the default character encoding on the JVM does vary by platform, so if you operate with a specific encoding you either want to explicitly set it as a command line parameter when launching of your application, or specify it at each call using the appropriate overload.

Any time you call take twice on the same iterator, all bets are off. Iterators are inherently imperative, and mixing them with functional idioms is dicey at best. Most of the iterators you come across in the standard library tend to be fairly well-behaved in this respect, but once you've used take, or drop, or filter, etc., you're in undefined-behavior land, and in principle anything could happen.
From the docs:
It is of particular importance to note that, unless stated otherwise,
one should never use an iterator after calling a method on it. The two
most important exceptions are also the sole abstract methods: next
and hasNext ...
def take(n: Int): Iterator[A] ...
Reuse: After calling this method, one should discard the iterator it
was called on, and use only the iterator that was returned. Using the
old iterator is undefined, subject to change, and may result in
changes to the new iterator as well.
So it's probably not worth trying to track down exactly what went wrong here.

If you just wish to convert the bytes to plain Latin data:
// File:
io.Source.fromFile(file)(io.Codec.ISO8859).mkString
// InputStream:
io.Source.fromInputStream(System.io)(io.Codec.ISO8859).mkString

use typeof in input validation

Can we use __typeof__ for input validation in C program run on a Linux platform and how?
If we can't then, are there any ways other than regex to achieve the same?

"typeof" is purely a compile-time directive. It cannot be used for "input validation."
Input validation rules can be complex. In C, they are made more complex by the fact that the tools that you have at your disposal in the standard C library are pretty awful. One example is atoi(), which will return 0 if the string that you pass in doesn't contain a number at the beginning (atoi("hello world") == 0), and "1337isanumber" will actually return 1337. To simply validate if something is a number, you could (assuming ASCII and not Unicode) use a loop and make sure each value up until the first null terminator (or the size of the memory you allocated for the string) that each digit is in fact numeric. A similar procedure could be done to check if something is alphanumeric, etc. As you mentioned, regexes can be used for a telephone number or some other relatively complex data format.
Your comment below references using "instanceof" in Java for input validation, but this isn't possible either. If you get user input from, say, the command line, or a query string parameter, or whatever, it really comes in as a string. If you're using a Scanner object to scan the standard input and use a method such as nextInt(), it's really converting a string (from the stream) into something, which can throw a runtime exception. You cannot use instanceof to determine a string's contents; a String is a String -- even if its contents are "42", it is not an instance of an Integer!