Can we use __typeof__ for input validation in C program run on a Linux platform and how?
If we can't then, are there any ways other than regex to achieve the same?
"typeof" is purely a compile-time directive. It cannot be used for "input validation."
Input validation rules can be complex. In C, they are made more complex by the fact that the tools that you have at your disposal in the standard C library are pretty awful. One example is atoi(), which will return 0 if the string that you pass in doesn't contain a number at the beginning (atoi("hello world") == 0), and "1337isanumber" will actually return 1337. To simply validate if something is a number, you could (assuming ASCII and not Unicode) use a loop and make sure each value up until the first null terminator (or the size of the memory you allocated for the string) that each digit is in fact numeric. A similar procedure could be done to check if something is alphanumeric, etc. As you mentioned, regexes can be used for a telephone number or some other relatively complex data format.
Your comment below references using "instanceof" in Java for input validation, but this isn't possible either. If you get user input from, say, the command line, or a query string parameter, or whatever, it really comes in as a string. If you're using a Scanner object to scan the standard input and use a method such as nextInt(), it's really converting a string (from the stream) into something, which can throw a runtime exception. You cannot use instanceof to determine a string's contents; a String is a String -- even if its contents are "42", it is not an instance of an Integer!
Related
I'm working on an assignment that is supposed to go over the basics of reading a file and storing the information from that file. I'm personally new to C and struggling with the lack of a "String" variable.
The file that the program is supposed to work with contains temperature values, but we are supposed to account for "corrupted data". The assignment states:
Every input item read from the file should be treated as a stream of characters (string), you can
use the function atof() to convert a string value into a floating point number (invalid data can be
set to a value lower than the lowest minimum to identify it as corrupt)."
The number of elements in the file is undetermined but an example given is:
37.8, 38.a, 139.1, abc.5, 37.9, 38.8, 40.5, 39.0, 36.9, 39.8
After reading the file we're supposed to allow a user to query these individual entries, but as mentioned if the data entry contains a non-numeric value, we are supposed to state that the specific data entry is corrupted.
Overall, I understand how to functionally write a program that can fulfill those requirements. My issue is not knowing what data structure to use and/or how to store the information to be called upon later.
The closest to an actual string datatype which you find in C is a sequence of chars which is terminated by a '\0' value. That is used for most things which you'd expect to do with strings.
Storing them requires just sufficent memory, as offered by a sufficiently large array of char, or as offered by malloc().
I think the requirements of your assignment would be met by making a char array as buffer, then reading in with fgets(), making sure to not read more than fits into your array and making sure that there is a '\0' at the end.
Then you can use atof() on the content of the array and if it fails do the handling of corrupted input. Though I would prefer sscanf() for its better feedback via separate return value.
I am looking to parse iCalendar files using C. I have an existing structure setup and reading in all ready and want to parse line by line with components.
For example I would need to parse something like the following:
UID:uid1#example.com
DTSTAMP:19970714T170000Z
ORGANIZER;CN=John Doe;SENT-BY="mailto:smith#example.com":mailto:john.doe#example.com
CATEGORIES:Project Report, XYZ, Weekly Meeting
DTSTART:19970714T170000Z
DTEND:19970715T035959Z
SUMMARY:Bastille Day Party
Here are some of the rules:
The first word on each line is the property name
The property name will be followed by a colon (:) or a semicolon (;)
If it is a colon then the property value will be directly to the right of the content to the end of the line
A further layer of complexity is added here as a comma separated list of values are allowed that would then be stored in an array. So the CATEGORIES one for example would have 3 elements in an array for the values
If after the property name a semi colon is there, then there are optional parameters that follow
The optional parameter format is ParamName=ParamValue. Again a comma separated list is supported here.
There can be more than one optional parameter as seen on the ORGANIZER line. There would just be another semicolon followed by the next parameter and value.
And to throw in yet another wrench, quotations are allowed in the values. If something is in quotes for the value it would need to be treated as part of the value instead of being part of the syntax. So a semicolon in a quotation would not mean that there is another parameter it would be part of the value.
I was going about this using strchr() and strtok() and have got some basic elements from that, however it is getting very messy and unorganized and does not seem to be the right way to do this.
How can I implement such a complex parser with the standard C libraries (or the POSIX regex library)? (not looking for whole solution, just starting point)
This answer is supposing that you want to roll your own parser using Standard C. In practice it is usually better to use an existing parser because they have already thought of and handled all the weird things that can come up.
My high level approach would be:
Read a line
Pass pointer to start of this line to a function parse_line:
Use strcspn on the pointer to identify the location of the first : or ; (aborting if no marker found)
Save the text so far as the property name
While the parsing pointer points to ;:
Call a function extract_name_value_pair passing address of your parsing pointer.
That function will extract and save the name and value, and update the pointer to point to the ; or : following the entry. Of course this function must handle quote marks in the value and the fact that their might be ; or : in the value
(At this point the parsing pointer is always on :)
Pass the rest of the string to a function parse_csv which will look for comma-separated values (again, being aware of quote marks) and store the results it finds in the right place.
The functions parse_csv and extract_name_value_pair should in fact be developed and tested first. Make a test suite and check that they work properly. Then write your overall parser function which calls those functions as needed.
Also, write all the memory allocation code as separate functions. Think of what data structure you want to store your parsed result in. Then code up that data structure, and test it, entirely independently of the parsing code. Only then, write the parsing code and call functions to insert the resulting data in the data structure.
You really don't want to have memory management code mixed up with parsing code. That makes it exponentially harder to debug.
When making a function that accepts a string (e.g. all three named functions above, plus any other helpers you decide you need) you have a few options as to their interface:
Accept pointer to null-terminated string
Accept pointer to start and one-past-the-end
Accept pointer to start, and integer length
Each way has its pros and cons: it's annoying to write null terminators everywhere and then unwrite them later if need be; but it's also annoying when you want to use strcspn or other string functions but you received a length-counted piece of string.
Also, when the function needs to let the caller know how much text it consumed in parsing, you have two options:
Accept pointer to character, Return the number of characters consumed; calling function will add the two together to know what happened
Accept pointer to pointer to character, and update the pointer to character. Return value could then be used for an error code.
There's no one right answer, with experience you will get better at deciding which option leads to the cleanest code.
We have a CGI-program, that processes POST-ed forms. Some of the POST-ed text can contain non-ASCII characters — and the browsers already helpfully convert these to UTF-8.
I need to "harden" the program to reject invalid strings — where a non-ASCII string is not a valid UTF-8 string either.
I thought, I'd rely on mbstowcs():
setlocale(LC_CTYPE, "en_US.UTF-8");
unilen = mbstowcs(NULL, foo, 0);
if (unilen == (size_t)-1) {
... report an error ...
}
However, I am having a hard time validating the method — it accepts valid strings alright, but I can't come up with an invalid one for it to reject...
Could someone, please, confirm, that this is a proper way and/or suggest an alternative?
Note, that I don't care for the actual result of conversion — once I'm confident, that the string is valid UTF-8, I'm copying it into an e-mail (with UTF-8 charset) and letting the recipient's e-mail program deal with it. The only reason I bother with the validation is to ensure, the form is not used to propagate arbitrary binaries (such as viruses).
Thanks!
The function documentation says
"If an invalid multibyte character is encountered, a (size_t)-1 value is returned."
So i believe your validation is pretty much fine. Personally, i always found this value corrupted for invalid data. You might submit an arbitrary hex sequence of even length to be certain.
If you are doubtful and need further validation, gnu iconv is a good alternate
utf-8 validation on SO
I have a file input, in which i have the following data.
1 1Apple 2Orange 10Kiwi
2 30Apple 4Orange 1Kiwi
and so on. I have to read this data from file and work on it but i dont know how to retrieve the data. I want to store 1(of 1 apple) as integer and then Apple as a string.
I thought of reading the whole 1Apple as a string. and then doing something with the stoi function.
Or I could read the whole thing character by character and then if the ascii value of that character lies b/w 48 to 57 then i will combine that as an integer and save the rest as string? Which one shall I do? Also how do I check what is the ASCII value of the char. (shall I convert the char to int and then compare, or is there any inbuilt function?)
How about using the fscanf() function if and only if your input pattern is not going to change. Otherwise you should probably use fgets() and perform checks if you want to separate the number from the string such as you suggested.
There is one easy right way to do this with standard C library facilities, one rather more difficult right way, and a whole lot of wrong ways. This is the easy right way:
Read an entire line into a char[] buffer using fgets.
Extract numbers from this line using strtol or strtoul.
It is very important to understand why the easier-looking alternatives (*scanf and atoi) should never be used. You might write less code initially, but once you start thinking about how to handle even slightly malformed input, you will discover that you should have used strtol.
The "rather more difficult right way" is to use lex and yacc. They are much more complicated but also much more powerful. You shouldn't need them for this problem.
I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.