C string formatting equivalence - c

On a project I work on, we recently ran into the issue where we need to check if 2 strings have string formatting, (for translations).
/* A simple example: */
str = "%.200sSOMETEXT%.5fSOMEMORETEXT%d%ul%.*s%%";
/* Should be able to be validated to be the equivalent of: */
str = "%.200sBLAHBLAH%.5ftest%d%ul%.*s%%MORETEXT";
/* and... */
str = "%.200s%.5f%d%ul%.*s%%";
/* but not... */
str = "%.5f%.200s%d%ul%%%.*s";
So my question is:
Is there a way to validate 2 strings have equivalence string formatting?
Perhaps the answer is some very good regex expression, or existing tools or some example code from another project. I can't imagine we're the first project to run into this problem.

Interesting problem.
I would try to implement a function that strips the non-formatting characters from a formatting string, thus leaving only the format specifiers. That should then, hopefully, be canonical enough to be compared.
Perhaps you'd need to further strip things like field widths, and (if you support it) argument indexes since those will differ for different translations.
It shouldn't be very hard to come up with the stripping function, format specifiers are pretty simple. Drop characters until you find a %, then check the following character, if it´s % then drop both, else copy characters until you find one of the "final" specifiers (d, f, s, u and so on).

Just as a followup/precision, our use case is to validate translations (po files), as printf mismatches between org string and translated one can lead to nasty crashes…
Currently I’m using that regex (python code, as we handle this in py), which is a basic representation of printf syntax:
>>> import re
>>> _format = re.compile(r"(?!<%)(?:%%)*%[-+#0]?(?:\*|[0-9]+)?(?:\.(?:\*|[0-9]+))?(?:[hljztL]|hh|ll)?[tldiuoxXfFeEgGaAcspn]").findall
>>> _format("%.200sSOMETEXT%.5fSOMEMORETEXT%d%ul%.*s%%")
['%.200s', '%.5f', '%d', '%u', '%.*s']
>>> _format("%.200sBLAHBLAH%.5ftest%d%ul%.*s%%MORETEXT")
['%.200s', '%.5f', '%d', '%u', '%.*s']
>>> _format("%.200s%.5f%d%ul%.*s%%")
['%.200s', '%.5f', '%d', '%u', '%.*s']
So a mere comparison between returned lists tells us whether those strings are printf-compatible or not.
This probably does not address all possible corner cases, but it works pretty well…

Related

Unable to form the required regex in C

I am trying to write a regex which can search a string and return true if it matches with the regex and false otherwise.
Check should ensure string is wildcard domain name of a website.
Example:
*.cool.dude is valid
*.cool is not valid
abc.cool.dude is not valid
So I had written something which like this
\\*\\.[.*]\\.[.*]
However, this is also allowing a *.. string as valid string because * means 0 or infinite occurrences.
I am looking for something which ensures that at-least 1 occurrence of the string happens.
Example:
*.a.b -> valid but *.. -> invalid
how to change the regex to support this?
I have already tried doing something like this:
\\*\\.([.*]{1,})\\.([.*]{1,}) -> doesnt work
\\*\\.([.+])\\.(.+) -> doesnt work
^\\*\\.[a-zA-Z]+\\.[a-zA-Z]+ -> doesnt work
I have tried a bunch of other options as well and have failed to find a solution. Would be great if someone can provide some input.
PS. Looking for a solution which works in C.
[.*] does not mean "0 or more occurrences" of anything. It means "a single character, either a (literal) . or a (literal) [*]". […] defines a character class, which matches exactly one character from the specified set. Brackets are not even remotely the same as parentheses.
So if you wanted to express "zero or more of any character except newline", you could just write .*. That's what .* means. And if you wanted "one or more" instead of "zero or more", you could change the * to a plus, as long as you remember that regex.h regexes should always be compiled with the REG_EXTENDED flag. Without that flag, + is just an ordinary character. (And there are a lot of other inconveniences.)
But that's probably not really what you want. My guess is that you want something like:
^[*]([.][A-Za-z0-9_]+){2,}$
although you'll have to correct the character class to specify the precise set of characters you think are legitimate.
Again, don't forget the crucial REG_EXTENDED flag when you call regcomp.
Some notes:
The {2,} requires at least two components after the *, so that *.cool doesn't match.
The ^ and $ at the beginning and end of the regex "anchor" the match to the entire input. That stops the pattern matching just a part of the input, but it might not be exactly what you want, either.
Finally, I deliberately used a single-character character class to force [*] and [.] to be ordinary characters. I find that a lot more readable than falling timber (\\) and it avoids having to think about the combination of string escaping and regex-escaping.
For more information, I highly recommend reading man regcomp and man 7 regex. A good introduction to regexes might be useful, as well.

JTidy not handling some characters correctly

Certain characters get mangled after I call Tidy.parse. Two examples are: ’ instead of ' and ∼ instead of ~
I'm guessing that these must have come from Word or something similar but the tidy handles them very badly. Specifically, it converts them to their individual entity representations for the diacritics which then get converted to meaningless junk later in my process. I'm sure there are others but these are the ones I have found so far. Is there any known way to convert these before hand or ignore them as part of the tidy?
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.setForceOutput(true);
tidy.parse(inputStream, outputStream);
After printing out the config, I could see that the input and output encodings were not set to UTF-8 as I had thought so I just had to add this:
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");

How to wildcard search with capture in C?

I'm trying to write a routine in C to capture sequences of characters in a string argument. The matching criteria in addition to characters can have ? meaning exactly one character and * meaning zero or more characters. (lazy).
e.g.
string: ok1ok1234567890
match: *(ok?2*)4*
The result should be the position of the match = 3 and the length of the match = 5
I have tried numerous ways of doing this, have put it aside, come back to it, put it aside again etc. I cannot crack it. It needs to be a purely C solution and able to capture multiple captures.
e.g. (*)(ok??)3(4*)8*
Every solution I come up with works in many cases but not all. I'm hoping someone somewhere might have done this already or have an insight to how it can be done.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

use __typeof__ in input validation

Can we use __typeof__ for input validation in C program run on a Linux platform and how?
If we can't then, are there any ways other than regex to achieve the same?
"typeof" is purely a compile-time directive. It cannot be used for "input validation."
Input validation rules can be complex. In C, they are made more complex by the fact that the tools that you have at your disposal in the standard C library are pretty awful. One example is atoi(), which will return 0 if the string that you pass in doesn't contain a number at the beginning (atoi("hello world") == 0), and "1337isanumber" will actually return 1337. To simply validate if something is a number, you could (assuming ASCII and not Unicode) use a loop and make sure each value up until the first null terminator (or the size of the memory you allocated for the string) that each digit is in fact numeric. A similar procedure could be done to check if something is alphanumeric, etc. As you mentioned, regexes can be used for a telephone number or some other relatively complex data format.
Your comment below references using "instanceof" in Java for input validation, but this isn't possible either. If you get user input from, say, the command line, or a query string parameter, or whatever, it really comes in as a string. If you're using a Scanner object to scan the standard input and use a method such as nextInt(), it's really converting a string (from the stream) into something, which can throw a runtime exception. You cannot use instanceof to determine a string's contents; a String is a String -- even if its contents are "42", it is not an instance of an Integer!

Resources