Is checking for first character before doing strcmp useful? - c

which of the following is more efficient :
if (strcmp(str1,str2) != 0) {
...
}
OR
if (str1[0]!=str2[0] && strcmp(str1,str2) !=0 ) {
...
}
If str2 is always unique and there can be multiple str1.

There is no need of second version as strcmp is usually implemented very smartly to compare multiple characters at once.
In second version, because of short-circuit property of &&, you may save a function call. You should benchmark both version for your requirements to get the correct idea.
But my suggestion still is, there is no need of version 2 (str1[0]!=str2[0] && strcmp(str1,str2) !=0 ) proposed by you unless strcmp is proved as bottleneck (in profiling result) for your requirement and there are evidences that version 2 performs better.

strcmp(str1,str2) !=0
checks the first character and returns if they are not equal. So you need not exclusively check for
str1[0]!=str2[0].
Your str1[0]!=str2[0] does the same thing what strcmp(str1,str2) does in the first check.

strcmp starts comparing the first character of each string. If they are equal to each other, it continues with the following pairs until the characters differ or until a terminating null-character is reached.
So for second case it seems no meaning for extra condition to check first character of string.
because strcmp already done (str1[0]!=str2[0]) it.

As #Abhineet suggests "test and see for yourself".
if (strcmp(str1,str2) != 0) and if ((str1[0] != str2[0]) && strcmp(str1,str2) !=0 ) are functionally the same when each is passed a C string. This, of course, is a requirement, else, why compare performance?
C does not focus on specifying performance, so should this approach work faster with a given compiler on a given machine, it may be worse with the next version of the compiler or some compiler option change or a different string data set.
But in my experience, making code with heavily string usage on multiple platforms, this trick did improve performance in select machines and did not significantly slow others. Your results may vary.
As with any linear improvements in performance, slight code tweaks in heavily used code need deep understanding of the target machine to know if is always faster.
Typically, using your programming time to think about other approaches can reap far larger performance improvements.
1) hash codes
2) unique strings need only pointer compare
3) Other "string" structures

Related

What is efficient way to test zero length string: `strlen(str) > 0` or `*str`

Assuming char *str;, to test the empty string following approaches works
/* Approrach 1 */
if (str && strlen(str) > 0) {
...
}
/* Approach 2 */
if (str && *str) {
...
}
What is the most preferable to use? I assume second will be faster as it does not have to iterate over the buffer to get the length. Also any downfalls of using the second?
I'll give a third option that I find would be better:
if (str && str[0]) {
// ...
}
The strlen method isn't ideal since it can iterate over a non-zero-length string. The compiler may optimize out that call entirely (as has been pointed out), but it won't on every compiler (and I assume the -ffreestanding option would disable this optimization), and it at least makes it look like more work needs to happen.
However, I consider the [0] to have much clearer intent than a *. I generally recommend using * when dereferencing a pointer to a single object, and using [0] when that pointer is to the first element of an array.
To be extra clear, you can do:
if (str && str[0] != '\0') {
// ...
}
but that starts tipping the special-to-alphanum-characters ratio towards hard-to-read.
Approach 2, because the first attempt will iterate over the string if it's not empty. And no, there are no downsides to approach 2.
It's unlikely that the compiler would generate different code if you have optimization enabled. But if performance REALLY is an issue it's probably better to use the second approach. I say probably, because it's not 100% certain. Who knows what the optimizer will do? But there is a risk that the first approach will iterate over the whole string if it's not empty.
When it comes to readability, I'd say that the first is slightly more readable. However, using *str to test for an empty string is very idiomatic in C. Any seasoned C coder would instantly understand what it means. So TBH, the readability issue is mostly in case someone who is not a C programmer will read the code. If someone does not understand what if (str && *str) does, then you don't want them to modify the code either. ;)
If there is a coding standard for the code base you're working on, stick to that. If there's not, pick the one you like most. It does not really matter.
What is the most preferable to use?
Obviously number one, as that is most readable.
I would ignore performance issues. Both CLANG and GCC will generate the same code for "-O3" options.
See godbolt
The first approach expresses programmer's intentions a bit more explicitly.

How to check if any one of several pointers is null?

I have several pointer/integer variables and I want to check if any of them is 0. Right now I compare each to 0 in a large if statement that would short circuit once it hits one that is 0. I was wondering if there is any more clever or faster way of accomplishing this.
It doesn't really matter. Even if you'd stack all pointers up and loop over this array or if you or-ed all values.. you'd still have to do that one after another. And if you have something like this if( a != 0 && b != 0 && .. && z != null) the compiler will convert that to as many instructions as it will need in all other cases.
The only thing you might could save using an array which you e.g. you loop over is maybe memory at some point but I don't think this is what you were looking for.
No there is not. Think about it: to really be sure that not a single one of your values is zero, you absolutely have to look at each an every one of them. As you correctly noted, it is possible to short-circuit, one a zero-value has been found. I would recommend something similar to this:
int has_null = -1;
for(int i=0; i < null_list_len && has_null &= null_list[i]; ++i)
;
if(has_null)
//do stuff
You can improve the run time, if you have more assumptions about the values you are testing. If, for example you knew, that the null_list array is sorted, you only have to check wether the very first entry is zero, as a non-zero entry would imply that all other values are also greater than zero.
Well, you could code whatever it is that sets the vars to zero to ensure that a common boolean is set to 'true'.
Checking would then be a matter of testing one boolean, no matter how many vars there are. If the bool is true, then you can do a sequential check, much as you are doing now.
That may, or may not, be possible, faster, or more efficient, overall, (or not).

Is there a known O(nm)-time/O(1)-space algorithm for POSIX filename matching (fnmatch)?

Edit: WHOOPS! Big admission, I screwed up the definition of the ? in fnmatch pattern syntax and seem to have proposed (and possibly solved) a much harder problem where it behaves like .? in regular expressions. Of course it actually is supposed to behave like . in regular expressions (matching exactly one character, not zero or one). Which in turn means my initial problem-reduction work was sufficient to solve the (now rather boring) original problem. Solving the harder problem is rather interesting still though; I might write it up sometime.
On the plus side, this means there's a much greater chance that something like 2way/SMOA needle factorization might be applicable to these patterns, which in turn could yield the better-than-originally-desired O(n) or even O(n/m) performance.
In the question title, let m be the length of the pattern/needle and n be the length of the string being matched against it.
This question is of interest to me because all the algorithms I've seen/used have either pathologically bad performance and possible stack overflow exploits due to backtracking, or required dynamic memory allocation (e.g. for a DFA approach or just avoiding doing backtracking on the call stack) and thus have failure cases that could also be dangerous if a program is using fnmatch to grant/deny access rights of some sort.
I'm willing to believe that no such algorithm exists for regular expression matching, but the filename pattern language is much simpler than regular expressions. I've already simplified the problem to the point where one can assume the pattern does not use the * character, and in this modified problem you're not matching the whole string but searching for an occurrence of the pattern in the string (like the substring match problem). If you further simplify the language and remove the ? character, the language is just composed of concatenations of fixed strings and bracket expressions, and this can easily be matched in O(mn) time and O(1) space, which perhaps can be improved to O(n) if the needle factorization techniques used in 2way and SMOA substring search can be extended to such bracket patterns. However, naively each ? requires trials with or without the ? consuming a character, bringing in a time factor of 2^q where q is the number of ? characters in the pattern.
Anyone know if this problem has already been solved, or have ideas for solving it?
Note: In defining O(1) space, I'm using the Transdichotomous_model.
Note 2: This site has details on the 2way and SMOA algorithms I referenced: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
Have you looked into the re2 regular expression engine by Russ Cox (of Google)?
It's a regular expression matching engine based on deterministic finite automata, which is different than the usual implementations (Perl, PCRE) using backtracking to simulate a non-deterministic finite automaton. One of the specific design goals was to eliminate the catastrophic backtracking behaviour you mention.
It disallows some of the Perl extensions like backreferences in the search pattern, but you don't need that for glob matching.
I'm not sure if it guarantees O(mn) time and O(1) memory constraints specifically, but it was good enough to run the Google Code Search service while it existed.
At the very least it should be cool to look inside and see how it works. Russ Cox has written three articles about re2 - one, two, three - and the re2 code is open source.
Edit: WHOOPS! Big admission, I screwed up the definition of the ? in fnmatch pattern syntax and seem to have solved a much harder problem where it behaves like .? in regular expressions. Of course it actually is supposed to behave like . in regular expressions (matching exactly one character, not zero or one). Which in turn means my initial problem-reduction work was sufficient to solve the (now rather boring) original problem. Solving the harder problem is rather interesting still though; I might write it up sometime.
Possible solution to the harder problem follows below.
I have worked out what seems to be a solution in O(log q) space (where q is the number of question marks in the pattern, and thus q < m) and uncertain but seemingly better-than-exponential time.
First of all, a quick explanation of the problem reduction. First break the pattern at each *; it decomposes as a (possibly zero length) initial and final component, and a number of internal components flanked on both sided by a *. This means once we've determined if the initial/final components match up, we can apply the following algorithm for internal matches: Starting with the last component, search for the match in the string that starts at the latest offset. This leaves the most possible "haystack" characters free to match earlier components; if they're not all needed, it's no problem, because the fact that a * intervenes allows us to later throw away as many as needed, so it's not beneficial to try "using more ? marks" of the last component or finding an earlier occurrence of it. This procedure can then be repeated for every component. Note that here I'm strongly taking advantage of the fact that the only "repetition operator" in the fnmatch expression is the * that matches zero or more occurrences of any character. The same reduction would not work with regular expressions.
With that out of the way, I began looking for how to match a single component efficiently. I'm allowing a time factor of n, so that means it's okay to start trying at every possible position in the string, and give up and move to the next position if we fail. This is the general procedure we'll take (no Boyer-Moore-like tricks yet; perhaps they can be brought in later).
For a given component (which contains no *, only literal characters, brackets that match exactly one character from a given set, and ?), it has a minimum and maximum length string it could match. The minimum is the length if you omit all ? characters and count bracket expressions as one character, and the maximum is the length if you include ? characters. At each position, we will try each possible length the pattern component could match. This means we perform q+1 trials. For the following explanation, assume the length remains fixed (it's the outermost loop, outside the recursion that's about to be introduced). This also fixes a length (in characters) from the string that we will be comparing to the pattern at this point.
Now here's the fun part. I don't want to iterate over all possible combinations of which ? characters do/don't get used. The iterator is too big to store. So I cheat. I break the pattern component into two "halves", L and R, where each contains half of the ? characters. Then I simply iterate over all the possibilities of how many ? characters are used in L (from 0 to the total number that will be used based on the length that was fixed above) and then the number of ? characters used in R is determined as well. This also partitions the string we're trying to match into part that will be matched against pattern L and pattern R.
Now we've reduced the problem of checking if a pattern component with q ? characters matches a particular fixed-length string to two instances of checking if a pattern component with q/2 ? characters matches a particular smaller fixed-length string. Apply recursion. And since each step halves the number of ? characters involved, the number of levels of recursion is bounded by log q.
You can create a hash of both strings and then compare these. The hash computation will be done in O(m) while the search in O(m + n)
You can use something like this for calculating the hash of the string where s[i] is a character
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
As you said this is for file-name matching and you can't use this where you have wildcards in the strings. Good luck!
My feeling is that this is not possible.
Though I can't provide a bullet-proof argument, my intuition is that you will always be able to construct patterns containing q=Theta(m) ? characters where it will be necessary for the algorithm to, in some sense, account for all 2^q possibilities. This will then require O(q)=O(m) space to keep track of which of the possibilities you're currently looking at. For example, the NFA algorithm uses this space to keep track of the set of states it's currently in; the brute-force backtracking approach uses the space as stack (and to add insult to injury, it uses O(2^q) time in addition to the O(q) of space).
OK, here's how I solved the problem.
Attempt to match the initial part of the pattern up to the first * against the string. If this fails, bail out. If it succeeds, throw away this initial part of both the pattern and the string; we're done with them. (And if we hit the end of pattern before hitting a *, we have a match iff we also reached the end of the string.)
Skip all the way to end end of the pattern (everything after the last *, which might be a zero-length pattern if the pattern ends with a *). Count the number of characters needed to match it, and examine that many characters from the end of the string. If they fail to match, we're done. If they match, throw away this component of the pattern and string.
Now, we're left with a (possibly empty) sequence of subpatterns, all of which are flanked on both sides by *'s. We try searching for them sequentially in what remains of the string, taking the first match for each and discarding the beginning of the string up through the match. If we find a match for each component in this manner, we have a match for the whole pattern. If any component search fails, the whole pattern fails to match.
This alogorithm has no recursion and only stores a finite number of offsets in the string/pattern, so in the transdichotomous model it's O(1) space. Step 1 was O(m) in time, step 2 was O(n+m) in time (or O(m) if we assume the input string length is already known, but I'm assuming a C string), and step 3 is (using a naive search algorithm) O(nm). Thus the algorithm overall is O(nm) in time. It may be possible to improve step 3 to be O(n) but I haven't yet tried.
Finally, note that the original harder problem is perhaps still useful to solve. That's because I didn't account for multi-character collating elements, which most people implementing regex and such tend to ignore because they're ugly to get right and there's no standard API to interface with the system locale and obtain the necessary info to get them. But with that said, here's an example: Suppose ch is a multi-character collating element. Then [c[.ch.]] could consume either 1 or 2 characters. And we're back to needing the more advanced algorithm I described in my original answer, which I think needs O(log m) space and perhaps somewhat more than O(nm) time (I'm guessing O(n²m) at best). At the moment I have no interest in implementing multi-character collating element support, but it does leave a nice open problem...

Speeding up large switches and if-elses

What can I do to improve large switches and if-elses speed manually? I will probably need some kind of hash or lookup table.
I'm working with gcc and C code, I doubt that gcc has any inbuilt optimizations for this.
Edit:
My switch code is what every switch looks like, do something based on if a particular int is some value.
My if-elses look like this:
if( !strcmp( "val1", str ) )
foo();
else if( !strcmp( "val2", str ) )
foo2();
...
I also have ifs that do this
if( struct.member1 != NULL )
foo();
if( struct.member2 != NULL )
foo2();
EDIT2:
Thank you everyone. I'm not sure which one I should pick as an answer, because a lot of these answers have valid points and valuable insights. Unfortunately, I have to pick just one. But thanks all!
In the end, using a perfect hash table seems the best way to get O(n) time on the access for both ifs and switches.
To use a hash table:
Pick a hash function. This one is a biggie. There are tradeoffs between speed, the quality of the hash, and the size of the output. Encryption algorithms can make good hash functions. The hash function performs some computation using all the bits of your input value to return some output value with a smaller number of bits.
So the hash function takes a string
and returns an integer between 0 and
N .
Now you can look up a pointer to a function in a table of size N.
Each entry in the table will be a linked list (or some other searchable data structure) because of the chance of collision, that is two strings that map to the same hash value.
E.g.
lets say hash(char*) returns a value between 0 and 3.
hash("val1") returns 2
hash("val2") returns 0
hash("val3") also returns 0
hash("val4") returns 1
Now your hash table looks something like:
table[0] ("val2",foo2) ("val3", foo3)
table[1] ("val4",foo4)
table[2] ("val1",foo1)
table[3] <empty>
I hope you can see how the cost of doing matching using a hash table is bound by the time it takes to calculate the hash function and the small time it takes to search the entry in the hash table. If the hash table is large enough most hash table entries would have very few items.
For strings, if you have a small finite number of possible strings, use a perfect hash and switch on the result. With only 30 or so strings, finding a perfect hash should be pretty easy. If you also need to validate the input, you'll have to do a single strcmp in each case, but that's pretty cheap.
Beyond that, just let the compiler optimize your switches. Only do anything fancier if you've done sufficient testing to know the time spent here is performance-critical.
I'm not sure what are you looking for, but branch prediction with gcc is discussed in this question
It has. Just see the generated code. At least it optimizes switches.
You may use hash table to optimize your code, but I'm sure that GCC does the same for you.
Another thing is if-else's, when they contain some complex boolean expressions. I will not answer this part of question here.
It really depends on the code base you are working with and whether it is open to further/better modularization. Otherwise, if nothing else I can recommend this.
If there are more common cases than others (one or two things happen more than the rest), place them at the beginning of the switch/if/else, that way in the more common cases your program will only make that first one or two comparisons and short circuit its path. Generally a good idea on its own for any code IMHO.
It depends much on the strings that you are comparing. You could do a switch on some characteristics of the strings:
If you know that they differ pretty
well in the 4th position, you could
do a switch on str[3] and only
then do the strcmp.
Or look at some sort on checksum and switch.
But all of this is quite handcrafted, you definitely should check the assembler that gcc produces.
A hash table would be ideal for speeding up a bunch of string compares.
You might want to look into a string library that does not use nul terminated strings like the C stdlib does. Lots of string manipulation in C involves a lot of "look through the string for the nul, then do your operation".
A string library like SafeStr keeps info about the length of the strings so there's no need to burn time to scan for nuls, especially for strings with unequal lengths
(I'm quoting some from my prior research I've written on this topic)
The specINT2006 benchmark, 458.sjeng, which implements
a chess simulator, uses many switch statements to process the
different chess pieces. Each statement is in a form like:
switch (board[from]) {
case (wpawn): ...
case (wknight): ...
Which the compiler (gcc) generates as an instruction sequence similar
to the following:
40752b: mov -0x28(%rbp),%eax
40752e: mov 0x4238(,%rax,8),%rax
407536: jmpq *%rax
This assembly acts as a lookup table. You can speed up the compiled code further by splitting your switch ... case into multiple switch statements. You'll want to keep the case values consecutive and put the most frequent cases into different switch statements. This particularly improves the indirect branch prediction.
I'll leave the remainder of your questions to others.
Other answers have already suggested using a hash table, I'd recommend generating a perfect hash function using gperf (or a minimal perfect hash function, see the wikipedia page for a few links)

Using scanf in a while loop

Probably an extremely simple answer to this extremely simple question:
I'm reading "C Primer Plus" by Pratta and he keeps using the example
while (scanf("%d", &num) == 1)...
Is the == 1 really necessary? It seems like one could just write:
while (scanf("%d", &num))
It seems like the equality test is unnecessary since scanf returns the number of objects read and 1 would make the while loop true. Is the reason to make sure that the number of elements read is exactly 1 or is this totally superfluous?
In C, 0 is evaluated to false and everything else to true. Thus, if scanf returned EOF, which is a negative value, the loop would evaluate to true, which is not what you'd want.
Since scanf returns the value EOF (which is -1) on end of file, the loop as written is correct. It runs as long as the input contains text that matches %d, and stops either at the first non-match or end of file.
It would have been clearer at a glance if scanf were expecting more than one input....
while (scanf("%d %d", &x, &y)==2) { ... }
would exit the loop when the first time it was unable to match two values, either due to end of file end of file (scanf returns EOF (which is -1)) or on input matching error (e.g. the input xyzzy 42 does not match %d %d so scanf stops on the first failure and returns 0 without writing to either x or y) when it returns some value less than 2.
Of course, scanf is not your friend when parsing real input from normal humans. There are many pitfalls in its handling of error cases.
Edit: Corrected an error: scanf returns EOF on end of file, or a non-negative integer counting the number of variables it successfully set.
The key point is that since any non-zero value is TRUE in C, failing to test the return value correctly in a loop like this can easily lead to unexpected behavior. In particular, while(scanf(...)) is an infinite loop unless it encounters input text that cannot be converted according to its format.
And I cannot emphasize strongly enough that scanf is not your friend. A combination of fgets and sscanf might be enough for some simple parsing, but even then it is easily overwhelmed by edge cases and errors.
You understood the C code correctly.
Sometimes the reason for testing the number of items read is that someone wants to make sure that all items were read instead of scanf quitting early when it the input didn't match the expected type. In this particular case it didn't matter.
Usually scanf is a poor choice of functions because it doesn't meet the needs of interactive input from a human user. Usually a combination of fgets and sscanf yield better results. In this particular case it didn't matter.
If later chapters explain why some kinds of coding practices are better than this trivial example, good. But if not, you should dump the book you're reading.
On the other hand, your substitute code isn't exactly a substitute. If scanf returns -1 then your while loop will execute.
While you are correct it is not strictly necessary, some people prefer it for several reasons.
First, by comparing to 1 it becomes an explicit boolean value (true or false). Without the comparison, you are testing on an integer, which is valid in C, but not in later languages (like C#).
Secondly, some people would read the second version in terms of while([function]), instead of while([return value]), and be momentarily confused by testing a function, when what is clearly meant is testing the return value.
This can be completely a matter of personal preference, and as far as I'm concerned, both are valid.
One probably could write it without an explicit comparison (see the JRL's answer though), but why would one? I'd say that comparison-less conditions should only be used with values that have explicitly boolean semantics (like an isdigit() call, for example). Everything else should use an explicit comparison. In this case (the result of scanf) the semantics is pronouncedly non-boolean, so the explicit comparison is in order.
Also, the comparison one can usually omit is normally a comparison with zero. When you feel the urge to omit the comparison with something else (like 1 in this case) it is better to think twice and make sure you know what your are doing (see the JRL's answer again).
In any case, when the comparison can be safely omitted and you actually omit it, the actual semantical meaning of the condition remains the same. It has absolutely no impact on the efficiency of the resultant code, if that's something you are worrying about.

Resources