strstr vs regex in c - c

Let's say, for example, I have a list of user id's, access times, program names, and version numbers as a list of CSV strings, like this:
1,1342995305,Some Program,0.98
1,1342995315,Some Program,1.20
2,1342985305,Another Program,15.8.3
1,1342995443,Bob's favorite game,0.98
3,1238543846,Something else,
...
Assume this list is not a file, but is an in-memory list of strings.
Now let's say I want to find out how often a program has been accessed to certain programs, as listed by their version number. (e.g. "Some Program version 1.20" was accessed 193 times, "Some Program version 0.98" was accessed 876 times, and "Some Program 1.0.1" was accessed 1,932 times)
Would it be better to build a regular expression and then use regexec() to find the matches and pull out the version numbers, or strstr() to match the program name plus comma, and then just read the following part of the string as the version number? If it makes a difference, assume I am using GCC on Linux.
Is there a performance difference? Is one method "better" or "more proper" than the other? Does it matter at all?

Go with strstr() - using regex to count a number of occurrences is not a good idea, as you would need to use loop anyway, so I would suggest you to do a simple loop with searching for poistion of substring and increase counter and starting search position after each match.

strchr/memcmp is how most libc versions implemented strstr. Hardware-dependent implementations of strstr in glibc do better. Both SSE2 and SSE4.2 (x86) instruction sets can do way better than scanning byte-by-byte. If you want to see how, I posted a couple blog articles a while back --- SSE2 and strstr and SSE2 and BNDM search --- that you might find interesting.

I'd do neither: I'm betting it would be faster to use strchr() to find the commas, and strcmp() to check the program name.
As for performance, I expect string functions (strtok/strstr/strchr/strpos/strcmp...) to run all more or less at the same speed (i.e. really, really fast), and regex to run appreciably slower albeit still quite fast.
The real performance benefit would come from properly designing the search though: how many times it must run, is the number of programs fixed...?
For example, a single scan whereby you get ALL the frequency data for all the programs would be much slower than a single scan seeking for a given program. But properly designed, all subsequent queries for other programs would run way faster.

strtok(), and break the data up into something more structured (like a list of structs).

Related

Will comparing string length before checking if string is the same yield me non-negligible speed increases for C?

Very new to programming in C so sorry if I have badly misunderstood something. I am currently doing the speller problem set from CS50 if anyone is familiar with this, and I given words from a text to check if they are spelled correctly by comparing them to a given dictionary. I have sorted this dictionary into a hash table with about 17,000 buckets which point to on average, a linked list with a length of around 100 nodes. There may be several hundred thousand words that need to be spell checked.
My question is this, will checking to see if the length of each word from my dictionary matches the length of the word required to be spellchecked, using strlen(), before then using strcmp() only if the lengths match, be faster than just checking if the strings match using strcmp().
I do potentially see that if there are lots of words that have the same length as your word you want to check, checking the length will disadvantage you but I am wondering if the speed increase, if there is one, by checking the length for words with a more uncommon length will make up for this.
strcmp is an O(n) operation - it iterates over both strings until one of them ends or a mismatching pair of characters is encountered, so at first glance comparing the lengths sounds like a good idea. However, strlen in C is also an O(n) operation - it takes a char* and iterates until it hits a \0 character. So just using strlen naively would in fact probably make your program slower.
Will comparing string length before checking if string is the same yield me non-negligible speed increases for C?
Either you explicitly keep the string bytes (as a flexible array member) with its length in some struct, and then yes, you could win a tiny bit of performance, or you use strlen which will scan all the bytes. Be aware of CPU cache. Study for inspiration the source code of open source libraries like Glib (they implement hashtables like you do...)
For more, read Modern C and study the source code of open source implementations such as GNU libc and GCC.
A similar question is implementing matrices in C. Then look into this answer.
Actually, you should benchmark.
If you use Linux and GCC, compile with gcc -pg -O2 -Wall then use gprof(1) or time(1) or perf(1) to profile your program. See of course time(7) and syscalls(2).
With other compilers or operating systems, read their documentation.
It may happen that in your code, the performance gains are negligible in practice (a few percents). Most English words have less than 16 bytes, which would fit into an L1 cache line (on current laptop processors in 2020).

Case-insensitive, exact substring matching/index for Node.js or C (no full-text search)

What libraries provide case-insensitive, exact substring matching in Node.js against a large corpus of strings? I'm specifically looking for index-based solutions.
As an example, consider a corpus consists with millions of strings:
"Abc Gef gHi"
"Def Ghi xYz"
…
I need a library such that a search for "C ge" returns the first string above, but a search for "C ge" (note the multiple spaces) does not. In order words, I'm not looking for fuzzy, intelligent, full-text search with stemming and stop words; rather, the most simple (and fast) exact substring matcher with an index that works on a large scale.
Solutions in JavaScript are welcome, and so are solutions in C (as they can be turned into a native Node.js module). Alternatively, solutions in other programming languages such as Java are also possible; they can be used through the command-line. Preferably, solutions are disk-space-bound rather than memory-bound (e.g., rather not Redis), and they should write an index to disk so that subsequent startup time is low.
The problem with most solutions I found (such as the ones here), is that they are too intelligent. I.e., they apply different kinds of stemming or normalization, so the matches are not exact.
Thanks in advance for your help!
I'll list some of the solutions I found.
The most simple, but fitting would be https://github.com/martijnversluis/JsSuffixTrie
Then, more elaborate, hash based: https://github.com/fergiemcdowall/search-index
I can also suggest http://redis.io/. It's advanced, but still quite low-level. Not too many fancy packaging.
Finally, this blog post discusses tries in javascript, where the problem seems to be mostly loading time: http://ejohn.org/blog/javascript-trie-performance-analysis/
On the top of my head I can think of two possible solutions.
One is to use case-insensitive regex (having the string you search for (e.g. "C ge") being the regex) matching.
Another is to store an all lower (or upper) case copy of all strings and use those for the searching while returning the unmodified string. Of course the search-string need to be made all lower (or upper) case for this to work.
It depends of course on the size of your dataset and minimum response times.
For many use cases standard Unix tools such as sed and grep are pretty unbeatable when it comes to pattern matching.

Accurately count number of keywords "if", "while" in a c file

Are there any libraries out there that I can pass my .c files through and will count the visible number of, of example, "if" statements?
We don't have to worry about "if" statement in other files called by the current file, just count of the current file.
I can do a simple grep or regex but wanted to check if there is something better (but still simple)
If you want to be sure it's done right, I'd probably make use of clang and walk the ast. A URL to get you started:
http://clang.llvm.org/docs/IntroductionToTheClangAST.html
First off, there is no way to use regular expressions or grep to give you the correct answer you are looking for. There are lots of ways that you would find those strings, but they could be buried in any amount of escape characters, quotations, comments, etc.
As some commenters have stated, you will need to use a parser/lexer that understands the C language. You want something simple, you said, so you won't be writing this yourself :)
This seems like it might be usable for you:
http://wiki.tcl.tk/3891
From the page:
lexes a string containing C source into a list of tokens
That will probably get you what you want, but even then it's not going to be trivial.
What everyone has said so far is correct; it seems a lot easier to just grep the shit out of your file. The performance hit of this is neglagible compared to the alternative which is to go get the gcc source code (or whichever compiler you're using), and then go through the parsing code and hook in what you want to do while it parses the syntax tree. This seems like a pain in the ass, especially when all you're worried about is the conditional statements. If you actually care about the branches, you could actually just take a look at the object code and count the number of if statements in the assembly, which would correctly tell you the number of branches (rather than just relying on how many times you typed a conditional, which will not translate exactly to the branching of the program).

Speeding up large switches and if-elses

What can I do to improve large switches and if-elses speed manually? I will probably need some kind of hash or lookup table.
I'm working with gcc and C code, I doubt that gcc has any inbuilt optimizations for this.
Edit:
My switch code is what every switch looks like, do something based on if a particular int is some value.
My if-elses look like this:
if( !strcmp( "val1", str ) )
foo();
else if( !strcmp( "val2", str ) )
foo2();
...
I also have ifs that do this
if( struct.member1 != NULL )
foo();
if( struct.member2 != NULL )
foo2();
EDIT2:
Thank you everyone. I'm not sure which one I should pick as an answer, because a lot of these answers have valid points and valuable insights. Unfortunately, I have to pick just one. But thanks all!
In the end, using a perfect hash table seems the best way to get O(n) time on the access for both ifs and switches.
To use a hash table:
Pick a hash function. This one is a biggie. There are tradeoffs between speed, the quality of the hash, and the size of the output. Encryption algorithms can make good hash functions. The hash function performs some computation using all the bits of your input value to return some output value with a smaller number of bits.
So the hash function takes a string
and returns an integer between 0 and
N .
Now you can look up a pointer to a function in a table of size N.
Each entry in the table will be a linked list (or some other searchable data structure) because of the chance of collision, that is two strings that map to the same hash value.
E.g.
lets say hash(char*) returns a value between 0 and 3.
hash("val1") returns 2
hash("val2") returns 0
hash("val3") also returns 0
hash("val4") returns 1
Now your hash table looks something like:
table[0] ("val2",foo2) ("val3", foo3)
table[1] ("val4",foo4)
table[2] ("val1",foo1)
table[3] <empty>
I hope you can see how the cost of doing matching using a hash table is bound by the time it takes to calculate the hash function and the small time it takes to search the entry in the hash table. If the hash table is large enough most hash table entries would have very few items.
For strings, if you have a small finite number of possible strings, use a perfect hash and switch on the result. With only 30 or so strings, finding a perfect hash should be pretty easy. If you also need to validate the input, you'll have to do a single strcmp in each case, but that's pretty cheap.
Beyond that, just let the compiler optimize your switches. Only do anything fancier if you've done sufficient testing to know the time spent here is performance-critical.
I'm not sure what are you looking for, but branch prediction with gcc is discussed in this question
It has. Just see the generated code. At least it optimizes switches.
You may use hash table to optimize your code, but I'm sure that GCC does the same for you.
Another thing is if-else's, when they contain some complex boolean expressions. I will not answer this part of question here.
It really depends on the code base you are working with and whether it is open to further/better modularization. Otherwise, if nothing else I can recommend this.
If there are more common cases than others (one or two things happen more than the rest), place them at the beginning of the switch/if/else, that way in the more common cases your program will only make that first one or two comparisons and short circuit its path. Generally a good idea on its own for any code IMHO.
It depends much on the strings that you are comparing. You could do a switch on some characteristics of the strings:
If you know that they differ pretty
well in the 4th position, you could
do a switch on str[3] and only
then do the strcmp.
Or look at some sort on checksum and switch.
But all of this is quite handcrafted, you definitely should check the assembler that gcc produces.
A hash table would be ideal for speeding up a bunch of string compares.
You might want to look into a string library that does not use nul terminated strings like the C stdlib does. Lots of string manipulation in C involves a lot of "look through the string for the nul, then do your operation".
A string library like SafeStr keeps info about the length of the strings so there's no need to burn time to scan for nuls, especially for strings with unequal lengths
(I'm quoting some from my prior research I've written on this topic)
The specINT2006 benchmark, 458.sjeng, which implements
a chess simulator, uses many switch statements to process the
different chess pieces. Each statement is in a form like:
switch (board[from]) {
case (wpawn): ...
case (wknight): ...
Which the compiler (gcc) generates as an instruction sequence similar
to the following:
40752b: mov -0x28(%rbp),%eax
40752e: mov 0x4238(,%rax,8),%rax
407536: jmpq *%rax
This assembly acts as a lookup table. You can speed up the compiled code further by splitting your switch ... case into multiple switch statements. You'll want to keep the case values consecutive and put the most frequent cases into different switch statements. This particularly improves the indirect branch prediction.
I'll leave the remainder of your questions to others.
Other answers have already suggested using a hash table, I'd recommend generating a perfect hash function using gperf (or a minimal perfect hash function, see the wikipedia page for a few links)

Avoiding string copying in Lua

Say I have a C program which wants to call a very simple Lua function with two strings (let's say two comma separated lists, returning true if the lists intersect at all, false if not).
The obvious way to do this is to push them onto the stack with lua_pushstring, which works fine, however, from the doc it looks like lua_pushstring but makes a copy of the string for Lua to work with.
That means that to cross over to the Lua function is going to require two string copies which I could avoid by rewriting the Lua function in C. Is there any way to arrange things so that the existing C strings could be reused on the Lua side for the sake of performance (or would the strcpy cost pale into insignificance anyway)?
From my investigation so far (my first couple of hours looking seriously at Lua), lite userdata seems like the sort of thing I want, but in the form of a string.
No. You cannot forbid Lua making a copy of the string when you call lua_pushstring().
Reason: Unless, the internal garbage collector would not be able to free unused memory (like your 2 input strings).
Even if you use the light user data functionality (which would be an overkill in this case), you would have to use lua_pushstring() later, when the Lua program asks for the string.
Hmm.. You could certainly write some C functions so that the work is being done on the C side, but as the other answer points out, you might get stuck pushing the string or sections of it in anyways.
Of note: Lua only stores strings once when they are brought in. i.e.: if I have 1 string containing "The quick brown fox jumps over the lazy dog" and I push it into Lua, and there are no other string objects that contain that string, it will make a new copy of it. If, on the other hand, I've already inserted it, you'll just get a pointer to that first string so equality checks are pretty cheap. Importing those strings can be a little expensive, I would guess, if this is done at a high frequency, however comparisons, again, are cheap.
I would try profiling what you're implementing and see if the performance is up to your expectations or not.
As the Lua Performance Tips document (which I recommend reading if you are thinking about maximizing performance with Lua), the two programming maxims related to optimizing are:
Rule #1: Don’t do it.
Rule #2: Don’t do it yet. (for experts only)

Resources