Lexical Analyzer C program for identifying tokens - c

I wrote a C program for lex analyzer (a small code) that will identify keywords, identifiers and constants. I am taking a string (C source code as a string) and then converting splitting it into words.
#include <stdio.h>
#include <conio.h>
#include <string.h>
char symTable[5][7] = { "int", "void", "float", "char", "string" };
int main() {
int i, j, k = 0, flag = 0;
char string[7];
char str[] = "int main(){printf(\"Hello\");return 0;}";
char *ptr;
printf("Splitting string \"%s\" into tokens:\n", str);
ptr = strtok(str, " (){};""");
printf("\n\n");
while (ptr != NULL) {
printf ("%s\n", ptr);
for (i = k; i < 5; i++) {
memset(&string[0], 0, sizeof(string));
for (j = 0; j < 7; j++) {
string[j] = symTable[i][j];
}
if (strcmp(ptr, string) == 0) {
printf("Keyword\n\n");
break;
} else
if (string[j] == 0 || string[j] == 1 || string[j] == 2 ||
string[j] == 3 || string[j] == 4 || string[j] == 5 ||
string[j] == 6 || string[j] == 7 || string[j] == 8 ||
string[j] == 9) {
printf("Constant\n\n");
break;
} else {
printf("Identifier\n\n");
break;
}
}
ptr = strtok(NULL, " (){};""");
k++;
}
_getch();
return 0;
}
With the above code, I am able to identify keywords and identifiers but I couldn't obtain the result for numbers. I've tried using strspn() but of no avail. I even replaced 0,1,2...,9 to '0','1',....,'9'.
Any help would be appreciated.

Here are some problems in your parser:
The test string[j] == 0 does not test if string[j] is the digit 0. The characters for digits are written '0' through '9', their values are 48 to 57 in ASCII and UTF-8. Furthermore, you should be comparing *p instead of string[j] to test if you have a digit in the string indicating the start of a number.
Splitting the string with strtok() is not a good idea: it modifies the string and overwrites the first separator character with '\0': this will prevent matching operators such as (, )...
The string " (){};""" is exactly the same as " (){};". In order to escape " inside strings, you must use \".
To write a lexer for C, you should switch on the first character and check the following characters depending on the value of the first character:
if you have white space, skip it
if you have //, it is a line comment: skip all characters up to the newline.
if you have /*, it is a block comment: skip all characters until you get the pair */.
if you have a ', you have a character constant: parse the characters, handling escape sequences until you get a closing '.
if you have a ", you have astring literal. do the same as for character constants.
if you have a digit, consume all subsequent digits, you have an integer. Parsing the full number syntax requires much more code: leave that for later.
if you have a letter or an underscore: consume all subsequent letters, digits and underscores, then compare the word with the set of predefined keywords. You have either a keyword or an identifier.
otherwise, you have an operator: check if the next characters are part of a 2 or 3 character operator, such as == and >>=.
That's about it for a simple C parser. The full syntax requires more work, but you will get there one step at a time.

When you're writing lexer, always create specific function that finds your tokens (name yylex is used for tool System Lex, that is why I used that name). Writing lexer in main is not smart idea, especially if you want to do syntax, semantic analysis later on.
From your question it is not clear whether you just want to figure out what are number tokens, or whether you want token + fetch number value. I will assume first one.
This is example code, that finds whole numbers:
int yylex(){
/* We read one char from standard input */
char c = getchar();
/* If we read new line, we will return end of input token */
if(c == '\n')
return EOI;
/* If we see digit on input, we can not return number token at the moment.
For example input could be 123a and that is lexical error */
if(isdigit(c)){
while(isdigit(c = getchar()))
;
ungetc(c,stdin);
return NUM;
}
/* Additional code for keywords, identifiers, errors, etc. */
}
Tokens EOI, NUM, etc. should be defined on top. Later on, when you want to write syntax analysis, you use these tokens to figure out whether code responds to language syntax or not. In lexical analysis, usually ASCII values are not defined at all, your lexer function would simply return ')' for example. Knowing that, tokens should be defined above 255 value. For example:
#define EOI 256
#define NUM 257
If you have any futher questions, feel free to ask.

string[j]==1
This test is wrong(1) (on all C implementations I heard of), since string[j] is some char e.g. using ASCII (or UTF-8, or even the old EBCDIC used on IBM mainframes) encoding and the encoding of the char digit 1 is not the the number 1. On my Linux/x86-64 machine (and on most machines using ASCII or UTF-8, e.g. almost all of them) using UTF-8, the character 1 is encoded as the byte of code 48 (that is (char)48 == '1')
You probably want
string[j]=='1'
and you should consider using the standard isdigit (and related) function.
Be aware that UTF-8 is practically used everywhere but is a multi-byte encoding (of displayable characters). See this answer.
Note (1): the string[j]==1 test is probably misplaced too! Perhaps you might test isdigit(*ptr) at some better place.
PS. Please take the habit of compiling with all warnings and debug info (e.g. with gcc -Wall -Wextra -g if using GCC...)
and use the debugger (e.g. gdb). You should have find out your bug in less time than it took you to get an answer here.

Related

Specific case of c scanf()

Input : [1,3,2,4]
I want to make arr[4] = {1, 3, 2, 4} from this input using scanf(). How can I do this in C language?
It is possible to parse input such as you describe with scanf, but each scanf call will parse up to a maximum number of fields determined by the given format. Thus, to parse an arbitrary number of fields requires an arbitrary number of scanf calls.
In comments, you wrote that
I want to find a method to ignore '[', ']', ',' and only accept integer units.
Taking that as the focus of the question, and therefore ignoring the issues of how you allocate space for the integers to be read when you do not know in advance how many there will be, and assuming that you may not use input functions other than scanf, it seems like you are looking for something along these lines:
int value;
char delim[2] = { 0 };
// Scan and confirm the opening '['
value = 0;
if (scanf("[%n", &value) == EOF) {
// handle end of file or I/O error ...
} else if (value == 0) {
// handle input not starting with a '[' ...
// Note: value == zero because we set it so, and the %n directive went unprocessed
} else {
// if value != 0 then it's because a '[' was scanned and the %n was processed
assert(value == 1);
}
// scan the list items
do {
// One integer plus trailing delimiter, either ',' or ']'
switch(scanf("%d%1[],]", &value, delim)) {
case EOF:
// handle end of file or I/O error (before an integer is read) ...
break;
case 0:
// handle input not starting with an integer ...
// The input may be malformed, but this point will also be reached for an empty list
break;
case 1:
// handle malformed input starting with an integer (which has been scanned) ...
break;
case 2:
// handle valid (to this point) input. The scanned value needs to be stored somewhere ...
break;
default:
// cannot happen
assert(0);
}
// *delim contains the trailing delimiter that was scanned
} while (*delim == ',');
// assuming normal termination of the loop:
assert(*delim == ']');
Points to note:
it is essential to pay attention to the return value of scanf. Failure to do so and to respond appropriately will cause all manner of problems when unexpected input is presented.
the above will accept slightly more general input than you describe, with whitespace (including line terminators) permitted before each integer.
The directive %1[],] attempts to scan a 1-character string whose element is either ] or ,. This is a bit arcane. Also, because the input is scanned as a string, you must be sure to provide space for a string terminator to be written, too.
it would be easier to write a character-by-character parser for your specific format that does not rely on scanf. You could also use scanf to read one character at a time to feed such a parser, but that seems to violate the spirit of the exercise.
While I think that John Bollinger answer is pretty good and complete (even without considering the wonderful %1[[,]), I would go for a more compact and tolerant version like this:
#include <stdio.h>
size_t arr_input(int *arr, size_t max_size)
{
size_t n;
for (n = 0; n < max_size; ++n) {
char c;
int res = scanf("%c%d", &c, arr + n);
if (res != 2
|| (n == 0 && c != '[')
|| (n > 0 && c != ',')
|| (n > 0 && c == ']')) {
break;
}
}
return n;
}
int main(void)
{
char *test_strings[] = { "[1,2,3,4]", "[42]", "[1,1,2,3,5,8]", "[]",
"[10,20,30,40,50,60,70,80,90,100]", "[1,2,3]4" };
size_t test_strings_n = sizeof test_strings / sizeof *test_strings;
char filename[L_tmpnam];
tmpnam(filename);
for (size_t i = 0; i < test_strings_n; ++i) {
freopen(filename, "w+", stdin);
fputs(test_strings[i], stdin);
rewind(stdin);
int arr[9];
size_t num_elem = arr_input(arr, 9);
printf("%zu: %s -> ", i, test_strings[i]);
for (size_t j = 0; j < num_elem; ++j) {
printf("%d ", arr[j]);
}
printf("\n");
fclose(stdin);
}
remove(filename);
return 0;
}
The idea is that you allocate space for the maximum number of integers you accept, then ask the arr_input() function to fill it up to max_size elements.
The check after scanf() tries to cope with incorrect input, but is not very complete. If you trust your input to be correct (don't) you can even make it shorter, by dropping the three || cases.
The most complex thing was to write the test driver with tmp files, strings, reopening and such. Here I'd have loved to have std::istream to just drop a std::stringstream. The fact that the FILE interface doesn't support strings really bugs me.
int arr[4];
for(int i=0;i<4;i++) scanf("%d",&arr[i]);
Are you asking for this? I was little bit confused with your question, if this doesn't solve your query, then don't hesitate to ask again...
use scanf to read a string input from user then parse that input into an integer array
To parse you can use string function "find" to locate the "," and "[]" and then use "atoi" to convert string into integer to fill the destination input array.
Edit: find is a C++ function.
the C function is strchr

Stdin + Dictionary Text Replacement Tool -- Debugging

I'm working on a project in which I have two main files. Essentially, the program reads in a text file defining a dictionary with key-value mappings. Each key has a unique value and the file is formatted like this where each key-value pair is on its own line:
ipsum i%##!
fubar fubar
IpSum XXXXX24
Ipsum YYYYY211
Then the program reads in input from stdin, and if any of the "words" match the keys in the dictionary file, they get replaced with the value. There is a slight thing about upper and lower cases -- this is the order of "match priority"
The exact word is in the replacement set
The word with all but the first character converted to lower case is in the replacement set
The word converted completely to lower case is in the replacement set
Meaning if the exact word is in the dictionary, it gets replaced, but if not the next possibility (2) is checked and so on...
My program passes the basic cases we were provided but then the terminal shows
that the output vs reference binary files differ.
I went into both files (not c files, but binary files), and one was super long with tons of numbers and the other just had a line of random characters. So that didn't really help. I also reviewed my code and made some small tests but it seems okay? A friend recommended I make sure I'm accounting for the null operator in processInput() and I already was (or at least I think so, correct me if I'm wrong). I also converted getchar() to an int to properly check for EOF, and allocated extra space for the char array. I also tried vimdiff and got more confused. I would love some help debugging this, please! I've been at it all day and I'm very confused.
There are multiple issues in the processInput() function:
the loop should not stop when the byte read is 0, you should process the full input with:
while ((ch = getchar()) != EOF)
the test for EOF should actually be done differently so the last word of the file gets a chance to be handled if it occurs exactly at the end of the file.
the cast in isalnum((char)ch) is incorrect: you should pass ch directly to isalnum. Casting as char is actually counterproductive because it will turn byte values beyond CHAR_MAX to negative values for which isalnum() has undefined behavior.
the test if(ind >= cap) is too loose: if word contains cap characters, setting the null terminator at word[ind] will write beyond the end of the array. Change the test to if (cap - ind < 2) to allow for a byte and a null terminator at all times.
you should check that there is at least one character in the word to avoid calling checkData() with an empty string.
char key[ind + 1]; is useless: you can just pass word to checkData().
checkData(key, ind) is incorrect: you should pass the size of the buffer for the case conversions, which is at least ind + 1 to allow for the null terminator.
the cast in putchar((char)ch); is useless and confusing.
There are some small issues in the rest of the code, but none that should cause a problem.
Start by testing your tokeniser with:
$ ./a.out <badhash2.c >zooi
$ diff badhash2.c zooi
$
Does it work for binary files, too?:
$ ./a.out <./a.out > zooibin
$ diff ./a.out zooibin
$
Yes, it does!
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
void processInput(void);
int main(int argc, char **argv) {
processInput();
return 0;
}
void processInput() {
int ch;
char *word;
int len = 0;
int cap = 60;
word = malloc(cap);
while(1) {
ch = getchar(); // (1)
if( ch != EOF && isalnum(ch)) { // (2)
if(len+1 >= cap) { // (3)
cap += cap/2;
word = realloc(word, cap);
}
word[len++] = ch;
} else {
if (len) { // (4)
#if 0
char key[len + 1];
memcpy(key, word, len); key[len] = 0;
checkData(key, len);
#else
word[len] = 0;
fputs(word, stdout);
#endif
len = 0;
}
if (ch == EOF) break; // (5)
putchar(ch);
}
}
free(word);
}
I only repaired your tokeniser, leaving out the hash table and the search & replace stuff. It is now supposed to generate a verbatim copy of the input. (which is silly, but great for testing)
If you want to allow binary input, you cannot use while((ch = getchar()) ...) : a NUL in the input would cause the loop to end. You must pospone testing for EOF, because ther could still be a final word in your buffer ...&& ch != EOF)
treat EOF just like a space here: it could be the end of a word
you must reserve space for the NUL ('\0') , too.
if (len==0) there would be no word, so no need to look it up.
we treated EOF just like a space, but we don't want to write it to the output. Time to break out of the loop.

Using sscanf to validate a string input

I have just started learning C after coding for some while in Java and Python.
I was wondering how I could "validate" a string input (if it stands in a certain criteria) and I stumbled upon the sscanf() function.
I had the impression that it acts kind of similarly to regular expressions, however I didn't quite manage to tell how I can create rather complex queries with it.
For example, lets say I have the following string:
char str[]={"Santa-monica 123"}
I want to use sscanf() to check if the string has only letters, numbers and dashes in it.
Could someone please elaborate?
The fact that sscanf allows something that looks a bit like a character class by no means implies that it is anything at all like a regular expression library. In fact, Posix doesn't even require the scanf functions to accept character ranges inside character classes, although I suspect that it will work fine on any implementation you will run into.
But the scanning problem you have does not require regular expressions, either. All you need is a repeated character class match, and sscanf can certainly do that:
#include <stdbool.h>
bool check_string(const char* s) {
int n = 0;
sscanf(s, "%*[-a-zA-Z0-9]%n", &n);
return s[n] == 0;
}
The idea behind that scanf format is that the first conversion will match and discard the longest initial sequence consisting of valid characters. (It might fail if the first character is invalid. Thanks to #chux for pointing that out.) If it succeeds, it will then set n to the current scan point, which is the offset of the next character. If the next character is a NUL, then all the characters were good. (This version returns OK for the empty string, since it contains no illegal characters. If you want the empty string to fail, change the return condition to return n && s[n] == 0;)
You could also do this with the standard regex library (or any more sophisticated library, if you prefer, but the Posix library is usually available without additional work). This requires a little bit more code in order to compile the regular expression. For efficiency, the following attempts to compile the regex only once, but for simplicity I left out the synchronization to avoid data races during initialization, so don't use this in a multithreaded application.
#include <regex.h>
#include <stdbool.h>
bool check_string(const char* s) {
static regex_t* re_ptr = NULL;
static regex_t re;
if (!re_ptr) regcomp((re_ptr = &re), "^[[:alnum:]-]*$", REG_EXTENDED);
return regexec(re_ptr, s, 0, NULL, 0) == 0;
}
I want to use sscanf() to check if the string has only letters, numbers and dashes in it.
Variation of #rici good answer.
Create a scanset for letters, numbers and dashes.
//v The * indicates to scan, but not save the result.
// v Dash (or minus sign), best to list first.
"%*[-0-9A-Za-z]"
// ^^^^^^ Letters a-z, both cases
// ^^^ Digits
Use "%n" to detect how far the scan went.
Now we can use determine if
Scanning stop due to a null character (the whole string is valid)
Scanning stop due to an invalid character
int n = 0;
sscanf(str, "%*[-0-9A-Za-z]%n", &n);
bool success = (str[n] == '\0');
sscanf does not have this functionality, the argument you are referring to is a format specifier and not used for validation. see here: https://www.tutorialspoint.com/c_standard_library/c_function_sscanf.htm
as also mentioned sscanf is for a different job. for more in formation see this link. You can loop over string using isalpha and isdigit to check if chars in string are digits and alphabetic characters or no.
char str[]={"Santa-monica 123"}
for (int i = 0; str[i] != '\0'; i++)
{
if ((!isalpha(str[i])) && (!isdigit(str[i])) && (str[i] != '-'))
printf("wrong character %c", str[i]);//this will be printed for spaces too
}
I want to ... check if the string has only letters, numbers and dashes in it.
In C that's traditionally done with isalnum(3) and friends.
bool valid( const char str[] ) {
for( const char *p = str; p < str + strlen(str); p++ ) {
if( ! (isalnum(*p) || *p == '-') )
return false;
}
return true;
}
You can also use your friendly neighborhood regex(3), but you'll find that requires a surprising amount of code for a simple scan.
After retrieving value on sscanf(), you may use regular expression to validate the value.
Please see Regular Expression ic C

How to change multicharacter signs by other ones in C?

I've got an UTF-8 text file containing several signs that i'd like to change by other ones (only those between |( and |) ), but the problem is that some of these signs are not considered as characters but as multi-character signs. (By this i mean they can't be put between '∞' but only like this "∞", so char * ?)
Here is my textfile :
Text : |(abc∞∪v=|)
For example :
∞ should be changed by ¤c
∪ by ¸!
= changed by "
So as some signs(∞ and ∪) are multicharacters, i decided to use fscanf to get all the text word by word. The problem with this method is that I have to put space between each character ... My file should look like this :
Text : |( a b c ∞ ∪ v = |)
fgetc can't be used because characters like ∞ can't be considered as one single character.If i use it I won't be able to strcmp a char with each sign (char * ), i tried to convert my char to char* but strcmp !=0.
Here is my code in C to help you understanding my problem :
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main(void){
char *carac[]={"∞","=","∪"}; //array with our signs
FILE *flot,*flot3;
flot=fopen("fichierdeTest2.txt","r"); // input text file
flot3=fopen("resultat.txt","w"); //output file
int i=0,j=0;
char a[1024]; //array that will contain each read word.
while(!feof(flot))
{
fscanf(flot,"%s",&a[i]);
if (strstr(&a[i], "|(") != NULL){ // if the word read contains |( then j=1
j=1;
fprintf(flot3,"|(");
}
if (strcmp(&a[i], "|)") == 0)
j=0;
if(j==1) { //it means we are between |( and |) so the conversion can begin
if (strcmp(carac[0], &a[i]) == 0) { fprintf(flot3, "¤c"); }
else if (strcmp(carac[1], &a[i]) == 0) { fprintf(flot3,"\"" ); }
else if (strcmp(carac[2], &a[i]) == 0) { fprintf(flot3, " ¸!"); }
else fprintf(flot3,"%s",&a[i]); // when it's a letter, number or sign that doesn't need to be converted
}
else { // when we are not between |( and |) just copy the word to the output file with a space after it
fprintf(flot3, "%s", &a[i]);
fprintf(flot3, " ");
}
i++;
}
}
Thanks a lot for the future help !
EDIT : Every sign will be changed correctly if i put a space between each them but without ,it won't work, that's what i'm trying to solve.
First of all, get the terminology right. Proper terminology is a bit confusing, but at least other people will understand what you are talking about.
In C, char is the same as byte. However, a character is something abstract like ∞ or ¤ or c. One character may contain a few bytes (that is a few chars). Such characters are called multi-byte ones.
Converting a character to a sequence of bytes (encoding) is not trivial. Different systems do it differently; some use UTF-8, while others may use UTF-16 big-endian, UTF-16 little endian, a 8-bit codepage or any other encoding.
When your C program has something inside quotes, like "∞" - it's a C-string, that is, several bytes terminated by a zero byte. When your code uses strcmp to compare strings, it compares each byte of both strings, to make sure they are equal. So, if your source code and your input file use different encodings, the strings (byte sequences) won't match, even though you will see the same character when examining them!
So, to rule out any encoding mismatches, you might want to use a sequence of bytes instead of a character in your source code. For example, if you know that your input file uses the UTF-8 encoding:
char *carac[]={
"\xe2\x88\x9e", // ∞
"=",
"\xe2\x88\xaa"}; // ∪
Alternatively, make sure the encodings (of your source code and your program's input file) are the same.
Another, less subtle, problem: when comparing strings, you actually have a big string and a small string, and you want to check whether the big string starts with the small string. Here strcmp does the wrong thing! You must use strncmp here instead:
if (strncmp(carac[0], &a[i], strlen(carac[0])) == 0)
{
fprintf(flot3, "\xC2\xA4""c"); // ¤c
}
Another problem (actually, a major bug): the fscanf function reads a word (text delimited by spaces) from the input file. If you only examine the first byte in this word, the other bytes will not be processed. To fix, make a loop over all bytes:
fscanf(flot,"%s",a);
for (i = 0; a[i] != '\0'; )
{
if (strncmp(&a[i], "|(", 2)) // start pattern
{
now_replacing = 1;
i += 2;
continue;
}
if (now_replacing)
{
if (strncmp(&a[i], whatever, strlen(whatever)))
{
fprintf(...);
i += strlen(whatever);
}
}
else
{
fputc(a[i], output);
i += 1; // processed just one char
}
}
You're on the right track, but you need to look at characters differently than strings.
strcmp(carac[0], &a[i])
(Pretending i = 2) As you know this compares the string "∞" with &a[2]. But you forget that &a[2] is the address of the second character of the string, and strcmp works by scanning the entire string until it hits a null terminator. So "∞" actually ends up getting compared with "abc∞∪v=|)" because a is only null terminated at the very end.
What you should do is not use strings, but expand each character (8 bits) to a short (16 bits). And then you can compare them with your UTF-16 characters
if( 8734 = *((short *)&a[i])) { /* character is infinity */ }
The reason for that 8734 is because that's the UTF16 value of infinity.
VERY IMPORTANT NOTE:
Depending if your machine is big-endian or little-endian matters for this case. If 8734 (0x221E) does not work, give 7714 (0x1E22) a try.
Edit Something else I overlooked is you're scanning the entire string at once. "%s: String of characters. This will read subsequent characters until a whitespace is found (whitespace characters are considered to be blank, newline and tab)." (source)
//feof = false.
fscanf(flot,"%s",&a[i]);
//feof = ture.
That means you never actually iterate. You need to go back and rethink your scanning procedure.

Fast C comparison

As part of a protocol I'm receiving C string of the following format:
WORD * WORD
Where both WORDs are the same given string.
And, * - is any string of printable characters, NOT including spaces!
So the following are all legal:
WORD asjdfnkn WORD
WORD 234kjk2nd32jk WORD
And the following are illegal:
WORD akldmWORD
WORD asdm zz WORD
NOTWORD admkas WORD
NOTWORD admkas NOTWORD
Where (1) is missing a trailing space; (2) has 3 or more spaces; (3)/(4) do not open/end with the correct string (WORD).
Of-course this could be implemented pretty straight-forward, however I'm not sure what I'm doing is the most efficient.
Note: WORD is pre-set for a whole run, however could change from run to run.
Currently I'm strncmping each string against "WORD ".
If that checks manually (char-by-char) run over the string, to check for the second space char.
[If found] I then strcmp (all the way) with "WORD".
Would love to hear your solution, with an emphasis on efficiency as I'll be running over millions of theses in real-time.
I'd say, have a look at the algorithms in Handbook of Exact String-Matching Algorithms, compare the complexities and choose the one that you like best, implement it.
Or you can use some ready-made implementations.
You have some really classical algorithms for searching strings inside another string here:
KMP(Knuth-Morris-Pratt)
Rabin-Karp
Boyer-Moore
Hope this helps :)
Have you profiled?
There's not much gain to be had here, since you're doing basic string comparisons. If you want to go for the last few percent of performance, I'd change out the str... functions for mem... functions.
char *bufp, *bufe; // pointer to buffer, one past end of buffer
if (bufe - bufp < wordlen * 2 + 2)
error();
if (memcmp(bufp, word, wordlen) || bufp[wordlen] != ' ')
error();
bufp += wordlen + 1;
char *datap = bufp;
char *datae = memchr(bufp, ' ', bufe - buf);
if (!datae || bufe - datae < wordlen + 1)
error();
if (memcmp(datae + 1, word, wordlen))
error();
// Your data is in the range [datap, datae).
The performance gains are likely less than spectacular. You have to examine each character in the buffer since each character could be a space, and any character in the delimiters could be wrong. Changing a loop to memchr is slick, but modern compilers know how to do that for you. Changing a strncmp or strcmp to memcmp is also probably going to be negligible.
There is probably a tradeoff to be made between the shortest code and the fastest implementation. Choices are:
The regular expression ^WORD \S+ WORD$ (requires a regex engine)
strchr on "WORD " and a strrchr on " WORD" with a lot of messy checks (not really recommended)
Walking the whole string character by character, keeping track of the state you are in (scanning first word, scanning first space, scanning middle, scanning last space, scanning last word, expecting end of string).
Option 1 requires the least code but backtracks near the end, and Option 2 has no redeeming qualities. I think you can do option 3 elegantly. Use a state variable and it will look okay. Remember to manually enter the last two states based on the length of your word and the length of your overall string and this will avoid the backtracking that a regex will most likely have.
Do you know how long the string that is to be checked is? If not, your are somewhat limited in what you can do. If you do know how long the string is, you can speed things up a bit. You have not specified for sure that the '*' part has to be at least one character. You've also not stipulated whether tabs are allowed, or newlines, or ... is it only alphanumerics (as in your examples) or are punctuation and other characters allowed? Control characters?
You know how long WORD is, and can pre-construct both the start and end markers. The function error() reports an error (however you need it to be reported) and returns false. The test function might be bool string_is_ok(const char *string, int actstrlen);, returning true on success and false when there is a problem:
// Preset variables characterizing the search
static int wordlen = 4;
static int marklen = wordlen + 1;
static int minstrlen = 2 * marklen + 1; // Two blanks and one other character.
static char bword[] = "WORD "; // Start marker
static char eword[] = " WORD"; // End marker
static char verboten[] = " "; // Forbidden characters
bool string_is_ok(const char *string, int actstrlen)
{
if (actstrlen < minstrlen)
return error("string too short");
if (strncmp(string, bword, marklen) != 0)
return error("string does not start with WORD");
if (strcmp(string + actstrlen - marklen, eword) != 0)
return error("string does not finish with WORD");
if (strcspn(string + marklen, verboten) != actstrlen - 2 * marklen)
return error("string contains verboten characters");
return true;
}
You probably can't reduce the tests by much if you want your guarantees. The part that would change most depending on the restrictions in the alphabet is the strcspn() line. That is relatively fast for a small list of forbidden characters; it will likely be slower as the number of characters forbidden is increased. If you only allow alphanumerics, you have 62 OK and 193 not OK characters, unless you count some of the high-bit set characters as alphabetic too. That part will probably be slow. You might do better with a custom function that takes a start position and length and reports whether all characters are OK. This could be along the lines of:
#include <stdbool.h>
static bool ok_chars[256] = { false };
static void init_ok_chars(void)
{
const unsigned char *ok = "abcdefghijklmnopqrstuvwxyz...0123456789";
int c;
while ((c = *ok++) != 0)
ok_chars[c] = 1;
}
static bool all_chars_ok(const char *check, int numchars)
{
for (i = 0; i < numchars; i++)
if (ok_chars[check[i]] == 0)
return false;
return true;
}
You can then use:
return all_chars_ok(string + marklen, actstrlen - 2 * marklen);
in place of the call to strcspn().
If your "stuffing" should contain only '0'-'9', 'A'-'Z' and 'a'-'z' and are in some encoding based on ASCII (like most Unicode based encodings), then you can skip two comparisons in one of your loops, since only one bit differ between capital and minor characters.
Instead of
ch>='0' && ch<='9' && ch>='A' && ch<='Z' && ch>='a' && ch<='a'
you get
ch2 = ch & ~('a' ^ 'A')
ch>='0' && ch<='9' && ch2>='A' && ch2<='Z'
But you better look at the assembler code your compiler generate and do some benchmarking, depending on computer architecture and compiler, this trick could give slower code.
If branching is expensive compared to comparisons on your computer, you can also replace the && with &. But most modern compilers know this trick in most situations.
If, on the other hand, you test for any printable glyph from some large character encoding, then it is most likely less expensive to test for white-space glyphs, rather then printable glyph.
Also, compile specifically for the computer that the code will run on and don't forget turn of any generation of debugging-code.
Added:
Don't make subroutine calls within your scan loops, unless it is worth it.
Whatever trick you use to speed up your loops, it will diminish if you have to make a sub-routine call within one of them. It is fine to use built-in functions that your compiler inline into your code, but if you use something lika an external regex-library and your compiler is unable to inline those functions (gcc can do that, sometimes, if you ask it to), then making that subroutine call will shuffle a lot of memory around, in worse case between different types of memory (registers, CPU buffers, RAM, harddisk et.c.) and may mess up CPU predictions and pipelines. Unless your text-snippets are very long, so that you spend much time parsing each of them, and the subroutine is effective enough to compensate for the cost of the call, don't do that. Some functions for parsing use call-backs, it might be more effective then you making a lot of subroutine calls from your loops (since the function can scan several pattern-matches in one sweep and bunch several call-backs together outside the critical loop), but that depend on how someone else have written that function and basically it is the same thing as you making the call.
WORD is 4 characters, with uint32_t you could do a quick comparison. You will need a different constant depending on system endianness. The rest seems to be fine.
Since WORD can change you have to precalculate the uint32_t, uint64_t, ... you need depending on the length of the WORD.
Not sure from the description, but if you trust the source you could just chomp the first n+1 and last n+1 characters.
bool check_legal(
const char *start, const char *end,
const char *delim_start, const char *delim_end,
const char **content_start, const char **content_end
) {
const size_t delim_len = delim_end - delim_start;
const char *p = start;
if (start + delim_len + 1 + 0 + 1 + delim_len < end)
return false;
if (memcmp(p, delim_start, delim_len) != 0)
return false;
p += delim_len;
if (*p != ' ')
return false;
p++;
*content_start = p;
while (p < end - 1 - delim_len && *p != ' ')
p++;
if (p + 1 + delim_len != end)
return false;
*content_end = p;
p++;
if (memcmp(p, delim_start, delim_len) != 0)
return false;
return true;
}
And here is how to use it:
const char *line = "who is who";
const char *delim = "who";
const char *start, *end;
if (check_legal(line, line + strlen(line), delim, delim + strlen(delim), &start, &end)) {
printf("this %*s nice\n", (int) (end - start), start);
}
(It's all untested.)
using STL find the number of spaces..if they are not two obviously the string is wrong..and using find(algorithm.h) you can get the position of the two spaces and the middle word! Check for WORD at the beginning and the end! you are done..
This should return the true/false condition in O(n) time
int sameWord(char *str)
{
char *word1, *word2;
word1 = word2 = str;
// Word1, Word2 points to beginning of line where the first word is found
while (*word2 && *word2 != ' ') ++word2; // skip to first space
if (*word2 == ' ') ++word2; // skip space
// Word1 points to first word, word2 points to the middle-filler
while (*word2 && *word2 != ' ') ++word2; // skip to second space
if (*word2 == ' ') ++word2; // skip space
// Word1 points to first word, word2 points to the second word
// Now just compare that word1 and word2 point to identical strings.
while (*word1 != ' ' && *word2)
if (*word1++ != *word2++) return 0; //false
return *word1 == ' ' && (*word2 == 0 || *word2 == ' ');
}

Resources