Boyer Moore Algorithm Implementation? - c

Is there a working example of the Boyer-Moore string search algorithm in C?
I've looked at a few sites, but they seem pretty buggy, including wikipedia.
Thanks.

The best site for substring search algorithms:
http://igm.univ-mlv.fr/~lecroq/string/

There are a couple of implementations of Boyer-Moore-Horspool (including Sunday's variant) on Bob Stout's Snippets site. Ray Gardner's implementation in BMHSRCH.C is bug-free as far as I know1, and definitely the fastest I've ever seen or heard of. It's not, however, the easiest to understand -- he uses some fairly tricky code to keep the inner loop as a simple as possible. I may be biased, but I think my version2 in PBMSRCH.C is a bit easier to understand (though definitely a bit slower).
1 Within its limits -- it was originally written for MS-DOS, and could use a rewrite for environments that provide more memory.
2 This somehow got labeled as "Pratt-Boyer-Moore", but is actually Sunday's variant of Boyer-Moore-Horspool (though I wasn't aware of it at the time and didn't publish it, I believe I actually invented it about a year before Sunday did).

Here is a C90 implementation that I have stressed with a lot of strange test cases:
#ifndef MAX
#define MAX(a,b) ((a > b) ? (a) : (b))
#endif
void fillBadCharIndexTable (
/*----------------------------------------------------------------
function:
the table fits for 8 bit character only (including utf-8)
parameters: */
size_t aBadCharIndexTable [],
char const * const pPattern,
size_t const patternLength)
/*----------------------------------------------------------------*/
{
size_t i;
size_t remainingPatternLength = patternLength - 1;
for (i = 0; i < 256; ++i) {
aBadCharIndexTable [i] = patternLength;
}
for (i = 0; i < patternLength; ++i) {
aBadCharIndexTable [pPattern [i]] = remainingPatternLength--;
}
}
void fillGoodSuffixRuleTable (
/*----------------------------------------------------------------
function:
the table fits for patterns of length < 256; for longer patterns ... (1 of)
- increase the static size
- use variable length arrays and >= C99 compilers
- allocate (and finally release) heap according to demand
parameters: */
size_t aGoodSuffixIndexTable [],
char const * const pPattern,
size_t const patternLength)
/*----------------------------------------------------------------*/
{
size_t const highestPatternIndex = patternLength - 1;
size_t prefixLength = 1;
/* complementary prefix length, i.e. difference from highest possible pattern index and prefix length */
size_t cplPrefixLength = highestPatternIndex;
/* complementary length of recently inspected pattern substring which is simultaneously pattern prefix and suffix */
size_t cplPrefixSuffixLength = patternLength;
/* too hard to explain in a C source ;-) */
size_t iRepeatedSuffixMax;
aGoodSuffixIndexTable [cplPrefixLength] = patternLength;
while (cplPrefixLength > 0) {
if (!strncmp (pPattern, pPattern + cplPrefixLength, prefixLength)) {
cplPrefixSuffixLength = cplPrefixLength;
}
aGoodSuffixIndexTable [--cplPrefixLength] = cplPrefixSuffixLength + prefixLength++;
}
if (pPattern [0] != pPattern [highestPatternIndex]) {
aGoodSuffixIndexTable [highestPatternIndex] = highestPatternIndex;
}
for (iRepeatedSuffixMax = 1; iRepeatedSuffixMax < highestPatternIndex; ++iRepeatedSuffixMax) {
size_t iSuffix = highestPatternIndex;
size_t iRepeatedSuffix = iRepeatedSuffixMax;
do {
if (pPattern [iRepeatedSuffix] != pPattern [iSuffix]) {
aGoodSuffixIndexTable [iSuffix] = highestPatternIndex - iRepeatedSuffix;
break;
}
--iSuffix;
} while (--iRepeatedSuffix > 0);
}
}
char const * boyerMoore (
/*----------------------------------------------------------------
function:
find a pattern (needle) inside a text (haystack)
parameters: */
char const * const pHaystack,
size_t const haystackLength,
char const * const pPattern)
/*----------------------------------------------------------------*/
{
size_t const patternLength = strlen (pPattern);
size_t const highestPatternIndex = patternLength - 1;
size_t aBadCharIndexTable [256];
size_t aGoodSuffixIndexTable [256];
if (*pPattern == '\0') {
return pHaystack;
}
if (patternLength <= 1) {
return strchr (pHaystack, *pPattern);
}
if (patternLength >= sizeof aGoodSuffixIndexTable) {
/* exit for too long patterns */
return 0;
}
{
char const * pInHaystack = pHaystack + highestPatternIndex;
/* search preparation */
fillBadCharIndexTable (
aBadCharIndexTable,
pPattern,
patternLength);
fillGoodSuffixRuleTable (
aGoodSuffixIndexTable,
pPattern,
patternLength);
/* search execution */
while (pInHaystack++ < pHaystack + haystackLength) {
int iPattern = (int) highestPatternIndex;
while (*--pInHaystack == pPattern [iPattern]) {
if (--iPattern < 0) {
return pInHaystack;
}
}
pInHaystack += MAX (aBadCharIndexTable [*pInHaystack], aGoodSuffixIndexTable [iPattern]);
}
}
return 0;
}

Related

Check if Char Array contains special sequence without using string library on Unix in C

Let‘s assume we have a char array and a sequence. Next we would like to check if the char array contains the special sequence WITHOUT <string.h> LIBRARY: if yes -> return true; if no -> return false.
bool contains(char *Array, char *Sequence) {
// CONTAINS - Function
for (int i = 0; i < sizeof(Array); i++) {
for (int s = 0; s < sizeof(Sequence); s++) {
if (Array[i] == Sequence[i]) {
// How to check if Sequence is contained ?
}
}
}
return false;
}
// in Main Function
char *Arr = "ABCDEFG";
char *Seq = "AB";
bool contained = contains(Arr, Seq);
if (contained) {
printf("Contained\n");
} else {
printf("Not Contained\n");
}
Any ideas, suggestions, websites ... ?
Thanks in advance,
Regards, from ∆
The simplest way is the naive search function:
for (i = 0; i < lenS1; i++) {
for (j = 0; j < lenS2; j++) {
if (arr[i] != seq[j]) {
break; // seq is not present in arr at position i!
}
}
if (j == lenS2) {
return true;
}
}
Note that you cannot use sizeof because the value you seek is not known at run time. Sizeof will return the pointer size, so almost certainly always four or eight whatever the strings you use. You need to explicitly calculate the string lengths, which in C is done by knowing that the last character of the string is a zero:
lenS1 = 0;
while (string1[lenS1]) lenS1++;
lenS2 = 0;
while (string2[lenS2]) lenS2++;
An obvious and easy improvement is to limit i between 0 and lenS1 - lenS2, and if lenS1 < lenS2, immediately return false. Obviously if you haven't found "HELLO" in "WELCOME" by the time you've gotten to the 'L', there's no chance of five-character HELLO being ever contained in the four-character remainder COME:
if (lenS1 < lenS2) {
return false; // You will never find "PEACE" in "WAR".
}
lenS1minuslenS2 = lenS1 - lenS2;
for (i = 0; i < lenS1minuslenS2; i++)
Further improvements depend on your use case.
Looking for the same sequence among lots of arrays, looking for different sequences always in the same array, looking for lots of different sequences in lots of different arrays - all call for different optimizations.
The length and distribution of characters within both array and sequence also matter a lot, because if you know that there only are (say) three E's in a long string and you know where they are, and you need to search for HELLO, there's only three places where HELLO might fit. So you needn't scan the whole "WE WISH YOU A MERRY CHRISTMAS, WE WISH YOU A MERRY CHRISTMAS AND A HAPPY NEW YEAR" string. Actually you may notice there are no L's in the array and immediately return false.
A balanced option for an average use case (it does have pathological cases) might be supplied by the Boyer-Moore string matching algorithm (C source and explanation supplied at the link). This has a setup cost, so if you need to look for different short strings within very large texts, it is not a good choice (there is a parallel-search version which is good for some of those cases).
This is not the most efficient algorithm but I do not want to change your code too much.
size_t mystrlen(const char *str)
{
const char *end = str;
while(*end++);
return end - str - 1;
}
bool contains(char *Array, char *Sequence) {
// CONTAINS - Function
bool result = false;
size_t s, i;
size_t arrayLen = mystrlen(Array);
size_t sequenceLen = mystrlen(Sequence);
if(sequenceLen <= arrayLen)
{
for (i = 0; i < arrayLen; i++) {
for (s = 0; s < sequenceLen; s++)
{
if (Array[i + s] != Sequence[s])
{
break;
}
}
if(s == sequenceLen)
{
result = true;
break;
}
}
}
return result;
}
int main()
{
char *Arr = "ABCDEFG";
char *Seq = "AB";
bool contained = contains(Arr, Seq);
if (contained)
{
printf("Contained\n");
}
else
{
printf("Not Contained\n");
}
}
Basically this is strstr
const char* strstrn(const char* orig, const char* pat, int n)
{
const char* it = orig;
do
{
const char* tmp = it;
const char* tmp2 = pat;
if (*tmp == *tmp2) {
while (*tmp == *tmp2 && *tmp != '\0') {
tmp++;
tmp2++;
}
if (n-- == 0)
return it;
}
tmp = it;
tmp2 = pat;
} while (*it++ != '\0');
return NULL;
}
The above returns n matches of substring in a string.

figure out why my RC4 Implementation doesent produce the correct result

Ok I am new to C, I have programmed in C# for around 10 years now so still getting used to the whole language, Ive been doing great in learning but im still having a few hickups, currently im trying to write a implementation of RC4 used on the Xbox 360 to encrypt KeyVault/Account data.
However Ive run into a snag, the code works but it is outputting the incorrect data, I have provided the original c# code I am working with that I know works and I have provided the snippet of code from my C project, any help / pointers will be much appreciated :)
Original C# Code :
public struct RC4Session
{
public byte[] Key;
public int SBoxLen;
public byte[] SBox;
public int I;
public int J;
}
public static RC4Session RC4CreateSession(byte[] key)
{
RC4Session session = new RC4Session
{
Key = key,
I = 0,
J = 0,
SBoxLen = 0x100,
SBox = new byte[0x100]
};
for (int i = 0; i < session.SBoxLen; i++)
{
session.SBox[i] = (byte)i;
}
int index = 0;
for (int j = 0; j < session.SBoxLen; j++)
{
index = ((index + session.SBox[j]) + key[j % key.Length]) % session.SBoxLen;
byte num4 = session.SBox[index];
session.SBox[index] = session.SBox[j];
session.SBox[j] = num4;
}
return session;
}
public static void RC4Encrypt(ref RC4Session session, byte[] data, int index, int count)
{
int num = index;
do
{
session.I = (session.I + 1) % 0x100;
session.J = (session.J + session.SBox[session.I]) % 0x100;
byte num2 = session.SBox[session.I];
session.SBox[session.I] = session.SBox[session.J];
session.SBox[session.J] = num2;
byte num3 = data[num];
byte num4 = session.SBox[(session.SBox[session.I] + session.SBox[session.J]) % 0x100];
data[num] = (byte)(num3 ^ num4);
num++;
}
while (num != (index + count));
}
Now Here is my own c version :
typedef struct rc4_state {
int s_box_len;
uint8_t* sbox;
int i;
int j;
} rc4_state_t;
unsigned char* HMAC_SHA1(const char* cpukey, const unsigned char* hmac_key) {
unsigned char* digest = malloc(20);
digest = HMAC(EVP_sha1(), cpukey, 16, hmac_key, 16, NULL, NULL);
return digest;
}
void rc4_init(rc4_state_t* state, const uint8_t *key, int keylen)
{
state->i = 0;
state->j = 0;
state->s_box_len = 0x100;
state->sbox = malloc(0x100);
// Init sbox.
int i = 0, index = 0, j = 0;
uint8_t buf;
while(i < state->s_box_len) {
state->sbox[i] = (uint8_t)i;
i++;
}
while(j < state->s_box_len) {
index = ((index + state->sbox[j]) + key[j % keylen]) % state->s_box_len;
buf = state->sbox[index];
state->sbox[index] = (uint8_t)state->sbox[j];
state->sbox[j] = (uint8_t)buf;
j++;
}
}
void rc4_crypt(rc4_state_t* state, const uint8_t *inbuf, uint8_t **outbuf, int buflen)
{
int idx = 0;
uint8_t num, num2, num3;
*outbuf = malloc(buflen);
if (*outbuf) { // do not forget to test for failed allocation
while(idx != buflen) {
state->i = (int)(state->i + 1) % 0x100;
state->j = (int)(state->j + state->sbox[state->i]) % 0x100;
num = (uint8_t)state->sbox[state->i];
state->sbox[state->i] = (uint8_t)state->sbox[state->j];
state->sbox[state->j] = (uint8_t)num;
num2 = (uint8_t)inbuf[idx];
num3 = (uint8_t)state->sbox[(state->sbox[state->i] + (uint8_t)state->sbox[state->j]) % 0x100];
(*outbuf)[idx] = (uint8_t)(num2 ^ num3);
printf("%02X", (*outbuf)[idx]);
idx++;
}
}
printf("\n");
}
Usage (c#) :
byte[] cpukey = new byte[16]
{
...
};
byte[] hmac_key = new byte[16]
{
...
};
byte[] buf = new System.Security.Cryptography.HMACSHA1(cpukey).ComputeHash(hmac_key);
MessageBox.Show(BitConverter.ToString(buf).Replace("-", ""), "");
Usage(c):
const char cpu_key[16] = { 0xXX, 0xXX, 0xXX };
const unsigned char hmac_key[16] = { ... };
unsigned char* buf = HMAC_SHA1(cpu_key, hmac_key);
uint8_t buf2[20];
uint8_t buf3[8] = { 0x1E, 0xF7, 0x94, 0x48, 0x22, 0x26, 0x89, 0x8E }; // Encrypted Xbox 360 data
uint8_t* buf4;
// Allocated 8 bytes out.
buf4 = malloc(8);
int num = 0;
while(num < 20) {
buf2[num] = (uint8_t)buf[num]; // convert const char
num++;
}
rc4_state_t* rc4 = malloc(sizeof(rc4_state_t));
rc4_init(rc4, buf2, 20);
rc4_crypt(rc4, buf3, &buf4, 8);
Now I have the HMACsha1 figured out, im using openssl for that and I confirm I am getting the correct hmac/decryption key its just the rc4 isnt working, Im trying to decrypt part of the Kyevault that should == "Xbox 360"||"58626F7820333630"
The output is currently : "0000008108020000" I do not get any errors in the compilation, again any help would be great ^.^
Thanks to John's help I was able to fix it, it was a error in the c# version, thanks John !
As I remarked in comments, your main problem appeared to involve how the output buffer is managed. You have since revised the question to fix that, but I describe it anyway here, along with some other alternatives for fixing it. The remaining problem is discussed at the end.
Function rc4_crypt() allocates an output buffer for itself, but it has no mechanism to communicate a pointer to the allocated space back to its caller. Your revised usage furthermore exhibits some inconsistency with rc4_crypt() with respect to how the output buffer is expected to be managed.
There are three main ways to approach the problem.
Function rc4_crypt() presently returns nothing, so you could let it continue to allocate the buffer itself, and modify it to return a pointer to the allocated output buffer.
You could modify the type of the outbuf parameter to uint8_t ** to enable rc4_crypt() to set the caller's pointer value indirectly.
You could rely on the caller to manage the output buffer, and make rc4_crypt() just write the output via the pointer passed to it.
The only one of those that might be tricky for you is #2; it would look something like this:
void rc4_crypt(rc4_state_t* state, const uint8_t *inbuf, uint8_t **outbuf, int buflen) {
*outbuf = malloc(buflen);
if (*outbuf) { // do not forget to test for failed allocation
// ...
(*outbuf)[idx] = (uint8_t)(num2 ^ num3);
// ...
}
}
And you would use it like this:
rc4_crypt(rc4, buf3, &buf4, 8);
... without otherwise allocating any memory for buf4.
The caller in any case has the responsibility for freeing the output buffer when it is no longer needed. This is clearer when it performs the allocation itself; you should document that requirement if rc4_crypt() is going to be responsible for the allocation.
The remaining problem appears to be strictly an output problem. You are apparently relying on print statements in rc4_crypt() to report on the encrypted data. I have no problem whatever with debugging via print statements, but you do need to be careful to print the data you actually want to examine. In this case you do not. You update the joint buffer index idx at the end of the encryption loop before printing a byte from the output buffer. As a result, at each iteration you print not the encrypted byte value you've just computed, but rather an indeterminate value that happens to be in the next position of the output buffer.
Move the idx++ to the very end of the loop to fix this problem, or change it from a while loop to a for loop and increment idx in the third term of the loop control statement. In fact, I strongly recommend for loops over while loops where the former are a good fit to the structure of the code (as here); I daresay you would not have made this mistake if your loop had been structured that way.

fast string search to find string array element which matches the givven pattern

I have an array of constant strings which I iterate through to find an index of element which is a string that contains a search pattern. Which search algorithm should I choose to improve the speed of finding this element? I am not limited in time before running the application for preparing the look up tables if any are necessary.
I corrected a question - I am not doing exact string match - I am searching for pattern inside the element, which is in an array
array of strings:
[0] Red fox jumps over the fence
[1] Blue table
[2] Red flowers on the fence
I need to find an element which contains word 'table' - in this case its element 1
I do like 50000 iterations of a set of 30 array which could contain up to 30000 strings of not less than 128 characters. Now I am using good-old strstr brute force which is too slow...
Ok, posting a part of my function, the first strstr - looks up in an uncut array of lines if there are any occurrences, then the brute search follows. I know I can speed this part, but I am not doing optimization on this approach...
// sheets[i].buffer - contains a page of a text which is split into lines
// fullfunccall - is the pattern
// sheets[i].cells[k] - are the pointers to lines in a buffer
for( i=0; i<sheetcount; i++) {
if( i!= currsheet && sheets[i].name && sheets[i].name[0] != '\0') {
if( strstr(sheets[i].buffer, fullfunccall )) {
usedexternally = 1;
int foundatleastone = 0;
for( k=0; k<sheets[i].numcells; k++ ) {
strncpy_s(testline, MAX_LINE_SIZE, sheets[i].cells[k]->line, sheets[i].cells[k]->linesize);
testline[sheets[i].cells[k]->linesize] = '\0';
if( strstr(testline, fullfunccall )) {
dependency_num++;
if( dependency_num >= MAX_CELL_DEPENDENCIES-1) {
printf("allocation for sheet cell dependencies is insuficcient\n");
return;
}
sheets[currsheet].cells[currcellposinsheet]->numdeps = dependency_num+1;
foundatleastone++;
sheets[currsheet].cells[currcellposinsheet]->deps[dependency_num] = &sheets[i].cells[k];
}
}
if( foundatleastone == 0 ) {
printf("error locating dependency for external func: %s\n", fullfunccall );
return;
}
}
};
}
You wrote that your 'haystack' (the set of strings to search through) is roughly 30000 strings with approx. 200 characters each. You also wrote that the 'needle' (the term to search for) is either a string of 5 or 20 characters.
Based on this, you could precompute a hashtable which maps any 5-character subsequence to the string(s) in the haystack it occurs in. For 30000 strings (200 characters each) there are at most 30000 * (200 - 5) = 5.850.000 different 5-character substrings. If you hash each of it to a 16bit checksum you'd need a minimum 11MB of memory (for the hash keys) plus some pointers pointing to the string(s) in which the substring occurs.
For instance, given a simplfied haystack of
static const char *haystack[] = { "12345", "123456", "23456", "345678" };
you precompute a hash map which maps any possible 5-character string such that
12345 => haystack[0], haystack[1]
23456 => haystack[1], haystack[2]
34567 => haystack[3]
45678 => haystack[4]
With this, you could take the first five characters of your given key (either 5 or 20 characters long), hash it and then do a normal strstr through all the strings to which the key is mapped by the hash.
For each sheet that you are treating, you could build a suffix array as described in this article. Before you start your search, read the sheet, find the line beginnings (as integer indices into the sheet buffer), create the suffix array and sort it as described in the article.
Now, if you are looking for the lines in which a pattern, say "table", occurs, you can search for the next entry after "table" and the next entry after "tablf", which is the first non-match, where you have moved the right-most letter, odometer-style.
If both indices are the same, there are no matches. If they are different, you'll get a list of pointers into the sheet:
"tab. And now ..."
----------------------------------------------------------------
"table and ..." 0x0100ab30
"table water for ..." 0x0100132b
"tablet computer ..." 0x01000208
----------------------------------------------------------------
"tabloid reporter ..."
This will give you a list of pointers from which, by subtracting the base pointer of the sheet buffer, you can get the integer offsets. Comparison with the line beginnings will give you the line numbers that correspond to these pointers. (The line numbers are sorted, so you can do binary search here.)
The memory overhead is an array of pointers that has the same size as the sheet buffer, so for 30,000 strings of 200 chars, that will be about 48MB on a 64-bit machine. (The overhead of the line indices is negligible.)
Sorting the array will take long, but it is done only once for each sheet.
Edit: The idea seems to work well. I have implemented it and can scan a dictionary of about 130,000 words on a text file of nearly 600k in less then one second:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define die(...) exit((fprintf(stderr, "Fatal: " __VA_ARGS__), \
putc(10, stderr), 1))
typedef struct Sheet Sheet;
struct Sheet {
size_t size; /* Number of chars */
char *buf; /* Null-terminated char buffer */
char **ptr; /* Pointers into char buffer */
size_t nline; /* number of lines */
int *line; /* array of offset of line beginnings */
size_t naux; /* size of scratch array */
char **aux; /* scratch array */
};
/*
* Count occurrence of c in zero-terminated string p.
*/
size_t strcount(const char *p, int c)
{
size_t n = 0;
for (;;) {
p = strchr(p, c);
if (p == NULL) return n;
p++;
n++;
}
return 0;
}
/*
* String comparison via pointers to strings.
*/
int pstrcmp(const void *a, const void *b)
{
const char *const *aa = a;
const char *const *bb = b;
return strcmp(*aa, *bb);
}
/*
* Pointer comparison.
*/
int ptrcmp(const void *a, const void *b)
{
const char *const *aa = a;
const char *const *bb = b;
if (*aa == *bb) return 0;
return (*aa < *bb) ? -1 : 1;
}
/*
* Create and prepare a sheet, i.e. a text file to search.
*/
Sheet *sheet_new(const char *fn)
{
Sheet *sheet;
FILE *f = fopen(fn, "r");
size_t n;
int last;
char *p;
char **pp;
if (f == NULL) die("Couldn't open %s", fn);
sheet = malloc(sizeof(*sheet));
if (sheet == NULL) die("Allocation failed");
fseek(f, 0, SEEK_END);
sheet->size = ftell(f);
fseek(f, 0, SEEK_SET);
sheet->buf = malloc(sheet->size + 1);
sheet->ptr = malloc(sheet->size * sizeof(*sheet->ptr));
if (sheet->buf == NULL) die("Allocation failed");
if (sheet->ptr == NULL) die("Allocation failed");
fread(sheet->buf, 1, sheet->size, f);
sheet->buf[sheet->size] = '\0';
fclose(f);
sheet->nline = strcount(sheet->buf, '\n');
sheet->line = malloc(sheet->nline * sizeof(*sheet->line));
sheet->aux = NULL;
sheet->naux = 0;
n = 0;
last = 0;
p = sheet->buf;
pp = sheet->ptr;
while (*p) {
*pp++ = p;
if (*p == '\n') {
sheet->line[n++] = last;
last = p - sheet->buf + 1;
}
p++;
}
qsort(sheet->ptr, sheet->size, sizeof(*sheet->ptr), pstrcmp);
return sheet;
}
/*
* Clean up sheet.
*/
void sheet_delete(Sheet *sheet)
{
free(sheet->buf);
free(sheet->ptr);
free(sheet->line);
free(sheet->aux);
free(sheet);
}
/*
* Binary range search for string pointers.
*/
static char **pstr_bsearch(const char *key,
char **arr, size_t high)
{
size_t low = 0;
while (low < high) {
size_t mid = (low + high) / 2;
int diff = strcmp(key, arr[mid]);
if (diff < 0) high = mid;
else low = mid + 1;
}
return arr + low;
}
/*
* Binary range search for line offsets.
*/
static const int *int_bsearch(int key, const int *arr, size_t high)
{
size_t low = 0;
while (low < high) {
size_t mid = (low + high) / 2;
int diff = key - arr[mid];
if (diff < 0) high = mid;
else low = mid + 1;
}
if (low < 1) return NULL;
return arr + low - 1;
}
/*
* Find occurrences of the string key in the sheet. Returns the
* number of lines in which the key occurs and assigns up to
* max lines to the line array. (If max is 0, line may be NULL.)
*/
int sheet_find(Sheet *sheet, char *key,
int line[], int max)
{
char **begin, **end;
int n = 0;
size_t i, m;
size_t last;
begin = pstr_bsearch(key, sheet->ptr, sheet->size);
if (begin == NULL) return 0;
key[strlen(key) - 1]++;
end = pstr_bsearch(key, sheet->ptr, sheet->size);
key[strlen(key) - 1]--;
if (end == NULL) return 0;
if (end == begin) return 0;
m = end - begin;
if (m > sheet->naux) {
if (sheet->naux == 0) sheet->naux = 0x100;
while (sheet->naux < m) sheet->naux *= 2;
sheet->aux = realloc(sheet->aux, sheet->naux * sizeof(*sheet->aux));
if (sheet->aux == NULL) die("Re-allocation failed");
}
memcpy(sheet->aux, begin, m * sizeof(*begin));
qsort(sheet->aux, m, sizeof(*begin), ptrcmp);
last = 0;
for (i = 0; i < m; i++) {
int offset = sheet->aux[i] - sheet->buf;
const int *p;
p = int_bsearch(offset, sheet->line + last, sheet->nline - last);
if (p) {
if (n < max) line[n] = p - sheet->line;
last = p - sheet->line + 1;
n++;
}
}
return n;
}
/*
* Example client code
*/
int main(int argc, char **argv)
{
Sheet *sheet;
FILE *f;
if (argc != 3) die("Usage: %s patterns corpus", *argv);
sheet = sheet_new(argv[2]);
f = fopen(argv[1], "r");
if (f == NULL) die("Can't open %s.", argv[1]);
for (;;) {
char str[80];
int line[50];
int i, n;
if (fgets(str, sizeof(str), f) == NULL) break;
strtok(str, "\n");
n = sheet_find(sheet, str, line, 50);
printf("%8d %s\n", n, str);
if (n > 50) n = 50;
for (i = 0; i < n; i++) printf(" [%d] %d\n", i, line[i] + 1);
}
fclose(f);
sheet_delete(sheet);
return 0;
}
The implementation has its rough edges, but it works. I'm not especially fond of the scratch array and the additional sorting on the found pointer range, but it turns out that even sorting the large suffix array doesn't take too long.
You can extend this solution to more sheets, if you like.
I believe the most practical would be DFA as it reads every character of input at most once - more precisely it reads every input char once and stops as soon as pattern will not match definitely (if set up properly). With DFA you can also check against multiple patterns simultaneously.
Two good (but different) implementations of DFA algorithms well tested in practice are
PIRE
Ragel
It's not possible say which fits your task unless you provide more on that.
edit: DFA stays for "Deterministic Finite Automata"
edit: as you indicated your patterns are exact substrings the most common solution is KMP algorithm (Knuth-Morris-Pratt)

Suggestions to improve a C ReplaceString function?

I've just started to get in to C programming and would appreciate criticism on my ReplaceString function.
It seems pretty fast (it doesn't allocate any memory other than one malloc for the result string) but it seems awfully verbose and I know it could be done better.
Example usage:
printf("New string: %s\n", ReplaceString("great", "ok", "have a g grea great day and have a great day great"));
printf("New string: %s\n", ReplaceString("great", "fantastic", "have a g grea great day and have a great day great"));
Code:
#ifndef uint
#define uint unsigned int
#endif
char *ReplaceString(char *needle, char *replace, char *haystack)
{
char *newString;
uint lNeedle = strlen(needle);
uint lReplace = strlen(replace);
uint lHaystack = strlen(haystack);
uint i;
uint j = 0;
uint k = 0;
uint lNew;
char active = 0;
uint start = 0;
uint end = 0;
/* Calculate new string size */
lNew = lHaystack;
for (i = 0; i < lHaystack; i++)
{
if ( (!active) && (haystack[i] == needle[0]))
{
/* Start of needle found */
active = 1;
start = i;
end = i;
}
else if ( (active) && (i-start == lNeedle) )
{
/* End of needle */
active = 0;
lNew += lReplace - lNeedle;
}
else if ( (active) && (i-start < lNeedle) && (haystack[i] == needle[i-start]) )
{
/* Next part of needle found */
end++;
}
else if (active)
{
/* Didn't match the entire needle... */
active = 0;
}
}
active= 0;
end = 0;
/* Prepare new string */
newString = malloc(sizeof(char) * lNew + 1);
newString[sizeof(char) * lNew] = 0;
/* Build new string */
for (i = 0; i < lHaystack; i++)
{
if ( (!active) && (haystack[i] == needle[0]))
{
/* Start of needle found */
active = 1;
start = i;
end = i;
}
else if ( (active) && (i-start == lNeedle) )
{
/* End of needle - apply replacement */
active = 0;
for (k = 0; k < lReplace; k++)
{
newString[j] = replace[k];
j++;
}
newString[j] = haystack[i];
j++;
}
else if ( (active) && (i-start < lNeedle) && (haystack[i] == needle[i-start])
)
{
/* Next part of needle found */
end++;
}
else if (active)
{
/* Didn't match the entire needle, so apply skipped chars */
active = 0;
for (k = start; k < end+2; k++)
{
newString[j] = haystack[k];
j++;
}
}
else if (!active)
{
/* No needle matched */
newString[j] = haystack[i];
j++;
}
}
/* If still matching a needle... */
if ( active && (i-start == lNeedle))
{
/* If full needle */
for (k = 0; k < lReplace; k++)
{
newString[j] = replace[k];
j++;
}
newString[j] = haystack[i];
j++;
}
else if (active)
{
for (k = start; k < end+2; k++)
{
newString[j] = haystack[k];
j++;
}
}
return newString;
}
Any ideas? Thanks very much!
Don't call strlen(haystack). You are already checking every character in the string, so computing the string length is implicit to your loop, as follows:
for (i = 0; haystack[i] != '\0'; i++)
{
...
}
lHaystack = i;
It's possible you are doing this in your own way for practice. If so, you get many points for effort.
If not, you can often save time by using functions that are in the C Runtime Library (CRT) versus coding your own equivalent function. For example, you could use strstr to locate the string that's targeted for replacement. Other string manipulation functions may also be useful to you.
A good exercise would be to complete this example to your satisfaction and then recode using the CRT to see how much faster it is to code and execute.
While looping the first time, you should keep indices on where there need to be replacement and skip those on the strcopy/replace part of the function. This would result in a loop where you only do strncpy from haystack or replacement to new string.
Make the parameters const
char *ReplaceString(const char *needle, const char *replace, const char *haystack)
Oh ... is the function supposed to work only once per word?
ReplaceString("BAR", "bar", "BARBARA WENT TO THE BAR")
My one suggestion has nothing to do with improving performance, but with improving readability.
"Cute" parameter names are much harder to understand than descriptive ones. Which of the following parameters do you think better convey their purpose?
char *ReplaceString(char *needle, char *replace, char *haystack)
char *ReplaceString(char *oldText, char *newText, char *inString)
With one, you have to consciously map a name to a purpose. With the other, the purpose IS the name. Juggling a bunch of name mappings in your head while trying to understand a piece of code can become difficult, especially as the number of variables increases.
This might not seem so important when you're the only one using your code, but it's paramount when your code is being used or read by someone else. And sometimes, "someone else" is yourself, a year later, looking at your own code, wondering why you're searching through haystacks and trying to replace needles ;)

How would you implement the pilloried function in the Daily WTF?

The Daily WTF for 2008-11-28 pillories the following code:
static char *nice_num(long n)
{
int neg = 0, d = 3;
char *buffer = prtbuf;
int bufsize = 20;
if (n < 0)
{
neg = 1;
n = -n;
}
buffer += bufsize;
*--buffer = '\0';
do
{
*--buffer = '0' + (n % 10);
n /= 10;
if (--d == 0)
{
d = 3;
*--buffer = ',';
}
}
while (n);
if (*buffer == ',') ++buffer;
if (neg) *--buffer = '-';
return buffer;
}
How would you write it?
If you're a seasoned C programmer, you'll realize this code isn't actually that bad. It's relatively straightforward (for C), and it's blazingly fast. It has three problems:
It fails on the edge case of LONG_MIN (-2,147,483,648), since negating this number produces itself in twos-complement
It assumes 32-bit integers - for 64-bit longs, a 20-byte buffer is not big enough
It's not thread-safe - it uses a global static buffer, so multiple threads calling it at the same time will result in a race condition
Problem #1 is easily solved with a special case. To address #2, I'd separate the code into two functions, one for 32-bit integers and one for 64-bit integers. #3 is a little harder - we have to change the interface to make completely thread-safe.
Here is my solution, based on this code but modified to address these problems:
static int nice_num(char *buffer, size_t len, int32_t n)
{
int neg = 0, d = 3;
char buf[16];
size_t bufsize = sizeof(buf);
char *pbuf = buf + bufsize;
if(n < 0)
{
if(n == INT32_MIN)
{
strncpy(buffer, "-2,147,483,648", len);
return len <= 14;
}
neg = 1;
n = -n;
}
*--pbuf = '\0';
do
{
*--pbuf = '0' + (n % 10);
n /= 10;
if(--d == 0)
{
d = 3;
*--pbuf = ',';
}
}
while(n > 0);
if(*pbuf == ',') ++pbuf;
if(neg) *--pbuf = '-';
strncpy(buffer, pbuf, len);
return len <= strlen(pbuf);
}
Explanation: it creates a local buffer on the stack and then fills that in in the same method as the initial code. Then, it copies it into a parameter passed into the function, making sure not to overflow the buffer. It also has a special case for INT32_MIN. The return value is 0 if the original buffer was large enough, or 1 if the buffer was too small and the resulting string was truncated.
Hmm... I guess I shouldn't admit this, but my int to string routine for an embedded system work in pretty much exactly the same way (but without putting in the commas).
It's not particularly straightforward, but I wouldn't call it a WTF if you're working on a system that you can't use snprintf() on.
The guy who wrote the above probably noted that the printf() family of routines can't do comma grouping, so he came up with his own.
Footnote: there are some libraries where the printf() style formatting does support grouping, but they are not standard. And I know that the posted code doesn't support other locales that group using '.'. But that's hardly a WTF, just a bug possibly.
That's probably pretty close to the way I would write it actually. The only thing I can immediately see that is wrong with the solution is that is doesn't work for LONG_MIN on machines where LONG_MIN is -(LONG_MAX + 1), which is most machines nowadays. I might use localeconv to get the thousands separator instead of assuming comma, and I might more carefully calculate the buffer size, but the algorithm and implementation seem pretty straight-forward to me, not really much of a WTF for C (there are much better solutions for C++).
Lisp:
(defun pretty-number (x) (format t "~:D" x))
I'm suprised how easily I could do this. I'm not even past the first chapter in my Lisp book. xD (Or should I say, ~:D)
size_t
signed_as_text_grouped_on_powers_of_1000(char *s, ssize_t max, int n)
{
if (max <= 0)
return 0;
size_t r=0;
bool more_groups = n/1000 != 0;
if (more_groups)
{
r = signed_as_text_grouped_on_powers_of_1000(s, max, n/1000);
r += snprintf(s+r, max-r, ",");
n = abs(n%1000);
r += snprintf(s+r, max-r, "%03d",n);
} else
r += snprintf(s+r, max-r, "% 3d", n);
return r;
}
Unfortunately, this is about 10x slower than the original.
In pure C:
#include <stdio.h>
#include <limits.h>
static char *prettyNumber(long num, int base, char separator)
{
#define bufferSize (sizeof(long) * CHAR_BIT)
static char buffer[bufferSize + 1];
unsigned int pos = 0;
/* We're walking backwards because numbers are right to left. */
char *p = buffer + bufferSize;
*p = '\0';
int negative = num < 0;
do
{
char digit = num % base;
digit += '0';
*(--p) = digit;
++pos;
num /= base;
/* This the last of a digit group? */
if(pos % 3 == 0)
{
/* TODO Make this a user setting. */
#ifndef IM_AMERICAN
# define IM_AMERICAN_BOOL 0
#else
# define IM_AMERICAN_BOOL 1
#endif
/* Handle special thousands case. */
if(!IM_AMERICAN_BOOL && pos == 3 && num < base)
{
/* DO NOTHING */
}
else
{
*(--p) = separator;
}
}
} while(num);
if(negative)
*(--p) = '-';
return p;
#undef bufferSize
}
int main(int argc, char **argv)
{
while(argc > 1)
{
long num = 0;
if(sscanf(argv[1], "%ld", &num) != 1)
continue;
printf("%ld = %s\n", num, prettyNumber(num, 10, ' '));
--argc;
++argv;
};
return 0;
}
Normally I'd return an alloc'd buffer, which would need to be free'd by the user. This addition is trivial.
I got bored and made this naive implementation in Perl. Works.
sub pretify {
my $num = $_[0];
my $numstring = sprintf( "%f", $num );
# Split into whole/decimal
my ( $whole, $decimal ) = ( $numstring =~ /(^\d*)(.\d+)?/ );
my #chunks;
my $output = '';
# Pad whole into multiples of 3
$whole = q{ } x ( 3 - ( length $whole ) % 3 ) . $whole;
# Create an array of all 3 parts.
#chunks = $whole =~ /(.{3})/g;
# Reassemble with commas
$output = join ',', #chunks;
if ($decimal) {
$output .= $decimal;
}
# Strip Padding ( and spurious commas )
$output =~ s/^[ ,]+//;
# Strip excess tailing zeros
$output =~ s/0+$//;
# Ending with . is ugly
$output =~ s/\.$//;
return $output;
}
print "\n", pretify 100000000000000000000000000.0000;
print "\n", pretify 10_202_030.45;
print "\n", pretify 10_101;
print "\n", pretify 0;
print "\n", pretify 0.1;
print "\n", pretify 0.0001;
print "\n";

Resources