Finding unique elements in an string array in C - c

C bothers me with its handling of strings. I have a pseudocode like this in my mind:
char *data[20];
char *tmp; int i,j;
for(i=0;i<20;i++) {
tmp = data[i];
for(j=1;j<20;j++)
{
if(strcmp(tmp,data[j]))
//then except the uniqueness, store them in elsewhere
}
}
But when i coded this the results were bad.(I handled all the memory stuff,little things etc.) The problem is in the second loop obviously :D. But i cannot think any solution. How do i find unique strings in an array.
Example input : abc def abe abc def deg entered
unique ones : abc def abe deg should be found.

You could use qsort to force the duplicates next to each other. Once sorted, you only need to compare adjacent entries to find duplicates. The result is O(N log N) rather than (I think) O(N^2).
Here is the 15 minute lunchtime version with no error checking:
typedef struct {
int origpos;
char *value;
} SORT;
int qcmp(const void *x, const void *y) {
int res = strcmp( ((SORT*)x)->value, ((SORT*)y)->value );
if ( res != 0 )
return res;
else
// they are equal - use original position as tie breaker
return ( ((SORT*)x)->origpos - ((SORT*)y)->origpos );
}
int main( int argc, char* argv[] )
{
SORT *sorted;
char **orig;
int i;
int num = argc - 1;
orig = malloc( sizeof( char* ) * ( num ));
sorted = malloc( sizeof( SORT ) * ( num ));
for ( i = 0; i < num; i++ ) {
orig[i] = argv[i + 1];
sorted[i].value = argv[i + 1];
sorted[i].origpos = i;
}
qsort( sorted, num, sizeof( SORT ), qcmp );
// remove the dups (sorting left relative position same for dups)
for ( i = 0; i < num - 1; i++ ) {
if ( !strcmp( sorted[i].value, sorted[i+1].value ))
// clear the duplicate entry however you see fit
orig[sorted[i+1].origpos] = NULL; // or free it if dynamic mem
}
// print them without dups in original order
for ( i = 0; i < num; i++ )
if ( orig[i] )
printf( "%s ", orig[i] );
free( orig );
free( sorted );
}

char *data[20];
int i, j, n, unique[20];
n = 0;
for (i = 0; i < 20; ++i)
{
for (j = 0; j < n; ++j)
{
if (!strcmp(data[i], data[unique[j]]))
break;
}
if (j == n)
unique[n++] = i;
}
The indexes of the first occurrence of each unique string should be in unique[0..n-1] if I did that right.

Why are you starting second loop from 1?
You should start it from
i+1. i.e.
for(j=i+1;j<20;j++)
Like if the list is
abc
def
abc
abc
lop
then
when i==4
tmp="lop"
but then the second loop starts which is from 1 to 19. This means it will get a value of 4 too at one stage, and then
data[4], which is "lop", will be same as tmp. So although "lop" is unique but it will be flagged as repeated.
Hope it was helpful.

Think a bit more about your problem -- what you really want to do is look at the PREVIOUS strings to see if you've already seen it. So, for each string n, compare it to strings 0 through n-1.
print element 0 (it is unique)
for i = 1 to n
unique = 1
for j = 0 to i-1 (compare this element to the ones preceding it)
if element[i] == element[j]
unique = 0
break from loop
if unique, print element i

Might it be that your test is if (strcmp (this, that)) which will succeed if the two are different? !strcmp is probably what you want there.

Related

How does a recursive code determine if palindrome work?

I have a problem question and a snippet code below. The snippet is filled already because I found out the solution but I do not understand why it is like that. Could you help me explain how the codes work?
Problem: Ten tiles each have strings of in between 1 and 4 letters on them (hardcoded in the code below). The goal of this problem is to complete the code below so it counts the number of different orders in which all of the tiles can be placed such that the string they form creates a palindrome (a word that reads the same forwards and backwards). All of main, as well as the function eval which determines if a particular ordering of the tiles forms a palindrome. You may call this function in the function go. Complete the recursive function (named go) to complete the solution.
Snippet code:
#include <stdio.h>
#include <string.h>
#define N 10
#define MAXLEN 5
int go(int perm[], int used[], int k, char tiles[N][MAXLEN]);
int eval(int perm[], char tiles[N][MAXLEN]);
char MYTILES[N][MAXLEN] = {
"at", "ta", "g", "cc", "ccac", "ca", "cc", "gag", "cga", "gc"
};
int
main(void)
{
int perm[N];
int used[N];
for (int i = 0; i < N; i++)
used[i] = 0;
int res = go(perm, used, 0, MYTILES);
printf("Number of tile orderings that create palindromes is %d\n", res);
return 0;
}
int
go(int perm[], int used[], int k, char tiles[N][MAXLEN])
{
if (k == N)
return eval(perm, tiles);
int res = 0;
for (int i = 0; i < N; i++) {
if (used[i])
continue;
used[i] = 1;
perm[k] = i;
res += go(perm, used, k + 1, tiles);
used[i] = 0;
}
return res;
}
int
eval(int perm[], char tiles[N][MAXLEN])
{
char tmp[N * MAXLEN];
int idx = 0;
for (int i = 0; i < N; i++) {
int len = strlen(tiles[perm[i]]);
for (int j = 0; j < len; j++)
tmp[idx++] = tiles[perm[i]][j];
}
tmp[idx] = '\0';
for (int i = 0; i < idx / 2; i++)
if (tmp[i] != tmp[idx - 1 - i])
return 0;
return 1;
}
Thank you. I appreciate all help!!
To understand this code, add the following line to the start of eval():
for( int j = 0; j < N; j++ ) printf( "%d ", perm[j] ); putchar('\n');
The for() loop in go() causes a recursion that is 10 levels deep, ultimately generating 10! (~3.6 million) permutations of the 10 indices from 0 to 9. In sequence, each of those permutations is used to concatenate the 'tokens' (the short ACTG variations) into a single string that is then tested for being palindromic by `eval()'
This is called a "brute force" search through the possibility space.
Below I've revised the code to be slightly more compact, adding two "printf debugging" lines (marked "/**/") that report what the program is doing. You'll need some patience if you wish to watch millions of permutations of 0 to 9 scroll by, or simply comment out that line and recompile. I also shuffled things around and made the two interesting arrays global instead of "whacking the stack" by passing them up/down the recursion. Less code is better. This program is "single purpose". The clarity gained justifies using global variables in this instance, imho.
More interesting is the additional puts() line that reports the palindromic sequences.
#include <stdio.h>
#include <string.h>
#define N 10
#define MAXLEN 5
char MYTILES[N][MAXLEN] = { "AT","TA","G","CC","CCAC","CA","CC","GAG","CGA","GC" };
int perm[N], used[N] = { 0 };
int go( int k ) {
if (k == N) {
// At extent of recursion here.
/**/ for( int j = 0; j < k; j++ ) printf( "%d ", perm[j] ); putchar('\n');
// Make a string in this sequence
char tmp[N*MAXLEN] = {0};
for( int i = 0; i < N; i++ )
strcat( tmp, MYTILES[ perm[ i ] ] );
// Test string for being palidromic
for( int l = 0, r = strlen( tmp ) - 1; l <= r; l++, r-- )
if( tmp[l] != tmp[r] )
return 0; // Not palidrome
/**/ puts( tmp );
return 1; // Is palidrome
}
// recursively generate permutations here
int res = 0;
for( int i = 0; i < N; i++ )
if( !used[i] ) {
used[i] = 1;
perm[k] = i;
res += go( k+1 );
used[i] = 0;
}
return res;
}
int main( void ) {
printf( "Palindromic tile orderings: %d\n", go( 0 ) );
return 0;
}
An immediate 'speed-up' would be to test that the first letter of the 0th string to be permuted matches the last letter of the 9th string... Don't bother concatenating if a palindrome is impossible from the get-go. Other optimisations are left as an exercise for the reader...
BTW: It's okay to make a copy of code and add your own print statements so that the program reports what it is doing when... Or, you could single-step through a debugger...
UPDATE
Having added a preliminary generation of a 10x10 matrix to 'gate' the workload of generating strings to be checked as palindromic, with the 10 OP supplied strings, it turns out that 72% of those operations were doomed to fail from the start. Of the 3.6 million "brute force" attempts, a quick reference to this pre-generated matrix prevented about 2.6 million of them.
It's worthwhile trying to make code efficient.
UPDATE #2:
Bothered that there was still a lot of 'fat' in the execution after trying to improve on the "brute force" in a simple way, I've redone some of the code.
Using a few extra global variables (the state of processing), the following now does some "preparation" in main(), then enters the recursion. In this version, once the string being assembled from fragments is over half complete (in length), it is checked from the "middle out" if it qualifies as being palindromic. If so, each appended fragment causes a re-test. If the string would never become a palindrome, the recursion 'backs-up' and tries another 'flavour' of permutation. This trims the possibility space immensely (and really speeds up the execution.)
char *Tiles[] = { "AT","TA","G","CC","CCAC","CA","CC","GAG","CGA","GC" };
const int nTiles = sizeof Tiles/sizeof Tiles[0];
int used[ nTiles ];
char buildBuf[ 1024 ], *cntrL, *cntrR; // A big buffer and 2 pointers.
int fullLen;
int cntTested, goCalls; // some counters to report iterations
uint32_t factorial( uint32_t n ) { // calc n! (max 12! to fit uint32_t)
uint32_t f = 1;
while( n ) f *= n--;
return f;
}
int hope() { // center outward testing for palindromic characteristics
int i;
for( i = 0; cntrL[ 0 - i ] == cntrR[ 0 + i ]; i++ ) ; // looping
return cntrR[ 0 + i ] == '\0';
}
int go( int k ) {
goCalls++;
if( k == nTiles ) { // at full extent of recursion here
// test string being palindromic (from ends toward middle for fun)
cntTested++;
for( int l = 0, r = fullLen - 1; l <= r; l++, r-- )
if( buildBuf[l] != buildBuf[r] )
return 0; // Not palindrome
/**/ puts( buildBuf );
return 1; // Is palindrome
}
// recursively generate permutations here
// instead of building from sequence of indices
// this builds the (global) sequence string right here
int res = 0;
char *at = buildBuf + strlen( buildBuf );
for( int i = 0; i < nTiles; i++ )
if( !used[i] ) {
strcpy( at, Tiles[ i ] );
// keep recursing until > half assembled and hope persists
if( at < cntrL || hope() ) {
used[i] = 1;
res += go( k+1 ); // go 'deeper' in the recursion
used[i] = 0;
}
}
return res;
}
int main( void ) {
for( int i = 0; i < nTiles; i++ )
fullLen += strlen( Tiles[i] );
if( fullLen % 2 == 0 ) // even count
cntrR = (cntrL = buildBuf + fullLen/2 - 1) + 1; // 24 ==> 0-11 & 12->23
else
cntrR = cntrL = buildBuf + fullLen/2; // 25 ==> 0-12 & 12->24
printf( "Palindromic tile orderings: %d\n", go( 0 ) );
printf( "Potential: %d\n", factorial( nTiles ) );
printf( "Calls to go(): %d\n", goCalls );
printf( "Actual: %d\n", cntTested );
return 0;
}
ATCCACGAGCCGCCGAGCACCTA
ATCCACGAGCCGCCGAGCACCTA
ATCCACGCCGAGAGCCGCACCTA
ATCCACGCCGAGAGCCGCACCTA
ATCACCGAGCCGCCGAGCCACTA
ATCACCGCCGAGAGCCGCCACTA
ATCACCGAGCCGCCGAGCCACTA
ATCACCGCCGAGAGCCGCCACTA
TACCACGAGCCGCCGAGCACCAT
TACCACGAGCCGCCGAGCACCAT
TACCACGCCGAGAGCCGCACCAT
TACCACGCCGAGAGCCGCACCAT
TACACCGAGCCGCCGAGCCACAT
TACACCGCCGAGAGCCGCCACAT
TACACCGAGCCGCCGAGCCACAT
TACACCGCCGAGAGCCGCCACAT
CCACATGAGCCGCCGAGTACACC
CCACATGAGCCGCCGAGTACACC
CCACATGCCGAGAGCCGTACACC
CCACATGCCGAGAGCCGTACACC
CCACTAGAGCCGCCGAGATCACC
CCACTAGAGCCGCCGAGATCACC
CCACTAGCCGAGAGCCGATCACC
CCACTAGCCGAGAGCCGATCACC
CACCATGAGCCGCCGAGTACCAC
CACCATGCCGAGAGCCGTACCAC
CACCTAGAGCCGCCGAGATCCAC
CACCTAGCCGAGAGCCGATCCAC
CACCATGAGCCGCCGAGTACCAC
CACCATGCCGAGAGCCGTACCAC
CACCTAGAGCCGCCGAGATCCAC
CACCTAGCCGAGAGCCGATCCAC
Palindromic tile orderings: 32
Potential: 3628800
Calls to go(): 96712
Actual: 32
UPDATE #3 (having fun)
When there's too much code, and an inefficient algorithm, it's easy to get lost and struggle to work out what is happening.
Below produces exactly the same results as above, but shaves a few more operations from the execution. In short, go() is called recursively until at least 1/2 of the candidate string has been built-up. At that point, hope() is asked to evaluate the string "from the middle, out." As long as the conditions of being palindromic (from the centre, outward) are being met, that evaluation is repeated as the string grows (via recursion) toward its fullest extent. It is the "bailing-out early" that makes this version far more efficient than the OP version.
One further 'refinement' is that the bottom of the recursion is found without an extra call to \0. Once one has the concepts of recursion and permutation, this should all be straight forward...
char *Tiles[] = { "AT", "TA", "G", "CC", "CCAC", "CA", "CC", "GAG", "CGA", "GC" };
const int nTiles = sizeof Tiles/sizeof Tiles[0];
int used[ nTiles ];
char out[ 1024 ], *cntrL, *cntrR;
int hope() { // center outward testing for palidromic characteristics
char *pL = cntrL, *pR = cntrR;
while( *pL == *pR ) pL--, pR++;
return *pR == '\0';
}
int go( int k ) {
int res = 0;
char *at = out + strlen( out );
for( size_t i = 0; i < nTiles; i++ )
if( !used[i] ) {
strcpy( at, Tiles[ i ] );
if( at >= cntrL && !hope() ) // abandon this string?
continue;
if( k+1 == nTiles ) { // At extent of recursion here.
puts( out );
return 1;
}
used[i] = 1, res += go( k+1 ), used[i] = 0;
}
return res;
}
int main( void ) {
int need = 0;
for( size_t i = 0; i < nTiles; i++ )
need += strlen( Tiles[ i ] );
cntrL = cntrR = out + need/2; // odd eg: 25 ==> 0-12 & 12->24
cntrL -= (need % 2 == 0 ); // but, if even eg: 24 ==> 0-11 & 12->23
printf( "Palindromic tile orderings: %d\n", go( 0 ) );
return 0;
}

Checking whether a string consists of two repetitions

I am writing a function that returns 1 if a string consists of two repetitions, 0 otherwise.
Example: If the string is "hellohello", the function will return 1 because the string consists of the same two words "hello" and "hello".
The first test I did was to use a nested for loop but after a bit of reasoning I thought that the idea is wrong and is not the right way to solve, here is the last function I wrote.
It is not correct, even if the string consists of two repetitions, it returns 0.
Also, I know this problem could be handled differently with a while loop following another algorithm, but I was wondering if it could be done with the for as well.
My idea would be to divide the string in half and check it character by character.
This is the last function I tried:
int doubleString(char *s){
int true=1;
char strNew[50];
for(int i=0;i<strlen(s)/2;i++){
strNew[i]=s[i];
}
for(int j=strlen(s)/2;j<strlen(s);j++){
if(!(strNew[j]==s[j])){
true=0;
}
}
return true;
}
The problem in your function is with the comparison in the second loop: you are using the j variable as an index for both the second half of the given string and for the index in the copied first half of that string. However, for that copied string, you need the indexes to start from zero – so you need to subtract the s_length/2 value from j when accessing its individual characters.
Also, it is better to use the size_t type when looping through strings and comparing to the results of functions like strlen (which return that type). You can also improve your code by saving the strlen(s)/2 value, so it isn't computed on each loop. You can also dispense with your local true variable, returning 0 as soon as you find a mismatch, or 1 if the second loop completes without finding such a mismatch:
int doubleString(char* s)
{
char strNew[50] = { 0, };
size_t full_len = strlen(s);
size_t half_len = full_len / 2;
for (size_t i = 0; i < half_len; i++) {
strNew[i] = s[i];
}
for (size_t j = half_len; j < full_len; j++) {
if (strNew[j - half_len] != s[j]) { // x != y is clearer than !(x == y)
return 0;
}
}
return 1;
}
In fact, once you have appreciated why you need to subtract that "half length" from the j index of strNew, you can remove the need for that temporary copy completely and just use the modified j as an index into the original string:
int doubleString(char* s)
{
size_t full_len = strlen(s);
size_t half_len = full_len / 2;
for (size_t j = half_len; j < full_len; j++) {
if (s[j - half_len] != s[j]) { // x != y is clearer than !(x == y)
return 0;
}
}
return 1;
}
This loop
for(int j=strlen(s)/2;j<strlen(s);j++){
if(!(strNew[j]==s[j])){
true=0;
}
}
is incorrect. The index in the array strNew shall start from 0 instead of the value of the expression strlen( s ) / 2.
But in any case your approach is incorrect because at least you are using an intermediate array with the magic number 50. The user can pass to the function a string of any length.
char strNew[50];
The function can look much simpler.
For example
int doubleString( const char *s )
{
int double_string = 0;
size_t n = 0;
if ( ( double_string = *s != '\0' && ( n = strlen( s ) ) % 2 == 0 ) )
{
double_string = memcmp( s, s + n / 2, n / 2 ) == 0;
}
return double_string;
}
That is the function at first checks that the passed string is not empty and its length is an even number. If so then the function compares two halves of the string.
Here is a demonstration program.
#include <stdio.h>
#include <string.h>
int doubleString( const char *s )
{
int double_string = 0;
size_t n = 0;
if (( double_string = *s != '\0' && ( n = strlen( s ) ) % 2 == 0 ))
{
double_string = memcmp( s, s + n / 2, n / 2 ) == 0;
}
return double_string;
}
int main( void )
{
printf( "doubleString( \"\" ) = %d\n", doubleString( "" ) );
printf( "doubleString( \"HelloHello\" ) = %d\n", doubleString( "HelloHello" ) );
printf( "doubleString( \"Hello Hello\" ) = %d\n", doubleString( "Hello Hello" ) );
}
The program output is
doubleString( "" ) = 0
doubleString( "HelloHello" ) = 1
doubleString( "Hello Hello" ) = 0
Pay attention to that the function parameter should have the qualifier const because the passed string is not changed within the function. And you will be able to call the function with constant arrays without the need to defined one more function for constant character arrays.
it's better to do it with a while loop since you don't always have to iterate through all the elements of the string but since you want the for loop version here it is (C++ version):
int doubleString(string s){
int s_length = s.length();
if(s_length%2 != 0) {
return 0;
}
for (int i = 0; i < s_length/2; i++) {
if (s[i] != s[s_length/2 + i]){
return 0;
}
}
return 1;
}

Error: assignment to expression with array type while using selection-sort

Basically, I'm trying to sort an agenda with 3 names using selection sort method. Pretty sure the selection sort part is OK. The problem is that apparently my code can identify the [0] chars of the string, but cannot pass one string to another variable. Here is my code:
include <stdio.h>
typedef struct{
char name[25];
} NAME;
int main(){
int a, b;
char x, y[25];
static NAME names[]={
{"Zumbazukiba"},
{"Ademiro"},
{"Haroldo Costa"}
};
for(a=0; a<4; a++){
x = names[a].name[0];
y = names[a];
for(b=(a-1); b>=0 && x<(names[b].name[0]); b--){
names[b+1] = names[b];
}
names[b+1].name = y;
}
}
I keep getting this error message:
main.c:21:11: error: assignment to expression with array type
y = names[a];
There are at least two errors in your code, in the line flagged by your compiler. First, you can't copy character strings (or, indeed, any other array type) using the simple assignment (=) operator in C - you need to use the strcpy function (which requires a #include <string.h> line in your code).
Second, you have declared y as a character array (char y[25]) but names is an array of NAME structures; presumably, you want to copy the name field of the given structure into y.
So, instead of:
y = names[a];
you should use:
strcpy(y, names[a].name);
Feel free to ask for further clarification and/or explanation.
For starters I do not see the selection sort. It seems you mean the insertion sort.
Arrays do not have the assignment operator. So statements like this
names[b+1].name = y;
where you are trying to assign an array are invalid.
And in statements like this
y = names[a];
you are trying to assign an object of the structure type to a character array.
Moreover the loops are also incorrect.
The array has only 3 elements. So it it is unclear what the magic number 4 is doing in this loop
for(a=0; a<4; a++){
and this loop
for(b=(a-1); b>=0 && x<(names[b].name[0]); b--){
skips the first iteration when a is equal to 0.
Here is a demonstrative program that shows how the selection sort can be applyed to elements of your array.
#include <stdio.h>
#include <string.h>
#define LENGTH 25
typedef struct
{
char name[LENGTH];
} NAME;
int main(void)
{
NAME names[] =
{
{ "Zumbazukiba" },
{ "Ademiro" },
{ "Haroldo Costa" }
};
const size_t N = sizeof( names ) / sizeof( *names );
for ( size_t i = 0; i < N; i++ )
{
puts( names[i].name );
}
putchar( '\n' );
for ( size_t i = 0; i < N; i++ )
{
size_t min = i;
for ( size_t j = i + 1; j < N; j++ )
{
if ( strcmp( names[j].name, names[min].name ) < 0 )
{
min = j;
}
}
if ( i != min )
{
NAME tmp = names[i];
names[i] = names[min];
names[min] = tmp;
}
}
for ( size_t i = 0; i < N; i++ )
{
puts( names[i].name );
}
putchar( '\n' );
return 0;
}
The program output is
Zumbazukiba
Ademiro
Haroldo Costa
Ademiro
Haroldo Costa
Zumbazukiba

Array rotation to the left (recursive)

Disclaimer: it is an exercise, but it's not homework.
Now, here we go. The exercise asks for the rotation of a generic array to the left, putting the first element in the last position, and doing it so with recursion. My thoughts:
Here's the right rotation one I've written:
void moveArrayRight (int array[], int dim){
if(dim!=1){
int holder;
holder = array[dim-1];
array[dim-1]=array[dim-2];
array[dim-2]=holder;
moveArrayRight(array, dim-1);
}
}
The thing is: I cannot (I think) use the same technique for the left one. I could add another parameter (technically, I could use whatever I want to), but I deeply dislike it. If possible, I would like to retain only two parameters. I also thought of doing something like using the last element of the array to store what is going to be in the next cell, but I don't know how to implement it mainly for a reason: I have no idea how to retain the original dimension of the array.
Any thoughts, hints or something like that?
void rotate_left( int a[], size_t n )
{
if ( n > 1 )
{
int tmp = a[0];
a[0] = a[1];
a[1] = tmp;
rotate_left( a + 1, n - 1 );
}
}
Here is an example of the function usage
#include <stdio.h>
void rotate_left( int a[], size_t n )
{
if ( n > 1 )
{
int tmp = a[0];
a[0] = a[1];
a[1] = tmp;
rotate_left( a + 1, n - 1 );
}
}
int main( void )
{
int a[] = { 1, 2, 3, 4, 5 };
for ( size_t i = 0; i < sizeof( a ) / sizeof( *a ); i++ ) printf( "%d ", a[i] );
puts( "" );
rotate_left( a, 5 );
for ( size_t i = 0; i < sizeof( a ) / sizeof( *a ); i++ ) printf( "%d ", a[i] );
puts( "" );
return 0;
}
The output is
1 2 3 4 5
2 3 4 5 1
To use recursion effectively you would split the work, for example in the middle, instead of just slicing off one item at a time. Something like:
function shiftLeft(arr, start, len, in) {
var result;
if (len == 1) {
result = arr[start];
arr[start] = in;
} else {
var half = Math.floor(len / 2);
in = shiftLeft(arr, start + half, len - half, in);
result = shiftLeft(arr, start, half, in);
}
return result;
}
Usage:
shiftLeft(arr, 0, arr.length, arr[0]);
(Disclaimer: The code is not tested and might have bugs, I wrote this on my phone.)

Optimization O(n^2) to O(n) (Unsorted String)

I'm having an optimization problem here. I would like to have this code running in O(n), which I tried for several hours now.
Byte-arrays c contains a string, e contains the same string, but sorted. Int-arrays nc and ne contain the indexes within the string, eg
c:
s l e e p i n g
nc:
0 0 0 1 0 0 0 0
e:
e e g i l n p s
ne:
0 1 0 0 0 0 0 0
The problem now is that get_next_index is linear - is there a way to solve this?
void decode_block(int p) {
BYTE xj = c[p];
int nxj = nc[p];
for (int i = 0; i < block_size; i++) {
result[i] = xj;
int q = get_next_index(xj, nxj, c, nc);
xj = e[q];
nxj = ne[q];
}
fwrite(result, sizeof (BYTE), block_size, stdout);
fflush(stdout);
}
int get_next_index(BYTE xj, int nxj, BYTE* c, int* nc) {
int i = 0;
while ( ( xj != c[i] ) || ( nxj != nc[i] ) ) {
i++;
}
return i;
}
This is part of an Burrows-Wheeler implementation
It starts with
xj = c[p]
nxj = nc[p]
Next I have to block_size (= length c = length nc = length e = length ne) times
store the result xj in result
find the number index for which c[i] == xj
xj is now e[i]
ne and nc are only used to make sure that every character in e and c is unique (e_0 != e_1).
Since your universe (i.e. a char) is small, I think you can get away with linear time. You need a linked list and any sequence container a lookup table for this.
First your go through your sorted string and populate a lookup table that allows you to find the first list element for a given character. For instance, your lookup table could look like std::array<std::list<size_t>,(1<<sizeof(char))> lookup. If you don't want a list, you can also use an std::deque or even an std::pair<std::vector,size_t> while the second item represents the index of the first valid entry of the vector (that way you don't need to pop the element later on, but just increment the index).
So for each element c in your sorted string you append that to you container in lookup[c].
Now, when you iterate over your unsorted array, for each element, you can lookup the corresponding index in your lookup table. Once you're done, you pop the front element in the lookup table.
All in all this is linear time and space.
To clarify; When initialising the lookup table:
// Instead of a list, a deque will likely perform better,
// but you have to test this yourself in your particular case.
std::array<std::list<size_t>,(1<<sizeof(char))> lookup;
for (size_t i = 0; i < sortedLength; i++) {
lookup[sorted[i]].push_back(i);
}
When finding the "first index" for the index i in the unsorted array:
size_t const j = lookup[unsorted[i]].front();
lookup[unsorted[i]].pop_front();
return j;
Scan xj and nxj once and build a lookup table. This is a two O(n) operations.
The most sensible way would be to have a binary tree, sorted on the value of xj or nxj. The node would contain your sought index. This would reduce your lookup to O(lg n).
Here is my complete implementation of the Burrowes-Wheeler transform:
u8* bwtCompareBuf;
u32 bwtCompareLen;
s32 bwtCompare( const void* v1, const void* v2 )
{
u8* c1 = bwtCompareBuf + ((u32*)v1)[0];
u8* c2 = bwtCompareBuf + ((u32*)v2)[0];
for ( u32 i = 0; i < bwtCompareLen; i++ )
{
if ( c1[i] < c2[i] ) return -1;
if ( c1[i] > c2[i] ) return +1;
}
return 0;
}
void bwtEncode( u8* inputBuffer, u32 len, u32& first )
{
s8* tmpBuf = alloca( len * 2 );
u32* indices = new u32[len];
for ( u32 i = 0; i < len; i++ ) indices[i] = i;
bwtCompareBuf = tmpBuf;
bwtCompareLen = len;
qsort( indices.data(), len, sizeof( u32 ), bwtCompare );
u8* tbuf = (u8*)tmpBuf + ( len - 1 );
for ( u32 i = 0; i < len; i++ )
{
u32 idx = indices[i];
if ( idx == 0 ) idx = len;
inputBuffer[i] = tbuf[idx];
if ( indices[i] == 1 ) first = i;
}
delete[] indices;
}
void bwtDecode( u8* inputBuffer, u32 len, u32 first )
{
// To determine a character's position in the output string given
// its position in the input string, we can use the knowledge about
// the fact that the output string is sorted. Each character 'c' will
// show up in the output stream in in position i, where i is the sum
// total of all characters in the input buffer that precede c in the
// alphabet, plus the count of all occurences of 'c' previously in the
// input stream.
// compute the frequency of each character in the input buffer
u32 freq[256] = { 0 };
u32 count[256] = { 0 };
for ( u32 i = 0; i < len; i++ )
freq[inputBuffer[i]]++;
// freq now holds a running total of all the characters less than i
// in the input stream
u32 sum = 0;
for ( u32 i = 0; i < 256; i++ )
{
u32 tmp = sum;
sum += freq[i];
freq[i] = tmp;
}
// Now that the freq[] array is filled in, I have half the
// information needed to position each 'c' in the input buffer. The
// next piece of information is simply the number of characters 'c'
// that appear before this 'c' in the input stream. I keep track of
// that information in the count[] array as I go. By adding those
// two numbers together, I get the destination of each character in
// the input buffer, and I just write it directly to the destination.
u32* trans = new u32[len];
for ( u32 i = 0; i < len; i++ )
{
u32 ch = inputBuffer[i];
trans[count[ch] + freq[ch]] = i;
count[ch]++;
}
u32 idx = first;
s8* tbuf = alloca( len );
memcpy( tbuf, inputBuffer, len );
u8* srcBuf = (u8*)tbuf;
for ( u32 i = 0; i < len; i++ )
{
inputBuffer[i] = srcBuf[idx];
idx = trans[idx];
}
delete[] trans;
}
The decode in O(n).

Resources