Optimization O(n^2) to O(n) (Unsorted String) - c

I'm having an optimization problem here. I would like to have this code running in O(n), which I tried for several hours now.
Byte-arrays c contains a string, e contains the same string, but sorted. Int-arrays nc and ne contain the indexes within the string, eg
c:
s l e e p i n g
nc:
0 0 0 1 0 0 0 0
e:
e e g i l n p s
ne:
0 1 0 0 0 0 0 0
The problem now is that get_next_index is linear - is there a way to solve this?
void decode_block(int p) {
BYTE xj = c[p];
int nxj = nc[p];
for (int i = 0; i < block_size; i++) {
result[i] = xj;
int q = get_next_index(xj, nxj, c, nc);
xj = e[q];
nxj = ne[q];
}
fwrite(result, sizeof (BYTE), block_size, stdout);
fflush(stdout);
}
int get_next_index(BYTE xj, int nxj, BYTE* c, int* nc) {
int i = 0;
while ( ( xj != c[i] ) || ( nxj != nc[i] ) ) {
i++;
}
return i;
}
This is part of an Burrows-Wheeler implementation
It starts with
xj = c[p]
nxj = nc[p]
Next I have to block_size (= length c = length nc = length e = length ne) times
store the result xj in result
find the number index for which c[i] == xj
xj is now e[i]
ne and nc are only used to make sure that every character in e and c is unique (e_0 != e_1).

Since your universe (i.e. a char) is small, I think you can get away with linear time. You need a linked list and any sequence container a lookup table for this.
First your go through your sorted string and populate a lookup table that allows you to find the first list element for a given character. For instance, your lookup table could look like std::array<std::list<size_t>,(1<<sizeof(char))> lookup. If you don't want a list, you can also use an std::deque or even an std::pair<std::vector,size_t> while the second item represents the index of the first valid entry of the vector (that way you don't need to pop the element later on, but just increment the index).
So for each element c in your sorted string you append that to you container in lookup[c].
Now, when you iterate over your unsorted array, for each element, you can lookup the corresponding index in your lookup table. Once you're done, you pop the front element in the lookup table.
All in all this is linear time and space.
To clarify; When initialising the lookup table:
// Instead of a list, a deque will likely perform better,
// but you have to test this yourself in your particular case.
std::array<std::list<size_t>,(1<<sizeof(char))> lookup;
for (size_t i = 0; i < sortedLength; i++) {
lookup[sorted[i]].push_back(i);
}
When finding the "first index" for the index i in the unsorted array:
size_t const j = lookup[unsorted[i]].front();
lookup[unsorted[i]].pop_front();
return j;

Scan xj and nxj once and build a lookup table. This is a two O(n) operations.
The most sensible way would be to have a binary tree, sorted on the value of xj or nxj. The node would contain your sought index. This would reduce your lookup to O(lg n).

Here is my complete implementation of the Burrowes-Wheeler transform:
u8* bwtCompareBuf;
u32 bwtCompareLen;
s32 bwtCompare( const void* v1, const void* v2 )
{
u8* c1 = bwtCompareBuf + ((u32*)v1)[0];
u8* c2 = bwtCompareBuf + ((u32*)v2)[0];
for ( u32 i = 0; i < bwtCompareLen; i++ )
{
if ( c1[i] < c2[i] ) return -1;
if ( c1[i] > c2[i] ) return +1;
}
return 0;
}
void bwtEncode( u8* inputBuffer, u32 len, u32& first )
{
s8* tmpBuf = alloca( len * 2 );
u32* indices = new u32[len];
for ( u32 i = 0; i < len; i++ ) indices[i] = i;
bwtCompareBuf = tmpBuf;
bwtCompareLen = len;
qsort( indices.data(), len, sizeof( u32 ), bwtCompare );
u8* tbuf = (u8*)tmpBuf + ( len - 1 );
for ( u32 i = 0; i < len; i++ )
{
u32 idx = indices[i];
if ( idx == 0 ) idx = len;
inputBuffer[i] = tbuf[idx];
if ( indices[i] == 1 ) first = i;
}
delete[] indices;
}
void bwtDecode( u8* inputBuffer, u32 len, u32 first )
{
// To determine a character's position in the output string given
// its position in the input string, we can use the knowledge about
// the fact that the output string is sorted. Each character 'c' will
// show up in the output stream in in position i, where i is the sum
// total of all characters in the input buffer that precede c in the
// alphabet, plus the count of all occurences of 'c' previously in the
// input stream.
// compute the frequency of each character in the input buffer
u32 freq[256] = { 0 };
u32 count[256] = { 0 };
for ( u32 i = 0; i < len; i++ )
freq[inputBuffer[i]]++;
// freq now holds a running total of all the characters less than i
// in the input stream
u32 sum = 0;
for ( u32 i = 0; i < 256; i++ )
{
u32 tmp = sum;
sum += freq[i];
freq[i] = tmp;
}
// Now that the freq[] array is filled in, I have half the
// information needed to position each 'c' in the input buffer. The
// next piece of information is simply the number of characters 'c'
// that appear before this 'c' in the input stream. I keep track of
// that information in the count[] array as I go. By adding those
// two numbers together, I get the destination of each character in
// the input buffer, and I just write it directly to the destination.
u32* trans = new u32[len];
for ( u32 i = 0; i < len; i++ )
{
u32 ch = inputBuffer[i];
trans[count[ch] + freq[ch]] = i;
count[ch]++;
}
u32 idx = first;
s8* tbuf = alloca( len );
memcpy( tbuf, inputBuffer, len );
u8* srcBuf = (u8*)tbuf;
for ( u32 i = 0; i < len; i++ )
{
inputBuffer[i] = srcBuf[idx];
idx = trans[idx];
}
delete[] trans;
}
The decode in O(n).

Related

How does a recursive code determine if palindrome work?

I have a problem question and a snippet code below. The snippet is filled already because I found out the solution but I do not understand why it is like that. Could you help me explain how the codes work?
Problem: Ten tiles each have strings of in between 1 and 4 letters on them (hardcoded in the code below). The goal of this problem is to complete the code below so it counts the number of different orders in which all of the tiles can be placed such that the string they form creates a palindrome (a word that reads the same forwards and backwards). All of main, as well as the function eval which determines if a particular ordering of the tiles forms a palindrome. You may call this function in the function go. Complete the recursive function (named go) to complete the solution.
Snippet code:
#include <stdio.h>
#include <string.h>
#define N 10
#define MAXLEN 5
int go(int perm[], int used[], int k, char tiles[N][MAXLEN]);
int eval(int perm[], char tiles[N][MAXLEN]);
char MYTILES[N][MAXLEN] = {
"at", "ta", "g", "cc", "ccac", "ca", "cc", "gag", "cga", "gc"
};
int
main(void)
{
int perm[N];
int used[N];
for (int i = 0; i < N; i++)
used[i] = 0;
int res = go(perm, used, 0, MYTILES);
printf("Number of tile orderings that create palindromes is %d\n", res);
return 0;
}
int
go(int perm[], int used[], int k, char tiles[N][MAXLEN])
{
if (k == N)
return eval(perm, tiles);
int res = 0;
for (int i = 0; i < N; i++) {
if (used[i])
continue;
used[i] = 1;
perm[k] = i;
res += go(perm, used, k + 1, tiles);
used[i] = 0;
}
return res;
}
int
eval(int perm[], char tiles[N][MAXLEN])
{
char tmp[N * MAXLEN];
int idx = 0;
for (int i = 0; i < N; i++) {
int len = strlen(tiles[perm[i]]);
for (int j = 0; j < len; j++)
tmp[idx++] = tiles[perm[i]][j];
}
tmp[idx] = '\0';
for (int i = 0; i < idx / 2; i++)
if (tmp[i] != tmp[idx - 1 - i])
return 0;
return 1;
}
Thank you. I appreciate all help!!
To understand this code, add the following line to the start of eval():
for( int j = 0; j < N; j++ ) printf( "%d ", perm[j] ); putchar('\n');
The for() loop in go() causes a recursion that is 10 levels deep, ultimately generating 10! (~3.6 million) permutations of the 10 indices from 0 to 9. In sequence, each of those permutations is used to concatenate the 'tokens' (the short ACTG variations) into a single string that is then tested for being palindromic by `eval()'
This is called a "brute force" search through the possibility space.
Below I've revised the code to be slightly more compact, adding two "printf debugging" lines (marked "/**/") that report what the program is doing. You'll need some patience if you wish to watch millions of permutations of 0 to 9 scroll by, or simply comment out that line and recompile. I also shuffled things around and made the two interesting arrays global instead of "whacking the stack" by passing them up/down the recursion. Less code is better. This program is "single purpose". The clarity gained justifies using global variables in this instance, imho.
More interesting is the additional puts() line that reports the palindromic sequences.
#include <stdio.h>
#include <string.h>
#define N 10
#define MAXLEN 5
char MYTILES[N][MAXLEN] = { "AT","TA","G","CC","CCAC","CA","CC","GAG","CGA","GC" };
int perm[N], used[N] = { 0 };
int go( int k ) {
if (k == N) {
// At extent of recursion here.
/**/ for( int j = 0; j < k; j++ ) printf( "%d ", perm[j] ); putchar('\n');
// Make a string in this sequence
char tmp[N*MAXLEN] = {0};
for( int i = 0; i < N; i++ )
strcat( tmp, MYTILES[ perm[ i ] ] );
// Test string for being palidromic
for( int l = 0, r = strlen( tmp ) - 1; l <= r; l++, r-- )
if( tmp[l] != tmp[r] )
return 0; // Not palidrome
/**/ puts( tmp );
return 1; // Is palidrome
}
// recursively generate permutations here
int res = 0;
for( int i = 0; i < N; i++ )
if( !used[i] ) {
used[i] = 1;
perm[k] = i;
res += go( k+1 );
used[i] = 0;
}
return res;
}
int main( void ) {
printf( "Palindromic tile orderings: %d\n", go( 0 ) );
return 0;
}
An immediate 'speed-up' would be to test that the first letter of the 0th string to be permuted matches the last letter of the 9th string... Don't bother concatenating if a palindrome is impossible from the get-go. Other optimisations are left as an exercise for the reader...
BTW: It's okay to make a copy of code and add your own print statements so that the program reports what it is doing when... Or, you could single-step through a debugger...
UPDATE
Having added a preliminary generation of a 10x10 matrix to 'gate' the workload of generating strings to be checked as palindromic, with the 10 OP supplied strings, it turns out that 72% of those operations were doomed to fail from the start. Of the 3.6 million "brute force" attempts, a quick reference to this pre-generated matrix prevented about 2.6 million of them.
It's worthwhile trying to make code efficient.
UPDATE #2:
Bothered that there was still a lot of 'fat' in the execution after trying to improve on the "brute force" in a simple way, I've redone some of the code.
Using a few extra global variables (the state of processing), the following now does some "preparation" in main(), then enters the recursion. In this version, once the string being assembled from fragments is over half complete (in length), it is checked from the "middle out" if it qualifies as being palindromic. If so, each appended fragment causes a re-test. If the string would never become a palindrome, the recursion 'backs-up' and tries another 'flavour' of permutation. This trims the possibility space immensely (and really speeds up the execution.)
char *Tiles[] = { "AT","TA","G","CC","CCAC","CA","CC","GAG","CGA","GC" };
const int nTiles = sizeof Tiles/sizeof Tiles[0];
int used[ nTiles ];
char buildBuf[ 1024 ], *cntrL, *cntrR; // A big buffer and 2 pointers.
int fullLen;
int cntTested, goCalls; // some counters to report iterations
uint32_t factorial( uint32_t n ) { // calc n! (max 12! to fit uint32_t)
uint32_t f = 1;
while( n ) f *= n--;
return f;
}
int hope() { // center outward testing for palindromic characteristics
int i;
for( i = 0; cntrL[ 0 - i ] == cntrR[ 0 + i ]; i++ ) ; // looping
return cntrR[ 0 + i ] == '\0';
}
int go( int k ) {
goCalls++;
if( k == nTiles ) { // at full extent of recursion here
// test string being palindromic (from ends toward middle for fun)
cntTested++;
for( int l = 0, r = fullLen - 1; l <= r; l++, r-- )
if( buildBuf[l] != buildBuf[r] )
return 0; // Not palindrome
/**/ puts( buildBuf );
return 1; // Is palindrome
}
// recursively generate permutations here
// instead of building from sequence of indices
// this builds the (global) sequence string right here
int res = 0;
char *at = buildBuf + strlen( buildBuf );
for( int i = 0; i < nTiles; i++ )
if( !used[i] ) {
strcpy( at, Tiles[ i ] );
// keep recursing until > half assembled and hope persists
if( at < cntrL || hope() ) {
used[i] = 1;
res += go( k+1 ); // go 'deeper' in the recursion
used[i] = 0;
}
}
return res;
}
int main( void ) {
for( int i = 0; i < nTiles; i++ )
fullLen += strlen( Tiles[i] );
if( fullLen % 2 == 0 ) // even count
cntrR = (cntrL = buildBuf + fullLen/2 - 1) + 1; // 24 ==> 0-11 & 12->23
else
cntrR = cntrL = buildBuf + fullLen/2; // 25 ==> 0-12 & 12->24
printf( "Palindromic tile orderings: %d\n", go( 0 ) );
printf( "Potential: %d\n", factorial( nTiles ) );
printf( "Calls to go(): %d\n", goCalls );
printf( "Actual: %d\n", cntTested );
return 0;
}
ATCCACGAGCCGCCGAGCACCTA
ATCCACGAGCCGCCGAGCACCTA
ATCCACGCCGAGAGCCGCACCTA
ATCCACGCCGAGAGCCGCACCTA
ATCACCGAGCCGCCGAGCCACTA
ATCACCGCCGAGAGCCGCCACTA
ATCACCGAGCCGCCGAGCCACTA
ATCACCGCCGAGAGCCGCCACTA
TACCACGAGCCGCCGAGCACCAT
TACCACGAGCCGCCGAGCACCAT
TACCACGCCGAGAGCCGCACCAT
TACCACGCCGAGAGCCGCACCAT
TACACCGAGCCGCCGAGCCACAT
TACACCGCCGAGAGCCGCCACAT
TACACCGAGCCGCCGAGCCACAT
TACACCGCCGAGAGCCGCCACAT
CCACATGAGCCGCCGAGTACACC
CCACATGAGCCGCCGAGTACACC
CCACATGCCGAGAGCCGTACACC
CCACATGCCGAGAGCCGTACACC
CCACTAGAGCCGCCGAGATCACC
CCACTAGAGCCGCCGAGATCACC
CCACTAGCCGAGAGCCGATCACC
CCACTAGCCGAGAGCCGATCACC
CACCATGAGCCGCCGAGTACCAC
CACCATGCCGAGAGCCGTACCAC
CACCTAGAGCCGCCGAGATCCAC
CACCTAGCCGAGAGCCGATCCAC
CACCATGAGCCGCCGAGTACCAC
CACCATGCCGAGAGCCGTACCAC
CACCTAGAGCCGCCGAGATCCAC
CACCTAGCCGAGAGCCGATCCAC
Palindromic tile orderings: 32
Potential: 3628800
Calls to go(): 96712
Actual: 32
UPDATE #3 (having fun)
When there's too much code, and an inefficient algorithm, it's easy to get lost and struggle to work out what is happening.
Below produces exactly the same results as above, but shaves a few more operations from the execution. In short, go() is called recursively until at least 1/2 of the candidate string has been built-up. At that point, hope() is asked to evaluate the string "from the middle, out." As long as the conditions of being palindromic (from the centre, outward) are being met, that evaluation is repeated as the string grows (via recursion) toward its fullest extent. It is the "bailing-out early" that makes this version far more efficient than the OP version.
One further 'refinement' is that the bottom of the recursion is found without an extra call to \0. Once one has the concepts of recursion and permutation, this should all be straight forward...
char *Tiles[] = { "AT", "TA", "G", "CC", "CCAC", "CA", "CC", "GAG", "CGA", "GC" };
const int nTiles = sizeof Tiles/sizeof Tiles[0];
int used[ nTiles ];
char out[ 1024 ], *cntrL, *cntrR;
int hope() { // center outward testing for palidromic characteristics
char *pL = cntrL, *pR = cntrR;
while( *pL == *pR ) pL--, pR++;
return *pR == '\0';
}
int go( int k ) {
int res = 0;
char *at = out + strlen( out );
for( size_t i = 0; i < nTiles; i++ )
if( !used[i] ) {
strcpy( at, Tiles[ i ] );
if( at >= cntrL && !hope() ) // abandon this string?
continue;
if( k+1 == nTiles ) { // At extent of recursion here.
puts( out );
return 1;
}
used[i] = 1, res += go( k+1 ), used[i] = 0;
}
return res;
}
int main( void ) {
int need = 0;
for( size_t i = 0; i < nTiles; i++ )
need += strlen( Tiles[ i ] );
cntrL = cntrR = out + need/2; // odd eg: 25 ==> 0-12 & 12->24
cntrL -= (need % 2 == 0 ); // but, if even eg: 24 ==> 0-11 & 12->23
printf( "Palindromic tile orderings: %d\n", go( 0 ) );
return 0;
}

Checking whether a string consists of two repetitions

I am writing a function that returns 1 if a string consists of two repetitions, 0 otherwise.
Example: If the string is "hellohello", the function will return 1 because the string consists of the same two words "hello" and "hello".
The first test I did was to use a nested for loop but after a bit of reasoning I thought that the idea is wrong and is not the right way to solve, here is the last function I wrote.
It is not correct, even if the string consists of two repetitions, it returns 0.
Also, I know this problem could be handled differently with a while loop following another algorithm, but I was wondering if it could be done with the for as well.
My idea would be to divide the string in half and check it character by character.
This is the last function I tried:
int doubleString(char *s){
int true=1;
char strNew[50];
for(int i=0;i<strlen(s)/2;i++){
strNew[i]=s[i];
}
for(int j=strlen(s)/2;j<strlen(s);j++){
if(!(strNew[j]==s[j])){
true=0;
}
}
return true;
}
The problem in your function is with the comparison in the second loop: you are using the j variable as an index for both the second half of the given string and for the index in the copied first half of that string. However, for that copied string, you need the indexes to start from zero – so you need to subtract the s_length/2 value from j when accessing its individual characters.
Also, it is better to use the size_t type when looping through strings and comparing to the results of functions like strlen (which return that type). You can also improve your code by saving the strlen(s)/2 value, so it isn't computed on each loop. You can also dispense with your local true variable, returning 0 as soon as you find a mismatch, or 1 if the second loop completes without finding such a mismatch:
int doubleString(char* s)
{
char strNew[50] = { 0, };
size_t full_len = strlen(s);
size_t half_len = full_len / 2;
for (size_t i = 0; i < half_len; i++) {
strNew[i] = s[i];
}
for (size_t j = half_len; j < full_len; j++) {
if (strNew[j - half_len] != s[j]) { // x != y is clearer than !(x == y)
return 0;
}
}
return 1;
}
In fact, once you have appreciated why you need to subtract that "half length" from the j index of strNew, you can remove the need for that temporary copy completely and just use the modified j as an index into the original string:
int doubleString(char* s)
{
size_t full_len = strlen(s);
size_t half_len = full_len / 2;
for (size_t j = half_len; j < full_len; j++) {
if (s[j - half_len] != s[j]) { // x != y is clearer than !(x == y)
return 0;
}
}
return 1;
}
This loop
for(int j=strlen(s)/2;j<strlen(s);j++){
if(!(strNew[j]==s[j])){
true=0;
}
}
is incorrect. The index in the array strNew shall start from 0 instead of the value of the expression strlen( s ) / 2.
But in any case your approach is incorrect because at least you are using an intermediate array with the magic number 50. The user can pass to the function a string of any length.
char strNew[50];
The function can look much simpler.
For example
int doubleString( const char *s )
{
int double_string = 0;
size_t n = 0;
if ( ( double_string = *s != '\0' && ( n = strlen( s ) ) % 2 == 0 ) )
{
double_string = memcmp( s, s + n / 2, n / 2 ) == 0;
}
return double_string;
}
That is the function at first checks that the passed string is not empty and its length is an even number. If so then the function compares two halves of the string.
Here is a demonstration program.
#include <stdio.h>
#include <string.h>
int doubleString( const char *s )
{
int double_string = 0;
size_t n = 0;
if (( double_string = *s != '\0' && ( n = strlen( s ) ) % 2 == 0 ))
{
double_string = memcmp( s, s + n / 2, n / 2 ) == 0;
}
return double_string;
}
int main( void )
{
printf( "doubleString( \"\" ) = %d\n", doubleString( "" ) );
printf( "doubleString( \"HelloHello\" ) = %d\n", doubleString( "HelloHello" ) );
printf( "doubleString( \"Hello Hello\" ) = %d\n", doubleString( "Hello Hello" ) );
}
The program output is
doubleString( "" ) = 0
doubleString( "HelloHello" ) = 1
doubleString( "Hello Hello" ) = 0
Pay attention to that the function parameter should have the qualifier const because the passed string is not changed within the function. And you will be able to call the function with constant arrays without the need to defined one more function for constant character arrays.
it's better to do it with a while loop since you don't always have to iterate through all the elements of the string but since you want the for loop version here it is (C++ version):
int doubleString(string s){
int s_length = s.length();
if(s_length%2 != 0) {
return 0;
}
for (int i = 0; i < s_length/2; i++) {
if (s[i] != s[s_length/2 + i]){
return 0;
}
}
return 1;
}

Concatenating arrays in place

I ran into an issue while implementing a circular buffer that must occasionally be aligned.
Say I have two arrays, leftArr and rightArr. I want to move the right array to byteArr and the left array to byteArr + the length of the right array. Both leftArr and rightArr are greater than byteArr, and rightArr is greater than leftArr. (this is not quite the same as rotating a circular buffer because the left array does not need to start at byteArr) Although the left and right arrays do not overlap, the combined array stored at byteArr may overlap with the current arrays, stored at leftArr and rightArr. All memory from byteArr to rightArr + rightArrLen can be safely written to. One possible implementation is:
void align(char* byteArr, char* leftArr, int leftArrLen, char* rightArr, int rightArrLen) {
char *t = malloc(rightArrLen + leftArrLen);
// form concatenated data
memcpy(t, right, rightArrLen);
memcpy(t + rightArrLen, left, leftArrLen);
// now replace
memcpy(byteArr, t, rightArrLen + leftArrLen);
free(t);
}
However, I must accomplish this with constant memory complexity.
What I have so far looks like this:
void align(char* byteArr, char* leftArr, int leftArrLen, char* rightArr, int rightArrLen)
{
// first I check to see if some combination of memmove and memcpy will suffice, if not:
unsigned int lStart = leftArr - byteArr;
unsigned int lEnd = lStart + leftArrLen;
unsigned int rStart = rightArr - byteArr;
unsigned int rEnd = rStart + rightArrLen;
unsigned int lShift = rEnd - rStart - lStart;
unsigned int rShift = -rStart;
char temp1;
char temp2;
unsigned int nextIndex;
bool alreadyMoved;
// move the right array
for( unsigned int i = 0; i < rEnd - rStart; i++ )
{
alreadyMoved = false;
for( unsigned int j = i; j < rEnd - rStart; j-= rShift )
{
if( lStart <= j + rStart - lShift
&& j + rStart - lShift < lEnd
&& lStart <= (j + rStart) % lShift
&& (j + rStart) % lShift < lEnd
&& (j + rStart) % lShift < i )
{
alreadyMoved = true;
}
}
if(alreadyMoved)
{
// byte has already been moved
continue;
}
nextIndex = i - rShift;
temp1 = byteArr[nextIndex];
while( rStart <= nextIndex && nextIndex < rEnd )
{
nextIndex += rShift;
temp2 = byteArr[nextIndex];
byteArr[nextIndex] = temp1;
temp1 = temp2;
while( lStart <= nextIndex && nextIndex < lEnd )
{
nextIndex += lShift;
temp2 = byteArr[nextIndex];
byteArr[nextIndex] = temp1;
temp1 = temp2;
}
if( nextIndex <= i - rShift )
{
// byte has already been moved
break;
}
}
}
// move the left array
for( unsigned int i = lStart; i < lShift && i < lEnd; i++ )
{
if( i >= rEnd - rStart )
{
nextIndex = i + lShift;
temp1 = byteArr[nextIndex];
byteArr[nextIndex] = byteArr[i];
while( nextIndex < lEnd )
{
nextIndex += lShift;
temp2 = byteArr[nextIndex];
byteArr[nextIndex] = temp1;
temp1 = temp2;
}
}
}
}
This code works in the case lStart = 0, lLength = 11, rStart = 26, rLength = 70 but fails in the case lStart = 0, lLength = 46, rStart = 47, rLength = 53. The solution that I can see is to add logic to determine when a byte from the right array has already been moved. While this would be possible for me to do, I was wondering if there's a simpler solution to this problem that runs with constant memory complexity and without extra reads and writes?
Here's a program to test an implementation:
bool testAlign(int lStart, int lLength, int rStart, int rLength)
{
char* byteArr = (char*) malloc(100 * sizeof(char));
char* leftArr = byteArr + lStart;
char* rightArr = byteArr + rStart;
for(int i = 0; i < rLength; i++)
{
rightArr[i] = i;
}
for(int i = 0; i < lLength; i++)
{
leftArr[i] = i + rLength;
}
align(byteArr, leftArr, lLength, rightArr, rLength);
for(int i = 0; i < lLength + rLength; i++)
{
if(byteArr[i] != i) return false;
}
return true;
}
Imagine dividing byteArr into regions (not necessarily to scale):
X1 Left X2 Right
|---|--------|---|------|
The X1 and X2 are gaps in byteArr before the start of the left array and between the two arrays. In the general case, any or all of those four regions may have zero length.
You can then proceed like this:
Start by partially or wholly filling in the leading space in byteArr
If Left has zero length then move Right to the front (if necessary) via memmove(). Done.
Else if X1 is the same length as the Right array or larger then move the right array into that space via memcpy() and, possibly, move up the left array to abut it via memmove(). Done.
Else, move the lead portion of the Right array into that space, producing the below layout. If X1 had zero length then R1 also has zero length, X2' == X2, and R2 == Right.
R1 Left X2' R2
|---|--------|------|---|
There are now two alternatives
If R2 is the same length as Left or longer, then swap Left with the initial portion of R2 to produce (still not to scale):
R1' X2'' Left R2'
|------|-----|-------|--|
Otherwise, swap the initial portion of Left with all of R2 to produce (still not to scale):
R1' L2 X2'' L1
|------|---|-------|----|
Now recognize that in either case, you have a strictly smaller problem of the same form as the original, where the new byteArr is the tail of the original starting immediately after region R1'. In the first case the new leftArr is the (final) Left region and the new rightArr is region R2'. In the other case, the new leftArr is region L2, and the new rightArr is region L1. Reset parameters to reflect this new problem, and loop back to step (1).
Note that I say to loop back to step 1. You could, of course, implement this algorithm (tail-)recursively, but then to achieve constant space usage you would need to rely on your compiler to optimize out the tail recursion, which otherwise consumes auxiliary space proportional to the length ratio of the larger of the two sub-arrays to the smaller.

Finding unique elements in an string array in C

C bothers me with its handling of strings. I have a pseudocode like this in my mind:
char *data[20];
char *tmp; int i,j;
for(i=0;i<20;i++) {
tmp = data[i];
for(j=1;j<20;j++)
{
if(strcmp(tmp,data[j]))
//then except the uniqueness, store them in elsewhere
}
}
But when i coded this the results were bad.(I handled all the memory stuff,little things etc.) The problem is in the second loop obviously :D. But i cannot think any solution. How do i find unique strings in an array.
Example input : abc def abe abc def deg entered
unique ones : abc def abe deg should be found.
You could use qsort to force the duplicates next to each other. Once sorted, you only need to compare adjacent entries to find duplicates. The result is O(N log N) rather than (I think) O(N^2).
Here is the 15 minute lunchtime version with no error checking:
typedef struct {
int origpos;
char *value;
} SORT;
int qcmp(const void *x, const void *y) {
int res = strcmp( ((SORT*)x)->value, ((SORT*)y)->value );
if ( res != 0 )
return res;
else
// they are equal - use original position as tie breaker
return ( ((SORT*)x)->origpos - ((SORT*)y)->origpos );
}
int main( int argc, char* argv[] )
{
SORT *sorted;
char **orig;
int i;
int num = argc - 1;
orig = malloc( sizeof( char* ) * ( num ));
sorted = malloc( sizeof( SORT ) * ( num ));
for ( i = 0; i < num; i++ ) {
orig[i] = argv[i + 1];
sorted[i].value = argv[i + 1];
sorted[i].origpos = i;
}
qsort( sorted, num, sizeof( SORT ), qcmp );
// remove the dups (sorting left relative position same for dups)
for ( i = 0; i < num - 1; i++ ) {
if ( !strcmp( sorted[i].value, sorted[i+1].value ))
// clear the duplicate entry however you see fit
orig[sorted[i+1].origpos] = NULL; // or free it if dynamic mem
}
// print them without dups in original order
for ( i = 0; i < num; i++ )
if ( orig[i] )
printf( "%s ", orig[i] );
free( orig );
free( sorted );
}
char *data[20];
int i, j, n, unique[20];
n = 0;
for (i = 0; i < 20; ++i)
{
for (j = 0; j < n; ++j)
{
if (!strcmp(data[i], data[unique[j]]))
break;
}
if (j == n)
unique[n++] = i;
}
The indexes of the first occurrence of each unique string should be in unique[0..n-1] if I did that right.
Why are you starting second loop from 1?
You should start it from
i+1. i.e.
for(j=i+1;j<20;j++)
Like if the list is
abc
def
abc
abc
lop
then
when i==4
tmp="lop"
but then the second loop starts which is from 1 to 19. This means it will get a value of 4 too at one stage, and then
data[4], which is "lop", will be same as tmp. So although "lop" is unique but it will be flagged as repeated.
Hope it was helpful.
Think a bit more about your problem -- what you really want to do is look at the PREVIOUS strings to see if you've already seen it. So, for each string n, compare it to strings 0 through n-1.
print element 0 (it is unique)
for i = 1 to n
unique = 1
for j = 0 to i-1 (compare this element to the ones preceding it)
if element[i] == element[j]
unique = 0
break from loop
if unique, print element i
Might it be that your test is if (strcmp (this, that)) which will succeed if the two are different? !strcmp is probably what you want there.

Removing Duplicates in an array in C

The question is a little complex. The problem here is to get rid of duplicates and save the unique elements of array into another array with their original sequence.
For example :
If the input is entered b a c a d t
The result should be : b a c d t in the exact state that the input entered.
So, for sorting the array then checking couldn't work since I lost the original sequence. I was advised to use array of indices but I don't know how to do. So what is your advise to do that?
For those who are willing to answer the question I wanted to add some specific information.
char** finduni(char *words[100],int limit)
{
//
//Methods here
//
}
is the my function. The array whose duplicates should be removed and stored in a different array is words[100]. So, the process will be done on this. I firstly thought about getting all the elements of words into another array and sort that array but that doesn't work after some tests. Just a reminder for solvers :).
Well, here is a version for char types. Note it doesn't scale.
#include "stdio.h"
#include "string.h"
void removeDuplicates(unsigned char *string)
{
unsigned char allCharacters [256] = { 0 };
int lookAt;
int writeTo = 0;
for(lookAt = 0; lookAt < strlen(string); lookAt++)
{
if(allCharacters[ string[lookAt] ] == 0)
{
allCharacters[ string[lookAt] ] = 1; // mark it seen
string[writeTo++] = string[lookAt]; // copy it
}
}
string[writeTo] = '\0';
}
int main()
{
char word[] = "abbbcdefbbbghasdddaiouasdf";
removeDuplicates(word);
printf("Word is now [%s]\n", word);
return 0;
}
The following is the output:
Word is now [abcdefghsiou]
Is that something like what you want? You can modify the method if there are spaces between the letters, but if you use int, float, double or char * as the types, this method won't scale at all.
EDIT
I posted and then saw your clarification, where it's an array of char *. I'll update the method.
I hope this isn't too much code. I adapted this QuickSort algorithm and basically added index memory to it. The algorithm is O(n log n), as the 3 steps below are additive and that is the worst case complexity of 2 of them.
Sort the array of strings, but every swap should be reflected in the index array as well. After this stage, the i'th element of originalIndices holds the original index of the i'th element of the sorted array.
Remove duplicate elements in the sorted array by setting them to NULL, and setting the index value to elements, which is the highest any can be.
Sort the array of original indices, and make sure every swap is reflected in the array of strings. This gives us back the original array of strings, except the duplicates are at the end and they are all NULL.
For good measure, I return the new count of elements.
Code:
#include "stdio.h"
#include "string.h"
#include "stdlib.h"
void sortArrayAndSetCriteria(char **arr, int elements, int *originalIndices)
{
#define MAX_LEVELS 1000
char *piv;
int beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R;
int idx, cidx;
for(idx = 0; idx < elements; idx++)
originalIndices[idx] = idx;
beg[0] = 0;
end[0] = elements;
while (i>=0)
{
L = beg[i];
R = end[i] - 1;
if (L<R)
{
piv = arr[L];
cidx = originalIndices[L];
if (i==MAX_LEVELS-1)
return;
while (L < R)
{
while (strcmp(arr[R], piv) >= 0 && L < R) R--;
if (L < R)
{
arr[L] = arr[R];
originalIndices[L++] = originalIndices[R];
}
while (strcmp(arr[L], piv) <= 0 && L < R) L++;
if (L < R)
{
arr[R] = arr[L];
originalIndices[R--] = originalIndices[L];
}
}
arr[L] = piv;
originalIndices[L] = cidx;
beg[i + 1] = L + 1;
end[i + 1] = end[i];
end[i++] = L;
}
else
{
i--;
}
}
}
int removeDuplicatesFromBoth(char **arr, int elements, int *originalIndices)
{
// now remove duplicates
int i = 1, newLimit = 1;
char *curr = arr[0];
while (i < elements)
{
if(strcmp(curr, arr[i]) == 0)
{
arr[i] = NULL; // free this if it was malloc'd
originalIndices[i] = elements; // place it at the end
}
else
{
curr = arr[i];
newLimit++;
}
i++;
}
return newLimit;
}
void sortArrayBasedOnCriteria(char **arr, int elements, int *originalIndices)
{
#define MAX_LEVELS 1000
int piv;
int beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R;
int idx;
char *cidx;
beg[0] = 0;
end[0] = elements;
while (i>=0)
{
L = beg[i];
R = end[i] - 1;
if (L<R)
{
piv = originalIndices[L];
cidx = arr[L];
if (i==MAX_LEVELS-1)
return;
while (L < R)
{
while (originalIndices[R] >= piv && L < R) R--;
if (L < R)
{
arr[L] = arr[R];
originalIndices[L++] = originalIndices[R];
}
while (originalIndices[L] <= piv && L < R) L++;
if (L < R)
{
arr[R] = arr[L];
originalIndices[R--] = originalIndices[L];
}
}
arr[L] = cidx;
originalIndices[L] = piv;
beg[i + 1] = L + 1;
end[i + 1] = end[i];
end[i++] = L;
}
else
{
i--;
}
}
}
int removeDuplicateStrings(char *words[], int limit)
{
int *indices = (int *)malloc(limit * sizeof(int));
int newLimit;
sortArrayAndSetCriteria(words, limit, indices);
newLimit = removeDuplicatesFromBoth(words, limit, indices);
sortArrayBasedOnCriteria(words, limit, indices);
free(indices);
return newLimit;
}
int main()
{
char *words[] = { "abc", "def", "bad", "hello", "captain", "def", "abc", "goodbye" };
int newLimit = removeDuplicateStrings(words, 8);
int i = 0;
for(i = 0; i < newLimit; i++) printf(" Word # %d = %s\n", i, words[i]);
return 0;
}
Traverse through the items in the array - O(n) operation
For each item, add it to another sorted-array
Before adding it to the sorted array, check if the entry already exists - O(log n) operation
Finally, O(n log n) operation
i think that in C you can create a second array. then you copy the element from the original array only if this element is not already in the send array.
this also preserve the order of the element.
if you read the element one by one you can discard the element before insert in the original array, this could speedup the process.
As Thomas suggested in a comment, if each element of the array is guaranteed to be from a limited set of values (such as a char) you can achieve this in O(n) time.
Keep an array of 256 bool (or int if your compiler doesn't support bool) or however many different discrete values could possibly be in the array. Initialize all the values to false.
Scan the input array one-by-one.
For each element, if the corresponding value in the bool array is false, add it to the output array and set the bool array value to true. Otherwise, do nothing.
You know how to do it for char type, right?
You can do same thing with strings, but instead of using array of bools (which is technically an implementation of "set" object), you'll have to simulate the "set"(or array of bools) with a linear array of strings you already encountered. I.e. you have an array of strings you already saw, for each new string you check if it is in array of "seen" strings, if it is, then you ignore it (not unique), if it is not in array, you add it to both array of seen strings and output. If you have a small number of different strings (below 1000), you could ignore performance optimizations, and simply compare each new string with everything you already saw before.
With large number of strings (few thousands), however, you'll need to optimize things a bit:
1) Every time you add a new string to an array of strings you already saw, sort the array with insertion sort algorithm. Don't use quickSort, because insertion sort tends to be faster when data is almost sorted.
2) When checking if string is in array, use binary search.
If number of different strings is reasonable (i.e. you don't have billions of unique strings), this approach should be fast enough.

Resources