How to avoid duplicates when finding all k-length substrings

How to avoid duplicates when finding all k-length substrings - c

I want to display all substrings with k letters, one per line, but avoid duplicate substrings. I managed to write to a new string all the k length words with this code:
void subSent(char str[], int k) {
int MaxLe, i, j, h, z = 0, Length, count;
char stOu[1000] = {'\0'};
Length = (int)strlen(str);
MaxLe = maxWordLength(str);
if((k >= 1) && (k <= MaxLe)) {
for(i = 0; i < Length; i++) {
if((int)str[i] == 32) {
j = i = i + 1;
} else {
j = i;
}
for(; (j < i + k) && (Length - i) >= k; j++) {
if((int)str[j] != 32) {
stOu[z] = str[j];
} else {
stOu[z] = str[j + 1];
}
z++;
}
stOu[z] = '\n';
z++;
}
}
}
But I'm struggling with the part that needs to save only one time of a word.
For example, the string HAVE A NICE DAY
and k = 1 it should print:
H
A
V
E
N
I
C
D
Y

Your subSent() routine poses a couple of challenges: first, it neither returns nor prints it's result -- you can only see it in the debugger; second it calls maxWordLength() which you didn't supply.
Although avoiding duplicates can be complicated, in the case of your algorithm, it's not hard to do. Since all your words are fixed length, we can walk the output string with the new word, k letters (plus a newline) at a time, doing strncmp(). In this case the new word is the last word added so we quit when the pointers meet.
I've reworked your code below and added a duplication elimination routine. I didn't know what maxWordLength() does so I just aliased it to strlen() to get things running:
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#define maxWordLength strlen
// does the last (fixed size) word in string appear previously in string
bool isDuplicate(const char *string, const char *substring, size_t n) {
for (const char *pointer = string; pointer != substring; pointer += (n + 1)) {
if (strncmp(pointer, substring, n) == 0) {
return true;
}
}
return false;
}
void subSent(const char *string, int k, char *output) {
int z = 0;
size_t length = strlen(string);
int maxLength = maxWordLength(string);
if (k >= 1 && k <= maxLength) {
for (int i = 0; i < length - k + 1; i++) {
int start = z; // where does the newly added word begin
for (int j = i; (z - start) < k; j++) {
output[z++] = string[j];
while (string[j + 1] == ' ') {
j++; // assumes leading spaces already dealt with
}
}
output[z++] = '\n';
if (isDuplicate(output, output + start, k)) {
z -= k + 1; // last word added was a duplicate so back it out
}
while (string[i + 1] == ' ') {
i++; // assumes original string doesn't begin with a space
}
}
}
output[z] = '\0'; // properly terminate the string
}
int main() {
char result[1024];
subSent("HAVE A NICE DAY", 1, result);
printf("%s", result);
return 0;
}
I somewhat cleaned up your space avoidance logic but it can be tripped by leading spaces on the input string.
OUTPUT
subSent("HAVE A NICE DAY", 1, result);
H
A
V
E
N
I
C
D
Y
subSent("HAVE A NICE DAY", 2, result);
HA
AV
VE
EA
AN
NI
IC
CE
ED
DA
AY
subSent("HAVE A NICE DAY", 3, result);
HAV
AVE
VEA
EAN
ANI
NIC
ICE
CED
EDA
DAY

Related

why can't my program recognize similar words in a string?

I want to write a program that will take an input T. In the next T lines, each line will take a string as an input. The output would be how many ways the string can be reordered.
#include <stdio.h>
#include <stdlib.h>
int main() {
int T, i, l, count = 1, test = 0, word = 0, ans;
char line[200];
scanf("%d", &T);
for (i = 0; i < T; i++) {
scanf(" %[^\n]", line);
l = strlen(line);
for (int q = 0; q < l; q++) {
if (line[q] == ' ') {
word++;
}
}
ans = fact(word + 1);
word = 0;
for (int j = 0; j < l; j++) {
for (int k = j + 1; k < l; k++) {
if (line[k] == ' ' && line[k + 1] == line[j]) {
int m = j;
int n = k + 1;
for (;;) {
if (line[m] != line[n]) {
break;
} else
if (line[m] == ' ' && line[n] == ' ') {
test = 1;
break;
} else {
m++;
n++;
}
}
if (test == 1) {
count++;
ans = ans / fact(count);
count = 0;
test = 0;
}
}
}
}
printf("%d\n", ans);
}
}
int fact(int n) {
if (n == 1) {
return 1;
} else {
return n * fact(n - 1);
}
}
Now, in my program,
my output is like this:
2
no way no good
12
yes no yes yes no
120
if T = 2 and the 1st string is no way no good, it gives the right output that is 12 (4!/2!). That means, it has identified that there are two similar words.
But in the 2nd input, the string is yes no yes yes no. that means 3 yes and 2 nos. So the and should be 5!/(3!2!) = 10. But why is the answer 120? and why can't it recognize the similar words?

The main problem in your duplicate detector is you test the end of word with if (line[m] == ' ' && line[n] == ' ') but this test fails to identify a duplicate that occurs with the last word because line[n] is '\0', not ' '.
Note these further problems:
you do not handle words that occur more than twice correctly: you should perform ans = ans / fact(count); only after the outer loop finishes. For example, if a word is present 3 times, it will be detected as 3 pairs of duplicates, effectively causing ans to be divided by 23 = 8, instead of 3! = 6.
you should protect against buffer overflow and detect invalid input with:
if (scanf(" %199[^\n]", line) != 1)
break;
the range of type int for ans is too small for a moderately large number of words: 13! is 6227020800, larger than INT_MAX on most systems.
The code is difficult to follow. You should consider parsing the line into an array of words and using a more conventional way of counting duplicates.
Here is a modified version using this approach:
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static int cmpstr(const void *p1, const void *p2) {
char * const *pp1 = p1;
char * const *pp2 = p2;
return strcmp(*pp1, *pp2);
}
unsigned long long factorial(int n) {
unsigned long long res = 1;
while (n > 1)
res *= n--;
return res;
}
int main() {
int T, i, n, begin, count;
unsigned long long ans;
char line[200];
char *words[100];
if (!fgets(line, sizeof line, stdin) || sscanf(line, "%d", &T) != 1)
return 1;
while (T --> 0) {
if (!fgets(line, sizeof line, stdin))
break;
n = 0;
begin = 1;
for (char *p = line; *p; p++) {
if (isspace((unsigned char)*p)) {
*p = '\0';
begin = 1;
} else {
if (begin) {
words[n++] = p;
begin = 0;
}
}
}
qsort(words, n, sizeof(*words), cmpstr);
ans = factorial(n);
for (i = 0; i < n; i += count) {
for (count = 1; i + count < n && !strcmp(words[i], words[i + count]); count++)
continue;
ans /= factorial(count);
}
printf("%llu\n", ans);
}
return 0;
}

how to write a function char* in c that returns words sorted in order of their length?

Write a function that takes a string as a parameter and returns its words sorted in order of their length first and then in alphabetical order on line separated by '^'
here is examples of output
There will be only spaces, tabs and alphanumeric caracters in strings.
You'll have only one space between same size words and ^ otherwise.
A word is a section of string delimited by spaces/tabs or the start/end of the string. If a word has a single letter, it must be capitalized.
A letter is a character in the set [a-zA-Z]
here is my code, but it returns nothing I think issue in last function....
#include <unistd.h>
#include <stdlib.h>
int is_upper(char c)
{
return c >= 'A' && c <= 'Z';
}
int my_lower(char c)
{
if (is_upper(c))
return c + 32;
return c;
}
int my_strlen(char *s)
{
int i = 0;
for (; s[i]; i++)
;
return i;
}
int my_is(char c)
{
return c == ' ' || c == '\t';
}
char *my_strsub(char *s, int start, int end)
{
char *res = malloc(end - start);
int i = 0;
while (start < end)
res[i++] = s[start++];
res[i] = 0;
return res;
}
int cmp_alpha(char *a, char *b)
{
while (*a && *b && *a == *b)
{
a++;
b++;
}
return my_lower(*a) <= my_lower(*b);
}
int cmp_len(char *a, char *b)
{
return my_strlen(a) <= my_strlen(b);
}
void my_sort(char *arr[], int n, int(*cmp)(char*, char*))
{
char *tmp;
for (int i = 0; i < n; i++)
for (int j = 0; j < n - 1; j++)
{
if ((*cmp)(arr[j], arr[j + 1]) == 0)
{
tmp = arr[j];
arr[j] = arr[j + 1];
arr[j + 1] = tmp;
}
}
}
char* long(char *s)
{
int start = 0, idx = 0;
char *words[my_strlen(s) / 2 + 1];
for (int i = 0; s[i]; i++)
{
if (!my_is(s[i]) && i > 0 && my_is(s[i - 1]))
start = i;
if (my_is(s[i]) && i > 0 && !my_is(s[i - 1]))
words[idx++] = my_strsub(s, start, i);
if (!s[i + 1] && !my_is(s[i]))
words[idx++] = my_strsub(s, start, i + 1);
}
my_sort(words, idx, &cmp_alpha);
my_sort(words, idx, &cmp_len);
char* res = malloc(100);
int pushed=0;
for (int i = 0; i < idx - 1; i++)
{
res[pushed]=*words[i];
if (my_strlen(&res[pushed]) < my_strlen(&res[pushed + 1]))
{
res[pushed]=res[94];
}
else
{
res[pushed]=res[32];
}
pushed++;
}
res[pushed]='\0';
return res;
}
int main()
{
long("Never take a gamble you are not prepared to lose");
return 0;
}

Apart from the off-by-one allocation error in my_strsub, separating and sorting the words seems to work well. Only then you confuse the result character array with a character pointer array, e. g. with res[pushed]=*words[i] you write only the first character of a word to the result. The last for loop of ord_alphlong could rather be:
if (idx)
for (int i = 0; ; )
{
char *word = words[i];
int lng = my_strlen(word);
if (100 < pushed+lng+1) exit(1); // too long
for (int i = 0; i < lng; ) res[pushed++] = word[i++];
if (++i == idx) break; // last word
res[pushed++] = lng < my_strlen(words[i]) ? '^' // other size
: ' '; // same size
}
Of course in order to see the result of the function, you'd have to output it somehow.

C - Cycle through all possible lowercase strings

I'm learning C with the CS50 course problem set 2, using the crypt function to brute force guess a password. Currently writing a function that prints all possible strings of a certain length, eg:
aa
ab
...
az
ba
...
zy
zz
I've written a fairly simple recursive function to do so:
#include <cs50.h>
#include <stdio.h>
#include <crypt.h>
#include <string.h>
void stringcycler(int n, int passLength, char *pass)
// Scrolls through all lowercase letter combinations for a string of length passLength
// Expects an integer value of the length of the strng as both n and passLength
// Also expects a char* array of length passLength with all chars set to 'a' (and a null character)
{
if(n != 0)
{
for(pass[passLength - n] = 'a'; pass[passLength - n] < 'z'; pass[passLength - n]++)
{
stringcycler(n-1, passLength, pass);
printf("%s\n", pass);
// return 0;
}
}
}
int main()
{
// Initialise char *c, and scroll through letters
int passLength = 2; // The number of characters you want to brute force guess
char pass[passLength + 1]; // Add 1 for the null character
int i;
for(i = 0; i < passLength; i++) pass[i] = 'a'; // Set every char in pass to 'a'
pass[passLength] = '\0'; // Set null character at the end of string
stringcycler(passLength, passLength, pass);
return 0;
}
It works for the most part, but only goes to yz. Whenever it sees a z it basically skips, so it goes to yz, then never does za to zz. If I add an = to the for loop line:
pass[passLength - n] < 'z';
ie.
pass[passLength - n] <= 'z';
Then it prints '{' characters in the mix. Any help? And another question is, how can I change this to work for all combos of upper and lower case too, is there a neat way of doing it?

You print after you return from you recursion, but you should print when the recursion has reached the end (or beginning, in your case) of the string. In other words, printing should be an alternative branch to recursing:
void stringcycler(int n, int len, char *pass)
{
if (n != 0) {
for (pass[len - n] = 'a'; pass[len - n] <= 'z'; pass[len - n]++) {
stringcycler(n - 1, len, pass);
}
} else {
printf("%s ", pass);
}
}
The if part constructs the strings as it recurses further down. The else part does something with the constructed string. (Of course, you must include 'z' in your loop. Your original code only prints the z in the last place, because it prints after ther recursion returns, which means thet the char buffer is in a condition that wouldn't (re-)enter the loop.)

Below is a generic backtracking algorithm for generating the password. The idea here is to imagine filling the slots for a given char array a. We will be generating the possible candidates for the given position k for the array a. I have taken the candidates as lower case ascii letters a-z and upper case ASCII letters A-Z. If you want to include other ASCII characters, just modify the construct_candidates function accordingly.
Once the array is filled i.e. k becomes PASS_LEN, we know we have generated the password, we can process it however we like, I have just printed the password here.
The value of PASS_LEN macro can be adjusted to generate password of any desired length.
#include <stdio.h>
#include <stdlib.h>
#define PASS_LEN 2
static char* construct_candidates (char a[], int k, int *count)
{
/* Lower case ASCII */
int min1 = 97;
int max1 = 122;
/* Upper case ASCII */
int min2 = 65;
int max2 = 90;
*count = (max1 - min1 + 1) + (max2 - min2 + 1);
char *cand = calloc(*count, sizeof(char));
if (cand == NULL) {
printf("malloc failed\n");
return NULL;
}
int idx = 0;
for (int i = min1; i <= max1; i++) {
cand[idx] = i;
idx++;
}
for (int i = min2; i <= max2; i++) {
cand[idx] = i;
idx++;
}
return cand;
}
static void backtrack(char a[], int k)
{
int i;
if (k == PASS_LEN) {
for (i = 0; i < PASS_LEN; i++) {
printf("%c", a[i]);
}
printf("\n");
return;
}
int cand_count = 0;
char *cand = construct_candidates(a, k, &cand_count);
if (cand == NULL) {
printf("Failed to get candidates\n");
return;
}
for (i = 0; i < cand_count; i++) {
a[k] = cand[i];
backtrack(a, k + 1);
}
free(cand);
}
int main()
{
char a[PASS_LEN] = {'\0'};
backtrack(a, 0);
}

Reversing the order of words backwards in a string

Sorry for such a mediocre question, but I ran into what seems to be a tiny problem, but simply can't get over it. For my task I have to take a line of string from a file, and put it into another file backwards, for example:
one two three
four five six
would be
three two one
six five four
My problem is, is that I'm getting
three two one
si five four
So basically the flaw is that there is a space character at the beginning of each line and the last letter of the last word is always missing. Here's my reverse function:
void reverse(char input[], int length, char output[]) {
char space = 32;
input[length - 1] = space;
int value = 0;
int i, k = 0, j;
for (i = 0; i <= length; i++) {
if (input[i] == space) {
for (j = i - 1; j >= k; j--, value++) {
output[value] = input[j];
}
if (j == -1) {
output[value] = space;
value++;
}
k = i;
}
}
char c = 0;
for (int i = 0, j = length - 1; i <= j; i++, j--) {
c = output[i];
output[i] = output[j];
output[j] = c;
}
}
What I'm doing is first reversing each word by character, and then the whole line. If someone could help me find the last bits that I've missed I would greatly appreciate it.

The flaws come from your approach:
why do you force a space at offset length - 1? If you read the line with fgets(), there is probably a newline ('\n') at the end of the line, but it might be missing at the end of the input, which would explain the x getting overwritten on the last line.
you should not modify the input buffer.
Here is a simplified version, along with a simple main function:
#include <stdio.h>
#include <string.h>
void reverse(const char *input, int length, char *output) {
int i, j, k, v;
for (i = k = v = 0;; i++) {
if (i == length || input[i] == ' ') {
for (j = i; j-- > k; v++) {
output[v] = input[j];
}
for (; i < length && input[i] == ' '; i++) {
output[v++] = ' ';
}
if (i == length) {
output[v] = '\0';
break;
}
k = i;
}
}
for (i = 0, j = length - 1; i < j; i++, j--) {
char c = output[i];
output[i] = output[j];
output[j] = c;
}
}
int main() {
char input[256];
char output[256];
while (fgets(input, sizeof input, stdin)) {
reverse(input, strcspn(input, "\n"), output);
puts(output);
}
return 0;
}
Output:
three two one
six five four
Here is a simpler reverse function that operates in one pass:
#include <string.h>
void reverse(const char *input, int length, char *output) {
int i, j, k, v;
for (i = k = 0, v = length;; i++) {
if (i == length || input[i] == ' ') {
for (j = i; j-- > k;) {
output[--v] = input[j];
for (; i < length && input[i] == ' '; i++) {
output[--v] = ' ';
}
if (v == 0) {
output[length] = '\0';
break;
}
k = i;
}
}
}

Replace input[length - 1] = space; with input[length] = space;

Need help creating a FindMaxOverlap function

I'm trying to create a function that, given two C strings, it spits back the number of consecutive character overlap between the two strings.
For example,
String 1: "Today is monday."
String 2: " is monday."
The overlap here would be " is monday.", which is 11 characters (it includes the space and '.').

If you need something more efficient, consider that a partial mismatch between Strings 1 and 2 means you can jump the length of the remainder of String 2 along String 1. This means you don't need to search the entirety of String 1.
Take a look at the Boyer-Moore algorithm. Though it is used for string searching, you could implement this algorithm for finding the maximum-length substring using String 2 as your pattern and String 1 as your target text.

There is probably a more efficient way to do this, but here's a simple approach:
#include <string.h>
int main() {
char s1[17] = "Today is monday.";
char s2[12] = " is monday.";
int max = 0;
int i_max = -1;
int j_max = -1;
int i = 0, j = 0, k=0;
int endl = 0, sl1, sl2;
char *ss1, *ss2;
for(i = 0; i < strlen(s1)-1; i++) {
ss1 = s1+i;
sl1 = strlen(ss1);
if(max >= sl1) {
break; // You found it.
}
for(j = 0; j < strlen(s2)-1; j++) {
ss2 = s2+j;
sl2 = strlen(ss2);
if(max >= sl2) {
break; // Can't find a bigger overlap.
}
endl = (sl1 > sl2)?sl2:sl1;
int n_char = 0;
for(k = 0; k < endl+1; k++) {
// printf("%s\t%s\n", ss1+k, ss2+k); // Uncomment if you want to see what it compares.
if(ss1[k] != ss2[k] || ss1[k] == '\0') {
n_char = k;
break;
}
}
if(n_char > max) {
max = n_char;
i_max = i;
j_max = j;
}
}
}
char nstr[max+1];
nstr[max] = '\0';
strncpy(nstr, s1+i_max, max);
printf("Maximum overlap is %d characters, substring: %s\n", max, nstr);
return 0;
}
Update: I have fixed the bugs. This definitely compiles. Here is the result: http://codepad.org/SINhmm7f
The problems were that endl was defined wrong and I wasn't checking for end-of-line conditions.
Hopefully the code speaks for itself.

Here is my solution, it will return the position of the overlap starting point, it's a bit complex, but that's how it's done in C:
#include <string.h>
int FindOverlap (const char * a, const char * b)
{
// iterators
char * u = a;
char * v = b;
char * c = 0; // overlap iterator
char overlapee = 'b';
if (strlen(a) < strlen(b)) overlapee = 'a';
if (overlapee == 'b')
{
while (*u != '\0')
{
v = b; // reset b iterator
c = u;
while (*v != '\0')
{
if (*c != *v) break;
c++;
v++;
}
if (*v == '\0') return (u-a); // return overlap starting point
}
}
else if (overlapee == 'a')
{
while (*v != '\0')
{
u = a; // reset b iterator
c = v;
while (*u != '\0')
{
if (*c != *u) break;
c++;
u++;
}
if (*v == '\0') return (v-b); // return overlap starting point
}
}
return (-1); // not found
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to avoid duplicates when finding all k-length substrings - c

Related

why can't my program recognize similar words in a string?

how to write a function char* in c that returns words sorted in order of their length?

C - Cycle through all possible lowercase strings

Reversing the order of words backwards in a string

Need help creating a FindMaxOverlap function

Categories

Resources