Global alignment using affine gap penalty function - arrays

I have to write a program that does global alignment between two sequences using affine gap penalty function. The dynamic algorithm (modified Needleman Wunsch) calculates similarity (maximum score that express how similar sequences are) of two given sequences, s and t. And it takes into account gaps, blocks of consecutive spaces in a sequence, which are more likely to occur then isolated spaces, by building three 2d arrays. The arrays could be not so formally described as:
array C: keeps maximum score for blocks that end with a character of sequence s aligned with a character of sequence t;array BS: keeps maximum score for blocks that end with a character of sequence t aligned with a space in sequence s
array BT: keeps maximum score for blocks that end with a character of sequence s aligned with a space in sequence t;
The algorithm has the following recurrence relation:
C[i,j] = v(s[i],t[j]) + max{C[i-1][j-1], BS[i-1][j-1], BT[i-1][j-1]}
BS[i,j] = max{C[i][j-1]-(h+g), BS[i][j-1]-g, BT[i][j-1]-(h+g)}
BT[i,j] = max{C[i-1][j]-(h+g), BS[i-1][j]-(h+g), BT[i-1][j]-g}
** v(s[i],t[i]) = value of match(when both character are identical) or mismatch(when characters are not identical)
Similarity is the highest value among the last value of each array. The problem is when I run the program it has a strange behaviour:
For a given pair of sequences, my program gives different values for the same pair of sequences if I change which one is t or s. So, could you please help me to find out why the program has such behaviour? Do you have any idea of what I'm doing wrong? And about the code, here it goes:
int main (void){
int mat, mis, h, g,
sim, i, j, m, n;
/* mat = match, mis = mismatch, h = open gap penalty, g = extend gap penalty */
string s, t;
s = malloc(1500);
t = malloc(1500);
scanf("%d %d %d %d", &mat, &mis, &h, &g);
scanf("%s", s);
scanf("%s", t);
m = strlen(s);
n = strlen(t);
int C[m][n], BS[m][n], BT[m][n];
C[0][0] = 0;
for(j = 1; j<= n; j++)
C[0][j] = -32000;
for(i = 1; i<= m; i++)
C[i][0] = -32000;
for(j = 1; j <= n; j++)
BS[0][j] = -(h + g*j);
for(i = 0; i <= m; i++)
BS[i][0] = -32000;
for(j = 0; j <= n; j++)
BT[0][j] = -32000;
for(i = 1; i <= m; i++)
BT[i][0] = -(h + g*i);
for(i = 1; i <= m; i++){
for(j = 1; j <= n; j++){
C[i][j] = align(s[i-1],t[j-1],mat,mis) + max(C[i-1][j-1],BS[i-1][j-1],BT[i-1][j-1]);
BS[i][j] = max((C[i][j-1]-(h+g)),(BS[i][j-1]-g),(BT[i][j-1])-(h+g));
BT[i][j] = max((C[i-1][j]-(h+g)),(BS[i-1][j]-(h+g)),(BT[i-1][j]-g));
}
}
printf("\n");
printf("c[m][n]: %d bs[m][n]:%d bt[m][n]: %d\n", C[m][n], BS[m][n], BT[m][n]);
sim = max(C[m][n], BS[m][n], BT[m][n]);
printf("sim: %d\n", sim);
return 0;
}

Ok, I finally find out the problem after trying many printfs, since I don't know how to use debuggers.The first clue I had was the segmentation fault that gcc was telling me when I tryed to read the sequences (with a relatively big length) from files. It's true that segmentation fault can have several causes, but almost all the times I saw this error in my programs, it occured because I was trying to access a position that actually didn't exist in the array.
Then the many printfs showed some values in the initialization step, which were different from what the first row and first column of each array should have. I associated the segmentation fault with the strange values from initialization step, and decided to check the declaration step of the arrays as well as all loop conditions. There it was! I simply forgot a very basic feature of any array: if the array has size n, you can access from zero to n-1. To solve my very newbie error, I added one position for row and column in each array (since the position (0,0) is not associated with any pair of aligned characters) so that each array has size [m][n], where m is the (length + 1) of sequence s, and n is the (length+1) of sequence t. Furthermore, I altered all loop conditions to access till [m-1][n-1] position. Now the program works fine.

Related

Having issues splitting string into chunks in C for cipher project

I'm working on a cybersecurity program where I need to split a string of plaintext into blocks of a certain size. The code I currently have does not work but is close. Certain letters are either skipped, or the blocks end up being larger than the block size. Additionally, characters appear that are not represented in the plaintext, but I am unsure of how this could even occur. Could anyone fix this code for me or illuminate where I am going wrong?
Example plaintext: abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
Example output:
In the following code, plaintext_len_no_pad is the string of plaintext I want to break into chunks of size size. The size in this case is 2.
// Split plaintext into blocks size n
int total_blocks = plaintext_len_no_pad / size;
printf("DEBUG Total blocks: %d\n", total_blocks);
char blocks[total_blocks][size + 1];
int k = 0;
for (int i = 0; i < total_blocks; i++)
{
for (int j = 0; j < size; j++)
{
blocks[i][j] = plaintext_no_pad[j + k];
}
blocks[i][k + 1] = '\0';
k += size;
}
for (int i = 0; i < total_blocks; i++)
{
printf("DEBUG Block %d: %s\n", i, blocks[i]);
}
I've tried adjusting where I put the string terminating character, and messed with different ways of splitting the string. This method I whipped up has a bug I cannot figure out. I have looked at related posts, but I have not found one that helped.
blocks[i][k + 1] = '\0';
is most definitely not right.
The first iteration of the outer loop it will be equivalent to blocks[i][1]. The second iteration it will be equivalent to blocks[i][size + 1] which is out of bounds. Then it get further out of bounds.
You should be using
blocks[i][size] = '\0';
instead.
Also be careful if plaintext_len_no_pad is not evenly divisible by size (i.e. when plaintext_len_no_pad % size != 0).

More efficient way of iterating over every small square in big square array

I'm in my first few months of learning to code in C through a high school program. Someone recently mentioned to me that there's often a way to make code more efficient and I think I have a problem that could be made more efficient. I'm not sure how but I have a hunch that it could be made faster.
We're given a 2D square array of integers with row and col size n. We have subsquares within the 2D square array with row and col size s. We can always assume that s will evenly divide n. I've written the following code to iterate over each subsquare
Currently my code looks something like this:
int **grid;
int s, i, j, k, l;
// reading in inputs, other processing
for (i = 0; i < n; i += s) {
for (j = 0; j < n; j += s) {
for (k = 0; k < s; k++) {
for (l = 0; l < s; l++) {
printf("%d \n", grid[i + k][j + l]);
}
}
printf("next subsquare: \n");
}
}
As you can see, I've got 4 nested for loops and I feel like it's a bit messy to have it in this format. Is there a better way to do this? Later on I might be summing each subsquare or performing some other operation with each subsquare.

Is this implementation of insertion in array incorrect?

I was reading a tutorial at tutorialspoint.com for Data Structures.
In the section about the array data structure is this implementation of insertion an element in an array, which is accesing an element outside of the bound of the array :
int LA[] = {1,3,5,7,8};
int item = 10, k = 3, n = 5;
int i = 0, j = n;
printf("The original array elements are :\n");
for(i = 0; i<n; i++) {
printf("LA[%d] = %d \n", i, LA[i]);
}
n = n + 1;
while( j >= k){
LA[j+1] = LA[j];
j = j - 1;
}
LA[k] = item;
printf("The array elements after insertion :\n");
for(i = 0; i<n; i++) {
printf("LA[%d] = %d \n", i, LA[i]);
}
Is'nt accesing an element outside of the bounds of an array undefined behaviour and therefore very bad practice?
If so why is this given in a tutorial?
Is'nt accesing an element outside of the bounds of an array undefined behaviour
Yes. Good spotting.
and therefore very bad practice?
I can't argue with that, except maybe to say that it's a little understated. A program that exhibits UB is flat wrong.
If so why is this given in a tutorial?
I can only speculate, but that section of the tutorial is not only wrong, but altogether poorly conceived. C arrays have fixed length, therefore you cannot "insert" into a C array in any sense that preserves all the values that already were there. (I disregard dynamic memory approaches, which are not relevant to the code presented.)
You can use code similar to that presented in the example if you adjust it to avoid reading or writing past the end of the array. Such an approach to "inserting" an element must lose the element originally at the array's end.
Of course it's a bad practice. When you try to do
while( j >= k){
LA[j+1] = LA[j];
j = j - 1;
}
... you will have LA[6] receiving a value when LA has size 5. Then an error message will be shown and at some moment the execution will be aborted. Besides that the exit status (echo $?) will be different of zero, i.e., it is saying that your program wasn't finished with successful.

C: Program crashes before completing for loop

Hello Stackoverflow crew. I'm a very amateur C programmer and I'm working on a program that reads some input about wedding gifts, and then outputs information that includes the maximum gift value, the minimum gift value, the total average of the gift values, and the average of the gifts that were valued at x > 0. I've finished writing everything, but the program always seems to crash after the first loop. I've been looking at it for the past few hours, so I'm having issues finding what the error might be. Here is the code I have:
#include <stdio.h>
#include <stdlib.h>
int main() {
//Opens the file and creats a pointer for it.
FILE *ifp;
ifp = fopen("gifts.txt", "r");
//Declares the variables
int i, j, k, l, m, n, o, p, q, x, y;
int gift_sets, num_gifts, prices, max_value, max, avg_val, no_zero;
//Scans the file and assigns the first line to variable "gift_sets"
fscanf(ifp, "%d", &gift_sets);
//Begins a for loop that repeats based on the value of gift_sets
for (i = 0; i < gift_sets; i++) {
printf("Wedding Gifts #%d\n", i + 1);
printf("Gift Value\t Number of Gifts\n");
printf("----------\t ---------------\n");
//Scans the price values into the array prices[num_gifts]
fscanf(ifp, "%d", &num_gifts);
int prices[num_gifts];
//Creates a loop through the prices array
for (j = 0; j < num_gifts; j++){
fscanf(ifp, "%d", &prices[j]);
}
//Declares a frequency array
int freq[max + 1];
for (k = 0; k <= max; k++) {
freq[k] = 0;
}
for (l = 0; l < num_gifts; l++) {
freq[prices[l]]++;
}
for (m = 0; m < max + 1; m++) {
if (freq[m] > 0){
printf("%d\t%d",m, freq[m]);
}
}
printf("\n");
//Zeroes the variable "max_val."
int max_val = prices[0];
//Loops through the array to find the maximum gift value.
for (n = 0; n < num_gifts; n++){
if (prices[n] > max_value)
max_value = prices[n];
}
// Zeroes "min_val."
int min_val = prices[0];
//Finds the lowest value within the array.
for(o = 0; o < num_gifts; o++){
if(prices[o] !=0){
if(prices[o] < min_val){
min_val = prices[o];
}
}
}
//Calculates the total number of gifts.
double sum_gifts = 0;
for(p = 0; p < num_gifts; p++){
sum_gifts = sum_gifts + prices[p];
}
//Calculates the average value of all the gifts.
avg_val = (sum_gifts / num_gifts);
//find non zero average
double x = 0;
int y = 0;
for(q = 0; q < num_gifts; q++){
if (prices[q] != 0){
x += prices[q];
y++;
}
}
//Calculates the average value of the gifts, excluding the gifts valued zero.
int no_zero = x / y;
//Prints the maximum gift value.
printf("The maximum gift value is: $%d", max_value);
printf("\n");
//Prints the minimum gift value.
printf("The minimum gift value is: $%d\n", min_val);
//Prints the average of all the gifts.
printf("The average of all gifts was $%.2lf\n",avg_val);
//Prints the no zero average value of the gifts.
printf("The average of all non-zero gifts was $%.2lf",no_zero);
printf("\n\n\n");
}
return 0;
}
Thanks in advance for the help guys. As always, it's much appreciated.
EDIT: To further elaborate, the "crash" is a windows error "gifts.exe has stopped working" when executing the program. It does say at the bottom of the window that "Process returned -1073741819 <0xC0000005>"
When you declare the array with the num_gifts variable, it generates assembly instructions which allocate enough space on the stack to hold num_gifts integers. It does this at compile-time. Normally this wouldn't compile, but depending on the behavior of the ms c compiler, it could compile and assume whatever value is put in num_gifts by default (maybe 0, maybe something else) is the length. When you access it, it's possible that you're trying to access an array with zero elements, which could cause an access violation.
I'll tell you one thing you should do, straight away.
Check the return values from fscanf and its brethren. If, for some reason the scan fails, this will return less than you expect (it returns the number of items successfully scanned).
In that case, your data file is not what your code expects.
You should also be checking whether ifp is NULL - that could be the cause here since you blindly use it regardless.
One thing you'll find in IDEs is that you may not be in the directory you think you're in (specifically the one where gifts.txt is).
And, on top of that, max ill be set to an arbitrary value, so that int freq[max+1]; will give you an array of indeterminate size. If that size is less than the largest price, you'll be modifying memory beyond the end of the array with:
freq[prices[l]]++;
That's a definite no-no, "undefined behaviour" territory.
At least at first glance, it looks like you haven't initialized max before you (try to) use it to define the freq array.

Counting Alphabetic Characters That Are Contained in an Array with C

I am having trouble with a homework question that I've been working at for quite some time.
I don't know exactly why the question is asking and need some clarification on that and also a push in the right direction.
Here is the question:
(2) Solve this problem using one single subscripted array of counters. The program uses an array of characters defined using the C initialization feature. The program counts the number of each of the alphabetic characters a to z (only lower case characters are counted) and prints a report (in a neat table) of the number of occurrences of each lower case character found. Only print the counts for the letters that occur at least once. That is do not print a count if it is zero. DO NOT use a switch statement in your solution. NOTE: if x is of type char, x-‘a’ is the difference between the ASCII codes for the character in x and the character ‘a’. For example if x holds the character ‘c’ then x-‘a’ has the value 2, while if x holds the character ‘d’, then x-‘a’ has the value 3. Provide test results using the following string:
“This is an example of text for exercise (2).”
And here is my source code so far:
#include<stdio.h>
int main() {
char c[] = "This is an example of text for exercise (2).";
char d[26];
int i;
int j = 0;
int k;
j = 0;
//char s = 97;
for(i = 0; i < sizeof(c); i++) {
for(s = 'a'; s < 'z'; s++){
if( c[i] == s){
k++;
printf("%c,%d\n", s, k);
k = 0;
}
}
}
return 0;
}
As you can see, my current solution is a little anemic.
Thanks for the help, and I know everyone on the net doesn't necessarily like helping with other people's homework. ;P
char c[] = "This is an example of text for exercise (2).";
int d[26] = {0}, i, value;
for(i=0; i < sizeof(c) - 1; i++){ //-1 to exclude terminating NULL
value = c[i]-'a';
if(value < 26 && value >= 0) d[value]++;
}
for(i=0; i < 26; i++){
if(d[i]) printf("Alphabet-%c Count-%d\n", 'a'+i, d[i]);
}
Corrected. Thanks caf and Leffler.
The intention of the question is for you to figure out how to efficiently convert a character between 'a' and 'z' into an index between 0 and 25. You are apparently allowed to assume ASCII for this (although the C standard does not guarantee any particular character set), which has the useful property that values of the characters 'a' through 'z' are sequential.
Once you've done that, you can increment the corresponding slot in your array d (note that you will need to initialise that array to all-zeroes to begin with, which can be done simply with char d[26] = { 0 };. At the end, you'd scan through the array d and print the counts out that are greater than zero, along with the corresponding character (which will involve the reverse transformation - from an index 0 through 25 into a character 'a' through 'z').
Fortunately for you, you do not seem to be required to produce a solution that would work on an EBCDIC machine (mainframe).
Your inner loop needs to be replaced by a conditional:
if (c[i] is lower-case alphabetic)
increment the appropriate count in the d-array
After finishing the string, you then need a loop to scan through the d-array, printing out the letter corresponding to the entry and the count associated with it.
Your d-array uses 'char' for the counts; that is OK for the exercise but you would probably need to use a bigger integer type for a general purpose solution. You should also ensure that it is initialized to all zeros; it is difficult to get meaningful information out of random garbage (and the language does not guarantee that anything other than garbage will be on the stack where the d-array is stored).
char c[] = "This is an example of text for exercise (2).";
char d[26];
int i;
int j;
for(i = 0; i < 26; i++)
{
d[i] = 0; // Set the frequency of the letter to zero before we start counting.
for(j = 0; j < strlen(c); j++)
{
if(c[j] == i + 'a')
d[i]++;
}
if(d[i] > 0) // If the frequency of the letter is greater than 0, show it.
printf("%c - %d\n", (i + 'a'), d[i]);
}
.
for(s = 'a'; s < 'z'; s++){
j=0;
for(i = 0; i < sizeof(c); i++) {
if( c[i] == s )
j++;
}
if (j > 0)
printf("%c,%d\n", s, j);
}

Resources