Segmentation Fault with fread() from my created binary file - c

I'm new to c from python. I'm trying to write two c scripts, one that reads a plain-text file in FASTA format (for DNA/RNA/protein sequences). They look like this...
>sequence1
ATCTATGTCGCTCGCTCGAGAGCTA
>sequence2
CGTCGCTGGGATCGATTTCGATAGCT
>sequence3
AAATATAACTCGCTAGCTCGATCGATC
>sequence4
CTCTCTCCTCTCTCTATATAGGGG
...where individual sequences are separated by ">" characters. Within each sequence, the actual sequence and its label are separated by a newline character. (ie ">label \n sequence"). The script for reading the plain-text and then writing it to a binary file seems to work. However, when I try to read the binary file and print its contents, I get a Segmentation Fault (Core dump).
I tried to produce a reduced example for posting here, but that example seems to work without error. So, I feel forced to attach my whole code snippets here. I must be missing something.
Here's the first script which reads in a plain text fasta file, splits it first by the ">" character, and then by the newline character, to make "sequence" structures for each sequence in the above FASTA file. These structures are then written to "your_sequences.bin".
#include <stdio.h>
#include <string.h>
#define BUZZ_SIZE 1024
struct sequence {
char *sequence;
char *label;
};
int main(int argc, char *argv[]) {
FILE *fptr;
char buffer[BUZZ_SIZE];
char fasta[BUZZ_SIZE];
char *token;
char *seqs[3];
int idx = 0;
const char fasta_delim[2] = ">";
const char newline[3] = "\n";
/* Read-in plain-text */
fptr = fopen(argv[1],"r");
while (fgets(buffer, BUZZ_SIZE, fptr) != NULL) {
strcat(fasta, buffer);
}
fclose(fptr);
/* Process text, first by splitting by > and then by \n for each sequence, and then write to binary */
FILE *out;
out = fopen("your_sequences.bin","wb");
struct sequence final_entry;
token = strtok(fasta,fasta_delim);
while (token != NULL) {
seqs[idx++] = token;
token = strtok(NULL,fasta_delim);
}
for (idx=0; idx<4; idx++) {
token = strtok(seqs[idx],newline);
char *this_seq[1];
int p = 0;
while (token != NULL) {
this_seq[p] = token;
token = strtok(NULL,newline);
p++;
}
final_entry.label = this_seq[0];
final_entry.sequence = this_seq[1];
printf("%s\n%s\n\n", final_entry.label, final_entry.sequence);
fwrite(&final_entry, sizeof(struct sequence), 1, out);
}
fclose(out);
return(0);
}
This outputs, as expected from the fprint() statement toward the bottom:
sequence1
ATCTATGTCGCTCGCTCGAGAGCTA
sequence2
CGTCGCTGGGATCGATTTCGATAGCT
sequence3
AAATATAACTCGCTAGCTCGATCGATC
sequence4
CTCTCTCCTCTCTCTATATAGGGG
I'm thinking the error has to be somewhere in the above script (ie my binary file is messed up), because the Segmentation Fault is caused by the fread() statement in the script below. I don't think I've made an error in calling fread(), but maybe I'm wrong.
#include <stdio.h>
#define BUZZ_SIZE 1024
struct sequence {
char *sequence;
char *label; };
int main(int argc, char *argv[]) {
struct sequence this_seq;
int n;
FILE *fasta_bin;
fasta_bin = fopen(argv[1],"rb");
for (n=0;n<4;n++) {
fread(&this_seq, sizeof(struct sequence), 1, fasta_bin);
printf (">%s\n%s\n", this_seq.label, this_seq.sequence);
}
fclose(fasta_bin);
return(0);
}
This outputs the segmentation fault
[1] 8801 segmentation fault (core dumped)
I've tinkered around with and gone over this a good amount over the past couple hours. I hope I haven't made some stupid mistake a wasted your time!
Thanks for your help.

I'm thinking the error has to be somewhere in the above script (ie my
binary file is messed up),
Sort of.
because the Segmentation Fault is caused by
the fread() statement in the script below.
I'm fairly confident that the error occurs not in the fread() but in the following printf().
I don't think I've made an
error in calling fread(), but maybe I'm wrong.
Your fread() corresponds to the fwrite(). There is every reason to expect that you will accurately read back what was written. The main problem here is a common one for C neophytes: you've misunderstood the nature of C strings (a null-terminated array of char), and failed to appreciate the crucial, but subtle, distinction between arrays and pointers.
To expand on that a bit, C does not have a first-class string data type. Instead, the standard library provides "string" functions that operate on sequences of objects of type char, where the end of the sequence is marked by a terminator char with the value 0. Such sequences typically are contained in char arrays, and always can be treated as if they were. Because that's what the standard library supports, that convention is ubiquitously used in programs and third-party libraries, too.
C, however, has no mechanism for passing arrays to functions or receiving them as return values. Nor do the assignment operator or most others work on arrays -- not even the indexing operator, []. Instead, in most contexts, values of array type are automatically converted to pointers to the first array element, and these can be passed around and used as operands to a wide variety of operators. Seeing (part of) this, inexperienced C programmers often mistakenly identify strings with such pointers instead of with the pointed-to data.
Of course a pointer value is just an address. You can copy it around and store it at any number of locations in the program, but this does nothing to the pointed-to data. And now I finally come around to the point: you can also write out a pointer value and read it back in, as your programs do, but it is rarely useful to do so, because the pointed-to data don't come along when you do that. Unless you read the pointer back into the same process that wrote it, the read-back pointer value is unlikely to be valid, and it certainly does not have the same significance it did in the program that wrote it.
You must instead write the pointed-to data, but you have to choose a format. In particular, titles and sequences generally have varying lengths, and one of the key things you need to decide is how, if at all, your binary format should reflect that. If I might be so bold, however, I have a suggestion for a well-defined format you could use: Fasta format! Seriously.
There's not much you can do short of data compression to express fasta-format data more compactly, as that format does little more than it needs to do to express the varying-length data it conveys. The question you need to answer, then, is what exactly you're trying to achieve by your reformatting -- both the reason for reformatting at all, and based on that, what your target format actually is.

You are getting segmentation fault because in your program you are using pointer without allocating memory to them:
printf (">%s\n%s\n", this_seq.label, this_seq.sequence);
You first need to allocate memory to this_seq.label and this_seq.sequence pointers, something like this:
this_seq.sequence = malloc(size_of_sequence);
if (this_seq.sequence == NULL)
exit(EXIT_FAILURE);
this_seq.label = malloc(size_of_label);
if (this_seq.label == NULL)
exit(EXIT_FAILURE);
and then read the data into them, like this:
fread(this_seq.sequence, size_of_sequence, 1, fasta_bin);
fread(this_seq.label, size_of_label, 1, fasta_bin);

The problem is that struct sequence doesn't actually carry any salvageable information, it only contains pointers.
Pointers carry memory addresses, they point where the actual information is in memory, but of course if you are reading the file in another process with an entirely different memory space, the information won't be there. In fact, you are likely to crash for trying to interact with memory space that wasn't properly initialized first.
A very simple solution is, don't use pointers, use arrays:
struct sequence
{
char sequence[1024];
char label[1024];
}
Now the structure actually carry the data, no longer just pointers. You will be able to read and write it to file with no worries. However, some code will need to be changed further.
You can no longer assign data to them like x.label = label, you need to use strcpy(), like strcpy(x.label, label). Those changes will need to be made everywhere in the code where you assign values to the properties of this structure.

Related

How do you assign a string in C

Printing the initials (first character) of the string held in the variable 'fn' and the variable 'ln'
#include <stdio.h>
#include <cs50.h>
int main(void)
{
string fn, ln, initials;
fn = get_string("\nFirst Name: ");
ln = get_string("Last Name: ");
initials = 'fn[0]', 'ln[0]';
printf("%s", initials)
}
Read more about C. In particular, read some good C programming book, and some C reference site and read the C11 standard n1570. Notice that cs50.h is not a standard C header (and I never encountered it).
The string type does not exist. So your example don't compile and is not valid C code.
An important (and difficult) notion in C is : undefined behavior (UB). I won't explain what is it here, but see this, read much more about UB, and be really afraid of UB.
Even if you (wrongly) add something like
typedef char* string;
(and your cs50.h might do that) you need to understand that:
not every pointer is valid, and some pointers may contain an invalid address (such as NULL, or most random addresses; in particular an uninitialized pointer variable often has an invalid pointer). Be aware that in your virtual address space most addresses are invalid. Dereferencing an invalid pointer is UB (often, but not always, giving a segmentation fault).
even when a pointer to char is valid, it could point to something which is not a string (e.g. some sequence of bytes which is not NUL terminated). Passing such a pointer (to a non-string data) to string related functions -e.g. strlen or printf with %s is UB.
A string is a sequence of bytes, with additional conventions: at the very least it should be NUL terminated and you generally want it to be a valid string for your system. For example, my Linux is using UTF-8 (in 2017 UTF-8 is used everywhere) so in practice only valid UTF-8 strings can be correctly displayed in my terminals.
Arrays are decayed into pointers (read more to understand what that means, it is tricky). So in several occasions you might declare an array variable (a buffer)
char buf[50];
then fill it, perhaps using strcpy like
strcpy(buf, "abc");
or using snprintf like
int xx = something();
snprintf(buf, sizeof(buf), "x%d", xx);
and latter you can use as a "string", e.g.
printf("buf is: %s\n", buf);
In some cases (but not always!), you might even do some array accesses like
char c=buf[4];
printf("c is %c\n", c);
or pointer arithmetic like
printf("buf+8 is %s\n", buf+8);
BTW, since stdio is buffered, I recommend ending your printf control format strings with \n or using fflush.
Beware and be very careful about buffer overflows. It is another common cause of UB.
You might want to declare
char initials[8];
and fill that memory zone to become a proper string:
initials[0] = fn[0];
initials[1] = ln[0];
initials[2] = (char)0;
the last assignment (to initials[2]) is putting the NUL terminating byte and makes that initials buffer a proper string. Then you could output it using printf or fputs
fputs(initials, stdout);
and you'll better output a newline with
putchar('\n');
(or you might just do puts(initials); ....)
Please compile with all warnings and debug info, so gcc -Wall -Wextra -g with GCC. Improve your code to get no warnings. Learn how to use your compiler and your debugger gdb. Use gdb to run your program step by step and query its state. Take time to read the documentation of every standard function that you are using (e.g. strcpy, printf, scanf, fgets) even if at first you don't understand all of it.
char initials[]={ fn[0], ln[0], '\0'};
This will form the char array and you can print it with
printf("%s", initials) //This is a string - null terminated character array.
There is no concept of string datatype in c . We simulate it using null terminated character array.
If you don't put the \0 in the end, it won't be a null terminated char array and if you want to print it you will have to use indexing in the array to determine the individual characters. (You can't use printf or other standard functions).
int s[]={'h','i'} // not null terminated
//But you can work with this, iterating over the elements.
for(size_t i=0; i< sizeof s; i++)
printf("%c",s[i]);
To explain further there is no string datatype in C. So what you can do is you simulate it using char [] and that is sufficient for that work.
For example you have to do this to get a string
char fn[MAXLEN}, ln[MAXLEN];
Reading an input can be like :
if(!fgets(fn, MAXLEN,stdin) ){
fprintf(stderr,"Error in input");
}
Do similarly for the second char array.
And then you do form the initializationg of array initials.
char initials[]={fn[0],ln[0],'\0'}
The benefit of the null terminated char array is that you can pass it to the fucntions which works over char* and get a correct result. Like strcmp() or strcpy().
Also there are lots of ways to get input from stdin and it is better always to check the return type of the standard functions that you use.
Standard don't restrict us that all the char arrays must be null terminated. But if we dont do that way then it's hardly useful in common cases. Like my example above. That array i shown earlier (without the null terminator) can't be passed to strlen() or strcpy() etc.
Also knowingly or unknowingly you have used somnething interesting The comma operator
Suppose you write a statememnt like this
char initialChar = fn[0] , ln[0]; //This is error
char initialChar = (fn[0] , ln[0]); // This is correct and the result will be `ln[0]`
, operator works that first it tries to evaluate the first expression fn[0] and then moves to the second ln[0] and that value is returned as a value of the whole expression that is assigned to initialChar.
You can check these helpful links to get you started
Beginner's Guide Away from scanf()
How to debug small programs

c programming "Access violation writing location 0x00000000."

I'm working on a program which I need to get string from a user and do some manipulates on it.
The problem is when my program ends I am getting "Access violation writing location 0x00000000." error message.
This is my code
}
//code
char *s;
s=gets();
//code
}
After some reading I relized that using gets() may cause some problems so I changed char *s to s[20] just to check it out and it worked fine without any errors at the end of the program.
The thing is that I don't know the string size in advance, thus, I'm not allowed (academic ex) to create string line as -> s[HugeNumber] like s[1000].
So I have no other choice but using gets() function.
Any way to solve my problem?
Thanks in advance
PS
Also tried using malloc as
char *temp;
char *s;
temp = gets();
s= (char*)malloc((strlen(temp) +1)* sizeof(char));
Error still popup at the end.
As long as I have *something = gets(); my program will throw an error at the end.
It looks like you are expecting gets to allocate an appropriately-sized string and return a pointer to it but that is not how it works. gets needs to receive the buffer as a parameter so you would still need to declare the array with a huge number. In fact, I am surprised that you managed to get your code to compile since you are passing the wrong number of arguments to gets.
char s[1000];
if (gets(s) == NULL) {
// handle error
}
The return value of gets is the same pointer that you passed as a parameter to it. The only use of the return value is to check for errors, since gets will return NULL if it reached the end of file before reading any characters.
A function that works more similarly to what you want is getline in the GNU libc:
char *s;
size_t n=0;
getline(&s, &n, stdin);
printf("%s", s); // Use the string here
free(s); //Then free it when done.
Alternatively, you could do something similar using malloc and realloc inside a loop. Malloc a small buffer to start out then use fgets to read into that buffer. If the whole line fits inside the buffer you are done. If it didn't then you realloc the buffer to something larger (multiply its size by a constant factor each time) and continue reading from where you stopped.
Another approach is to give up on reading arbitrarily large lines. The simplest thing you can do in C is to set up a maximum limit for line length (say, 255 characters), use fgets to read up to that number of characters and then abort with an error if you are given a line that is longer than that. This way you stick to functions in the standard library and you keep your logic as simple as possible.
You have not allocated temp.
And 3 kinds of C you should avoid i.e.
void main() use int main() instead
fflush(stdin)
gets() use fgets() instead

How does a program shut down when reading farther than memory allocated to an array?

Good evening everybody, I am learning C++ on Dev C++ 5.9.2, I am really novice at it. I intentionnally make my programs crash to get a better understanding of bugs. I've just learned that we can pass a char string to a function by initializing a pointer with the address of the array and that was the only way to do it. Therefore we should always pass to the function the size of that string to handle it properly. It also means that any procedure can run with a wrong size passed in the argument line hence I supposed we could read farther than the allocated memory assigned to the string.
But how far can we do it? I've tested several integers and apparently it works fine below 300 bytes but it doesn't for above 1000 (the program displays characters but end up to crash). So my questions are :
How far can we read or write on the string out of its memory range?
Is it called an overflow?
How does the program detect that the procedure is doing something unlegit?
Is it, the console or the code behind 'cout', that conditions the shutting down of the program?
What is the condition for the program to stop?
Does the limit depend on the console or the OS?
I hope my questions don't sound too trivial. Thank you for any answer. Good day.
#include <iostream>
using namespace std;
void change(char str[])
{
str[0] = 'C';
}
void display(char str[], int lim)
{
for(int i = 0; i < lim; i++) cout << str[i];
}
int main ()
{
char mystr[] = "Hello.";
change(mystr);
display(mystr, 300);
system("PAUSE");
return 0;
}
The behavior when you read past the end of an array is undefined. The compiler is free to implement the read operation in whatever way works correctly when you don't read beyond the end of the buffer, and then if you do read too far - well, whatever happens is what happens.
So there are no definite answers to most of your questions. The program could crash as soon as you read 1 byte too far, or you could get random data as you read several megabytes and never crash. If you crash, it could be for any number of reasons - though it likely will have something to do with how memory is managed by the OS.
As an aside, the normal way to let a function know where a string ends is to end it with a null character rather than passing a separate length value.

How strcpy works behind the scenes?

This may be a very basic question for some. I was trying to understand how strcpy works actually behind the scenes. for example, in this code
#include <stdio.h>
#include <string.h>
int main ()
{
char s[6] = "Hello";
char a[20] = "world isnsadsdas";
strcpy(s,a);
printf("%s\n",s);
printf("%d\n", sizeof(s));
return 0;
}
As I am declaring s to be a static array with size less than that of source. I thought it wont print the whole word, but it did print world isnsadsdas .. So, I thought that this strcpy function might be allocating new size if destination is less than the source. But now, when I check sizeof(s), it is still 6, but it is printing out more than that. Hows that working actually?
You've just caused undefined behaviour, so anything can happen. In your case, you're getting lucky and it's not crashing, but you shouldn't rely on that happening. Here's a simplified strcpy implementation (but it's not too far off from many real ones):
char *strcpy(char *d, const char *s)
{
char *saved = d;
while (*s)
{
*d++ = *s++;
}
*d = 0;
return saved;
}
sizeof is just returning you the size of your array from compile time. If you use strlen, I think you'll see what you expect. But as I mentioned above, relying on undefined behaviour is a bad idea.
http://natashenka.ca/wp-content/uploads/2014/01/strcpy8x11.png
strcpy is considered dangerous for reasons like the one you are demonstrating. The two buffers you created are local variables stored in the stack frame of the function. Here is roughly what the stack frame looks like:
http://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Call_stack_layout.svg/342px-Call_stack_layout.svg.png
FYI things are put on top of the stack meaning it grows backwards through memory (This does not mean the variables in memory are read backwards, just that newer ones are put 'behind' older ones). So that means if you write far enough into the locals section of your function's stack frame, you will write forward over every other stack variable after the variable you are copying to and break into other sections, and eventually overwrite the return pointer. The result is that if you are clever, you have full control of where the function returns. You could make it do anything really, but it isn't YOU that is the concern.
As you seem to know by making your first buffer 6 chars long for a 5 character string, C strings end in a null byte \x00. The strcpy function copies bytes until the source byte is 0, but it does not check that the destination is that long, which is why it can copy over the boundary of the array. This is also why your print is reading the buffer past its size, it reads till \x00. Interestingly, the strcpy may have written into the data of s depending on the order the compiler gave it in the stack, so a fun exercise could be to also print a and see if you get something like 'snsadsdas', but I can't be sure what it would look like even if it is polluting s because there are sometimes bytes in between the stack entries for various reasons).
If this buffer holds say, a password to check in code with a hashing function, and you copy it to a buffer in the stack from wherever you get it (a network packet if a server, or a text box, etc) you very well may copy more data from the source than the destination buffer can hold and give return control of your program to whatever user was able to send a packet to you or try a password. They just have to type the right number of characters, and then the correct characters that represent an address to somewhere in ram to jump to.
You can use strcpy if you check the bounds and maybe trim the source string, but it is considered bad practice. There are more modern functions that take a max length like http://www.cplusplus.com/reference/cstring/strncpy/
Oh and lastly, this is all called a buffer overflow. Some compilers add a nice little blob of bytes randomly chosen by the OS before and after every stack entry. After every copy the OS checks these bytes against its copy and terminates the program if they differ. This solves a lot of security problems, but it is still possible to copy bytes far enough into the stack to overwrite the pointer to the function to handle what happens when those bytes have been changed thus letting you do the same thing. It just becomes a lot harder to do right.
In C there is no bounds checking of arrays, its a trade off in order to have better performance at the risk of shooting yourself in the foot.
strcpy() doesn't care whether the target buffer is big enough so copying too many bytes will cause undefined behavior.
that is one of the reasons that a new version of strcpy were introduced where you can specify the target buffer size strcpy_s()
Note that sizeof(s) is determined at run time. Use strlen() to find the number of characters s occupied. When you perform strcpy() source string will be replaced by destination string so your output wont be "Helloworld isnsadsdas"
#include <stdio.h>
#include <string.h>
main ()
{
char s[6] = "Hello";
char a[20] = "world isnsadsdas";
strcpy(s,a);
printf("%s\n",s);
printf("%d\n", strlen(s));
}
You are relying on undefined behaviour in as much as that the compiler has chose to place the two arrays where your code happens to work. This may not work in future.
As to the sizeof operator, this is figured out at compile time.
Once you use adequate array sizes you need to use strlen to fetch the length of the strings.
The best way to understand how strcpy works behind the scene is...reading its source code!
You can read the source for GLibC : http://fossies.org/dox/glibc-2.17/strcpy_8c_source.html . I hope it helps!
At the end of every string/character array there is a null terminator character '\0' which marks the end of the string/character array.
strcpy() preforms its task until it sees the '\0' character.
printf() also preforms its task until it sees the '\0' character.
sizeof() on the other hand is not interested in the content of the array, only its allocated size (how big it is supposed to be), thus not taking into consideration where the string/character array actually ends (how big it actually is).
As opposed to sizeof(), there is strlen() that is interested in how long the string actually is (not how long it was supposed to be) and thus counts the number of characters until it reaches the end ('\0' character) where it stops (it doesn't include the '\0' character).
Better Solution is
char *strcpy(char *p,char const *q)
{
char *saved=p;
while(*p++=*q++);
return saved;
}

How do I transfer text between different functions in C?

I have two .c files that I would like to compile into on executable. Having just started learning C, I'm finding it tricky to know how to transfer text as an argument between functions (which I've found to be incredibly simple in every other language).
Here are the two files:
Program.c
#include <stdio.h>
#include <string.h>
int main(){
char temp[40] = stringCall();
printf("%s",temp);
}
StringReturn.c
#include <stdio.h>
#include <string.h>
char[] stringCall(){
char toReturn[40] = "Hello, Stack Overflow!";
return toReturn;
}
I usually get a problem that says something like "Segmentation Failed (core dumped)" or alike. I've done a lot of Googling and amazing I can't really find a solution, and certainly no simple tutorial "This is how to move text between functions".
Any help would be appreciated :)
char toReturn[40] = "Hello, Stack Overflow!";
return toReturn;
This is invalid, you're returning an auto array which gets out of scope after the function returns - this invokes undefined behavior. Try returning the string literal itself (it's valid throughout the program):
return "Hello, Stack Overflow!";
Or a dynamic duplicate of your array:
char toReturn[40] = "Hello, Stack Overflow!";
return strdup(toReturn);
In this latter case, you'll need to free() the string in the caller function.
You are correct, this isn't simple in C because C can't treat strings as values.
The most flexible way to do it is this:
// Program.c
char temp[40];
if (stringCall(temp, sizeof temp) <= sizeof temp) {
puts(temp);
} else {
// error-handling
puts("My buffer wasn't big enough");
}
// StringReturn.c
int stringCall(char *buf, int size) {
const char toReturn[] = "Hello, Stack Overflow!";
if (sizeof toReturn <= size) {
strcpy(buf, toReturn);
}
return sizeof toReturn;
}
stringCall doesn't return the string data, it writes it to a buffer supplied by the caller
A buffer always comes with a size. You could use size_t or ssize_t for this rather than int.
stringCall checks the size before writing. There are other ways to do this in a single call, all of which are either not in C89 and C99, or are defective in some other way. C11 introduces strcpy_s. Use what tools you can, which should be logically equivalent to checking it yourself as in my code above. Never forget to make sure there's space for the nul terminator, which invisibly lurks at the end of every C string.
stringCall returns the number of bytes that it wants to write, even if it doesn't write anything. This means that a caller whose buffer is too small can allocate a bigger one and try again. For that matter a caller can do stringCall(NULL, 0) to get the size without trying any particular buffer.
I'm using sizeof here because I'm using arrays whose size is known by the compiler, but in practice stringCall might use strlen or some other way of knowing how much data it wants to write.
As I've written it, callers are required not to pass in a negative value for size. That's usually OK, because their buffer in point of fact cannot have a negative size, so their code is already buggy if they do. But if you want to be really sure, or if you want to help your callers catch those bugs, you could write if ((int) sizeof toReturn < size) or if (size > 0 && sizeof toReturn < size).
"Most flexible" isn't always best, but this is good when the function is actually generating the text on the fly, especially when there isn't an easy way for the caller to know the length in advance without doing half the work that stringCall is supposed to do for them. The interface is similar to the C99 standard function snprintf.

Resources