Convert string to boolean array - c

I need to convert a string that consists of a million 'zero' or 'one' characters (1039680 characters to be specific) to a boolean array. The way I have it now takes a few seconds for a 300000 character string and that is too long. I need to be able to do the whole milion character conversion in less than a second.
The way I tried to do it was to read a file with one line of (in this trial case) 300000 zeros.
I know my code will act funky for strings that contain stuff other than zeros or ones, but I know that the string will only contain those.
I also looked at atoi, but I don't think it would suit my needs.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#define BUFFERSIZE 1039680
int main ()
{
int i ;
char buffer[BUFFERSIZE];
bool boolList[BUFFERSIZE] ;
// READ FILE WITH A LOT OF ZEROS
FILE *fptr;
if ((fptr=fopen("300000zeros.txt","r"))==NULL){
printf("Error! opening file\n");
exit(1);
}
fscanf(fptr,"%[^\n]",buffer);
fclose(fptr);
// CONVERT STRING TO BOOLEAN ARRAY
for (i=0 ; i<strlen(buffer) ; i++) {
if (buffer[i] == '1') boolList[i] = 1 ;
}
return 0;
}

Try
char *sptr = buffer;
bool *bptr = boolList;
while (*sptr != '\0')
*bptr++ = *sptr++ == '1'? 1:0;

If the string length is always 1039680 characters like you said then why do you use strlen(buffer) in your code? Why don't just loop BUFFERSIZE times? And if the string length can be changed somehow then you should cache the length into a variable like others said instead of calling it again and again each loop.
More importantly you haven't included space for the NULL termination byte in the buffer, so when you read exact BUFFERSIZE characters, the char array is not a valid NULL terminated string, hence calling strlen on it invokes undefined behavior
If you want to read the file as text then you must add one more char to buffer
char buffer[BUFFERSIZE + 1];
Otherwise, open the file as binary and read the whole 1039680-byte block at once. That'll be much faster
fread(buffer, sizeof(buffer[0]), BUFFERSIZE, fptr);
And then just loop over BUFFERSIZE bytes and set it to 0 without a branch
for (i = 0 ; i < BUFFERSIZE; i++)
{
buffer[i] -= '0';
}
You don't need another boolList, just use buffer as boolList or change the name to boolList and discard the buffer

Related

Appending Characters to an Empty String in C

I'm relatively new to C, so any help understanding what's going on would be awesome!!!
I have a struct called Token that is as follows:
//Token struct
struct Token {
char type[16];
char value[1024];
};
I am trying to read from a file and append characters read from the file into Token.value like so:
struct Token newToken;
char ch;
ch = fgetc(file);
strncat(newToken.value, &ch, 1);
THIS WORKS!
My problem is that Token.value begins with several values I don't understand, preceding the characters that I appended. When I print the result of newToken.value to the console, I get #�����TheCharactersIWantedToAppend. I could probably figure out a band-aid solution to retroactively remove or work around these characters, but I'd rather not if I don't have to.
In analyzing the � characters, I see them as (in order from index 1-5): \330, \377, \377, \377, \177. I read that \377 is a special character for EOF in C, but also 255 in decimal? Do these values make up a memory address? Am I adding the address to newToken.value by using &ch in strncat? If so, how can I keep them from getting into newToken.value?
Note: I get a segmentation fault if I use strncat(newToken.value, ch, 1) instead of strncat(newToken.value, &ch, 1) (ch vs. &ch).
I'll try to consolidate the answers already given in the comments.
This version of the code uses strncat(), as yours, but solving the problems noted by Nick (we must initialize the target) and Dúthomhas (the second parameter to strncat() must be a string, and not a pointer to a single char) (Yes, a "string" is actually a char[] and the value passed to the function is a char*; but it must point to an array of at least two chars, the last one containing a '\0'.)
Please be aware that strncat(), strncpy() and all related functions are tricky. They don't write more than N chars. But strncpy() only adds the final '\0' to the target string when the source has less than N chars; and strncat() always adds it, even if it the source has exactly N chars or more (edited; thanks, #Clifford).
#include <stdio.h>
#include <string.h>
int main() {
FILE* file = stdin; // fopen("test.txt", "r");
if (file) {
struct Token {
char type[16];
char value[1024];
};
struct Token newToken;
newToken.value[0] = '\0'; // A '\0' at the first position means "empty"
int aux;
char source[2] = ""; // A literal "" has a single char with value '\0', but this syntax fills the entire array with '\0's
while ((aux = fgetc(file)) != EOF) {
source[0] = (char)aux;
strncat(newToken.value, source, 1); // This appends AT MOST 1 CHAR (and always adds a final '\0')
}
strncat(newToken.value, "", 1); // As the source string is empty, it just adds a final '\0' (superfluous in this case)
printf(newToken.value);
}
return 0;
}
This other version uses an index variable and writes each singe char directly into the "current" position of the target string, without using strncat(). I think is simpler and more secure, because it doesn't mix the confusing semantics of single chars and strings.
#include <stdio.h>
#include <string.h>
int main() {
FILE* file = stdin; // fopen("test.txt", "r");
if (file) {
struct Token {
int index = 0;
char type[16];
char value[1024]; // Max size is 1023 chars + '\0'
};
struct Token newToken;
newToken.value[0] = '\0'; // A '\0' at the first position means "empty". This is not really necessary anymore
int aux;
while ((aux = fgetc(file)) != EOF)
// Index will stop BEFORE 1024-1 (value[1022] will be the last "real" char, leaving space for a final '\0')
if (newToken.index < sizeof newToken.value -1)
newToken.value[newToken.index++] = (char)aux;
newToken.value[newToken.index++] = '\0';
printf(newToken.value);
}
return 0;
}
Edited: fgetc() returns an int and we should check for EOF before casting it to a char (thanks, #chqrlie).
You are appending string that is not initialised, so can contain anything. The end I'd a string is indicated by a NUL(0) character, and in your example there happened to be one after 6 bytes, but there need not be any within the value array, so the code is seriously flawed, and will result in non-deterministic behaviour.
You need to initialise the newToken instance to empty string. For example:
struct Token newToken = { "", "" } ;
or to zero initialise the whole structure:
struct Token newToken = { 0 } ;
The point is that C does not initialise non-static objects without an explicit initialiser.
Furthermore using strncat() is very inefficient and has non-deterministic execution time that depends on the length of the destination string (see https://www.joelonsoftware.com/2001/12/11/back-to-basics/).
In this case you would do better to maintain a count of the number of characters added, and write the character and terminator directly to the array. For example:
size_t index ;
int ch = 0 ;
do
{
ch = fgetc(file);
if( ch != EOF )
{
newToken.value[index] = (char)ch ;
index++ ;
newToken.value[index] = '\0' ;
}
} while( ch != EOF &&
index < size of(newToken.value) - 1 ) ;

How do you prevent buffer overflow using fgets?

So far I have been using if statements to check the size of the user-inputted strings. However, they don't see to be very useful: no matter the size of the input, the while loop ends and it returns the input to the main function, which then just outputs it.
I don't want the user to enter anything greater than 10, but when they do, the additional characters just overflow and are outputted on a newline. The whole point of these if statements is to stop that from happening, but I haven't been having much luck.
#include <stdio.h>
#include <string.h>
#define SIZE 10
char *readLine(char *buf, size_t sz) {
int true = 1;
while(true == 1) {
printf("> ");
fgets(buf, sz, stdin);
buf[strcspn(buf, "\n")] = 0;
if(strlen(buf) < 2 || strlen(buf) > sz) {
printf("Invalid string size\n");
continue;
}
if(strlen(buf) > 2 && strlen(buf) < sz) {
true = 0;
}
}
return buf;
}
int main(int argc, char **argv) {
char buffer[SIZE];
while(1) {
char *input = readLine(buffer, SIZE);
printf("%s\n", input);
}
}
Any help towards preventing buffer overflow would be much appreciated.
When the user enters in a string longer than sz, your program processes the first sz characters, but then when it gets back to the fgets call again, stdin already has input (the rest of the characters from the user's first input). Your program then grabs another up to sz characters to process and so on.
The call to strcspn is also deceiving because if the "\n" is not in the sz chars you grab than it'll just return sz-1, even though there's no newline.
After you've taken input from stdin, you can do a check to see if the last character is a '\n' character. If it's not, it means that the input goes past your allowed size and the rest of stdin needs to be flushed. One way to do that is below. To be clear, you'd do this only when there's been more characters than allowed entered in, or it could cause an infinite loop.
while((c = getchar()) != '\n' && c != EOF)
{}
However, trying not to restructure your code too much how it is, we'll need to know if your buffer contains the newline before you set it to 0. It will be at the end if it exists, so you can use the following to check.
int containsNewline = buf[strlen(buf)-1] == '\n'
Also be careful with your size checks, you currently don't handle the case for a strlen of 2 or sz. I would also never use identifier names like "true", which would be a possible value for a bool variable. It makes things very confusing.
In case that string inside the file is longer that 10 chars, your fgets() reads only the first 10 chars into buf. And, because these chars doesn't contain the trailing \n, function strcspn(buf, "\n") returns 10 - it means, you are trying to set to 0 an buf[10], so it is over buf[] boundaries (max index is 9).
Additionally, never use true or false as the name of variable - it totally diminishes the code. Use something like 'ok' instead.
Finally: please clarify, what output is expected in case the file contains string longer than 10 characters. It should be truncated?

Find and replace a word in a file, how to avoid reading the entire file into a buffer?

I have an assignment where I'm supposed to write to a file, then perform a find and replace on it, with the condition that the old word must have the same length as the new one.
What I'm currently doing is finding the file size, then allocating a memory of that size and assign it to a buffer, read the entire file into the buffer, change the words, then write it back on the file.
This would fail if the files are too big, the only thing I can think of to avoid this is:
Check if the buffer contains \n
If it doesn't (the entire line wasn't read), then use realloc to increase its size by any amount (the original for example)
Delete the last n characters in the buffer, where n is the length of the word we want to replace. (To avoid reading the same data again)
Set the file pointer back by n. (Because the word could be cut)
Is there any other method? This feels complicated, and realloc causes some issues that might make the program need new buffers.
This is the current code where I read the entire file at once:
void replace_word(const char *s, const char *old_word, const char *new_word){
FILE *original_file;
if((original_file = fopen(s, "r+")) == NULL){
perror(s);
exit(EXIT_FAILURE);
}
const int BUFFER_SIZE = fsize(s);
char *buffer = malloc(BUFFER_SIZE);
char *init_loc = buffer;
int word_len = strlen(old_word);
int word_frequency = 0;
fgets(buffer, BUFFER_SIZE, original_file);
while((buffer = strstr(buffer, old_word))){
memcpy(buffer, new_word, word_len);
word_frequency++;
}
buffer = init_loc;
rewind(original_file);
fputs(buffer, original_file);
printf("'%s' found %i times\n", old_word, word_frequency);
fclose(original_file);
free(buffer);
}
You can do it with a "sliding window" algorithm using just one fixed buffer of any length that you want, as long as the buffer is longer than the word you are looking for.
The pseudocode to search for a word of length N would look as follows:
Begin with a buffer full of data from the file.
Loop:
Search for the word in the buffer; if found:
calculate the offset of the word in the file
write the replacement over it.
move the last N - 1 characters from the end of the buffer to the beginning of the buffer. (That's because these characters may contain part of the word, and the remaining part may be in the beginning of the next buffer that you will read.)
fill the remainder of the buffer from the file.
repeat the above loop until you reach the end of the file.
For this to perform well, the buffer must be much longer than the word. So, if your word is up to 100 characters long, the buffer should be at least 4 kilobytes long. But 64 and even 128 kilobyte buffers work well in modern systems.
Do not forget to seek to the right offset before each read operation.
I don't know if this is the best solution or not, but i would just look at one word at a time. Then when you find the word you want to change, go back by the size of the word you read and overwrite it. As long as the word is the same size, it should work.
Use fgetc to get one char at a time from your file. Replace getchar with fgetc in the code below.
Just modify this code, to work with fgetc, it from K&R famous book on C, which i read 10 months ago, to learn C. I've used it a few times in my own code, and it works fine.
#include <stdio.h>
#include <ctype.h>
/* getword: get next word or character from input */
int getword(char *word, int lim)
{
int c, getch(void);
void ungetch(int);
char *w = word;
while (isspace(c = getch()))
;
if (c != EOF)
*w++ = c;
if (!isalpha(c)) {
*w = '\0';
return c;
}
for ( ; --lim > 0; w++)
if (!isalnum(*w = getch())) {
ungetch(*w);
break;
}
*w = '\0';
return word[0];
}
#define BUFSIZE 100
char buf[BUFSIZE]; /* buffer for ungetch */
int bufp = 0; /* next free position in buf */
int getch(void) /* get a (possibly pushed-back) character */
{
return (bufp > 0) ? buf[--bufp] : getchar(); //change to fgetc
}
void ungetch(int c) /* push character back on input */
{
if (bufp >= BUFSIZE)
printf("ungetch: too many characters\n");
else
buf[bufp++] = c;
}
You can make the max size of the array anything you want, it's set to 100, since there should be no words bigger then 100 char, but you can make it anything.
just modify the code to read form fgetc, and end when you hit EOF.

Dynamically allocate user inputted string

I am trying to write a function that does the following things:
Start an input loop, printing '> ' each iteration.
Take whatever the user enters (unknown length) and read it into a character array, dynamically allocating the size of the array if necessary. The user-entered line will end at a newline character.
Add a null byte, '\0', to the end of the character array.
Loop terminates when the user enters a blank line: '\n'
This is what I've currently written:
void input_loop(){
char *str = NULL;
printf("> ");
while(printf("> ") && scanf("%a[^\n]%*c",&input) == 1){
/*Add null byte to the end of str*/
/*Do stuff to input, including traversing until the null byte is reached*/
free(str);
str = NULL;
}
free(str);
str = NULL;
}
Now, I'm not too sure how to go about adding the null byte to the end of the string. I was thinking something like this:
last_index = strlen(str);
str[last_index] = '\0';
But I'm not too sure if that would work though. I can't test if it would work because I'm encountering this error when I try to compile my code:
warning: ISO C does not support the 'a' scanf flag [-Wformat=]
So what can I do to make my code work?
EDIT: changing scanf("%a[^\n]%*c",&input) == 1 to scanf("%as[^\n]%*c",&input) == 1 gives me the same error.
First of all, scanf format strings do not use regular expressions, so I don't think something close to what you want will work. As for the error you get, according to my trusty manual, the %a conversion flag is for floating point numbers, but it only works on C99 (and your compiler is probably configured for C90)
But then you have a bigger problem. scanf expects that you pass it a previously allocated empty buffer for it to fill in with the read input. It does not malloc the sctring for you so your attempts at initializing str to NULL and the corresponding frees will not work with scanf.
The simplest thing you can do is to give up on n arbritrary length strings. Create a large buffer and forbid inputs that are longer than that.
You can then use the fgets function to populate your buffer. To check if it managed to read the full line, check if your string ends with a "\n".
char str[256+1];
while(true){
printf("> ");
if(!fgets(str, sizeof str, stdin)){
//error or end of file
break;
}
size_t len = strlen(str);
if(len + 1 == sizeof str){
//user typed something too long
exit(1);
}
printf("user typed %s", str);
}
Another alternative is you can use a nonstandard library function. For example, in Linux there is the getline function that reads a full line of input using malloc behind the scenes.
No error checking, don't forget to free the pointer when you're done with it. If you use this code to read enormous lines, you deserve all the pain it will bring you.
#include <stdio.h>
#include <stdlib.h>
char *readInfiniteString() {
int l = 256;
char *buf = malloc(l);
int p = 0;
char ch;
ch = getchar();
while(ch != '\n') {
buf[p++] = ch;
if (p == l) {
l += 256;
buf = realloc(buf, l);
}
ch = getchar();
}
buf[p] = '\0';
return buf;
}
int main(int argc, char *argv[]) {
printf("> ");
char *buf = readInfiniteString();
printf("%s\n", buf);
free(buf);
}
If you are on a POSIX system such as Linux, you should have access to getline. It can be made to behave like fgets, but if you start with a null pointer and a zero length, it will take care of memory allocation for you.
You can use in in a loop like this:
#include <stdlib.h>
#include <stdio.h>
#include <string.h> // for strcmp
int main(void)
{
char *line = NULL;
size_t nline = 0;
for (;;) {
ptrdiff_t n;
printf("> ");
// read line, allocating as necessary
n = getline(&line, &nline, stdin);
if (n < 0) break;
// remove trailing newline
if (n && line[n - 1] == '\n') line[n - 1] = '\0';
// do stuff
printf("'%s'\n", line);
if (strcmp("quit", line) == 0) break;
}
free(line);
printf("\nBye\n");
return 0;
}
The passed pointer and the length value must be consistent, so that getline can reallocate memory as required. (That means that you shouldn't change nline or the pointer line in the loop.) If the line fits, the same buffer is used in each pass through the loop, so that you have to free the line string only once, when you're done reading.
Some have mentioned that scanf is probably unsuitable for this purpose. I wouldn't suggest using fgets, either. Though it is slightly more suitable, there are problems that seem difficult to avoid, at least at first. Few C programmers manage to use fgets right the first time without reading the fgets manual in full. The parts most people manage to neglect entirely are:
what happens when the line is too large, and
what happens when EOF or an error is encountered.
The fgets() function shall read bytes from stream into the array pointed to by s, until n-1 bytes are read, or a is read and transferred to s, or an end-of-file condition is encountered. The string is then terminated with a null byte.
Upon successful completion, fgets() shall return s. If the stream is at end-of-file, the end-of-file indicator for the stream shall be set and fgets() shall return a null pointer. If a read error occurs, the error indicator for the stream shall be set, fgets() shall return a null pointer...
I don't feel I need to stress the importance of checking the return value too much, so I won't mention it again. Suffice to say, if your program doesn't check the return value your program won't know when EOF or an error occurs; your program will probably be caught in an infinite loop.
When no '\n' is present, the remaining bytes of the line are yet to have been read. Thus, fgets will always parse the line at least once, internally. When you introduce extra logic, to check for a '\n', to that, you're parsing the data a second time.
This allows you to realloc the storage and call fgets again if you want to dynamically resize the storage, or discard the remainder of the line (warning the user of the truncation is a good idea), perhaps using something like fscanf(file, "%*[^\n]");.
hugomg mentioned using multiplication in the dynamic resize code to avoid quadratic runtime problems. Along this line, it would be a good idea to avoid parsing the same data over and over each iteration (thus introducing further quadratic runtime problems). This can be achieved by storing the number of bytes you've read (and parsed) somewhere. For example:
char *get_dynamic_line(FILE *f) {
size_t bytes_read = 0;
char *bytes = NULL, *temp;
do {
size_t alloc_size = bytes_read * 2 + 1;
temp = realloc(bytes, alloc_size);
if (temp == NULL) {
free(bytes);
return NULL;
}
bytes = temp;
temp = fgets(bytes + bytes_read, alloc_size - bytes_read, f); /* Parsing data the first time */
bytes_read += strcspn(bytes + bytes_read, "\n"); /* Parsing data the second time */
} while (temp && bytes[bytes_read] != '\n');
bytes[bytes_read] = '\0';
return bytes;
}
Those who do manage to read the manual and come up with something correct (like this) may soon realise the complexity of an fgets solution is at least twice as poor as the same solution using fgetc. We can avoid parsing data the second time by using fgetc, so using fgetc might seem most appropriate. Alas most C programmers also manage to use fgetc incorrectly when neglecting the fgetc manual.
The most important detail is to realise that fgetc returns an int, not a char. It may return typically one of 256 distinct values, between 0 and UCHAR_MAX (inclusive). It may otherwise return EOF, meaning there are typically 257 distinct values that fgetc (or consequently, getchar) may return. Trying to store those values into a char or unsigned char results in loss of information, specifically the error modes. (Of course, this typical value of 257 will change if CHAR_BIT is greater than 8, and consequently UCHAR_MAX is greater than 255)
char *get_dynamic_line(FILE *f) {
size_t bytes_read = 0;
char *bytes = NULL;
do {
if ((bytes_read & (bytes_read + 1)) == 0) {
void *temp = realloc(bytes, bytes_read * 2 + 1);
if (temp == NULL) {
free(bytes);
return NULL;
}
bytes = temp;
}
int c = fgetc(f);
bytes[bytes_read] = c >= 0 && c != '\n'
? c
: '\0';
} while (bytes[bytes_read++]);
return bytes;
}

Correctly store content of file by line to array and later print the array content

I'm getting some issues with reading the content of my array. I'm not sure if I'm storing it correctly as my result for every line is '1304056712'.
#include <stdio.h>
#include <stdlib.h>
#define INPUT "Input1.dat"
int main(int argc, char **argv) {
int data_index, char_index;
int file_data[1000];
FILE *file;
int line[5];
file = fopen(INPUT, "r");
if(file) {
data_index = 0;
while(fgets(line, sizeof line, file) != NULL) {
//printf("%s", line); ////// the line seems to be ok here
file_data[data_index++] = line;
}
fclose(file);
}
int j;
for(j = 0; j < data_index; j++) {
printf("%i\n", file_data[j]); // when i display data here, i get '1304056712'
}
return 0;
}
I think you need to say something like
file_data[data_index++] = atoi(line);
From your results I assume the file is a plain-text file.
You cannot simply read the line from file (a string, an array of characters) into an array of integers, this will not work. When using pointers (as you do by passing line to fgets()) to write data, there will be no conversion done. Instead, you should read the line into an array of chars and then convert it to integers using either sscanf(), atoi() or some other function of your choice.
fgets reads newline terminated strings. If you're reading binary data, you need fread. If you're reading text, you should declare line as an array of char big enough for the longest line in the file.
Because file_data is an array of char, file_data[data_index] is a single character. It is being assigned a pointer (the base address of int line[5] buffer). If reading binary data, file_data should be an array of integers. If reading strings, it should be an array of string, ie char pointers, like char * file_data[1000]
you also need to initialize data_index=0 outside the if (file) ... block, because the output loop needs it to be set even if the file failed to open. And when looping and storing input, the loop should test that it's not reached the size of the array being stored into.

Resources