Showing the difference between 2 char * strings in C - c

I am new to C, and I am trying to figure out the best way to approach this problem. I have 2 strings that are both char *'s.
They have multiple \n characters within the strings themselves, and they are usually about 1000 characters in length. I want to display only single lines that are different. Typically only one character (or a relatively small number) would be different in the entire string. So I was hoping to make it so that I could display only that one changed line (the whole string from \n to \n).
I'm not asking for anybody to write the code, or even supply code examples, just in theory what would be the most efficient way to do this?
I've been looking into using strtok, using the '\n' symbol as a delimiter, and then using strcmp to compare the two strings, and if they were not equal then I could add that string to a "old_data" and "new_data" array. Would this be a bad way to do this?
Any advice would be a huge help.

It sounds like you're on the right track: strsep will let you chunk the string up by newline. One thing to keep in mind is that it operates on the original string in place, and doesn't allocate any new memory, which can be both a blessing and a curse.
Probably the most memory efficient way to do this would be to look into allocating arrays of pointers to hold your "old_data" and "new_data" values, and then just save the pointers that point directly into the original string rather than copying the strings themselves over. As long as your original two strings are going to stick around/not get freed from under you, this could save you a decent chunk of memory.
If you aren't ever going to be removing strings from the arrays, one naïve (but effective) way to implement your arrays is to maintain two state variables — a count, and a capacity — and double the capacity each time you're about to overflow the array. E.g.:
char **strArray = NULL;
unsigned int capacity = 10;
unsigned int count = 0;
strArray = malloc(capacity * sizeof(char *));
/* on insert */
if (count == capacity)
{
capacity *= 2;
strArray = realloc(strArray, capacity * sizeof(char *));
}
strArray[count++] = pointerIntoOriginalString;
Good luck!

strtok() is not reentrant. If you were going to do this with strtok, you'd have to iterate over the arrays one after the other. I recommend using strtok_r(), which is the reentrant implementation of strtok.
The other consideration you need to worry about is making sure that your old_data and new_data arrays are big enough, or resizable. Matt's answer shows a simple example of resizing the array, although if you're new to C you might just want to declare something like:
char *new_data[2000];
char *old_data[2000];
Especially since it sounds like you have a good idea of how many lines are in your buffer.

Related

How to iterate character by character in strings which lengths are larger than INT_MAX, or SIZE_MAX?

C language, How to iterate character by character in strings which lengths are larger than INT_MAX, or SIZE_MAX?
How to find out that string length exceeded the any MAXIMUM SIZE applicable for the code below?
int len = strlen(item);
int i=0;
while (i <= len ) {
//do smth
i++;
}
You can access characters in a string (or elements in an array generally) without integer indices by using pointers:
for (char *p = item; *p; ++p)
{
// Do something.
// *p is the current character.
}
int len = strlen(item);
first, this is not an impedment to have a string longer thatn INT_MAX, but it will if you have to deal with it's length. If you thinkg about the implementation of strlen() you'll see that, as how strings are defined (a sequence of chars in memory bounded by thre presence of a null char) you'll see that the only possible implementation is to search the string, incrementing the length as you traverse it searching for the first null char on it. This makes your code very ineficient, because you first traverse the string searching for its end, then you traverse it a second time to do useful work.
int i=0;
while (i <= len ) {
//do smth
i++;
}
it should be better to use directly a pointer, in a for loop, like this one:
char *p;
for (p = item; *p; p++) {
// so something knowing that the char `*p` is the iterated char.
}
In this way, you navigate the string and stop when you find the null char, and you will not have to traverse it twice.
By the way, having strings longer than INT_MAX is quite difficult, because normally (and more with the new 64bit architectures) you are not allowed to create a so compact memory structure (this meaning that if you try to create a static array of that size, you will be fighting with the compiler, and if you try to malloc() such a huge amount of memory, you will end fighting wiht the operating system)
It's most normal that developers having to deal with huge amounts of memory, use an unseen structure to hold large amounts of characters. Just imagine that you need to insert one char and this forces you to move one gigabyte of memory one position because you have no other way to make room for it. It's simply unpractical to use such an amount. A simple approach is to use a similar structure as it is used for the file data in a disk in a unix system. The data has a series of direct pointers that point to fixed blocks of memory holding characters, but at some point those pointers become double pointers, pointing to an array of simple poointers, then a triple pointer, etc. This way you can handle strings as sort as one byte (with just a memory page)to more than INT_MAX bytes, by selecting an appropiate size for the page and the number of pointers.
Another approach is the mbuf_t approach used by BSD software to handle networking packets. This is expressely appropiate when you have to add to the string in front of it (e.g. to add a new protocol header) or to the rear of the packet (to add payload and/or checksum or trailing data)
One last thing... if you create an array of 5Gb, most probably every today operating system will swap it, as soon as you stop using part of it. This will make your application to start swaping as soon as you move on the array, and probably you will not be able to run your application in a computer with a limited address space (like, today a 32bit machine is)

Dynamic Structures And Storing Data without stdlib.h

I have tried using Google, but not really sure how to phrase my search to get relevant results. The programming language is C. I was given a (homework) assignment which requires reading a text file and outputting the unique words in the text file. The restriction is that the only allowable import is <stdio.h>. So, is there a way to use dynamic structures without using <stdlib.h>? Would it be necessary to define those dynamic structures on my own? If this has already been addressed on Stack Overflow, then please point me to the question.
Clarification was provided today that the allowable imports now include <stdlib.h> as well as (though not necessary or desirable) the use of <string.h>, which in turn makes this problem easier (and I am tempted to say trivial).
It is telling that you couldn't find anything with Google. Assignments with completely arbitrary restrictions are idiotic. The assignment tells something profound about the quality of the course and the instructor. There is more to be learnt from an assignment that requires the use of realloc and other standard library functions.
You don't need a data structure, only a large enough 2-dimensional char array - you must know at compile time how long words you're going to have and how many of them are there going to be at most; or you need to read the file once and then you're going to allocate a two-dimensional variable-length array on the stack (and possibly blow the stack), reset the file pointer and read the file again into that array...
Then you read the words into it using fgets, loop over the words using 2 nested for loops and comparing the first and second strings together (of course you'd skip if both outer and inner loop are at the same index) - if you don't find a match in the inner loop, you'll print the word.
Doing the assignment this way doesn't teach anything useful about programming, but the only standard library routine you need replicate yourself is strcmp and at least you'll save your energy for something useful instead.
It is not possible to code dynamic data structures in c using only stdio.h. That may be one of the reasons your teacher restricted you to using just stdio.h--they didn't want you going down the rabbit hole of trying to make a linked list or something in which to store unique words.
However, if you think about it, you don't need a dynamic data structure. Here's something to try: (1) make a copy of your source file. (2) declare a results text file to store your results. (3) Copy the first word in your source file to the results file. Then run through your source file and delete every copy of that word. Now there can't be any duplicates of that word. Then move on to the next word and copy and delete.
When you're done, your source file should be empty (thus the reason for the backup) and your results file should have one copy of every unique word from the original source file.
The benefit of this approach is that it doesn't require you to know (or guess) the size of the initial source file.
Agreed on the points above on "exercises with arbitrary constraints" mostly being used to illustrate a lecturers favorite pet peeve.
However, if you are allowed to be naive you could do what others have said and assume a maximum size for your array of unique strings and use a simple buffer. I wrote a little stub illustrating what I was thinking. However, it is shared with the disclaimer that I am not a "real programmer", with all the bad habits and knowledge-gaps that follows...
I have obviously also ignored the topics of reading the file and filtering unique words.
#include <stdio.h> // scanf, printf, etc.
#include <string.h> // strcpy, strlen (only for convenience here)
#define NUM_STRINGS 1024 // maximum number of strings
#define MAX_STRING_SIZE 32 // maximum length of a string (in fixed buffer)
char fixed_buff[NUM_STRINGS][MAX_STRING_SIZE];
char * buff[NUM_STRINGS]; // <-- Will only work for string literals OR
// if the strings that populates the buffer
// are stored in a separate location and the
// buffer refers to the permanent location.
/**
* Fixed length of buffer (NUM_STRINGS) and max item length (MAX_STRING_SIZE)
*/
void example_1(char strings[][MAX_STRING_SIZE] )
{
// Note: terminates when first item in the current string is '\0'
// this may be a bad idea(?)
for(size_t i = 0; *strings[i] != '\0'; i++)
printf("strings[%ld] : %s (length %ld)\n", i, strings[i], strlen(strings[i]));
}
/**
* Fixed length of buffer (NUM_STRINGS), but arbitrary item length
*/
void example_2(char * strings[])
{
// Note: Terminating on reaching a NULL pointer as the number of strings is
// "unknown".
for(size_t i = 0; strings[i] != NULL; i++)
printf("strings[%ld] : %s (length %ld)\n", i, strings[i], strlen(strings[i]));
}
int main(int argc, char* argv[])
{
// Populate buffers
strncpy(fixed_buff[0], "foo", MAX_STRING_SIZE - 1);
strncpy(fixed_buff[1], "bar", MAX_STRING_SIZE - 1);
buff[0] = "mon";
buff[1] = "ami";
// Run examples
example_1(fixed_buff);
example_2(buff);
return 0;
}

What is the proper way to populate an array of Strings in C, such that each string is a single element in the array

I'm trying to initialize a 2D array of strings in C; which does not seem to work like any other language I've coded in. What I'm TRYING to do, is read input, and take all of the comments out of that input and store them in a 2d array, where each string would be a new row in the array. When I get to a character that is next line, I want to advance the first index of the array so that I can separate each "comment string". ie.
char arr[100][100];
<whatever condition>
arr[i][j] = "//First comment";
Then when I get to a '/n' I want to increment the first index such that:
arr[i+1][j] = "//Second comment";
I just want to be able to access each input as an individual element in my array. In Java I wouldn't need to do this, as each string would already be an individual element in a String array. I've only been working with c for 3 weeks now, and things that I used to take for granted as being simple, have proven to be quite frustrating in C.
My actual code is below. It gives me an infinite loop and prints out a ton of numbers:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
const int MAXLENGTH = 500;
int main(){
char comment[MAXLENGTH][MAXLENGTH];
char activeChar;
int cIndex = 0;
int startComment = 0;
int next = 0;
while((activeChar = getchar()) != EOF){
if(activeChar == '/'){
startComment = 1;
}
if(activeChar == '\n'){
comment[next][cIndex] = '\0';
next++;
}
if(startComment){
comment[next][cIndex] = activeChar;
cIndex++;
}
}
for(int x = 0 ; x < MAXLENGTH; x++){
for (int j = 0; j < MAXLENGTH; j++){
if(comment[x][j] != 0)
printf("%s", comment[x][j]);
}
}
return 0;
}
The problem you are having is that C was designed to be essentially a glorified assembler. That means that the only stuff it has 'built-in' are things for which there is an obvious correct way to do it. Strings do not meet this criteria. As such strings are not a first-order citizen in c.
In particular there are at least three viable ways to deal with strings, C doesn't force you to use any of them, but instead allows you to use what you want for the job at hand.
Method 1: Static Array
This method appears to be similar to what you are trying to do, and is often used by new C programmers. In this method a string is just an array of characters exactly long enough to fit its contents. Assigning arrays becomes difficult, so this promotes using strings as immutables. It feels likely that this is how most JVM's would implement strings. C code: char my_string[] = "Hello";
Method 2: Static Bounded Array
This method is what you are doing. You decide that your strings must be shorter than a specified length, and pre-allocate a large enough buffer for them. In this case it is relatively easy to assign strings and change them, but they must not become longer than the set length. C code: char my_string[MAX_STRING_LENGTH] = "Hello";
Method 3: Dynamic Array
This is the most advanced and risky method. Here you dynamically allocate your strings so that they always fit their content. If they grow too big, you resize. They can be implemented many ways (usually as a single char pointer that is realloc'd as necessary in combination with method 2, occasionally as a linked list).
Regardless of how you implement strings, to C's eyes they are all just arrays of characters. The only caveat is that to use the standard library you need to null terminate your strings manually (although many [all?] of them specify ways to get around this by manually specifying the length).
This is why java strings are not primitive types, but rather objects of type String.
Interestingly enough, many languages actually use different String types for these solutions. For example Ada has String, Bounded_String, and Unbounded_String for the three methods above.
Solution
Look at your code again: char arr[100][100]; which method is this, and what is it?
Obviously it is method 2 with MAX_STRING_LENGTH of 100. So you could pretend the line says: my_strings arr[100] which makes your issue apparent, this is not a 2D array of strings, but a 2D array of characters which represents a 1D array of strings. To create a 2D array of strings in C you would use: char arr[WIDTH][HEIGHT][MAX_STRING_LENGTH] which is easy to get wrong. As above, however, you have some logic errors in your code, and you can probably solve this problem with just a 1D array of strings. (2D array of chars)
comment is a 2D array of chars, which are single characters. In C, a string is simply an array of characters, so your definition of comment is one way to define a 1D array of strings.
As far as the loading goes, the only obvious potential problem is that you don't ever reset startComment to zero (but you should use a debugger to make sure it's being loaded correctly), however your code to print it out is wrong.
Using printf() with a %s tells it to start printing the string at whatever address you give it, but you're giving it individual characters, not whole strings, so it's interpreting each character in each string (because C is a horrible, horrible language) as an address in RAM and trying to print that RAM. To print an individual character, use %c instead of %s. Or, just make a 1D for loop:
for(int x=0; x<MAX_LENGTH; X++)
printf("%s\n", comment[x])
It's also a bit confusing that you use the same MAX_LENGTH for the number of lines in the array and the length of the string in each line

Appending a char w/ null terminator in C

perhaps a lil trivial, but im just learning C and i hate doing with 2 lines, what can be done with one(as long as it does not confuse the code of course).
anyway, im building strings by appending one character at a time. im doing this by keeping track of the char index of the string being built, as well as the input file string's(line) index.
str[strIndex] = inStr[index];
str[strIndex + 1] = '\0';
str is used to temporarily store one of the words from the input line.
i need to append the terminator every time i add a char.
i guess what i want to know; is there a way to combine these in one statement, without using strcat()(or clearing str with memset() every time i start a new word) or creating other variables?
Simple solution: Zero out the string before you add anything to it. The NULs will already be at every location ahead of time.
// On the stack:
char str[STRLEN] = {0};
// On the heap
char *str = calloc(STRLEN, sizeof(*str));
In the calloc case, for large allocations, you won't even pay the cost of zeroing the memory explicitly (in bulk allocation mode, it requests memory directly from the OS, which is either lazily zero-ed (Linux) or has been background zero-ed before you ask for it (Windows)).
Obviously, you can avoid even this amount of work by defering the NUL termination of the string until you're done building it, but if you might need to use it as a C-style string at any time, guaranteeing it's always NUL-terminated up front isn't unreasonable.
I believe the way you are doing it now is the neatest that satisfies your requirement of
1) Not having string all zero to start with
2) At every stage the string is valid (as in always has a termination).
Basically you want to add two bytes each time. And really the most neat way to do that is the way you are doing it now.
If you are wanting to make the code seem neater by having the "one line" but not calling a function then perhaps a macro:
#define SetCharAndNull( String, Index, Character ) \
{ \
String[Index] = (Character); \
String[Index+1] = 0; \
}
And use it like:
SetCharAndNull( str, strIndex, inStr[index]);
Otherwise the only other thing I can think of which would achieve the result is to write a "word" at a time (two bytes, so an unsigned short) in most cases. You could do this with some horrible typecasting and pointer arithmetic. I would strongly recommend against this though as it won't be very readable, also it won't be very portable. It would have to be written for a particular endianness, also it would have problems on systems that require alignment on word access.
[Edit: Added the following]
Just for completeness I'm putting that awful solution I mentioned here:
*((unsigned short*)&str[strIndex]) = (unsigned short)(inStr[index]);
This is type casting the pointer of str[strIndex] to an unsigned short which on my system (OSX) is 16 bits (two bytes). It is then setting the value to a 16 bit version of inStr[index] where the top 8 bits are zero. Because my system is little endian, then the first byte will contain the least significant one (which is the character), and the second byte will be the zero from the top of the word. But as I said, don't do this! It won't work on big endian systems (you would have to add in a left shift by 8), also this will cause alignment problems on some processors where you can not access a 16bit value on a non 16-bit aligned address (this will be setting address with 8bit alignment)
Declare a char array:
char str[100];
or,
char * str = (char *)malloc(100 * sizeof(char));
Add all the character one by one in a loop:
for(i = 0; i<length; i++){
str[i] = inStr[index];
}
Finish it with a null character (outside the loop):
str[i] = '\0';

Scanning a file and allocating correct space to hold the file

I am currently using fscanf to get space delimited words. I establish a char[] with a fixed size to hold each of the extracted words. How would I create a char[] with the correct number of spaces to hold the correct number of characters from a word?
Thanks.
Edit: If I do a strdup on a char[1000] and the char[1000] actually only holds 3 characters, will the strdup reserve space on the heap for 1000 or 4 (for the terminating char)?
Here is a solution involving only two allocations and no realloc:
Determine the size of the file by seeking to the end and using ftell.
Allocate a block of memory this size and read the whole file into it using fread.
Count the number of words in this block.
Allocate an array of char * able to hold pointers to this many words.
Loop through the block of text again, assigning to each pointer the address of the beginning of a word, and replacing the word delimiter at the end of the word with 0 (the null character).
Also, a slightly philosophical matter: If you think this approach of inserting string terminators in-place and breaking up one gigantic string to use it as many small strings is ugly, hackish, etc. then you probably should probably forget about programming in C and use Python or some other higher-level language. The ability to do radically-more-efficient data manipulation operations like this while minimizing the potential points of failure is pretty much the only reason anyone should be using C for this kind of computation. If you want to go and allocate each word separately, you're just making life a living hell for yourself by doing it in C; other languages will happily hide this inefficiency (and abundance of possible failure points) behind friendly string operators.
There's no one-and-only way. The idea is to just allocate a string large enough to hold the largest possible string. After you've read it, you can then allocate a buffer of exactly the right size and copy it if needed.
In addition, you can also specify a width in your fscanf format string to limit the number of characters read, to ensure your buffer will never overflow.
But if you allocated a buffer of, say 250 characters, it's hard to imaging a single word not fitting in that buffer.
char *ptr;
ptr = (char*) malloc(size_of_string + 1);
char first = ptr[0];
/* etc. */

Resources