I have structure with most frequent words in huge text file, the field pointer array to char are words, and field count are their frequencies. My question is how to sort them from the longest word length to lowest - to nicely display it to user? Code:
typedef struct pair {
char * a[20000];
int count[32000];
} Pair;
Example print:
printf("%d, %d, %d\n", bag.count[0], bag.count[1], bag.count[2]); // -> 8, 7, 3
printf("%s, %s, %s\n", bag.a[0], bag.a[1], bag.a[2]); // -> abbes, abbey, abhor
I'd suggest to turn the structure/array inside-out.
Having your arrays inside the struct does not feel right. Because you primarily have a pair of things, and secondarily you want one array of these things. Do you see what I mean?
It would look like this:
typedef struct pair
{
char* word;
int count;
} Pair;
Pair pairs[32000];
You'd also need to know how many pairs are filled. (You would have needed this anyway.):
int index; // Index of next free pair.
Then use C standard qsort():
#include <stdlib.h>
...
int comparePairs(const void *pairA, const void *pairB)
{
Pair* a = (Pair*)pairA;
Pair* b = (Pair*)pairB;
return strlen(a->word) - strlen(b->word);
}
qsort(pairs, index, sizeof(Pair), comparePairs);
The index would start at 0, which indicates the next free Pair is at that index. Adding an element would be:
pairs[index].word = someWord; // someWord must be allocated elsewhere!
pairs[index].count = 1;
index++;
Note that, because your structure only has a char pointer, that the someWord must be allocated elsewhere. Without automatic memory management this is going to be rather cumbersome. A better alternative would be to strcpy() the word in by using the following structure:
typedef struct pair
{
char word[50]; // Assumes a word is NEVER longer than 49 characters.
int count;
} Pair;
Adding a new element would then become:
strncpy(pairs[index].word, someWord, 50 - 1);
pairs[index].count = 1;
index++;
The strncpy() above copies at most 49 characters. You need to make sure you chose this 50 or whatever size wisely to make sure strncpy() never starts chopping off ends of your very long words.
But of course to know if you have to add a new or simply increment the count of an existing one, you'd first need to search through the existing Pairs with a simple loop.
Related
I'm trying to create data structures that would help to store a dictionary with words and their definitions.
At first, I have defined the following structures that describe entry in the dictionary and structure of the dictionary itself, respectively:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
typedef struct{
char wordInDictionary[32];
int numberOfMeanings;
char *wordDefinitions[10];
}Entry;
typedef struct{
int entries;
Entry *arrayOfEntries[10000];
}Dictionary;
As the next step, I should have created the empty dictionary with 0 entries. Here is what I had an attempt on:
Dictionary createDictionary(){
Dictionary emptyDictionary;
emptyDictionary.entries = 0;
int i, j;
for(i=0;i<10000;i++){
emptyDictionary.arrayOfEntries[i]->numberOfMeanings = 0;
strcpy(emptyDictionary.arrayOfEntries[i]->wordInDictionary, "\0");
for(j=0;j<10;j++){
strcpy(emptyDictionary.arrayOfEntries[i]->wordDefinitions[j], "\0");
}
}
return emptyDictionary;
}
However when I test it in main() it shows that there is segmentation fault: 11:
int main(){
Dictionary dict = createDictionary();
printf("%s", dict.arrayOfEntries[0]->wordInDictionary);
printf("%i",dict.entries);
}
How could I fix the issue so that I would return a dictionary with 0 entries?
Your emptyDictionary.arrayOfEntries[] is a statically initialized array containing 10000 pointer to ... nothing.
The first line in createDictionary() creates the 'Dictionary' item as a stack variable. The array inside is uninitialized. With many compilers it will contain NULL pointers. But assume it even contains garbage.
As soon as you touch emptyDictionary.arrayOfEntries[0]->... in the first iteration of your for-loop, you are dereferencing that pointer. (using -> on it. That is the point of the crash.
You need to fill those first by allocating an Entry first.
#define DEFINITION_SIZE 40
.....
for(i=0;i<10000;i++){
// first allocate the entry before touching it.
emptyDictionary.arrayOfEntries[i] = (Entry*)malloc(sizeof(Entry*));
emptyDictionary.arrayOfEntries[i]->numberOfMeanings = 0;
strcpy(emptyDictionary.arrayOfEntries[i]->wordInDictionary, "\0");
for(j=0;j<10;j++){
// first allocate this item as well
emptyDictionary.arrayOfEntries[i]->wordDefinitions[j] =
(char*)malloc(DEFINITION_SIZE);
strcpy(emptyDictionary.arrayOfEntries[i]->wordDefinitions[j], "\0");
}
}
But there are a few problems:
- The structure is huge. 10000 entries, each containing another array of 32 chars and 10 pointers.
- The returns statement in createDictionary() copies and returns the structure (shallow copy).
So I'm supposed to do the sorting algorithm as a CS homework.
It should read arbitrary number of words each ending with '\n'. After it reads the '.', it should print the words in alphabetical order.
E.g.:
INPUT:
apple
dog
austria
Apple
OUTPUT:
Apple
apple
Austria
dog
I want to store the words into a struct. I think that in order to work it for arbitrary number of words I should make the array of structs.
So far I've tried to create a typedef struct with only one member (string) and I planned to make the array of structs from that, into which I would then store each of the words.
As for the "randomness" of the number of words, I wanted to set the struct type in main after finding out how many words had been written and then store each word into each element of the struct array.
My problem is:
1. I don't know how to find out the number of words. The only thing I tried was making a function which counts how many times the '\n' occured, though it didn't work very good.
as for the datastructure, I've came up with struct having only one string member:
typedef struct{
char string[MAX];
}sort;
then in main function I firstly read a number of words to come (not the actual assignment but only for purposes of making the code work)
and after having the "len" I declared the variable of type sort:
int main(){
/*code, scanf("%d", &len) and stuff*/
sort sort_t[len];
for(i = 0; i < len; i++){
scanf("%s", sort_t[i].string);
}
Question: Is such thing "legal" and do I use a good approach?
Q2: How do I get to know the number of words to store (for the array of structs) before I start storing them?
IMHO the idea of reserving the same maximal storage for each and every string is a bit wasteful. You are probably better off sticking to dynamic NUL-terminated strings as usually done in C code. This is what the C library supports best.
As for managing an unknown number of strings, you have a choice. Possibility 1 is to use a linked list as mentioned by Xavier. Probably the most elegant solution, but it could be time-consuming to debug, and ultimately you have to convert it to an array in order to use one of the common sort algorithms.
Possibility 2 is to use something akin to a C++ std::vector object. Say the task of allocating storage is delegated to some "bag" object. Code dealing with the "bag" has a monopoly on calling the realloc() function mentioned by Vlad. Your main function only calls bag_create() and bag_put(bag, string). This is less elegant but probably easier to get right.
As your focus is to be on your sorting algorithm, I would rather suggest using approach #2. You could use the code snippet below as a starting point.
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
typedef struct {
size_t capacity;
size_t usedSlotCount;
char** storage;
} StringBag;
StringBag* bag_create()
{
size_t initialSize = 4; /* start small */
StringBag* myBag = malloc(sizeof(StringBag));
myBag->capacity = initialSize;
myBag->usedSlotCount = 0;
myBag->storage = (char**)malloc(initialSize*sizeof(char*));
return myBag;
}
void bag_put(StringBag* myBag, char* str)
{
if (myBag->capacity == myBag->usedSlotCount) {
/* have to grow storage */
size_t oldCapacity = myBag->capacity;
size_t newCapacity = 2 * oldCapacity;
myBag->storage = realloc(myBag->storage, newCapacity*sizeof(char*));
if (NULL == myBag->storage) {
fprintf(stderr, "Out of memory while reallocating\n");
exit(1);
}
fprintf(stderr, "Growing capacity to %lu\n", (unsigned long)newCapacity);
myBag->capacity = newCapacity;
}
/* save string to new allocated memory, as this */
/* allows the caller to always use the same static storage to house str */
char* str2 = malloc(1+strlen(str));
strcpy(str2, str);
myBag->storage[myBag->usedSlotCount] = str2;
myBag->usedSlotCount++;
}
static char inputLine[4096];
int main()
{
StringBag* myBag = bag_create();
/* read input data */
while(scanf("%s", inputLine) != EOF) {
if (0 == strcmp(".", inputLine))
break;
bag_put(myBag, inputLine);
}
/* TODO: sort myBag->storage and print the sorted array */
}
This function works well aside from not accepting 0 because it considers it NULL
void addtolist(int list[], int item){
for(int a=0;a<5;a++){
if(list[a]==NULL){
list[a]=item;
break;
}
}
}
Is there any way I can make the array accept zeroes?
Additional info: -List- is a simple int array, accepting -item- inputs with scanf
No, there's no way to distinguish between list[a] == NULL and list[a] == 0. list is nothing more than a block of memory where every 4 bytes are considered be an integer.
The basic problem is you want to distinguish between elements which are empty and those which have a value. There's a few ways to handle this.
Use a special integer.
You could define all negative numbers as "empty". Or, if you need negative values, perhaps just one value. INT_MIN is a good choice.
But special values lead to bugs, and there's no type checking to save you, and you need to specially initialize each list. There are better ways.
Use integer pointers.
Instead of storing integers, store pointers to integers. Now NULL works.
int main() {
int **list = calloc(10, sizeof(int*));
int num = 42;
list[5] = #
for( size_t i = 0; i < 10; i++ ) {
int *entry = list[i];
if( entry == NULL ) {
continue;
}
printf("%d\n", *entry);
}
}
The downside if you're storing pointers, so you have to remember to copy lest you alter the original memory. Also now NULL can't be used as a sentry to indicate the end of the array.
Use a hash or tree
The real solution is to change your data structure to better match the job. An array is great if you have a complete, ordered list of things easily indexed with integers. If you have an incomplete list with gaps, a hash might be a better fit. Here's an example using GLib's hash table.
#include <stdio.h>
#include <glib.h>
// A little convenience function for inserting integers into a hash
// that normally wants pointers.
gboolean hash_table_insert_int(
GHashTable *table, int key, int value
) {
return g_hash_table_insert( table, GINT_TO_POINTER(key), GINT_TO_POINTER(value) );
}
int main() {
GHashTable *numbers = g_hash_table_new(g_direct_hash, g_direct_equal);
// Add 5 -> 42 and 9 -> 23
hash_table_insert_int( numbers, 5, 42 );
hash_table_insert_int( numbers, 9, 23 );
// Iterate through the entries in the table.
GHashTableIter iter;
gpointer key, value;
g_hash_table_iter_init(&iter, numbers);
while( g_hash_table_iter_next(&iter, &key, &value) ) {
printf("%d -> %d\n", (int)key, (int)value);
}
}
GLib's data structures take some getting used to because it's designed to be type generic, but it's worth it for the robust flexibility they bring. Now you have a data structure which you can explicitly insert and delete entries without messing with special values, it knows its memory boundaries, and is memory and performance efficient.
Newbie here,
I have a struct for a word, which contains a char array for the words themselves(the struct has other functions, which are unrelated to my question) and I'm trying to store it in a hashmap, which is an array of word struct pointers. In my program, every time I see a new word, I create a new word struct and malloc the char-array to create it. However, after a few run through of the loop, it changes the old word to a new word, even though it's at different hashmap locations.
What I'm wondering is if it's possible to have the loop in which I create the new word struct point to a new address?
struct words add;
int b;
for(b = 0; b < strlen(LowerCaseCopy); b++)
{
add.word[b] = '\0';
}
for(b=0;b< strlen(LowerCaseCopy);b++)
{
add.word[b] = LowerCaseCopy[b];
}
hashmap[hashf] = &add;
This is the code in question.
An example of my problem:
the first runthrough of the loop, I set add.word to apple, which is stored at a specific hashmap slot.
the next runthrough of the loop, I set add.word to orange, which is stored at a different slot. The problem is that at the first slot, it no longer stores apple, it instead stores orange, so I have 2 slots that store orange, which is not what I want. How do I fix this?
A simple solution (I think) would be to put the functionality to add entries to the hashmap in a separate function. This function allocates a new words structure and puts that in the hashmap:
void add_to_hashmap(struct something *hashmap, char *lower_case_word)
{
/* Using "calloc" we don't have to manually clear the structure */
struct words *words = calloc(1, sizeof(struct words));
/* Copy +1 to include the terminating '\0' */
memcpy(words->word, lower_case_word, strlen(lower_case_word) + 1);
/* Replace this with whatever you use to calculate the hash */
int hashf = calculate_hash(lower_case_word);
hashmap[hashf] = words;
}
If you remove an entry (i.e. setting it to NULL) you have to remember to free it first.
For an assignment at school, we have to use structs to make matrices that can store a infinite amount of points for an infinite amount of matrices. (theoretical infinite)
For the assignment I decided to use calloc and realloc. How the sizes for the matrix go is: It doubles in size every time its limit is hit for its points (so it starts at 1, then goes to 2, then 4 and so on). It also doubles in size every time a matrix is added as well.
This is where my issue lies. After the initial matrix is added, and it goes to add the second matrix name and points, it gives me the following:
B???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
B is the portion of it that I want (as I use strcmp later on), but the ? marks are not supposed to be there. (obviously)
I am not sure why it is exactly doing this. Since the code is modular it isn't very easy to get portions of it to show exactly how it is going about this.
Note: I can access the points of the matrix via its method of: MyMatrix[1].points[0].x_cord; (this is just an example)
Sample code that produces problem:
STRUCTS:
struct matrice {
char M_name[256];
int num_points[128];
int set_points[128];
int hasValues[1];
struct matrice_points * points;
} * MyMatrix;
struct matrice_points {
int set[1];
double cord_x;
double cord_y;
};
Setup Matrix Function:
void setupMatrix(){
MyMatrix = calloc(1, sizeof(*MyMatrix));
numMatrix = 1;
}
Grow Matrix Function:
void growMatrix(){
MyMatrix = realloc(MyMatrix, numMatrix * 2 * sizeof(*MyMatrix));
numMatrix = numMatrix * 2;
}
Add Matrix Function which outputs this problem after growing the matrix once.
void addMatrix(char Name, int Location){
int exists = 0;
int existsLocation = 0;
for (int i = 0; i < numMatrix; i++){
if (strcmp(MyMatrix[i].M_name, &Name) == 0){
exists = 1;
existsLocation = i;
}
}
*MyMatrix[Location].M_name = Name;
printf("Stored Name: %s\n", MyMatrix[Location].M_name);
*MyMatrix[Location].num_points = 1;
*MyMatrix[Location].set_points = 0;
*MyMatrix[Location].hasValues = 1;
MyMatrix[Location].points = calloc(1, sizeof(*MyMatrix[Location].points));
}
void addMatrix(char Name, int Location)
char Name represents a single char, i.e. a integer-type quantity. char is just a number, it's not a string at all.
When you do this:
strcmp(..., &Name)
you're assuming that the location where that one character is stored represents a valid C string. This is wrong, there is no reason why this should be the case. If you want to pass a C string to this function, you will need to declare it like this:
void addMatrix(char *Name, int Location)
Then you need to copy that C string into the appropriate place in your matrix structure. It should look like:
strncpy(... .M_name, Name, max_number_of_chars_you_can_store_in_M_Name);
Also these field definitions are strange in your struct:
int num_points[128];
int set_points[128];
int hasValues[1];
This means that your struct will contain an array of 128 ints called num_points, another array of 128 ints calls set_points, and an array of one int (strange) called hasValues. If you only need to store the count of total points and set points, and a flag indicating whether values are stored, the definition should be:
int num_points;
int set_points;
int hasValues;
and correct the assignments in your addMatrix function.
If you do need those arrays, then your assignments as they are are wrong also.
Please turn on all warnings in your compiler.
Try adding '\0' to the end of your data.
*MyMatrix[Location].M_name = Name;
You're copying a single character here, not a string. If you want a string, Name should be defined as char *, and you should be using strcpy.