So, I have two files of financial data, say 'symbols', and 'volumes'. In symbols I have strings such as:
FOO
BAR
BAZINGA
...
In volumes, I have integer values such as:
0001387
0000022
0123374
...
The idea is that the stock symbols will repeat in the file and I need to find the total volume of each stock. So, each row where I observe foo I increment total volume of foo by the value observed in volumes. The problem is that these files can be huge: easily 5 - 100 million records. A typical day may have ~1K different symbols in the file.
Doing it using strcmp on symbols each new line will be very inefficient. I was thinking of using an associative array --- hash table library which allows string keys --- such as uthash or Glib's hashtable.
I am reading some pretty good things about Judy arrays? Is the licensing a problem in this case?
Any thoughts on the choice of an efficient hash-table implementation? And also, whether I should use hash tables at all or perhaps something else entirely.
Umm.. apologize for the omission earlier: I need to have a pure C solution.
Thanks.
Definitely hashtable sounds good. You should look at the libiberty implementation.
You can find it on the GCC project Here.
I would use Map of C++ STL. Here's how the pseudo-code looks like:
map< string, long int > Mp;
while(eof is not reached)
{
String stock_name=readline_from_file1();
long int stock_value=readline_from_file2();
Mp[stock_name]+=stock_value;
}
for(each stock_name in Mp)
cout<<stock_name<<" "<<stock_value<<endl;
Based on the amount of data you gave, it may be a bit inefficient, but I'd suggest this because its much easier to implement.
If the solution is to be implemented strictly in C, then hashing will be the best solution. But, if you feel that implementing a hash-table and writing the code to avoid collisions is complex, I have another idea of using trie. It may sound weird, but this can also help a bit.
I would suggest you to read this one. It has a nice explanation about what a trie is and how to construct it. The implementation in C was also given there. So, you may have a doubt of where to store the volumes for each stock. This value can be stored at the end of the stock string and can be updated easily whenever needed.
But as you say that you are new to C, i advice you to try implementing using hash table and then try this one.
Thinking why not stick to your associative array idea. I assume, at the end of execution you need to a have list of unique names with their aggregated values. Below will work as far as you have memory to hold all unique names. ofcourse, this might not be that efficient, however, few tricks can be done depending upon the patterns of your data.
Consolidate_Index =0;
struct sutruct_Customers
{
name[];
value[];
}
sutruct_Customers Customers[This_Could_be_worse_if_all_names_are_unique]
void consolidate_names(char *name , int value)
{
for(i=0;i<Consolidate_Index;i++){
if(Customers[i].name & name)
{
Customers[i].value+= Values[index];
}
else
{
Allocate memory for Name Now!
Customers[Consolidate_Index].name = name;
Customers[Consolidate_Index].value = Value;
Consolidate_Index++;
}
}
}
main(){
sutruct_Customers buffer[Size_In_Each_Iteration]
while(unless file is done){
file-data-chunk_names to buffer.name
file-data-chunk_values to buffer.Values
for(; i<Size_In_Each_Iteration;i++)
consolidate_names(buffer.Names , buffer.Values);
}
My solution:
I did end up using the JudySL array to solve this problem. After some reading, the solution was quite simple to implement using Judy. I am replicating the solution here in full for it to be useful to anyone else.
#include <stdio.h>
#include <Judy.h>
const unsigned int BUFSIZE = 10; /* A symbol is only 8 chars wide. */
int main (int argc, char const **argv) {
FILE *fsymb = fopen(argv[1], "r");
if (fsymb == NULL) return 1;
FILE *fvol = fopen(argv[2], "r");
if (fvol == NULL) return 1;
FILE *fout = fopen(argv[3], "w");
if (fout == NULL) return 1;
unsigned int lnumber = 0;
uint8_t symbol[BUFSIZE];
unsigned long volume;
/* Initialize the associative map as a JudySL array. */
Pvoid_t assmap = (Pvoid_t) NULL;
Word_t *value;
while (1) {
fscanf(fsymb, "%s", symbol);
if (feof(fsymb)) break;
fscanf(fvol, "%lu", &volume);
if (feof(fvol)) break;
++lnumber;
/* Insert a new symbol or return value if exists. */
JSLI(value, assmap, symbol);
if (value == PJERR) {
fclose(fsymb);
fclose(fvol);
fclose(fout);
return 2;
}
*value += volume;
}
symbol[0] = '\0'; /* Start from the empty string. */
JSLF(value, assmap, symbol); /* Find the next string in the array. */
while (value != NULL) {
fprintf(fout, "%s: %lu\n", symbol, *value); /* Print to output file. */
JSLN(value, assmap, symbol); /* Get next string. */
}
Word_t tmp;
JSLFA(tmp, assmap); /* Free the entire array. */
fclose(fsymb);
fclose(fvol);
fclose(fout);
return 0;
}
I tested the solution on a 'small' sample containing 300K lines. The output is correct and the elapsed time was 0.074 seconds.
Related
I have a few questions about an assignment that i need to do. It might seem that what im looking for is to get the code, however, what im trying to do is to learn because after weeks of searching for information im lost. Im really new atC`.
Here is the assignment :
Given 3 files (foo.txt , bar.txt , foo2.txt) they all have a different amount of words (I need to use dynamic memory).
Create a program that ask for a word and tells you if that word is in any of the documents (the result is the name of the document where it appears).
Example :
Please enter a word: dog
"dog" is in foo.txt and bar.txt
(I guess i need to load the 3 files, create a hash table that has the keyvalues for every word in the documents but also has something that tells you which one is the document where the word is at).
I guess i need to implement:
A Hash Function that converts a word into a HashValue
A Hash Table that stores the HashValue of every word (But i think i should also store the document index?).
Use of dynamic allocation.
Check for collisions while im inserting values into the hash table (Using Quadratic Probing and also Chaining).
Also i need to know how many times the word im looking for appears in the text.
I've been searching about hashmaps implementations, hash tables , quadratic probing, hash function for strings...but my head is a mess right now and i dont really now from where i should start.
so far i`ve read :
Algorithm to get a list of all words that are anagrams of all substrings (scrabble)?
Implementing with quadratic probing
Does C have hash/dictionary data structure?
https://gist.github.com/tonious/1377667
hash function for string
http://www.cs.yale.edu/homes/aspnes/pinewiki/C(2f)HashTables.html?highlight=(CategoryAlgorithmNotes)
https://codereview.stackexchange.com/questions/115843/dictionary-implementation-using-hash-table-in-c
Sorry for my english in advance.
Hope you can help me.
Thanks.
FIRST EDIT
Thanks for the quick responses.
I'm trying to put all together and try something, however #Shasha99 I cannot use the TRIE data structure, i'm checking the links you gave me.
#MichaelDorgan Thanks for posting a solution for beginners however i must use Hashing (It's for Algorithms and Structures Class) and the teacher told us we MUST implement a Hash Function , Hash Table and probably another structure that stores important information.
After thinking for an hour I tried the following :
A Structure that stores the word, the number of documents where it appears and the index of those documents.
typedef struct WordMetadata {
char* Word;
int Documents[5];
int DocumentsCount;
} WordMetadata;
A function that Initializes that structure
void InitTable (WordMetadata **Table) {
Table = (WordMetadata**) malloc (sizeof(WordMetadata) * TABLESIZE);
for (int i = 0; i < TABLESIZE; i++) {
Table[i] = (WordMetadata*) NULL;
}
}
A function that Loads to memory the 3 documents and index every word inside the hash table.
A function that index a word in the mentioned structure
A function that search for the specific word using Quadratic Probing (If i solve this i will try with the chaining one...).
A function that calculates the hash value of a word (I think i will use djb2 or any of the ones i found here http://www.cse.yorku.ca/~oz/hash.html) but for now :
int Hash (char *WordParam) {
for (int i = 0; *WordParam != '\0';) {
i += *WordParam++;
}
return (i % TABLESIZE);}
EDIT 2
I tried to implement something, its not working but would take a look and tell me what is wrong (i know the code is a mess)
EDIT 3
This code is properly compiling and running, however , some words are not finded (maybe not indexed i' dont know), i'm thinking about moving to another hashfunction as i mentioned in my first message.
Approximately 85% of the words from every textfile (~ 200 words each) are correctly finded by the program.
The other ones are ramdom words that i think are not indexed correctly or maybe i have an error in my search function...
Here is the current (Fully functional) code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define TABLESIZE 4001
#define LINESIZE 2048
#define DELIMITER " \t"
typedef struct TTable {
char* Word; /* The actual word */
int Documents[5]; /* Documents Index */
int DocumentsCount; /* Number of documents where the word exist */
} TTable;
int Hash (char *Word);
void Index (TTable **HashTable, char* Word, int DocumentIndex);
int Search (TTable **HashTable, char* Word);
int mystrcmp(char *s1, char *s2);
char* Documents[] = {"foo.txt","bar.txt","foo2.txt",NULL};
int main() {
FILE* file;
TTable **HashTable
int DocumentIndex;
char Line[LINESIZE];
char* Word;
char* Tmp;
HashTable = (TTable**) malloc (sizeof(TTable)*TABLESIZE);
for (int i = 0; i < TABLESIZE; i++) {
HashTable[i] = (TTable*) NULL;
}
for (DocumentIndex = 0; Documents[DocumentIndex] != NULL; DocumentIndex++) {
file = fopen(Documents[DocumentIndex],"r");
if (file == NULL) {
fprintf(stderr, "Error%s\n", Documents[DocumentIndex]);
continue;
}
while (fgets (Line,LINESIZE,file) != NULL) {
Line[LINESIZE-1] = '\0';
Tmp = strtok (Line,DELIMITER);
do {
Word = (char*) malloc (strlen(Tmp)+1);
strcpy(Word,Tmp);
Index(HashTable,Word,DocumentIndex);
Tmp = strtok(NULL,DELIMITER);
} while (Tmp != NULL);
}
fclose(file);
}
printf("Enter the word:");
fgets(Line,100,stdin);
Line[strlen(Line)-1]='\0'; //fgets stores newline as well. so removing newline.
int i = Search(HashTable,Line);
if (i != -1) {
for (int j = 0; j < HashTable[i]->DocumentsCount; j++) {
printf("%s\n", Documents[HashTable[i]->Documents[j]]);
if ( j < HashTable[i]->DocumentsCount-1) {
printf(",");
}
}
}
else {
printf("Cant find word\n");
}
for (i = 0; i < TABLESIZE; i++) {
if (HashTable[i] != NULL) {
free(HashTable[i]->Word);
free(HashTable[i]);
}
}
return 0;
}
/* Theorem: If TableSize is prime and ? < 0.5, quadratic
probing will always find an empty slot
*/
int Search (TTable **HashTable, char* Word) {
int Aux = Hash(Word);
int OldPosition,ActualPosition;
ActualPosition = -1;
for (int i = 0; i < TABLESIZE; i++) {
OldPosition = ActualPosition;
ActualPosition = (Aux + i*i) % TABLESIZE;
if (HashTable[ActualPosition] == NULL) {
return -1;
}
if (strcmp(Word,HashTable[ActualPosition]->Word) == 0) {
return ActualPosition;
}
}
return -1; // Word not found
}
void Index (TTable **HashTable, char* Word, int DocumentIndex) {
int Aux; //Hash value
int OldPosition, ActualPosition;
if ((ActualPosition = Search(HashTable,Word)) != -1) {
for (int j = 0; j < HashTable[ActualPosition]->DocumentsCount;j++) {
if(HashTable[ActualPosition]->Documents[j] == DocumentIndex) {
return;
}
}
HashTable[ActualPosition]->Documents[HashTable[ActualPosition]->DocumentsCount] = DocumentIndex; HashTable[ActualPosition]->DocumentsCount++;
return;
}
ActualPosition = -1;
Aux = Hash(Word);
for (int i = 0; i < TABLESIZE; i++) {
OldPosition = ActualPosition;
ActualPosition = (Aux + i*i) % TABLESIZE;
if (OldPosition == ActualPosition) {
break;
}
if (HashTable[ActualPosition] == NULL) {
HashTable[ActualPosition] = (TTable*)malloc (sizeof(TTable));
HashTable[ActualPosition]->Word = Word;
HashTable[ActualPosition]->Documents[0] = DocumentIndex;
HashTable[ActualPosition]->DocumentsCount = 1;
return;
}
}
printf("No more free space\n");
}
int Hash (char *Word) {
int HashValue;
for (HashValue = 0; *Word != '\0';) {
HashValue += *Word++;
}
return (HashValue % TABLESIZE);
}
I would suggest you to use TRIE data structure for storing strings present in all three files in memory as Hash would be more space consuming.
As the first step you should read all three files one by one and for each word in file_i, you should do the following:
if the word is already present in TRIE, append the file index to that node or update the word count relative to that particular file. You may need 3 variables for file1, file and file3 at each node to store the values of word count.
if the word is not present, add the word and the file index in TRIE node.
Once you are done with building your TRIE, checking whether the word is present or not would be an O(1) operation.
If you are going with Hash Tables, then:
You should start with how to get hash values for strings.
Then read about open addressing, probing and chaining
Then understand the problems in open addressing and chaining approaches.
How will you delete and element in hash table with open addressing and probing ? here
How will the search be performed in case of chaining ? here
Making a dynamic hash table with open addressing ? Amortized analysis here and here.
Comparing between chaining and open addressing. here.
Think about how these problems can be resolved. May be TRIE ?
Problem in the code of your EDIT 2:
An outstanding progress from your side !!!
After a quick look, i found the following problems:
Don't use gets() method, use fgets() instead So replace:
gets(Line);
with the following:
fgets(Line,100,stdin);
Line[strlen(Line)-1]='\0'; //fgets stores newline as well. so removing newline.
The line:
if ( j < HashTable[j]->DocumentsCount-1){
is causing segmentation fault. I think you want to access HashTable[i]:
if ( j < HashTable[i]->DocumentsCount-1){
In the line:
HashTable[ActualPosition]->Documents[HashTable[ActualPosition]->DocumentsCount];
You were supposed to assign some value. May be this:
HashTable[ActualPosition]->Documents[HashTable[ActualPosition]->DocumentsCount] = DocumentIndex;
Malloc returns void pointer. You should cast it to the appropriate
one:
HashTable[ActualPosition] = (TTable*)malloc (sizeof(TTable));
You should also initialize the Documents array with default value while creating a new node in Hash:
for(j=0;j<5;j++)HashTable[ActualPosition]->Documents[j]=-1;
You are removing everything from your HashTable after finding the
first word given by user. May be you wanted to place that code outside
the while loop.
Your while loop while(1) does not have any terminating condition, You
should have one.
All the best !!!
For a school assignment, you probably don't need to worry about hashing. For a first pass, you can just get away with a straight linear search instead:
Create 3 pointers to char arrays (or a char ** if you prefer), one for each dictionary file.
Scan each text/dictionary file to see how many individual words reside within it. Depending on how the file is formatted, this may be spaces, strings, newlines, commas, etc. Basically, count the words in the file.
Allocate an array of char * times the word count in each file and store it in the char ** for that file. (if 100 words found in the file , num_words=100; fooPtr = malloc(sizeof(char *) * num_words);
Go back through the file a second time and allocate an array of chars to the size of each word in the file and store it in the previously created array. You now have a "jagged 2D array" for every word in each dictionary file.
Now, you have 3 arrays for your dictionaries and can use them to scan for words directly.
When given a word, setup a for loop to look through each file's char array. if the entered word matches with the currently scanned dictionary, you have found a match and should print the result. Once you have scanned all dictionaries, you are done.
Things to make it faster:
Sort each dictionary, then you can binary search them for matches (O(log n)).
Create a hash table and add each string to it for O(1) lookup time (This is what most professional solutions would do, which is why you found so many links on this.)
I've offered almost no code here, just a method. Give it a shot.
One final note - even if you decide to use a the hash method or a list or whatever, the code you write with arrays will still be useful.
I'm looking for a way to check if a specific string exists in a large array of strings. The array is multi-dimensional: all_strings[strings][chars];. So essentially, this array is an array of character arrays. Each character array ends in '\0'
Given another array of characters, I need to check to see if those characters are already in all_strings, kind of similar to the python in keyword.
I'm not really sure how to go about this at all, I know that strcmp might help but I'm not sure how I could implement it.
As lurker suggested, the naive method is to simply loop on the array of strings calling strcmp. His string_in function is unfortunately broken due to a misunderstanding regarding sizeof(string_list), and should probably look like this:
#include <string.h>
int string_in(char *needle, char **haystack, size_t haystack_size) {
for (size_t x = 0; x < haystack_size; x++) {
if (strcmp(needle, haystack[x]) == 0) {
return 1;
}
}
return 0;
}
This is fairly inefficient, however. It'll do if you're only going to use it once in a while, particularly on a small collection of strings, but if you're looking for an efficient way to perform the search again and again, changing the search query for each search, the two options I would consider are:
If all_strings is relatively static, you could sort your array like so: qsort(all_strings, strings, chars, strcmp);... Then when you want to determine whether a word is present, you can use bsearch to execute a binary search like so: char *result = bsearch(search_query, all_strings, strings, chars, strcmp);. Note that when all_strings changes, you'll need to sort it again.
If all_strings changes too often, you'll probably benefit from using some other data structure such as a trie or a hash table.
Use a for loop. C doesn't have a built-in like Python's in:
int i;
for ( i = 0; i < NUM_STRINGS; i++ )
if ( strcmp(all_strings[i], my_other_string) == 0 )
break;
// Here, i is the index of the matched string in all_strings.
// If i == NUM_STRINGS, then the string wasn't found
If you want it to act like Python's in, you could make it a function:
// Assumes C99
#include <string.h>
#include <stdbool.h>
bool string_in(char *my_str, char *string_list[], size_t num_strings)
{
for ( int i = 0; i < num_strings; i++ )
if (strcmp(my_str, string_list[i]) == 0 )
return true;
return false;
}
You could simply check if a string exists in an array of strings. A better solution might be to actually return the string:
/*
* haystack: The array of strings to search.
* needle: The string to find.
* max: The number of strings to search in "haystack".
*/
char *
string_find(char **haystack, char *needle, size_t max)
{
char **end = haystack + max;
for (; haystack != end; ++haystack)
if (strcmp(*haystack, needle) == 0)
return *haystack;
return NULL;
}
If you're wanting the behavior of a set, where all strings in the array are unique, then you can use it that way:
typedef struct set_strings {
char **s_arr;
size_t count;
size_t max;
} StringSet;
.
.
.
int
StringSet_add(StringSet *set, const char *str)
{
// If string exists already, the add operation is "successful".
if (string_find(set->s_arr, str, set->count))
return 1;
// Add string to set and return success if possible.
/*
* Code to add string to StringSet would go here.
*/
return 1;
}
If you want to actually do something with the string, you can use it that way too:
/*
* Reverse the characters of a string.
*
* str: The string to reverse.
* n: The number of characters to reverse.
*/
void
reverse_str(char *str, size_t n)
{
char c;
char *end;
for (end = str + n; str < --end; ++str) {
c = *str;
*str = *end;
*end = c;
}
}
.
.
.
char *found = string_find(words, word, word_count);
if (found)
reverse_str(found, strlen(found));
As a general-purpose algorithm, this is reasonably useful and even can be applied to other data types as necessary (some re-working would be required of course). As pointed out by undefined behaviour's answer, it won't be fast on large amounts of strings, but it is good enough for something simple and small.
If you need something faster, the recommendations given in that answer are good. If you're coding something yourself, and you're able to keep things sorted, it's a great idea to do that. This allows you to use a much better search algorithm than a linear search. The standard bsearch is great, but if you want something suitable for fast insertion, you'd probably want a search routine that would provide you with the position to insert a new item to avoid searching for the position after bsearch returns NULL. In other words, why search twice when you can search once and accomplish the same thing?
This sort code fails for very large input file data because it takes too long for it to finish.
rewind(ptr);
j=0;
while(( fread(&temp,sizeof(temp),1,ptr)==1) &&( j!=lines-1)) //read object by object
{
i=j+1;
while(fread(&temp1,sizeof(temp),1,ptr)==1) //read next object , to compare previous object with next object
{
if(temp.key > temp1.key) //compare key value of object
{
temp2=temp; //if you don't want to change records and just want to change keys use three statements temp2.key =temp.key;
temp=temp1;
temp1=temp2;
fseek(ptr,j*sizeof(temp),0); //move stream to overwrite
fwrite(&temp,sizeof(temp),1,ptr); //you can avoid above swap by changing &temp to &temp1
fseek(ptr,i*sizeof(temp),0); //move stream to overwrite
fwrite(&temp1,sizeof(temp),1,ptr); //you can avoid above swap by changing &temp1 to &temp
}
i++;
}
j++;
fseek(ptr,j*sizeof(temp),0);
}
Any idea on how to make this C code much faster? Also would using qsort() (predefined in C) be much faster and how should be applied to the above code?
You asked the question Sorting based on key from a file and were given various answers about how to sort in memory. You added a supplemental question as an answer, and then created this question instead (which was correct).
Your code here is basically a disk-based bubble sort, with O(N2) complexity, and poor time performance because it is manipulating file buffers and disk. A bubble sort is a bad choice at the best of times — simple, yes, but slow.
The basic ways to speed up sorting programs are:
If possible, read all the data into memory, sort in memory, and write the result out.
If it won't all fit into memory, read as much into memory as possible, sort it, and write the sorted data to a temporary file. Repeat as often as necessary to sort all the data. Then merge the temporary files into one file. If the data set is truly astronomical (or the memory truly minuscule), you may have to create intermediate merge files. These days, though, you have to be sorting many hundreds of gigabytes for that to be an issue at all, even on a 32-bit computer.
Make sure you choose a good sorting algorithm. Quick sort with appropriate pivot selection is very good. You could look up 'introsort' too.
You'll find example in-memory sorting code in the answers to the cross-referenced question (your original question). If you choose to write your own sort, you can consider whether to base the interface on the standard C qsort() function. If you write a Quick Sort, you should look at Quicksort — Choosing the pivot where the answers have copious references.
You'll find example merging code in the answer to Merging multiple sorted files into one file. The merging code out-performs the system sort program in its merge mode, which is intriguing since it is not highly polished code (but it is reasonably workmanlike).
You could look at the external sort program described in Software Tools, though it is a bit esoteric in that it is written in 'RatFor' or Rational Fortran. The design, though, is readily transferrable to other languages.
Yes, by all means, use qsort(). Use it either as SpiderPig suggests by reading the whole file into memory, or as the in-memory sort for runs that do fit into memory preparing for a merge sort. Don't worry about the worst-case performance. A decent implementation will take a the median of (first, last, middle) to get fast sorting for the already-sorted and reverse-order pathological case, plus better average performance in the random case.
This all-in-memory example shows you how to use qsort:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef struct record_tag
{
int key;
char data[12];
} record_type, *record_ptr;
const record_type * record_cptr;
void create_file(const char *filename, int n)
{
record_type buf;
int i;
FILE *fptr = fopen(filename, "wb");
for (i=0; i<n; ++i)
{
buf.key = rand();
snprintf(buf.data, sizeof buf.data, "%d", buf.key);
fwrite(&buf, sizeof buf, 1, fptr);
}
fclose(fptr);
}
/* Key comparison function used by qsort(): */
int compare_records(const void *x, const void *y)
{
const record_ptr a=(const record_ptr)x;
const record_ptr b=(const record_ptr)y;
return (a->key > b->key) - (a->key < b->key);
}
/* Read an input file of (record_type) records, sort by key field, and write to the output file */
void sort_file(const char *ifname, const char *ofname)
{
const size_t MAXREC = 10000;
int n;
FILE *ifile, *ofile;
record_ptr buffer;
ifile = fopen(ifname, "rb");
buffer = (record_ptr) malloc(MAXREC*sizeof *buffer);
n = fread(buffer, sizeof *buffer, MAXREC, ifile);
fclose(ifile);
qsort(buffer, n, sizeof *buffer, compare_records);
ofile = fopen(ofname, "wb");
fwrite(buffer, sizeof *buffer, n, ofile);
fclose(ofile);
}
void show_file(const char *fname)
{
record_type buf;
int n = 0;
FILE *fptr = fopen(fname, "rb");
while (1 == fread(&buf, sizeof buf, 1, fptr))
{
printf("%9d : %-12s\n", buf.key, buf.data);
++n;
}
printf("%d records read", n);
}
int main(void)
{
srand(time(NULL));
create_file("test.dat", 99);
sort_file("test.dat", "test.out");
show_file("test.out");
return 0;
}
Notice the compare_records function. The qsort() function needs a function that accepts void pointers, so those pointer must be cast to the correct type. Then the pattern:
(left > right) - (left < right)
...will return 1 if the left argument is greater, 0 if they are equal or -1 if the right argument is greater.
The could be improved. First, there is absolutely no error checking. That's not sensible in production code. Second, you could examine the input file to get the file size instead of guessing that it's less than some MAXxxx value. One way to do that is to use ftell. (Follow the link for a file size example.) Then, use that value to allocate a single buffer, just big enough to qsort the data.
If there is not enough room (if the malloc returns NULL) then you can fall back on sorting chunks (with qsort, as in the snippet) that do fit into memory, writing them to separate temporary files, and then merging them into a single output file. That's more complicated, and rarely done since there are sort/merge utility programs designed specifically for sorting large files.
I have a simulation program written in c and I need to create random numbers and write them to a txt file. Program only stops
- when a random number already generated is generated again or
- 1 billion random number are generated (no repetition)
My problem is that I could not search the generated long int random number in the txt file!
Text file format is:
9875
764
19827
2332
...
Any help is appreciated..
`
FILE * out;
int checkNumber(long int num){
char line[512];
long int number;
int result=0;
if((out = fopen("out.txt","r"))==NULL){
result= 1;
}
char buf[10];
itoa(num, buf, 10);
while(fgets(line, 512, out) != NULL)
{
if((strstr(line,buf)) != NULL){
result = 0;
}
}
if(out) {
fclose(out);
}
return result;
}
int main(){
int seed;
long int nRNs=0;
long int numberGenerated;
out = fopen ("out.txt","w");
nRNs=0;
seed = 12345;
srand (seed);
fprintf(out,"%d\n",numberGenerated);
while( nRNs != 1000000000 )
{
numberGenerated = rand();
nRNs++;
if(checkNumber(numberGenerated)==0){
fclose(out); break; system("pause");
}
else{
fprintf(out,"%d\n",numberGenerated);
}
}
fclose(out);
}`
If the text file only contains randomly generated numbers separated by space, then you need strtok() function(google its usage) and throw it into the binary tree structure as mentioned by #jacekmigacz. But in any circumstance, you will have to search the whole file once at least. Then ftell() the value to get the location you've searched for in the file. When another number is generated you can use fseek() to get the latest number. Remember to get the data line by line with fgets()
Take care of the memory requirements and use malloc() judiciously
Try with tree (data structure).
Searching linearly through the text file every time is gonna take forever with so many numbers. You could hold every number generated so far sorted in a data structure so that you can do a binary search for a duplicate. This is going to need a lot of RAM though. For 1 billion integers that's already 4GB on a system with 32-bit integers, and you'll need several more for the data structure overhead. My estimate is around 16GB in the worst case scenario (where you actually get to 1 billion unique integers.)
If you don't have a memory monster machine, you should instead write the data structure to a binary file and do the binary search there. Though that's still gonna be quite slow.
This may work or you can approach like this : (slow but will work)
int new_rand = rand();
static int couter = 0;
FILE *fptr = fopen("txt","a+");
int i;
char c,buf[10];
while((c=getc(fptr))!=EOF)
{
buf[j++]=c;
if(c == ' ')
{
buf[--j]='\0';
i=atoi(buf);
if(i == new_rand)
return;
j=0;
}
if(counter < 1000000)
{
fwrite(&new_rand, 4, 1, fptr);
counter++;
}
Don't open and scan your file to checkNumber(). You'll be waiting forever.
Instead, keep your generated numbers in memory using a bit set data structure and refer to that.
Your bit set will need to be large enough to indicate every 32-bit integer, so it'll consume 2^32 / 8 bytes (or 512MiB) of memory. This may seem like a lot but it's much smaller than 32-bit * 1,000,000,000 (4GB). Also, both checking and updating will be done in constant time.
Edit: The wikipedia link doesn't do much to explain how to code one, so here's a rough sample: (There're faster ways of writing this, e.g.: using bit shifts instead of division, but this should be easier to understand.)
int checkNumberOrUpdate(char *bitSet, long int num){
char b = 1 << (num % 8);
char w = num / 8;
if (bitSet[w] & ~b) {
return 1;
}
bitSet[w] |= b;
return 0;
}
Note, bitSet needs to be calloc()d to the right size from your main function.
I'm a beginner to C programming. I'm trying to learning how to code a spell checker that looks through all the words in a dictionary file, compare them with an article, print out all the words that do not exist in the dictionary file onto the console. Since I'm studying malloc in class, I've lowercased every word, removed all the punctuations in the article, and string copied them into malloc. I don't know what should the next step be, would someone give me a hint? Thanks
MAIN.C
#include <stdio.h>
#include <stdlib.h>
char dictionary[1000000];
char article[100000];
void spellCheck(char[], char[]);
int main(void) {
FILE* dict_file;
FILE* article_file;
int bytes_read;
char* p;
dict_file = fopen("american-english.txt", "r");
if (dict_file == 0) {
printf("unable to open dictionary file \"american-english.txt\"\n");
return -1;
}
article_file = fopen("article.txt", "r");
if (article_file == 0) {
printf("unable to open file \"article.txt\"\n");
return -1;
}
/* read dictionary */
p = dictionary;
p = fgets(p, 100, dict_file);
while (p != 0) {
while (*p != '\0') {
p += 1;
}
p = fgets(p, 100, dict_file);
}
/* read article */
p = article;
bytes_read = fread(p, 1, 1000, article_file);
p += bytes_read;
while (bytes_read != 0) {
bytes_read = fread(p, 1, 1000, article_file);
p += bytes_read;
}
*p = 0;
spellCheck(article, dictionary);
}
PROJECT.C
void spellCheck(char article[], char dictionary[]) {
int len = strlen(article) + 1;
int i;
char* tempArticle;
tempArticle = malloc(len);
if (tempArticle == NULL) {
printf("spellcheck: Memory allocation failed.\n");
return;
}
for(i = 0; i < len; i++)
tempArticle[i] = tolower(article[i]);
i=0;
while (article[i] != '\0'){
if (article[i] >= 33 && article[i] <= 64)
article[i] = ' ';
}
printf("%s", tempArticle);
free(tempArticle);
}
How you organize your data structures will be important.
You may want to not only put your dictionary into a binary tree, as Zareth mentioned, but do the same with the article, so you can remove all duplicate words and have them sorted.
This way when you start to search through the dictionary, if you go past the letters that your word starts with then you can quit, as the dictionary is sorted.
Congratulations, you have loaded the data into memory and you did everything right with checking the status of the system calls. Now you need to do more things with your dictionary data:
Create an array of char * pointers, one pointing to each word.
char * words[100000]; /* make sure you have enough space. */
For each word in your dictionary, make an entry in words. There are various ways to do this, for example you could use strndup to copy each word from dictionary after finding its length using isspace or strcspn.
Sort words (see qsort).
Read the article, word by word, using the same method as in step 2.
Search the dictionary (see bsearch) for the word.
Put the misspelled words into another array similar to words.
If you want to get fancy, you might want to look into using stat to get the size of your files and allocating the memory for dictionary and article using malloc instead of using "magic numbers" or "very big numbers". For industrial strength C, you definitely need to do that.
The next step for your code would be to for each article word compare to to every word in the dictionary. The comparison is easily performed with strcmp, but the way you store the dictionary will force you to mess around with pointers to find the start of each new word in the dictionary.
Without any major changes you could do the comparison something like this, but it will require that you somehow determine when you've compared against all words in the dictionary, for example by counting how many words there are in the dictionary when you read it in from the file.
char* dictionary_word = dictionary;
int not_found = 1;
int i = 0;
for (; i < dictionary_word_count; ++i) {
if ((not_found = strcmp(tempArticle, dictionary_word)) == 0) {
break; /* Word found, we're done */
}
/* Add code to move dictionary_word to the next word here */
}
The problem with your current program is moving dictionary_word to the next word in a good way. It's possible to do so simply by advancing the pointer one character at a time and checking if you've found the next word. I would instead recommend you to create another array of char pointers and have them point to the beginning of each word and assign these as you read the words from the dictionary file. That would allow you to do something like
dictionary_word = dictionary_word_pointers[i]; at the start of the for loop to get it to point to the correct word, instead of using a while loop to find the start of the next word. It would also have the added benefit of being easy to sort.
You can sort the dictionary beforehand and use binary search to speed up the dictionary lookups if the dictionary is large and searching through it using linear search is too slow.
Is the 'dictionary' organized with one word per line? You could sensibly use 'strlen()` instead of the loop with 'p += 1'. Presumably the dictionary is also sorted?
Once you have the dictionary in memory, you don't need to read the whole of the article into memory. You could read one word at a time with 'fscanf()', then eliminate any punctuation so "t'other" appears as words "t" and "other" and "doesn't" appears as "doesn" and "t" - if you like. Or you could decide that isn't helpful. On the other hand, you probably do want to remove characters like question marks and double quotes.
Does your dictionary provide all variants on a word, or do you need to get involved in stemming? "Antidisestablishmentarianism" can be stemmed into "anti", "dis", "establish", "ment", "arian", "ism", I think, as an example.
You also need to consider whether it is correct to lower-case everything. You might decide that "IBM" is OK and "ibm" is not, for example; likewise with "ICBM" and "icbm" (and both "Ibm" and "Icbm" are bad under any reasonable definition of 'proper spelling').
You should be exploiting the fact that your dictionary is sorted to reduce the search time using a binary search or some similar mechanism.