File Descriptors and System Calls - c

I am doing merging k-sorted streams using read write system calls.
After having read the first integers out of the files and sorting them, the file having the smallest element should be accessed again.
I am not sure how to do this. I thought I can use a structure like:
struct filePointer {
int ptr;
int num;
}fptr[5];
Can someone tell me how to do this.
Thanks

Although reading integers one-by-one is not an efficient way of doing this, I will try to write the solution that you are looking for. however this is not a real implementation, just the idea.
Your structure should be like this:
struct filePointer {
FILE * fp;
int num;
} fptr[k]; /* I assume k is constant, known at compile time */
You need to have an priority queue ( http://en.wikipedia.org/wiki/Priority_queue ) and prioities are determined accourding to num.
First read the first numbers from all files and insert them to priority queue (pq).
Then while pq is not empty, pop the first element which holds the smallest integer compared to other elements in the pq.
Write the integer that first element holds to file.
Using the file pointer (fp) try to read new integer from the input file.
If EOF (end of file), then do nothing
else insert the new element to pq by setting num to the read one.
When the loop is finished, close all files and you will have a new file that contains all the elements from the input files and it will be sorted.
I hope this helps.

Related

Efficient way of writing large variable number of integers from array to text file

I have a program that results in an integer array of variable size and I need to write the elements to a text file. The elements need to be on the same line in the file.
This is a minimal example of what I am currently doing. I'm using the approach in this post https://stackoverflow.com/a/30234430/10163981
FILE *file = fopen("file.txt","w");
int nLines = 1000;
char breakstr[]="\n";
for(; ix<N; ix++){
char s[nLines*13];
for(int jx = 0 ; jx<nLines; jx++){
index += sprintf(&s[index],"%03i %03i %03i ", array[ix][jx], array[ix][jx], array[ix][jx]);
// I need jx:th element in repeats of three, and may need to modify it externally for printing
}
fwrite(s,sizeof(char),strlen(s), file);
fwrite(breakstr,sizeof(char),strlen(breakstr), file);
}
fclose(file);
I am formatting the array contents as a string and using fwrite, as this method has been given to me as a requirement. My problem is that this implementation that I am using is way too slow. I have also tried using shorter strings and writing for each iteration, but this even slower. There is not much I can do with regards to the outer ix loop, as the initial value of ix is variable. I included it for the same of completeness.
nLines is expected to reach as high as 10000 at most.

Using fseek() To Update a Binary File

I have looked and looked online for help on using fseek() efficiently, but no matter what I do, I am still not receiving the right results. Basically I am reading from a file of animals that have an "age" parameter. If the age is -1, then upon adding to this binary file, I should use fseek() to find the first -1 in the file and overwriting that entire line with new information that the user inputs. I have an array that traverses and finds all of the holes at the beginning of the file, and it is working correctly. My issue is that it is updating the new animal and putting each one in the next empty slot with age -1, but when I go to refresh my file, all of the animals are appended to the end, even though their id's are the id's of the once empty slots. Here is my code:
void addingAnimal(FILE *file, struct animal ani, int * availableHoles) {
int i;
int offset = ((sizeof(int) + sizeof(ani)) * ani.id -1);
if (availableHoles[0] != 0) {
fseek(file, offset, SEEK_SET);
ani.id = availableHoles[0];
fwrite & ani, sizeof(ani), 1, file);
for (i = 0; i < sizeof(availableHoles) -1; i++) {
availableHoles[i] = avialablesHoles[i+1];
}
}
The very beginning of the file has an integer that tells us the number of holes within the file, so the offset is removing that, so once I print it, it prints everything correctly. Then I check if there are holes in the helper array I created, if there are, then I want the animal's id to be that id and I am trying to seek to the line with the first -1 age to put my updated animal's information there, and then writing it to the file. The last for-loop is just me shifting up the available holes. Oh and as for opening the file, I am using r+b for reading and writing. Thank you in advance!
You cannot use sizeof(availableHoles) to iterate on the array. You are in a function that receives availableHoles as a pointer, its size is irrelevant to the number of holes.
Pass the number of elements of this array as a separate argument.
Using the FILE streams in read/write mode is tricky, do you call fseek() systematically between accesses in read mode and write mode?
Post the calling code, the function addingAnimal alone is not enough to investigate your problem.

Why do we need unsigned char for Huffman tree code

I am trying to create a Huffman tree the question I read is very strange for me, it is as follows:
Given the following data structure:
struct huffman
{
unsigned char sym; /* symbol */
struct huffman *left, *right; /* left and right subtrees */
};
write a program that takes the name of a binary file as sole argument,
builds the Huffman tree of that file assuming that atoms (elementary
symbols) are 8-bit unsigned characters, and prints the tree as well as
the dictionary.
allocations must be done using nothing else than
malloc(), and sorting can be done using qsort().
Here the thing which confuses me is that to write a program to create a huffman tree we just need to do following things:
We need to take a frequency array (That could be Farray[]={.......})
Sort it and add the two smallest nodes to form a tree until it don't left 1 final node(which is head).
Now the question is here: why and where do we need those unsigned char data? (what type of unsigned char data this question want, I think only frequency is enough to display a Huffman tree)?
If you purely want to display the shape of the tree, then yes, you just need to build it. However, for it to be of any use whatsoever you need to know what original symbol each node represents.
Imagine your input symbols are [ABCD]. An imaginary Huffman tree/dictionary might look like this:
( )
/ \ A = 1
( ) (A) B = 00
/ \ C = 010
(B) ( ) D = 011
/ \
(C) (D)
If you don't store sym, it looks like this:
( )
/ \ A = ?
( ) ( ) B = ?
/ \ C = ?
( ) ( ) D = ?
/ \
( ) ( )
Not very useful, that, is it?
Edit 2: The missing step in the plan is step 0: build the frequency array from the file (somehow I missed that you don't need to actually encode the file too). This isn't part of the actual Huffman algorithm itself and I couldn't find a decent example to link to, so here's a rough idea:
FILE *input = fopen("inputfile", "rb");
int freq[256] = {0};
int c;
while ((c = fgetc(input)) != EOF)
freq[c]++;
fclose(input);
/* do Huffman algorithm */
...
Now, that still needs improving since it neither uses malloc() nor takes a filename as an argument, but it's not my homework ;)
It's a while since I did this, but I think the generated "dictionary" is required to encode data, while the "tree" is used to decode it. Of course, you can always build one from the other.
While decoding, you traverse the tree (left/right, according to successive input bits), and when you hit a terminal node (null pointer) then the 'sym' in the node is the output value.
Usually data compression is divided into 2 big steps; given a stream of data:
evaluate the probability that a given symbol will appear in the stream, in other words you evaluate how frequent a symbol appears in a dataset
once you have studied the occurences and created your table with symbols associated with a probability, you need to encode the symbols according their probability, to achieve this magic you create a dictionary were the original symbol is often times just replaced with another symbol that is much smaller in size, especially true for symbols that are frequently used in the dataset, the dictionary keeps track of this substitutions for both the encoding and decoding phase. Hoffman gives you an algorithm to automate this process and get a fairly good result.
In practice it's a little bit more complicated than this, because trees are involved, but the main purpose is always to build the dictionary.
There is a complete tutorial here.

Creating a File of Random Size [1...500] KB

Foreword: I have simplified the problem into its key functionalities, so if it sounds weird it is because this is a small aspect of the whole program.
Problem:
I want to create something like 100 text files: I'll loop and use my loop counter to name the files.
Then, I want to populate each file with random strings. I use my String struck defined below for this. I want to fill the file up from [1KB up to 500KB].
struct String // And yes I am using my own String library.
{
char *c;
int length;
int maxLength;
}
Lets assume I have the file opened (probably at the moment I create it, so it is empty). Now I would check something like this.
int range = Random.Range(0,500);
I would get a number that would predetermine the file size. So if range == 100 then the file would be populated with 100KB of "data".
I would first have my string created.
// Maybe making this 100 chars would help?
String *s1 = makeString("abcdefghijklmnopqrstuvwxyz");
How would I figure out how many times I have to write my String s1 into the file to make it the size of range? Preferably before writing to the file, I wouldn't want to write first then check, then write again.
And How would I get a random integer value in C? I used to the Random.Range in C#.
To keep it simple, it would best if you can make your string size the common denominator of 1KB (1024 bytes). So you don't have to take care fraction.
After that you can do as #naitoon mentioned above (range*1024)/s1->length. If each of the character of your string is 1 byte long.
As for random integer, you can call the standard library rand() which returns integer between 0 to RAND_MAX, which is at least 32767.
Also, in order to keep the random number with your range(0~500), you can do a modular of the return value.
range = rand() % 500;

Comparing strings in two files in C;

I'm new to language c, so I'll appreciate every help :D
I need to compare given words in the first file ( " Albert\n Martin\n Bob" ) with words in the second file ( " Albert\n Randy\n Martin\n Ohio" ) .
Whenever they're the same i need to put in the file word " Language " ; and print every word without representation in second file "
Something like that:
Language
Language
Bob
need's to be in my third file;
I tried to come up with some ideas , but they dont work; p ,
Thanks for every anwser in advance .
First, you need to open a stream to read the files.
If you need to do this in C, then you may use the strcmp function. It allows you to compares the two strings.
For example:
int strcmp(const char *s1, const char *s2);
I'd open all three files to begin with (both input files and the output file). If you can't open all of them then you can't do anything useful (other than display an error message or something); and there's no point wasting CPU time only to find out that (for e.g.) you can't open the output file later. This can also help to reduce race conditions (e.g. second file changes while you're processing the first file).
Next, start processing the first file. Break it into words/tokens as you read it, and for each word/token calculate a hash value. Then use the hash value and the word/token itself to check if the new word/token is a duplicate of a previous (already known) word/token. If it's not a duplicate, allocate some memory and create a new entry for the word/token and insert the entry onto the linked list that corresponds to the hash.
Finally, process the second file. This is similar to how you processed the first file (break it into words/tokens, calculate the hash, use the hash to find out if the word/token is known), except if the word/token isn't known you write it to the output file, and if it is known you write " language" to the output file instead.
If you're not familiar with hash tables, they're fairly easy. For a simple method (not necessary the best method) of calculating the hash value for ASCII/text you could do something like:
hash = 0;
while(*src != 0) {
hash = hash ^ (hash << 5) ^ *src;
src++;
}
hash = hash % HASH_SIZE;
Then you have an array of linked lists, like "INDEX_ENTRY *index[HASH_SIZE]" that contains a pointer to the first entry for each linked list (or NULL if the linked list for the hash is empty).
To search, use the hash to find the first entry of the correct linked list then do "strcmp()" on each entry in the linked list. An example might look something like this:
INDEX_ENTRY *find_entry(uint32_t hash, char *new_word) {
INDEX_ENTRY *entry;
entry = index[hash];
while(entry != NULL) {
if(strcmp(new_word, entry->word) == 0) return entry;
entry = entry->next;
}
return NULL;
}
The idea of all this is to improve performance. For example, if both files have 1024 words then (without a hash table) you'd need to do "strcmp()" 1024*1024 times; but if you use a hash table with "#define HASH_SIZE 1024" you'll probably reduce that to about 2000 times (and end up with much faster code). Larger values of HASH_SIZE increase the amount of memory you use a little (and reduce the chance of different words having the same hash).
Don't forget to close your files when you're finished with them. Freeing the memory you used is a good idea if you do something else after this (but if you don't do anything after this then it's faster and easier to "exit()" and let the OS cleanup).

Resources