What is an efficient way to map a file to array of structures? - c

I have a file in my system with 1024 rows,
student_db.txt
Name Subject-1 Subject-2 Subject-3
----- --------- --------- ---------
Alex 98 90 80
Bob 87 95 73
Mark 90 83 92
.... .. .. ..
.... .. .. ..
I have an array structures in my C code,
typedef struct
{
char name[10];
int sub1;
int sub2;
int sub3;
} student_db;
student_db stud_db[1024];
What is the efficient way to read this file and mapping to this array of structures ?
If the number entries is less then we can go for normal fgets in a while with strtok but here number of entries is 1024.
So please suggest some efficient way to do this task.

Check the size of the file, I think, it is max. 100 KByte. This is literally nothing, even a poorly written PHP script can read it in some millisec. There's no slow method to load such small quantity of data.
I assume, loading this file is only the first step, the real task will be to process this list (search, filter etc.). Instead of optimizing the loading speed, you should focus on the processing speed.
Premature optimization is evil. Make a working unoptimized code, see if you're satisfied with the result and speed. Probably you never should optimize it.

You can try to store data in a binary way. The way file is written now is character way. So everything is represented as a character (numbers, strings and other things).
If you store things binary, then it means you store numbers, strings and other stuff the way they are, so when you write integer to a file you write a number, when you open this in a text editor later, then you will see character, which corresponding number in eg. ASCII is the one you have written to a file.
You use standard functions to store things in a binary way:
fopen("test.bin","wb");
w stands for write and b stands for binary format. And function to write is:
fwrite(&someInt, sizeof(someInt), 1, &someFile);
where someInt is variable you want to write (function takes pointer),
sizeof(someInt) is sizeof element int,
1 stands for number of elements if first argument is an array, and someFile is a file, where you want to store your data.
This way size of the file can be reduced, so loading of the file also will be faster. It's also simplier to process sata in file

Related

logically Understanding a compression algorithm

this idea had been flowing in my head for 3 years and i am having problems to apply it
i wanted to create a compression algorithm that cuts the file size in half
e.g. 8 mb to 4 mb
and with some searching and experience in programming i understood the following.
let's take a .txt file with letters (a,b,c,d)
using the IO.File.ReadAllBytes function , it gives the following array of bytes : ( 97 | 98 | 99 | 100 ) , which according to this : https://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart is the decimal value of the letter.
what i thought about was : how to mathematically cut this 4-membered-array to only 2-membered-array by combining each 2 members into a single member but you can't simply mathematically combine two numbers and simply reverse them back as you have many possibilities,e.g.
80 | 90 : 90+80=170 but there is no way to know that 170 was the result of 80+90 not like 100+70 or 110+60.
and even if you could overcome that , you would be limited by the maximum value of bytes (255 bytes) in a single member of the array.
i understand that most of the compression algorithms use the binary compression and they were successful,but imagine cutting a file size in half , i would like to hear your ideas on this.
Best Regards.
It's impossible to make a compression algorithm that makes every file shorter. The proof is called the "counting argument", and it's easy:
There are 256^L possible files of length L.
Lets say there are N(L) possible files with length < L.
If you do the math, you find that 256^L = 255*N(L)+1
So. You obviously cannot compress every file of length L, because there just aren't enough shorter files to hold them uniquely. If you made a compressor that always shortened a file of length L, then MANY files would have to compress to the same shorter file, and of course you could only get one of them back on decompression.
In fact, there are more than 255 times as many files of length L as there are shorter files, so you can't even compress most files of length L. Only a small proportion can actually get shorter.
This is explained pretty well (again) in the comp.compression FAQ:
http://www.faqs.org/faqs/compression-faq/part1/section-8.html
EDIT: So maybe you're now wondering what this compression stuff is all about...
Well, the vast majority of those "all possible files of length L" are random garbage. Lossless data compression works by assigning shorter representations (the output files) to the files we actually use.
For example, Huffman encoding works character by character and uses fewer bits to write the most common characters. "e" occurs in text more often than "q", for example, so it might spend only 3 bits to write "e"s, but 7 bits to write "q"s. bytes that hardly ever occur, like character 131 may be written with 9 or 10 bits -- longer than the 8-bit bytes they came from. On average you can compress simple English text by almost half this way.
LZ and similar compressors (like PKZIP, etc) remember all the strings that occur in the file, and assign shorter encodings to strings that have already occurred, and longer encodings to strings that have not yet been seen. This works even better since it takes into account more information about the context of every character encoded. On average, it will take fewer bits to write "boy" than "boe", because "boy" occurs more often, even though "e" is more common than "y".
Since it's all about predicting the characteristics of the files you actually use, it's a bit of a black art, and different kinds of compressors work better or worse on different kinds of data -- that's why there are so many different algorithms.

How to save a 2-Dimensional array to a file in C?

I am an beginner C programmer and I am currently working on a project to implement viola jones object detection algorithm using C. I would like to know how I would be able to store data in a 2-Dimensional array to a file that can be easily ported and accessed by different program files(e.g. main.c, header_file.h etc.)
Thank you in advance.
There's not quite enough detail to be sure what you're looking for, but the basic structure of what you want to do is going to look something like this:
open file.csv for writing
for(iterate through one dimension of the array using i)
{
for(iterate through the other dimension of the array using j)
{
fprintf(yourfilehandle,"%d,",yourvalue[i][j]);
}
fprintf(yourfilehandle,"\n");
}
close your file
As has been suggested by others, this will leave you with a .CSV file, which is a pretty good choice, as it's easy to read in and parse, and you can open your file in Notepad or Excel and view it no problems.
This is assuming you really meant to do this with C file I/O, which is a perfectly valid way of doing things, some just feel it's a bit dated.
Note this leaves an extraneous comma at the end of the line. If that bugs you it's easy enough to do the pre and post conditions to only get commas where you want. Hint: it involves printing the comma before the entry inside the second for loop, reducing the number of entries you iterate over for the interior for loop, and printing out the first and last case of each row special, immediately before and after the inner for loop, respectively. Harder to explain that to do, probably.
Here is a reference for C-style file I/O, and here is a tutorial.
Without knowing anything about what type of data you're storing, I would say to store this as a matrix. You'll need to choose a delimiter to separate your elements (tab or space are common choices, aka 'tsv' and 'csv', respectively) and then something to mark the end of a row (new line is a good choice here).
So your saved file might look something like:
10 162 1 5
7 1 4 12
9 2 2 0
You can also define your format as having some metadata in the first line -- the number of rows and columns may be useful if you want to pre-allocate memory, along with other information like character encoding. Start simple and add as necessary!

Binary Search on Large Disk File in C - Problems

This question recurs frequently on StackOverflow, but I have read all the previous relevant answers, and have a slight twist on the question.
I have a 23Gb file containing 475 million lines of equal size, with each line consisting of a 40-character hash code followed by an identifier (an integer).
I have a stream of incoming hash codes - billions of them in total - and for each incoming hash code I need to locate it and print out corresponding identifier. This job, while large, only needs to be done once.
The file is too large for me to read into memory and so I have been trying to usemmap in the following way:
codes = (char *) mmap(0,statbuf.st_size,PROT_READ,MAP_SHARED,codefile,0);
Then I just do a binary search using address arithmetic based on the address in codes.
This seems to start working beautifully and produces a few million identifiers in a few seconds, using 100% of the cpu, but then after some, seemingly random, amount of time it slows down to a crawl. When I look at the process using ps, it has changed from status "R" using 100% of the cpu, to status "D" (diskbound) using 1% of the cpu.
This is not repeatable - I can start the process off again on the same data, and it might run for 5 seconds or 10 seconds before the "slow to crawl" happens. Once last night, I got nearly a minute out of it before this happened.
Everything is read only, I am not attempting any writes to the file, and I have stopped all other processes (that I control) on the machine. It is a modern Red Hat Enterprise Linux 64-bit machine.
Does anyone know why the process becomes disk-bound and how to stop it?
UPDATE:
Thanks to everyone for answering, and for your ideas; I had not previously tried all the various improvements before because I was wondering if I was somehow using mmap incorrectly. But the gist of the answers seemed to be that unless I could squeeze everything into memory, I would inevitable run into problems. So I squashed the size of the hash code to the size of the leading prefix that did not create any duplicates - the first 15 characters were enough. Then I pulled the resulting file into memory, and ran the incoming hash codes in batches of about 2 billion each.
The first thing to do is split the file.
Make one file with the hash-codes and another with the integer ids. Since the rows are the same then it will line up fine after the result is found. Also you can try an approach that puts every nth hash into another file and then stores the index.
For example, every 1000th hash key put into a new file with the index and then load that into memory. Then binary scan that instead. This will tell you the range of 1000 entries that need to be further scanned in the file. Yes that will do it fine! But probably much less than that. Like probably every 20th record or so will divide that file size down by 20 +- if I am thinking good.
In other words after scanning you only need to touch a few kilobytes of the file on disk.
Another option is to split the file and put it in memory on multiple machines. Then just binary scan each file. This will yield the absolute fastest possible search with zero disk access...
Have you considered hacking a PATRICIA trie algorithm up? It seems to me that if you can build a PATRICIA tree representation of your data file, which refers to the file for the hash and integer values, then you might be able to reduce each item to node pointers (2*64 bits?), bit test offsets (1 byte in this scenario) and file offsets (uint64_t, which might need to correspond to multiple fseek()s).
Does anyone know why the process becomes disk-bound and how to stop it?
Binary search requires a lot of seeking within the file. In the case where the whole file doesn't fit in memory, the page cache doesn't handle the big seeks very well, resulting in the behaviour you're seeing.
The best way to deal with this is to reduce/prevent the big seeks and make the page cache work for you.
Three ideas for you:
If you can sort the input stream, you can search the file in chunks, using something like the following algorithm:
code_block <- mmap the first N entries of the file, where N entries fit in memory
max_code <- code_block[N - 1]
while(input codes remain) {
input_code <- next input code
while(input_code > max_code) {
code_block <- mmap the next N entries of the file
max_code <- code_block[N - 1]
}
binary search for input code in code_block
}
If you can't sort the input stream, you could reduce your disk seeks by building an in-memory index of the data. Pass over the large file, and make a table that is:
record_hash, offset into file where this record starts
Don't store all records in this table - store only every Kth record. Pick a large K, but small enough that this fits in memory.
To search the large file for a given target hash, do a binary search in the in-memory table to find the biggest hash in the table that is smaller than the target hash. Say this is table[h]. Then, mmap the segment starting at table[h].offset and ending at table[h+1].offset, and do a final binary search. This will dramatically reduce the number of disk seeks.
If this isn't enough, you can have multiple layers of indexes:
record_hash, offset into index where the next index starts
Of course, you'll need to know ahead of time how many layers of index there are.
Lastly, if you have extra money available you can always buy more than 23 gb of RAM, and make this a memory bound problem again (I just looked at Dell's website - you pick up a new low-end workstation with 32 GB of RAM for just under $1,400 Australian dollars). Of course, it will take a while to read that much data in from disk, but once it's there, you'll be set.
Instead of using mmap, consider just using plain old lseek+read. You can define some helper functions to read a hash value or its corresponding integer:
void read_hash(int line, char *hashbuf) {
lseek64(fd, ((uint64_t)line) * line_len, SEEK_SET);
read(fd, hashbuf, 40);
}
int read_int(int line) {
lseek64(fd, ((uint64_t)line) * line_len + 40, SEEK_SET);
int ret;
read(fd, &ret, sizeof(int));
return ret;
}
then just do your binary search as usual. It might be a bit slower, but it won't start chewing up your virtual memory.
We don't know the back story. So it is hard to give you definitive advice. How much memory do you have? How sophisticated is your hard drive? Is this a learning project? Who's paying for your time? 32GB of ram doesn't seem so expensive compared to two days of work of person that makes $50/h. How fast does this need to run? How far outside the box are you willing to go? Does your solution need to use advanced OS concepts? Are you married to a program in C? How about making Postgres handle this?
Here's is a low risk alternative. This option isn't as intellectually appealing as the other suggestions but has the potential to give you significant gains. Separate the file into 3 chunks of 8GB or 6 chunks of 4GB (depending on the machines you have around, it needs to comfortably fit in memory). On each machine run the same software, but in memory and put an RPC stub around each. Write an RPC caller to each of your 3 or 6 workers to determine the integer associated with a given hash code.

How to search common passwords from two given files of size 20GB?

i have two files of size 20GB. i have to remove common passwords from either of one file.
i sorted the second file by calling sort command of UNIX. after this i splitted the sorted file into many files so that file could fit in RAM memory using split command. After splitting into n files i just used an structure array of size n to store first password of each splitted file and its corresponding file name.
then i applied binary search technique in that structure array for each key of first file to to first password stored in structure to get index of the corresponding file. and then i applied b search to that indexed splitted file.
i assumed 20 character as a max length of passwords
this program is not yet efficient.
Please help to make it efficient, if possible....
Please give me some advise to sort that 20GB file efficiently .....
in 64 bit stream with 8gb RAM and i3 quard processor.....
i just tested my program with two file of size 10MB. it took about 2.66 hours without using any optimization option. ....according to my program it will take about 7-8 hours to check each passwords of 20GB after splitting , sorting and binary searching.....
can i improve its time complexity? i mean can i make it to run more "faster"???
Check out external sorting. See http://www.umbrant.com/blog/2011/external_sorting.html which does have code at the end of the page (https://github.com/umbrant/extsort).
The idea behind external sorting is selecting and sorting equidistant samples from the file. Then partitioning the file at sampling points, sorting the partitions and merging the results.
example numbers = [1, 100, 2, 400, 60, 5, 0, 4]
example samples (distance 4) = 1, 60
chunks = {0,1,2,5,4} , {60, 100, 400}
Also, I don't think splitting the file is a good idea because you need to write 20GB to disk to split them. You might as well create the structure on the fly by seeking within the file.
For a previous SE question, "What algorithm to use to delete duplicates?" I described an algorithm for a probably-similar problem except with 50GB files instead of 20GB. The method is faster than sorting the big files in that problem.
Here is an adaptation of the method to your problem. Let's call the original two files A and B, and suppose A is larger than B. I don't understand from your problem description what is supposed to happen if or when a duplicate is detected, but in the following I assume you want to leave file A unchanged, and remove from B any items that also are in A. I also assume that entries within A are specified to be unique within A at the outset, and similarly for B. If that is not the case, the method needs more adapting and about twice as much I/O.
Suppose you can fit 1/k'th of file A into memory and still have room for the other required data structures. The whole file B can then be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting either file, depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 2G lines, ln(n) ~ 21.4, likely to be about 4 times as bad as O(k n) if k=5. (Algorithm constants still can change the situation either way, but with a fast hash function the method has good constants.)
Process:
Let Q = B (ie rename or copy B to Q)
Allocate a few gigabytes for a work buffer W, and a gigabyte or so for a hash table H. Open input files A and Q, output file O, and temp file T. Go to step 2.
Fill work buffer W by reading from file A.
For each line L in W, hash L into H, such that H[hash[L]] indexes line L.
Read all of Q, using H to detect duplicates, writing non-duplicates to temp file T.
Close and delete Q, rename T to Q, open new temp file T.
If EOF(A), rename Q to B and quit, else go to step 2.
Note that after each pass (ie at start of step 6) none of the lines in Q are duplicates of what has been read from A so far. Thus, 1/k'th of the original file is processed per pass, and processing takes k passes. Also note that although processing will be I/O bound you can read and write several times faster with big buffers (eg 8MB) than line-by-line.
The algorithm as stated above does not include buffering details or how to deal with partial lines in big buffers.
Here is a simple performance example: Suppose A, B both are 20GB files, that each has about 2G passwords in it, and that duplicates are quite rare. Also suppose 8GB RAM is enough for work buffer W to be 4GB in size leaving enough room for hash table H to have say .6G 4-byte entries. Each pass (steps 2-5) reads 20% of A and reads and writes almost all of B, at each pass weeding out any password already seen in A. I/O is about 120GB read (1*A+5*B), 100GB written (5*B).
Here is a more involved performance example: Suppose about 1G randomly distributed passwords in B are duplicated in A, with all else as in previous example. Then I/O is about 100GB read and 70GB written (20+20+18+16+14+12 and 18+16+14+12+10, respectively).
Searching in external files is going to be painfully slow, even using binary search. You might speed it up by putting the data in an actual database designed for fast lookups. You could also sort both text files once and then do a single linear scan to filter out words. Something like the following pseudocode:
sort the files using any suitable sorting utility
open files A and B for reading
read wordA from A
read wordB from B
while (A not EOF and B not EOF)
{
if (wordA < wordB)
write wordA to output
read wordA from A
else if (wordA > wordB)
read wordB from B
else
/* match found, don't output wordA */
read wordA from A
}
while (A not EOF) /* output remaining words */
{
write wordA to output
read wordA from A
}
Like so:
Concatenate the two files.
Use sort to sort the total result.
Use uniq to remove duplicates from the sorted total.
If c++ it's an option for you, the ready to use STXXL should be able to handle your dataset.
Anyway, if you use external sort in c, as suggested by another answer, I think you should sort both files and then scan both sequentially. The scan should be fast, and the sort can be done in parallel.

Writing structure into a file in C

I am reading and writting a structure into a text file which is not readable. I have to write readable data into the file from the structure object.
Here is little more detail of my code:
I am having the code which reads and writes a list of itemname and code into a file (file.txt). The code uses linked list concept to read and write data.
The data are stored into a structure object and then writen into a file using fwrite.
The code works fine. But I need to write a readable data into the text file.
Now the file.txt looks like bellow,
㵅㡸䍏䥔䥆㘸䘠㵅㩃䠀\䵏㵈䑜㵅㡸䍏䥔䥆㘸䘠\㵅㩃䠀䵏㵈䑜㵅㡸䍏䥔䥆㘸䘠㵅㩃䠀䵏㵈\䑜㵅㡸䍏䥔䥆㘸䘠㵅㩃䠀䵏㵈䑜㵅㡸䍏䥔\䥆㘸䘠㵅㩃䠀䵏㵈
I am expecting the file should be like this,
pencil aaaa
Table bbbb
pen cccc
notebook nnnn
Here is the snippet:
struct Item
{
char itemname[255];
char dspidc[255];
struct Item *ptrnext;
};
// Writing into the file
printf("\nEnter Itemname: ");
gets(ptrthis->itemname);
printf("\nEnter Code: ");
gets(ptrthis->dspidc);
fwrite(ptrthis, sizeof(*ptrthis), 1, fp);
// Reading from the file
while(fread(ptrthis, sizeof(*ptrthis), 1, fp) ==1)
{
printf("\n%s %s", ptrthis->itemname,ptrthis->dspidc);
ptrthis = ptrthis->ptrnext;
}
Writing the size of an array that is 255 bytes will write 255 bytes to file (regardless of what you have stuffed into that array). If you want only the 'textual' portion of that array you need to use a facility that handles null terminators (i.e. printf, fprintf, ...).
Reading is then more complicated as you need to set up the idea of a sentinel value that represents the end of a string.
This speaks nothing of the fact that you are writing the value of a pointer (initialized or not) that will have no context or validity on the next read. Pointers (i.e. memory locations) have application only within the currently executing process. Trying to use one process' memory address in another is definitely a bad idea.
The code works fine
not really:
a) you are dumping the raw contents of the struct to a file, including the pointer to another instance if "Item". you can not expect to read back in a pointer from disc and use it as you do with ptrthis = ptrthis->ptrnext (i mean, this works as you "use" it in the given snippet, but just because that snippet does nothing meaningful at all).
b) you are writing 2 * 255 bytes of potential crap to the file. the reason why you see this strange looking "blocks" in your file is, that you write all 255 bytes of itemname and 255 bytes of dspidc to the disc .. including terminating \0 (which are the blocks, depending on your editor). the real "string" is something meaningful at the beginning of either itemname or dspidc, followed by a \0, followed by whatever is was in memory before.
the term you need to lookup and read about is called serialization, there are some libraries out there already which solve the task of dumping data structures to disc (or network or anything else) and reading it back in, eg tpl.
First of all, I would only serialize the data, not the pointers.
Then, in my opinion, you have 2 choices:
write a parser for your syntax (with yacc for instance)
use a data dumping format such as rmi serialization mechanism.
Sorry I can't find online docs, but I know I have the grammar on paper.
Both of those solution will be platform independent, be them big endian or little endian.

Resources