C : Store and index a HUGH amount of info! School Project - c

i need to do a school project in C (I'm really don't know c++ as well).
I need a data struct to index each word of about 34k documents, its a lot of words, and need to do some ranking about the words, i already did this project about 2 years ago (i'm pause in the school and back this year) and a i use a hash table of binary tree, but i got a small grade cause my project took about 2hours to index all words. I need something a little fast... any sugestions?
Tkz
Roberto

If you have the option, I'd strongly recommend using a database engine (MSSQL, MySQL, etc.) as that's exactly the sort of datasets and operations these are written for. Best not to reinvent the wheel.
Otherwise, why use a btree at all? From what you've described (and I realise we're probably not getting the full story...) a straight up hash table with the word as a key and its rank/count of occurences should be useful?

bogofilter (the spam filter) has to keep word counts. It uses dbm as a backend, since it needs persistent storage of the word -> count map. You might want to look at the code for inspiration. Or not, since you need to implement the db part of it for the school project, not so much the spam filter part.
Minimize the amount of pointer chasing you have to do. Data-dependent memory-load operations are slow, esp. on a large working set where you will have cache misses. So make sure your hash table is big enough that you don't need a big tree in each bucket. And maybe check that your binary trees are dense, not degenerate linked lists, when you do get more than one value in a hash bucket.
If it's slow, profile it, and see if your problem is one slow function, or if it's cache misses, or if it's branch mispredictions.

Related

Datastore for large astrophysics simulation data

I'm a grad student in astrophysics. I run big simulations using codes mostly developed by others over a decade or so. For examples of these codes, you can check out gadget http://www.mpa-garching.mpg.de/gadget/ and enzo http://code.google.com/p/enzo/. Those are definitely the two most mature codes (they use different methods).
The outputs from these simulations are huge. Depending on your code, your data is a bit different, but it's always big data. You usually take billions of particles and cells to do anything realistic. The biggest runs are terabytes per snapshot and hundreds of snapshots per simulation.
Currently, it seems that the best way to read and write this kind of data is to use HDF5 http://www.hdfgroup.org/HDF5/, which is basically an organized way of using binary files. It's a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this.
I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?
If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume.
edit: Sorry for being vague about the issues. Steve is right that this might just be an issue of data structure rather than the data storage method. I have to run now, but I will provide more details late tonight or tomorrow.
edit 2: So the more I look into this, the more I realize that this probably isn't a datastore issue anymore. The main issue with unformatted binary was all the headaches reading the data correctly (getting the block sizes and order right and being sure about it). HDF5 pretty much fixed that and there isn't going to be a faster option until the file system limitations are improved (thanks Matt Turk).
The new issues probably come down to data structure. HDF5 is as performant as we can get, even if it is not the nicest interface to query against. Being used to databases, I thought it would be really interesting/powerful to be able to query something like "give me all particles with velocity over x at any time". You can do something like that now, but you have to work at a lower level. Of course, given how big the data is and depending on what you are doing with it, it might be a good thing to work at a low level for performance sake.
MongoDB: http://www.mongodb.org/
Netezza
Products:
http://www.netezza.com/data-warehouse-appliance-products/skimmer.aspx
Hadoop: http://hadoop.apache.org/
Wikipedia's List of Distributed File
Systems:
http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems
EDIT
Rationale for my lack of explanation / etc.:
OP says: "[HDF5]'s a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this."
What does "better" mean? Better structured? He seems to allude to the "unformatted binary files" as being an issue - so maybe that's what he means by better. If so, he'll need something with some structure - hence the first couple suggestions.
OP says: "I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?"
Yes, there are several. Both structured, and "unstructured" - does he want structure, or is he happy to leave them in some sort of "unformatted binary format"? We still don't know - so I suggest checking out some Distributed File Systems.
OP says: "If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume."
Again, Does the OP want better structure, or doesn't he? Seems like he wants both - better structure AND faster.... maybe scaling OUT will give him this. This further reinforces the first few options I listed.
OP says (in comments): "I don't know if we can take the hit on io though."
Are there IO requirements? Cost restrictions? What are they?
We can't get something for nothing here. There is no "silver-bullet" storage solution. All we have to go on here for requirements is "lots of data" and "I don't know if I like the lack of structure, but I'm not willing to increase my IO to accommodate any additional structure"... so I don't know what kind of answer he's expecting. He hasn't listed a single complaint about the current solution he has other than the lack of structure - and he's already said he's not willing to pay any overhead to do anything about that... so.... ?

Persisting a Large List for Membership Testing in C

Each item is an array of 17 32-bit integers. I can probably produce 120-bit unique hashes for them.
I have an algorithm that produces 9,731,643,264 of these items, and want to see how many of these are unique. I speculate that at most 1/36th of these will be unique but can't be sure.
At this size, I can't really do this in memory (as I only have 4 gigs), so I need a way to persist a list of these, do membership tests, and add each new one if it's not already there.
I am working in C(gcc) on Linux so it would be good if the solution can work from there.
Any ideas?
This reminds me of some of the problems I faced working on a solution to "Knight's Tour" many years ago. (A math problem which is now solved, but not by me.)
Even your hash isn't that much help . . . at the nearly the size of a GUID, they could easily be unique accross all the the known universe.
It will take approximately .75 Terrabytes just to hold the list on disk . . . 4 Gigs of memory or not, you'd still need a huge disk just to hold them. And you'd need double that much disk or more to do the sort/merge solutions I talk about below.
If you could SORT that list, then you could just go threw the list one item at a time looking for unique copies next to each other. Of course sorting that much data would required a custom sort routine (that you wrote) since it is binary (coverting to hex would double the size of your data, but would allow you to use standard routines) . . . though likely even there they would probably choke on that much data . . . so your are back to your own custom routines.
Some things to think about:
Sorting that much data will take weeks, months or perhaps years. While you can do a nice heap sort or whatever in memory, because you only have so much disk space, you will likely be doing a "bubble" sort of the files regardless of what you do in memory.
Depending on what your generation algorithm looks like, you could generate "one memory load" worth of data, sort it in place then write it out to disk in a file (sorted). Once that was done, you just have to "merge" all those individual sorted files, which is a much easier task (even thought there would be 1000s of files, it would still be a relatively easier task).
If your generator can tell you ANYTHING about your data, use that to your advantage. For instance in my case, as I processed the Knight's Moves, I know my output values were constantly getting bigger (because I was always adding one bit per move), that small knowledge allowed me to optimize my sort in some unique ways. Look at your data, see if you know anything similar.
Making the data smaller is always good of course. For instance you talk about a 120 hash, but is that has reversable? If so, sort the hash since it is smaller. If not, the hash might not be that much help (at least for my sorting solutions).
I am interested in the machanics of issues like this and I'd be happy to exchange emails on this subject just to bang around ideas and possible solutions.
You can probably make your life a lot easier if you can place some restrictions on your input data: Even assuming only 120 significant bits, the high number of duplicate values suggests an uneven distribution, as an even distribution would make duplicates unlikely for a given sample size of 10^10:
2^120 = (2^10)^12 > (10^3)^12 = 10^36 >> 10^10
If you have continuous clusters (instead of sparse, but repeated values), you can gain a lot by operating on ranges instead of atomic values.
What I would do:
fill a buffer with a batch of generated values
sort the buffer in-memory
write ranges to disk, ie each entry in the file consists of start and end value of a continuous group of values
Then, you need to merge the individual files, which can be done online - ie as the files become available - the same way a stack-based mergesort operates: associate to each file a counter equal to the number of ranges in the file and push each new file on a stack. When the file on top of the stack has a counter greater or equal to the previous file, merge the files into a new file whose counter is the number of ranges in the merged file.

database vs flat file, which is a faster structure for "regex" matching with many simultaneous requests

which structure returns faster result and/or less taxing on the host server, flat file or database (mysql)?
Assume many users (100 users) are simultaneously query the file/db.
Searches involve pattern matching against a static file/db.
File has 50,000 unique lines (same data type).
There could be many matches.
There is no writing to the file/db, just read.
Is it possible to have a duplicate the file/db and write a logic switch to use the backup file/db if the main file is in use?
Which language is best for the type of structure? Perl for flat and PHP for db?
Addition info:
If I want to find all the cities have the pattern "cis" in their names.
Which is better/faster, using regex or string functions?
Please recommend a strategy
TIA
I am a huge fan of simple solutions, and thus prefer -- for simple tasks -- flat file storage. A relational DB with its indexing capabilities won't help you much with arbitrary regex patterns at all, and the filesystem's caching ensures that this rather small file is in memory anyway. I would go the flat file + perl route.
Edit: (taking your new information into account) If it's really just about finding a substring in one known attribute, then using a fulltext index (which a DB provides) will help you a bit (depending on the type of index applied) and might provide an easy and reasonably fast solution that fits your requirements. Of course, you could implement an index yourself on the file system, e.g. using a variation of a Suffix Tree, which is hard to be beaten speed-wise.
Still, I would go the flat file route (and if it fits your purpose, have a look at awk), because if you had started implementing it, you'd be finished already ;) Further I suspect that the amount of users you talk about won't make the system feel the difference (your CPU will be bored most of the time anyway).
If you are uncertain, just try it! Implement that regex+perl solution, it takes a few minutes if you know perl, loop 100 times and measure with time. If it is sufficiently fast, use it, if not, consider another solution. You have to keep in mind that your 50,000 unique lines are really a low number in terms of modern computing. (compare with this: Optimizing Mysql Table Indexing for Substring Queries )
HTH,
alexander
Depending on how your queries and your data look like a full text search engine like Lucene or Sphinx could be a good idea.

How are hash tables implemented internally in popular languages?

Can someone please shed some light on how popular languages like Python, Ruby implements hash tables internally for symbol lookup? Do they use the classic "array with linked-list" method, or use a balanced tree?
I need a simple (fewer LOC) and fast method for indexing the symbols in a DSL written in C. Was wondering what others have found most efficient and practical.
The classic "array of hash buckets" you mention is used in every implementation I've seen.
One of the most educative versions is the hash implementation in the Tcl language, in file tcl/generic/tclHash.c. More than half of the lines in the file are comments explaining everything in detail: allocation, search, different hash table types, strategies, etc. Sidenote: the code implementating the Tcl language is really readable.
Perl uses an array with linked lists to hold collisions. It has a simple heuristic to automatically double the size of the array as necessary. There's also code to share keys between hashes to save a little memory. You can read about it in the dated but still relevant Perl Illustrated Guts under "HV". If you're truly adventurous you can dig into hv.c.
The hashing algorithm used to be pretty simple but its probably a lot more complicated now with Unicode. Because the algorithm was predictable there was a DoS attack whereby the attacker generated data which would cause hash collisions. For example, a huge list of keys sent to a web site as POST data. The Perl program would likely split it and dump it into a hash which then shoved it all into one bucket. The resulting hash was O(n) rather than O(1). Throw a whole lot of POST requests at a server and you might clog the CPU. As a result Perl now perturbs the hash function with a bit of random data.
You also might want to look at how Parrot implements basic hashes which is significantly less terrifying than the Perl 5 implementation.
As for "most efficient and practical", use someone else's hash library. For god's sake don't write one yourself for production use. There's a hojillion robust and efficient ones out there already.
Lua tables use an utterly ingenious implemenation which for arbitrary keys behaves like 'array of buckets', but if you use consecutive integers as keys, it has the same representation and space overhead as an array. In the implementation each table has a hash part and an array part.
I think this is way cool :-)
Attractive Chaos have a comparison of Hash Table Libraries and a update.
The source code is available and it is in C and C++
Balanced trees sort of defeat the purpose of hash tables since a hash table can provide lookup in (amortized) constant time, whereas the average lookup on a balanced tree is O(log(n)).
Separate chaining (array with linked list) really works quite well if you have enough buckets, and your linked list implementation uses a pooling allocator rather than malloc()ing each node from the heap individually. I've found it to be just about as performant as any other technique when properly tuned, and it is very easy and quick to write. Try starting with 1/8 as many buckets as source data.
You can also use open addressing with quadratic or polynomial probing, as Python does.
If you can read Java, you might want to check out the source code for its various map implementations, in particular HashMap, TreeMap and ConcurrentSkipListMap. The latter two keep the keys ordered.
Java's HashMap uses the standard technique you mention of chaining at each bucket position. It uses fairly weak 32-bit hash codes and stores the keys in the table. The Numerical Recipes authors also give an example (in C) of a hash table essentially structured like Java's but in which (a) you allocate the nodes of the bucket lists from an array, and (b) you use a stronger 64-bit hash code and dispense with storing keys in the table.
What Crashworks mean to say was....
The purpose of Hash tables are constant time lookup, addition and deletion. In terms of Algorithm, the operation for all operation is O(1) amortized.
Whereas in case you use tree ...the worst case operation time will be O(log n) for a balanced tree. N is the number of nodes. but, do we really have hash implemented as Tree?

Linked lists or hash tables?

I have a linked list of around 5000 entries ("NOT" inserted simultaneously), and I am traversing the list, looking for a particular entry on occasions (though this is not very often), should I consider Hash Table as a more optimum choice for this case, replacing the linked list (which is doubly-linked & linear) ?? Using C in Linux.
If you have not found the code to be the slow part of the application via a profiler then you shouldn't do anything about it yet.
If it is slow, but the code is tested, works, and is clear, and there are other slower areas that you can work on speeding up do those first.
If it is buggy then you need to fix it anyways, go for the hash table as it will be faster than the list. This assumes that the order that the data is traversed does not matter, if you care about what the insertion order is then stick with the list (you can do things with a hash table and keep the order, but that will make the code much tricker).
Given that you need to search the list only on occasion the odds of this being a significant bottleneck in your code is small.
Another data structure to look at is a "skip list" which basically lets you skip over a large portion of the list. This requires that the list be sorted however, which, depending on what you are doing, may make the code slower overall.
Whether using hash table is more optimum or not depends on the use case, which you have not described in detail. But more importantly, make sure the bottleneck of performance is in this part of the code. If this code is called only once in a while and not in a critical path, no use bothering to change the code.
Have you measured and found a performance hit with the lookup? A hash_map or hash table should be good.
If you need to traverse the list in order (not as a part of searching for elements, but say for displaying them) then a linked list is a good choice. If you're only storing them so that you can look up elements then a hash table will greatly outperform a linked list (for all but the worst possible hash function).
If your application calls for both types of operations, you might consider keeping both, and using whichever one is appropriate for a particular task. The memory overhead would be small, since you'd only need to keep one copy of each element in memory and have the data structures store pointers to these objects.
As with any optimization step that you take, make sure you measure your code to find the real bottleneck before you make any changes.
If you care about performance, you definitely should. If you're iterating through the thing to find a certain element with any regularity, it's going to be worth it to use a hash table. If it's a rare case, though, and the ordinary use of the list is not a search, then there's no reason to worry about it.
If you only traverse the collection I don't see any advantages of using a hashmap.
I advise against hashes in almost all cases.
There are two reasons; firstly, the size of the hash is fixed.
Second and much more importantly; the hashing algorithm. How do you know you've got it right? how will it behave with real data rather than test data?
I suggest a balanced b-tree. Always O(log n), no uncertainty with regard to a hash algorithm and no size limits.

Resources