Each item is an array of 17 32-bit integers. I can probably produce 120-bit unique hashes for them.
I have an algorithm that produces 9,731,643,264 of these items, and want to see how many of these are unique. I speculate that at most 1/36th of these will be unique but can't be sure.
At this size, I can't really do this in memory (as I only have 4 gigs), so I need a way to persist a list of these, do membership tests, and add each new one if it's not already there.
I am working in C(gcc) on Linux so it would be good if the solution can work from there.
Any ideas?
This reminds me of some of the problems I faced working on a solution to "Knight's Tour" many years ago. (A math problem which is now solved, but not by me.)
Even your hash isn't that much help . . . at the nearly the size of a GUID, they could easily be unique accross all the the known universe.
It will take approximately .75 Terrabytes just to hold the list on disk . . . 4 Gigs of memory or not, you'd still need a huge disk just to hold them. And you'd need double that much disk or more to do the sort/merge solutions I talk about below.
If you could SORT that list, then you could just go threw the list one item at a time looking for unique copies next to each other. Of course sorting that much data would required a custom sort routine (that you wrote) since it is binary (coverting to hex would double the size of your data, but would allow you to use standard routines) . . . though likely even there they would probably choke on that much data . . . so your are back to your own custom routines.
Some things to think about:
Sorting that much data will take weeks, months or perhaps years. While you can do a nice heap sort or whatever in memory, because you only have so much disk space, you will likely be doing a "bubble" sort of the files regardless of what you do in memory.
Depending on what your generation algorithm looks like, you could generate "one memory load" worth of data, sort it in place then write it out to disk in a file (sorted). Once that was done, you just have to "merge" all those individual sorted files, which is a much easier task (even thought there would be 1000s of files, it would still be a relatively easier task).
If your generator can tell you ANYTHING about your data, use that to your advantage. For instance in my case, as I processed the Knight's Moves, I know my output values were constantly getting bigger (because I was always adding one bit per move), that small knowledge allowed me to optimize my sort in some unique ways. Look at your data, see if you know anything similar.
Making the data smaller is always good of course. For instance you talk about a 120 hash, but is that has reversable? If so, sort the hash since it is smaller. If not, the hash might not be that much help (at least for my sorting solutions).
I am interested in the machanics of issues like this and I'd be happy to exchange emails on this subject just to bang around ideas and possible solutions.
You can probably make your life a lot easier if you can place some restrictions on your input data: Even assuming only 120 significant bits, the high number of duplicate values suggests an uneven distribution, as an even distribution would make duplicates unlikely for a given sample size of 10^10:
2^120 = (2^10)^12 > (10^3)^12 = 10^36 >> 10^10
If you have continuous clusters (instead of sparse, but repeated values), you can gain a lot by operating on ranges instead of atomic values.
What I would do:
fill a buffer with a batch of generated values
sort the buffer in-memory
write ranges to disk, ie each entry in the file consists of start and end value of a continuous group of values
Then, you need to merge the individual files, which can be done online - ie as the files become available - the same way a stack-based mergesort operates: associate to each file a counter equal to the number of ranges in the file and push each new file on a stack. When the file on top of the stack has a counter greater or equal to the previous file, merge the files into a new file whose counter is the number of ranges in the merged file.
Related
I have a multi-dimensional array of strings (arbitrary length but generally no more than 5 or 6 characters) which needs to be random-read capable. Currently my issue is that I cannot create this array in any programming language (which keeps or loads the entire array into memory) since the array would far exceed my 32GB RAM.
All values in this database are interrelated, so if I break the database up into smaller more manageable pieces, I will need to have every "piece" with related data in it, loaded, in order to do computations with the held values. Which would mean loading the entire database, so we're back to square one.
The array dimensions are: [50,000] * [50,000] * [2] * [8] (I'll refer to this structure as X*Y*Z*M)
The array needs to be infinitely resizable on the X, Y, and M dimensions, though the M dimension would very rarely be changed, so a sufficiently high upper-bound would be acceptable.
While I do have a specific use-case for this, this is meant to be a more general and open-ended question about dealing with huge multi-dimensional arrays - what methods, structures, or tricks would you recommend to store and index the values? The array itself, clearly needs to live on disk somewhere, as it is far too large to keep in memory.
I've of course looked into the basic options like an SQL database or a static file directory structure.
SQL doesnt seem like it would work since there is an upper-bound limitation to column widths, and it only supports tables with columns and rows - SQL doesn't seem to support the kind of multidimensionality that I require. Perhaps there's another DBMS system for things like this which someone recommends?
The static file structure seemed to work when I first created the database, but upon shutting the PC down and everything being lost from the read cache, the PC will no longer read from the disk. It returns zero on every read and doesnt even attempt to actually read the files. Any attempt to enumerate the database's contents (right-clicking properties on the directory) will BSOD the entire PC instantly. There's just too many files and directories, windows can't handle it.
I am working with an AS400 IBM i-series machine. The problem is that I have a packed-decimal field in one of my files. The design has put the length of field to 9 but now my data is growing and when the data is greater than 999999999 an overflow occurs. so for values greater than 999999999 it turns to 0 and it will write the remaining of value in the file. As my machine is under high loads and there are lots of data I cannot change the file design. So What's the best way to handle this problem?
It's a really big problem and I would appreciate any help. Thanks
You really only have two choices: change your file or change your code.
If you truly have one billion records and this is a key field, then you should go through the process of changing the file. If all of these records do not need to be "live", then you might be able to offload some of these to a "historical" table with one extra key so that you would be able to delete records and reuse some of these numbers.
If you need to keep all of this data as "live", then you should plan to widen the field. First of all make a modified version of the file in another library. Then manipulate your library list to test it with the existing code. Revise and compile as needed. At last schedule a known limited downtime, put the latest data into the new file, and swap the new file and programs into the production library.
If you are not to that scale and want to try to change your code and not this file, ask yourself why the data takes up nine digits to describe it. Are you skipping large blocks of numbers? Are you trying to make certain digits specify certain meanings? Those techniques can eat up numbers at a much faster rate than taking them out sequentially. You could change the code to change how these numbers are assigned, but if multiple parts of the application expect them to play by a certain set of rules, changing all of those places might be much more work than just changing the file. This can be better or worse depending on how the application is structured. That is a judgement that you will have to make.
First of all let me start off by saying that I read this question.
So as I was strolling through the internet and I came across that algorithm and I was wondering how it worked. After reading about it I did understand how it counts the views by hashing and using bits.
What I haven't quite understand yet, is how can be sure to avoid counting the same view again. Do we store each hashed value we come across and before incrementing the count check if it already exists in our array or whatever?
Doesn't that make it a lot less efficient if we have 1000k+ items?
The cool thing about HyperLogLog is that you do not have to store the entire array you have seen which would be O(n), and not even the unique values. What you do need to store is of oreder O(log(log(n)) which is much lower.
Basically, if two objects have the same value, then their hash will be the same. This means that the leading bits will also be the same. So having multiple objects with the same value won't affect the computation at all.
This fact also allows easy parallelism - you can divide your population, and calculate the max separately, combining them later by calculating the maximum of your separate maxes.
Recently I was faced with having to store many 'versions' of an array in memory (think of an undo system or changes to a file in version-control - but could apply elsewhere too).
In case this isn't clear:
Arrays may be identical, share some data, or none at all.
Elements may be added or removed an any point.
The goal is to avoid storing an entirely new array when there are large sections of the array that are identical.
For the purpose of this question, changes such as adding a number to each value can be ignored (treated as different data).
I've looked into writing my own solution, in principle this can be done fairly simply:
divide the array into small blocks.
if nothing changes, reuse the blocks in each new version of the array.
if only one block changes, make a new block with changed data.
retrieving an array can be done by allocating the memory, then filling it with the data from each block.
Things become more involved is when the array length changes or when the data is re-ordered.
Then it becomes a trade off for how much time its worth to spend searching for duplicate blocks (in my case I hashed some data at the beginning of each block to help identify candidates to use).
I've got my implementation working (and can link to it if its useful, though I rather avoid discussing my spesific code, since it distracts from the general case).
I suspect my own code could be improved (using tried-and-tested memory hashing & searching methods). Possibly I'm not using the right terms but I wasn't able to find information on this searching online.
So my questions are:
Which methods are most efficient for recognizing and storing arrays that share some contiguous data?
Are there known, working methods which are considered best-practice to solve this problem?
Update, wrote a small(ish) single file library and tests, as well as a Python reference version.
i need to do a school project in C (I'm really don't know c++ as well).
I need a data struct to index each word of about 34k documents, its a lot of words, and need to do some ranking about the words, i already did this project about 2 years ago (i'm pause in the school and back this year) and a i use a hash table of binary tree, but i got a small grade cause my project took about 2hours to index all words. I need something a little fast... any sugestions?
Tkz
Roberto
If you have the option, I'd strongly recommend using a database engine (MSSQL, MySQL, etc.) as that's exactly the sort of datasets and operations these are written for. Best not to reinvent the wheel.
Otherwise, why use a btree at all? From what you've described (and I realise we're probably not getting the full story...) a straight up hash table with the word as a key and its rank/count of occurences should be useful?
bogofilter (the spam filter) has to keep word counts. It uses dbm as a backend, since it needs persistent storage of the word -> count map. You might want to look at the code for inspiration. Or not, since you need to implement the db part of it for the school project, not so much the spam filter part.
Minimize the amount of pointer chasing you have to do. Data-dependent memory-load operations are slow, esp. on a large working set where you will have cache misses. So make sure your hash table is big enough that you don't need a big tree in each bucket. And maybe check that your binary trees are dense, not degenerate linked lists, when you do get more than one value in a hash bucket.
If it's slow, profile it, and see if your problem is one slow function, or if it's cache misses, or if it's branch mispredictions.