How to deal with HUGE (32GB+) multi-dimensional arrays? - database

I have a multi-dimensional array of strings (arbitrary length but generally no more than 5 or 6 characters) which needs to be random-read capable. Currently my issue is that I cannot create this array in any programming language (which keeps or loads the entire array into memory) since the array would far exceed my 32GB RAM.
All values in this database are interrelated, so if I break the database up into smaller more manageable pieces, I will need to have every "piece" with related data in it, loaded, in order to do computations with the held values. Which would mean loading the entire database, so we're back to square one.
The array dimensions are: [50,000] * [50,000] * [2] * [8] (I'll refer to this structure as X*Y*Z*M)
The array needs to be infinitely resizable on the X, Y, and M dimensions, though the M dimension would very rarely be changed, so a sufficiently high upper-bound would be acceptable.
While I do have a specific use-case for this, this is meant to be a more general and open-ended question about dealing with huge multi-dimensional arrays - what methods, structures, or tricks would you recommend to store and index the values? The array itself, clearly needs to live on disk somewhere, as it is far too large to keep in memory.
I've of course looked into the basic options like an SQL database or a static file directory structure.
SQL doesnt seem like it would work since there is an upper-bound limitation to column widths, and it only supports tables with columns and rows - SQL doesn't seem to support the kind of multidimensionality that I require. Perhaps there's another DBMS system for things like this which someone recommends?
The static file structure seemed to work when I first created the database, but upon shutting the PC down and everything being lost from the read cache, the PC will no longer read from the disk. It returns zero on every read and doesnt even attempt to actually read the files. Any attempt to enumerate the database's contents (right-clicking properties on the directory) will BSOD the entire PC instantly. There's just too many files and directories, windows can't handle it.

Related

Choosing the right DBM-like C++ library for sequential data

I am trying to choose a database for a newly developing application. There are so many alternatives and it’s so easy to choose a wrong one. First of all, there is a requirement to not use database servers. A required database should be a static or dynamic C++ library. The data that needs to be stored is an array of records. They vary but are fixed for a given dataset (so they can be stored in a table). The information in each row could be from several hundred bytes up to several megabytes. And a number of rows may be millions for now and expected to grow.
The index of the row could be used as a key. No need to maintain a separate key column.
Data is inserted sequentially. Read access will be performed only by iterating all the data or some segment of it sequentially (May need to iterate with steps like each 5th).
I don’t think that relational DBs are good feet for many reasons.
a. They are mostly server-based. I know about SQLite but as far as I know, it stores data in one file which I assume may lead to issues related to maximum file size.
b. We don’t need the power that SQL provides instead we would like to have more flexibility in stored data types.
There are Key/Value non-SQL dbms like BerkeleyDB, RocksDB, or something like luxio for lighter alternatives. The functionality they provide is more than enough for the task. And this might be the right choice however I don’t know how well they are optimized for such case where we have continuous integer keys. The associative key access (which is not required for us) may have some overhead in performance.
I know there are some type of non-SQL databases called “wide-column” which I am not familiar with. However, the name sounds like it is perfect for our task. All databases I can find are server of claud based. If you know dbm-like library for such type of database please advise.
I am not experienced in databases so please correct me if I am wrong in any of 3 above stamens.
If your row data can grow to megabytes, and you're talking about only millions of records, maybe just figure out a way to lay it out in a filesystem? If you need a more database-like index, use SQLite for the keys, and have the data records point to a location on the filesystem. This kind of thing will be far quicker to implement and get right than trying to do it all in one giant database.

Multiple Json Files or A single file with multiple arrays

I have a large blob (azure) file with 10k json objects in a single array. This does not perform because of its size. As I look to re-architect it, I can either create multiple files with a single array in each of 500-1000 objects or I could keep the one file, but burst the single array into an array of arrays-- maybe 10 arrays of 1000 objects each.
For simplicity, I'd rather break into multiple files. However, I thought this was worth asking the question and seeing if there was something to be learned in the answers.
I would think this depends strongly on your use-case. The multiple files or multiple arrays you create will partition your data somehow: will the partitions be used mostly together or mostly separate? I.e. will there be a lot of cases in which you only read one or a small number of the partitions?
If the answer is "yes, I will usually only care about a small number of partitions" then creating multiple files will save you having to deal with most of your data on most of your calls. If the answer is "no, I will usually need either 1.) all/most of my data or 2.) data from all/most of my partitions" then you probably want to keep one file just to avoid having to open many files every time.
I'll add: in this latter case, it may well turn out that the file structure (one array vs an array-of-arrays) doesn't change things very much, since a full scan is a full scan is a full scan etc. If that's the case, then you may need to start thinking about how to move to the prior case where you partition your data so that your calls fall neatly within few partitions, or how to move to a different data format.

Huge lookup table in SV/UVM

I have to build a big lookup table (~14k entries with strings as keys) by parsing an input file for a predictor. Am I better off using an associative array or use uvm_config_db from simulation performance point of view?
Since it is only one component, I'd use an associative array.
If it were multiple components, I'd be more inclined to put the the entire associative array in one class object, then register that class to the uvm_config_db. This they all components accessing the table are pointing to the same object; thereby limiting the memory footprint.
Still, loading the table may be a bit resource incentive and holding the table will likely require your simulator to use a lot of RAM. If you are only going to need a couple hundred or less entries per test/seed, I suggest loading only the entries you may need into the associative array; not the whole thing.
If you need the whole thing, then it probable will be worth figuring out how to put the table into a real database and use DPI or VPI to access it form simulation. DPI typicality has less overhead. Some tool-sets have IP for accessing memory images and other large data structures that you may be able to utilize.

When to use an array vs database

I'm a student and starting to relearn again the basics of programming.
The problem I stated above starts when I have read some Facebook posts that most of the programmers use arrays in their application and arrays are useful. And I started to realize that I never use arrays in my program.
I read some books but they only show the syntax of array and didn't discuss on when to apply them in creating real world applications. I tried to research this on the Internet but I cannot find any. Do you guys have circumstance when you use arrays. Can you please share it to me so I can have an idea.
Also, to clear my doubts can you please explain to me why arrays are good to store information because database can also store information. When is the right time for me to use database and arrays?
I hope to get a clear answer because I have one remaining semester before the internship and I want to clear my head on this. I do not include any specific programming language because I know most of the programming language have arrays.
I hope to get an answer that can I can easily understand.
When is the right time for me to use database and arrays?
I can see how databases and arrays may seem like competing solutions to the same problem, but you're comparing apples and oranges. Arrays are a way to represent structured data in memory. Databases are a tool to store data on disk until you need to retrieve it.
The question you pose is kind of like asking: "When is the right time to use an integer to hold a value, vs a piece of paper?" One of them is a structural representation in memory; the other is a storage tool.
Do you guys have circumstance when you use arrays
In most applications, databases and arrays work together. Applications often retrieve data from a database, and hold it in an array for easy processing. Here is a simple example:
Google allows you to receive an alert when something of interest is mentioned on the news. Let's call it the event. Many people can be interested in the event, so Google needs to keep a list of people to alert. How? Probably in a database.
When the event occurs, what does Google do? Well it needs to:
Retrieve the list of interested users from the DB and place it in an array
Loop through the array and send a notification to each user.
In this example, arrays work really well because users form a collection of similarly shaped data structures that needs to be put through a similar process. That's exactly what arrays are for!
Some other common uses of arrays
A bank wants to send invoice and payment due reminders at the end of the day. So it retrieves the users with past due payments from the DB, and loops through the users' array sending notifications.
An IT admin panel wants to check whether all critical websites in a list are still online. So it loops through the array of domains, pings each one and records the results in a log
An educational program wants to perform statistical functions on student test results. So it puts the results in an array to easily perform operations such as average, sum, standardDev...
Arrays are also awesome at keeping things in a predictable order. You can be certain that as you loop forward through an array, you encounter values in the order you put them in. If you're trying to simulate a checkout line at the store, the customers in a queue are a perfect candidate to represent in an array because:
They are similarly shaped data: each customer has a name, cart contents, wait time, and position in line
They will be put through a similar process: each customer needs methods for enter queue, request checkout, approve payment, reject payment, exit queue
Their order should be consistent: When your program executes next(), you should expect that the next customer in line will be the one at the register, not some customer from the back of the line.
Trying to store the checkout queue in a database doesn't make sense because we want to actively work with the queue while we run our simulation, so we need data in memory. The database can hold a historical record of all customers and their checkout outcomes, perhaps for another program to retrieve and use in another way (maybe build customized statistical reports)
There are two different points. Let's me try to explain the simple way:
Array: container objects to keep a fixed number of values. The array is stored in your memory. So it depends on your requirements but when you need a fixed and fast one, just use array.
Database: when you have a relational data or you would like to store it in somewhere and not really worry about the size of the objects. You can store 10, 100, 1000 records to you DB. It's also flexible and you can select/query/update the data flexible. Simple way to use is: have a relational data, large amount and would like to flexible it, use database.
Hope this help.
There are a number of ways to store data when you have multiple instances of the same type of data. (For example, say you want to keep information on all the people in your city. There would be some sort of object to hold the information on each person, and you might want to have a data structure that holds the information on every person.)
Java has two main ways to store multiple instances of data in memory: arrays and Collections.
Databases are something different. The difference between a database and an array or collection, as I see it, are:
databases are persistent, i.e. the data will stay around after your program has finished running;
databases can be shared between programs, often programs running in all different parts of the world;
databases can be extremely large, much, much larger than could fit in your computer's memory.
Arrays and collections, however, are intended only for use by one program as it runs. Your program may want to keep track of some information in order to do its calculations. But the data will be in your computer's memory, and therefore other programs on other computers won't be able to access it. And when your program is done running, the data is gone. However, since the data is in memory, it's much faster to use it than data in a database, which is stored on some sort of external device. (This is really an overgeneralization, and doesn't consider things like virtual memory and caching. But it's good enough for someone learning the basics.)
The Java run time gives you three basic kinds of collections: sets, lists, and maps. A set is an unordered collection of unique elements; you use that when the data doesn't belong in any particular order, and the main operations you want are to see if something is in the set, or return all the data in the set without caring about the order. A list is ordered, though; the data has a particular order, and provides operations like "get the Nth element" for some number N, and adding to the ends of the list or inserting in a particular place in the list. A map is unordered like a set, but it also attaches keys to the data, so that you can look for data by giving the key. (Again, this is an overgeneralization. Some sets do have order, like SortedSet. And I haven't even gotten into queues, trees, multisets, etc., some of which are in third-party libraries.)
Java provides a List type for ordered lists, but there are several ways to implement it. One is ArrayList. Like all lists, it provides the capability to get the Nth item in the list. But an ArrayList provides this capability faster; under the hood, it's able to go directly to the Nth item. Some other list implementations don't do that--they have to go through the first, second, etc., items, until they get to the Nth.
An array is similar to an ArrayList, but it has a different syntax. For an array x, you can get the Nth element by referring to x[n], while for an ArrayList you'd say x.get(n). As far as functionality goes, the biggest difference is that for an array, you have to know how big it is before you create it, while an ArrayList can grow. So you'd want to use an ArrayList if you don't know beforehand how big your list will be. If you do know, then an array is a little more efficient, I think. Although you can probably get by mostly with just ArrayList, arrays are still fundamental structures in every computer language. The implementation of ArrayList depends on arrays under the hood, for instance.
Think of an array as a book, and database as library. You can't share the book with others at the same time, but you can share a library. You can't put the entire library in one book, but you can checkout 1 book at a time.

Persisting a Large List for Membership Testing in C

Each item is an array of 17 32-bit integers. I can probably produce 120-bit unique hashes for them.
I have an algorithm that produces 9,731,643,264 of these items, and want to see how many of these are unique. I speculate that at most 1/36th of these will be unique but can't be sure.
At this size, I can't really do this in memory (as I only have 4 gigs), so I need a way to persist a list of these, do membership tests, and add each new one if it's not already there.
I am working in C(gcc) on Linux so it would be good if the solution can work from there.
Any ideas?
This reminds me of some of the problems I faced working on a solution to "Knight's Tour" many years ago. (A math problem which is now solved, but not by me.)
Even your hash isn't that much help . . . at the nearly the size of a GUID, they could easily be unique accross all the the known universe.
It will take approximately .75 Terrabytes just to hold the list on disk . . . 4 Gigs of memory or not, you'd still need a huge disk just to hold them. And you'd need double that much disk or more to do the sort/merge solutions I talk about below.
If you could SORT that list, then you could just go threw the list one item at a time looking for unique copies next to each other. Of course sorting that much data would required a custom sort routine (that you wrote) since it is binary (coverting to hex would double the size of your data, but would allow you to use standard routines) . . . though likely even there they would probably choke on that much data . . . so your are back to your own custom routines.
Some things to think about:
Sorting that much data will take weeks, months or perhaps years. While you can do a nice heap sort or whatever in memory, because you only have so much disk space, you will likely be doing a "bubble" sort of the files regardless of what you do in memory.
Depending on what your generation algorithm looks like, you could generate "one memory load" worth of data, sort it in place then write it out to disk in a file (sorted). Once that was done, you just have to "merge" all those individual sorted files, which is a much easier task (even thought there would be 1000s of files, it would still be a relatively easier task).
If your generator can tell you ANYTHING about your data, use that to your advantage. For instance in my case, as I processed the Knight's Moves, I know my output values were constantly getting bigger (because I was always adding one bit per move), that small knowledge allowed me to optimize my sort in some unique ways. Look at your data, see if you know anything similar.
Making the data smaller is always good of course. For instance you talk about a 120 hash, but is that has reversable? If so, sort the hash since it is smaller. If not, the hash might not be that much help (at least for my sorting solutions).
I am interested in the machanics of issues like this and I'd be happy to exchange emails on this subject just to bang around ideas and possible solutions.
You can probably make your life a lot easier if you can place some restrictions on your input data: Even assuming only 120 significant bits, the high number of duplicate values suggests an uneven distribution, as an even distribution would make duplicates unlikely for a given sample size of 10^10:
2^120 = (2^10)^12 > (10^3)^12 = 10^36 >> 10^10
If you have continuous clusters (instead of sparse, but repeated values), you can gain a lot by operating on ranges instead of atomic values.
What I would do:
fill a buffer with a batch of generated values
sort the buffer in-memory
write ranges to disk, ie each entry in the file consists of start and end value of a continuous group of values
Then, you need to merge the individual files, which can be done online - ie as the files become available - the same way a stack-based mergesort operates: associate to each file a counter equal to the number of ranges in the file and push each new file on a stack. When the file on top of the stack has a counter greater or equal to the previous file, merge the files into a new file whose counter is the number of ranges in the merged file.

Resources