Good database design for recall and comparison of 2D data arrays? - arrays

I am looking to store 2D arrays of 900x100 elements in a database. Efficient recall and comparison of the arrays is important. I could use a table with a schema like [A, x, y, A(x,y)] such that a single array would compromise 90,000 records. This seems like an ~ok~ table design to store the array, and would provide for efficient recall of single elements, but inefficient recall of a whole array and would make for very inefficient array comparisons.
Should I leave the table design this way and build and compare my arrays in code? Or is there a better way to structure the table such that I can get efficient array comparisons using database only operations?
thanks

If the type of data allows, store it in a concatenated format and compare in memory after it has been de-concatenated. The database operation will be much faster and the in-memory operations will be faster than database retrievals as well.
Who knows, you may even be able to compare it without de-concatenating.

900 x 100 elements is actually very small (even if the elements are massive 1K things that'd only be 90 MB). Can't you just compare in memory when needed and store on disk in some serialized format?
It doesn't make sense to store 2D arrays in the database, especially if it is immutable data.

When I used to work in the seismic industry we used to just dump our arrays (typically 1d of a few thousand elements) to binary file. The database would only be used for what was essentially meta data (location, indexing, etc). This would be considerably quicker, but it also allowed the data to decoupled if necessary: In production this was usual, a few thousand elements doesn't sound much, but a typical dataset could easily be hundreds of GB - this is the 1990s, so we had to decouple to tape.

Related

Hashmap, hashtable, map or anyother method

I am looking to compare two values (like greater than or less than from other) in HashMap, Hashtable, Map or any other Array types.
Could you please help me this.
Here are some factors that would affect your selection of a data structure:
What is the purpose of the comparison?
What type of data are you comparing?
How often will data be inserted into this data structure?
How often will data be selected from this data structure?
When should you use a HashMap?
One should use HashMap when their major requirements are only retrieving or modifying data's based on Key. For example, in Web
Applications username is stored as a key and user data is stored as a
value in the HashMap, for faster retrieval of user data corresponding
to a username.
HashMap
When should you use a HashTable?
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by
someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can
be wasteful and slow when excessive memory's used), the majority of
implementations will outgrow the arrays they're using for the hash
table every now and then, then allocate a bigger array and copy
content across: this can make the specific insertions that cause this
rehashing to be much slower than the normal O(1) behaviour, even
though the average is still O(1); if you need more consistent
behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating
on elements with keys that are "nearby" in some specific sort order),
such that cache efficiency is better for other storage models that
keep them nearby in memory (e.g. bucket sorted elements), even if
you're not exactly relying on the sort order for e.g. iteration
HashTable

Why does Lucene use arrays instead of hash tables for its inverted index?

I was watching Adrien Grand's talk on Lucene's index architecture and a point he makes is that Lucene uses sorted arrays to represent the dictionary part of its inverted indices. What's the reasoning behind using sorted arrays instead of hash tables (the "classic" inverted index data structure)?
Hash tables provide O(1) insertion and access, which to me seems like it would help a lot with quickly processing queries and merging index segments. On the other hand, sorted arrays can only offer up O(logN) access and (gasp) O(N) insertion, although merging 2 sorted arrays is the same complexity as merging 2 hash tables.
The only downsides to hash tables that I can think of are a larger memory footprint (this could indeed be a problem) and less cache friendliness (although operations like querying a sorted array require binary search which is just as cache unfriendly).
So what's up? The Lucene devs must have had a very good reason for using arrays. Is it something to do with scalability? Disk read speeds? Something else entirely?
Well, I will speculate here (should probably be a comment - but it's going to be too long).
HashMap is in general a fast look-up structure that has search time O(1) - meaning it's constant. But that is the average case; since (at least in Java) a HashMap uses TreeNodes - the search is O(logn) inside that bucket. Even if we treat that their search complexity is O(1), it does not mean it's the same time wise. It just means it is constant for each separate data structure.
Memory Indeed - I will give an example here. In short storing 15_000_000 entries would require a little over 1GB of RAM; the sorted arrays are probably much more compact, especially since they can hold primitives, instead of objects.
Putting entries in a HashMap (usually) requires all the keys to re-hashed that could be a significant performance hit, since they all have to move to different locations potentially.
Probably one extra point here - searches in ranges, that would require some TreeMap probably, wheres arrays are much more suited here. I'm thinking about partitioning an index (may be they do it internally).
I have the same idea as you - arrays are usually contiguous memory, probably much easier to be pre-fetched by a CPU.
And the last point: put me into their shoes, I would start with a HashMap first... I am sure there are compelling reasons for their decision. I wonder if they have actual tests that prove this choice.
I was thinking of the reasoning behind it. Just thought of one use-case that was important in the context of text search. I could be totally wrong :)
Why sorted array and not Dictionary?
Yes, it performs well on range queries, but IMO Lucene was mainly built for text searches. Now imagine you were to do a search for prefix-based queries Eg: country:Ind*, you will need to scan the whole HashMap/Dictionary. Whereas this becomes log(n) if you have a sorted array.
Since we have a sorted array, it would be inefficient to update the array. Hence, in Lucene segments(inverted index resides in segments) are immutable.

Is there a Database engine that implements Random-Access?

by Random Access i do not mean selecting a random record,Random Access is the
ability to fetch all records in equal time,the same way values are fetched from an array.
From wikipedia: http://en.wikipedia.org/wiki/Random_access
my intention is to store a very large array of strings, one that is too big for memory.
but still have the benefit or random-access to the array.
I usally use MySQL but it seems it has only B-Tree and Hash index types.
I don't see a reason why it isn't possible to implement such a thing.
The indexes will be like in array, starting from zero and incrementing by 1.
I want to simply fetch a string by its index, not get the index according to the string.
The goal is to improve performance. I also cannot control the order in which the strings
will be accessed, it'll be a remote DB server which will constantly receive indexes from
clients and return the string for that index.
Is there a solution for this?
p.s I don't thing this is a duplicate of Random-access container that does not fit in memory?
Because in that question he has other demands except random access
Given your definition, if you just use an SSD for storing your data, it will allow for what you call random access (i.e. uniform access speed across the data set). The fact that sequential access is less expensive than random one comes from the fact that sequential access to disk is much faster than random one (and any database tries it's best to make up for this, btw).
That said, even RAM access is not uniform as sequential access is faster due to caching and NUMA. So uniform access is an illusion anyway, which begs the question, why you are so insisting of having it in the first place. I.e. what you think will go wrong when having slow random access - it might be still fast enough for your use case.
You are talking about constant time, but you mention a unique incrementing primary key.
Unless such a key is gapless, you cannot use it as an offset, so you still need some kind of structure to look up the actual offset.
Finding a record by offset isn't usually particularly useful, since you will usually want to find it by some more friendly method, which will invariably involve an index. Searching a B-Tree index is worst case O(log n), which is pretty good.
Assuming you just have an array of strings - store it in a disk file of fixed length records and use the file system to seek to your desired offset.
Then benchmark against a database lookup.

Sqlite3 Database versus populating Arrays

I am working on a program that requires me to input values for 12 objects, each with 4 arrays, each with 100 values. (4800) values. The 4 arrays represent possible outcomes based on 2 boolean values... i.e. YY, YN, NN, NY and the 100 values to the array are what I want to extract based on another inputted variable.
I previously have all possible outcomes in a csv file, and have imported these into sqlite where I can query then for the value using sql. However, It has been suggested to me that sqlite database is not the way to go, and instead I should populate using arrays hardcoded.
Which would be better during run time and for memory management?
If you only need to query the data (no update/delete/insert), I won't suggset to use sqlite. I think the hardcode version beat sqlite both in run time and memory efficiency.
Most likely sqlite will always be less efficient then hardcoded variables, but sqlite would offer other advantages down the road, potentially making maintenance of the code easier. I would think that it would be difficult, for the amount of data that you are talking about from really noticing a difference between 4800 values being stored in the code or being stored in a database.
sqlite would easily beat your CSV though as far as processing time, and memory management would depending a lot on how efficient your language of choice handles .csv versus sqlite connectivity.
Usually a database is used when you want to handle many data (or potentially you could handle many data), and you want a faster way to search part of the data.
If you are just need to save few values, then you probably don't need a database engine.

Storing Array of Floats as a BLOB in Oracle

I am designing a new laboratory database. For some tests, I have several waveforms with ~10,000 data points acquired simultaneously. In the application (written in C), the waveforms are stored as an array of floats.
I believe I would like to store each waveform as a BLOB.
Questions:
Can the data in a BLOB be structured in such a way that Oracle can work with the data itself using only SQL or PL/SQL?
Determine max, min, average, etc
Retrieve index when value first exceeds 500
Retrieve 400th number
Create BLOB which is a derivative of first BLOB
NOTE: This message is a sub-question of Storing Waveforms in Oracle.
Determine max, min, average, etc
Retrieve index when value first
exceeds 500
Retrieve 400th number
The relational data model was designed for this kind of analysis - and Oracle's SQL is more than capable of doing this, if you model your data correctly. I recommend you focus on transforming the array of floats into tables of numbers - I suspect you'll find that the time taken will be more than compensated for by the speed of performing these sorts of queries in SQL.
The alternative is to try to write SQL that will effectively do this transformation at runtime anyway - every time the SQL is run; which will probably be much less efficient.
You may also wish to consider the VARRAY type. You do have to work with the entire array (no retreival of subsets, partial updates, etc.) but you can define a max length and Oracle will store only what you use. You can declare VARRAYs of most any datatype, including BINARY_FLOAT or NUMBER. BINARY_FLOAT will minimize your storage, but suffers from some minor precision issues (although important in financial applications). It is in IEEE 754 format.
Since you're planning to manipulate the data with PL/SQL I might back off from the BLOB design. VARRAYs will be more convenient to use. BLOBs would be very convenient to store an array of raw C floats for later use in another C program.
See PL/SQL Users Guide and Reference for how to use them.
I think that you could probably create PL/SQL functions that take the blob as a parameter and return information on it.
If you could use XMLType for the field, then you can definitely parse in PL/SQL and write the functions you want.
http://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/appdev.111/b28369/xdb10pls.htm
Of course, XML will be quite a bit slower, but if you can't parse the binary data, it's an alternative.

Resources