Is there a Database engine that implements Random-Access? - database

by Random Access i do not mean selecting a random record,Random Access is the
ability to fetch all records in equal time,the same way values are fetched from an array.
From wikipedia: http://en.wikipedia.org/wiki/Random_access
my intention is to store a very large array of strings, one that is too big for memory.
but still have the benefit or random-access to the array.
I usally use MySQL but it seems it has only B-Tree and Hash index types.
I don't see a reason why it isn't possible to implement such a thing.
The indexes will be like in array, starting from zero and incrementing by 1.
I want to simply fetch a string by its index, not get the index according to the string.
The goal is to improve performance. I also cannot control the order in which the strings
will be accessed, it'll be a remote DB server which will constantly receive indexes from
clients and return the string for that index.
Is there a solution for this?
p.s I don't thing this is a duplicate of Random-access container that does not fit in memory?
Because in that question he has other demands except random access

Given your definition, if you just use an SSD for storing your data, it will allow for what you call random access (i.e. uniform access speed across the data set). The fact that sequential access is less expensive than random one comes from the fact that sequential access to disk is much faster than random one (and any database tries it's best to make up for this, btw).
That said, even RAM access is not uniform as sequential access is faster due to caching and NUMA. So uniform access is an illusion anyway, which begs the question, why you are so insisting of having it in the first place. I.e. what you think will go wrong when having slow random access - it might be still fast enough for your use case.

You are talking about constant time, but you mention a unique incrementing primary key.
Unless such a key is gapless, you cannot use it as an offset, so you still need some kind of structure to look up the actual offset.
Finding a record by offset isn't usually particularly useful, since you will usually want to find it by some more friendly method, which will invariably involve an index. Searching a B-Tree index is worst case O(log n), which is pretty good.
Assuming you just have an array of strings - store it in a disk file of fixed length records and use the file system to seek to your desired offset.
Then benchmark against a database lookup.

Related

Hashmap, hashtable, map or anyother method

I am looking to compare two values (like greater than or less than from other) in HashMap, Hashtable, Map or any other Array types.
Could you please help me this.
Here are some factors that would affect your selection of a data structure:
What is the purpose of the comparison?
What type of data are you comparing?
How often will data be inserted into this data structure?
How often will data be selected from this data structure?
When should you use a HashMap?
One should use HashMap when their major requirements are only retrieving or modifying data's based on Key. For example, in Web
Applications username is stored as a key and user data is stored as a
value in the HashMap, for faster retrieval of user data corresponding
to a username.
HashMap
When should you use a HashTable?
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by
someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can
be wasteful and slow when excessive memory's used), the majority of
implementations will outgrow the arrays they're using for the hash
table every now and then, then allocate a bigger array and copy
content across: this can make the specific insertions that cause this
rehashing to be much slower than the normal O(1) behaviour, even
though the average is still O(1); if you need more consistent
behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating
on elements with keys that are "nearby" in some specific sort order),
such that cache efficiency is better for other storage models that
keep them nearby in memory (e.g. bucket sorted elements), even if
you're not exactly relying on the sort order for e.g. iteration
HashTable

Expected performance of tries vs bucket arrays with constant load-factor

I know that I can simply use bucket array for associative container if I have uniformly distributed integer keys or keys that can be mapped into uniformly distributed integers. If I can create the array big enough to ensure a certain load factor (which assumes the collection is not too dynamic), than the expected number of collisions for a key will be bounded, because this is simply hash table with identity hash function.
Edit: I view strings as equivalent to positional fractions in the range [0..1]. So they can be mapped into any integer range by multiplication and taking floor of the result.
I can also do prefix queries efficiently, just like with tries. I presume (without knowing a proof) that the expected number of empty slots corresponding to a given prefix that have to be skipped sequentially before the first bucket with at least one element is reached is also going to be bounded by constant (again depending on the chosen load factor).
And of course, I can do stabbing queries in worst-case constant time, and range queries in solely output sensitive linear expected time (if the conjecture of denseness from the previous paragraph is indeed true).
What are the advantages of a tries then?
If the distribution is uniform, I don't see anything that tries do better. But I may be wrong.
If the distribution has large uncompensated skew (because we had no prior probabilities or just looking at the worst case), the bucket array performs poorly, but tries also become heavily imbalanced, and can have linear worst case performance with strings of arbitrary length. So the use of either structure for your data is questionable.
So my question is - what are the performance advantages of tries over bucket arrays that can be formally demonstrated? What kind of distributions elicit those advantages?
I was thinking of distributions with self-similar structure at different scales. I believe those are called fractal distributions, of which I confess to know nothing. May be then, if the distribution is prone to clustering at every scale, tries can provide superior performance, by keeping the load factor of each node similar, adding levels at dense regions as necessary - something that bucket arrays can not do.
Thanks
Tries are good if your strings share common prefixes. In that case, the prefix is stored only once and can be queried with linear performance in the output string length. In a bucket array, all strings with the same prefixes would end up close together in your key space, so you have very skewed load where most buckets are empty and some are huge.
More generally, tries are also good if particular patterns (e.g. the letters t and h together) occur often. If there are many such patterns, the order of the trie's tree nodes will typically be small, and little storage is wasted.
One of the advantages of tries I can think of is insertion. Bucket array may need to be resized at some point and this is expensive operation. So worst-case insertion time into trie is much better than into bucket array.
Another thing is that you need to map string to fraction to be used with bucket arrays. So if you have short keys, theoretically trie can be more efficient, because you don't need to do the mapping.

Optimize I/O to a large file in C

I have written a C program that reads data from a huge file(>3 GB). Each record in the file is a key-value pair. Whenever a query comes, the program searches for the key and retrieves the corresponding value, similarly for updating the value.
The queries are coming at a fast rate so this technique will eventually fail.
The worst case access time is too large. Creating an in-memory object will again be a bad idea, because of the size.
Is there any way in which this problem can be sorted out?
Sure seems to me a file of that size wrapping a series of name-value pairs is begging to be migrated to an actual database; failing that, I'd probably at least explore the idea of a memory-mapped file, with only portions resident at any given time...
How large are the keys, in comparison to their corresponding values? If they are significantly smaller, you might try creating a table in memory between the keys and the corresponding locations within the file of their values.

Memory optimization in C for an array of 1 million records

I am writing a program which requires me to create an array of a million records. The array indices are unique ids(0-million represents unique product id). At first all elements are initialized to zero. They are incremented depending upon product sold.
This approach however has a high space complexity (4 * million bytes). Later I saw that only certain products need frequent updating. So is there any way in which I can reduce memory usage as well as keep track of all the products?
If you don't need frequent updating then you can store all the results in a file. Whenever you are updating any entry you can just create a temp file with all the other entries plus the updated one. After that you can just change the name of the temp file using rename(temp,new);.
Although, an array of million records doesn't require that much memory(just 4 megabytes). So, your approach is the best and the easiest one.
The best approach(algorithmically) would be to make a hash table to store all the entries. But if you are not an expert in C then making a hash table could be a problem for you.
This sounds more like a situation for table in a database than an in-memory array to me. If your use case allows for it, I'd use a database instead.
Otherwise, if in your use case:
a significant fraction of the products will eventually be used,
RAM is limited,
external storage (disk, serial memory) is available,
average access performance comparable to RAM speeds is required, and
increased worst case access time is acceptable,
then you could try some sort of caching scheme (lru maybe?). This will use more code space, somewhat increase your average access time, and more significantly increase your worst case access time.
If a large fraction of the products will not just be infrequently, but never used, then you should look into #fatrock92's suggestion of a hash table.
It's better use dynamic allocation of memory for array.
use of malloc or realloc can give you better way to allocate memory
I think you know how to use malloc and realloc
You can use link list, so whenever you need you can add or update elements in your list.
Also you can hold last time access in each node so you'd able to remove the nodes that has not been used lately.

Is it faster to search for a large string in a DB by its hashcode?

If I need to retrieve a large string from a DB, Is it faster to search for it using the string itself or would I gain by hashing the string and storing the hash in the DB as well and then search based on that?
If yes what hash algorithm should I use (security is not an issue, I am looking for performance)
If it matters: I am using C# and MSSQL2005
In general: probably not, assuming the column is indexed. Database servers are designed to do such lookups quickly and efficiently. Some databases (e.g. Oracle) provide options to build indexes based on hashing.
However, in the end this can be only answered by performance testing with representative (of your requirements) data and usage patterns.
I'd be surprised if this offered huge improvement and I would recommend not using your own performance optimisations for a DB search.
If you use a database index there is scope for performance to be tuned by a DBA using tried and trusted methods. Hard coding your own index optimisation will prevent this and may stop you gaining for any performance improvements in indexing in future versions of the DB.
Though I've never done it, it sounds like this would work in principle. There's a chance you may get false positives but that's probably quite slim.
I'd go with a fast algorithm such as MD5 as you don't want to spend longer hashing the string than it would have taken you to just search for it.
The final thing I can say is that you'll only know if it is better if you try it out and measure the performance.
Are you doing an equality match, or a containment match? For an equality match, you should let the db handle this (but add a non-clustered index) and just test via WHERE table.Foo = #foo. For a containment match, you should perhaps look at full text index.
First - MEASURE it. That is the only way to tell for sure.
Second - If you don't have an issue with the speed of the string searching, then keep it simple and don't use a Hash.
However, for your actual question (and just because it is an interesting thought). It depends on how similar the strings are. Remember that the DB engine doesn't need to compare all the characters in a string, only enough to find a difference. If you are looking through 10 million strings that all start with the same 300 characters then the hash will almost certainly be faster. If however you are looking for the only string that starts with an x, then i the string comparison could be faster. I think though that SQL will still have to get the entire string from disc, even if it then only uses the first byte (or first few bytes for multi byte characters), so the total string length will still have an impact.
If you are trying the hash comparison then you should make the hash an indexed calculated column. It will not be faster if you are working out the hashes for all the strings each time you run a query!
You could also consider using SQL's CRC function. It produces an int, which will be even quicker to comapre and is faster to calculate. But you will have to double check the results of this query by actually testing the string values because the CRC function is not designed for this sort of usage and is much more likly to return duplicate values. You will need to do the CRC or Hash check in one query, then have an outer query that compares the strings. You will also want to watch the QEP generated to make sure the optimiser is processing the query in the order you intended. It might decide to do the string comparisons first, then the CRC or Hash checks second.
As someone else has pointed out, this is only any good if you are doing an exact match. A hash can't help if you are trying to do any sort of range or partial match.
If your strings are short (less than 100 charaters in general), strings will be faster.
If the strings are large, HASH search may and most probably will be faster.
HashBytes(MD4) seems to be the fastest on DML.
If you use a fixed length field and an index it will probably be faster...
TIP: if you are going to store the hash in the database, a MD5 Hash is always 16 bytes, so can be saved in a uniqueidentifier column (and System.Guid in .NET)
This might offer some performance gain over saving hashes in a different way (I use this method to check for binary/ntext field changes but not for strings/nvarchars).
The 'ideal' answer is definitely yes.
String matching against an indexed column will always be slower than matching a hashvalue stored in an index column. This is what hashvalues are designed for, because they take a large dataset (e.g. 3000 comparison points, one per character) and coalesce it into a smaller dataset, (e.g. 16 comparison points, one per byte).
So, the most optimized string comparison tool will be slower than the optimized hash value comparison.
However, as has been noted, implementing your own optimized hash function is dangerous and likely to not go well. (I've tried and failed miserably) Hash collisions are not particulrly a problem, because then you will just have to fall back on the string matching algorithm, which means that would be (at worst) exactly as fast as your string comparison method.
But, this is all assuming that your hashing is done in an optimal fashion, (which it probably won't be) and that there will not be any bugs in your hashing component (which there will be) and that the performance increase will be worth the effort (probably not). String comparison algorithms, especially in indexed columns are already pretty fast, and the hashing effort (programmer time) is likely to be much higher than your possible gain.
And if you want to know about performance, Just Measure It.
I am confused and am probably misunderstanding your question.
If you already have the string (thus you can compute the hash), why do you need to retrieve it?
Do you use a large string as the key for something perhaps?

Resources