Data structure to handle the requirement of following use case

Data structure to handle the requirement of following use case - database

All the records in a database is saved in (key, value) pair formats.Records can always be retrieved by specifying key value. Data structure needs to be developed to handle following scenarios
Access all the records in a linear fashion (Array or linked list is best data structure for this scenario to access in O(N) time)
retrieve the record by providing the key (hash table can be implemented to index it in O(1) complexity)
Retrieve set of records for a value at a particular byte in the key . Ex: List of all records for which 2nd number(10's place) in the key should be 5 and if the keys are 256, 1452, 362, 874, the records for keys , 256 and 1452 should be returned

I am assuming you keys are at most d digits long (in decimal).
How about a normal hashtable and an additional 10*d two dimensional array (let's call it A) of sets. A[i][j] is the set of keys which have digit i in the jth position. The sets can support O(1) insert/delete if implemented themselves as hashtables.

For 1 and 2, I think Linked Hash Map is a good choice.
For the point 3, an additional Hash map with (digit, position) tuple as key and list of pointers to the values.
Both data structures can be wrapped inside one, and both will point to the same data, of course.

Store the keys in a trie. For the numbers in your example (assuming 4 digit numbers) it looks like this:
*root*
|
0 -- 2 - 5 - 6
| |
| +- 3 - 6 - 2
| |
| +- 8 - 7 - 4
|
1 - 4 - 5 - 2
This data structure can be traversed in a way that returns (1) or (3). It won't be quite as fast for (3) as would maintaining an index for each digit, so I guess it's a question of whether space or lookup time is your primary concern. For (2), it is already O(log n), but if you need O(1), you could store the keys in both the trie and a hash table.

The first thing that comes to mind is embedding a pair of nodes in each record. One of the nodes would be a part of a tree sorted by the record index and the other, part of a tree sorted by the record key. You can then have quick access to the records by index or key using these trees. With this you can also quickly visit records in sequential index or key order. This covers the first and second requirement.
You can add another node for a tree of records whose values contain 5 in the tens position. That covers the third requirement.
Extra benefit: the same tree handling code will be used in all cases.

A dictionary (hash map, etc.) would easily handle those requirements, although your third requirement would be an O(N) operation. You just iterate over the keys and select those that match your criteria. You don't say what your desired performance is for that case.
But O(N) might be plenty fast enough. How many items are in your data set, and how often will you be performing that third function?

Related

Does ScyllaDB optimize search of clustering key?

I have a table with 3 clustering keys:
K1 K2 C1 C2 C3 V1 V2
Where K1 & K2 are the partition keys, C1, C2 and C3 are the clustering keys and V1 and V2 are two value columns.
In this example, C1, C2 and C3 represent the 3 coordinates of some shape (where each coordinate is a number from 1 to approx. 500). Each partition key in our table is linked to several hundred different values.
If I want to search for a row of K1 & K2 that is has a clustering key equal to C1 = 50, C2 = 450 and C3 = 250 how would Scylla execute this search assuming the clustering key is sorted from lowest to highest (ASC order)? Does Scylla always start searching from the beginning of the column to see whether a given key exists? In other words, I’m assuming Scylla will first search C1 column for the value 50. If Scylla detects that we are at value 51+ and could not find C1 contains 50 it could stop searching the rest of the C1's column data since the data is sorted so there’s no way for the value 50 to appear after 51. In this case, it would not even need to check whether C2 contains the value 450 since we need all 3 clustering columns to match. If, however, C1 contains the value 50, it will move onto C2 and search (once again starting from the first entry of C2 column) whether C2 contains the value 450.
However, when C1 = 50 it would indeed be more efficient to start from the beginning. But when C2 = 450 (and the highest index = 500) it would be more efficient to start from the end. This is assuming Scylla “knows” the lowest / highest value for each of the clustering columns.
Does Scylla, in fact, optimize search in this fashion or does it take an entirely different approach?
Perhaps another way to phrase the question is as follows: Does it normally take Scylla longer to search for C1 = 450 vs. C1 = 50? Obviously, since we only have a small dateset in this example the effect won’t be huge but if the dateset contained tens of thousands of entries the effect would be more pronounced.
Thanks

Scylla has two data structures to search when executing a query on the replica:
The in-memory data structure, used by row-cache and memtables
The on-disk data structure, used by SStables
In both cases, data is organized on two levels: on the partition and row level. So Scylla will first do a lookup of the sought-after partition, then do a lookup of the sought-after row (by its clustering key) in the already looked-up partition.
Both of these data structures are sorted on both levels and in general the lookup happens via a binary search.
We use a btree in memory, binary search in this works as expected.
The SStable files include two indexes, an index, which covers all the partition found in the data file and a so called "summary", which is a sampled index of the index, this latter is always kept in memory in its entirety. Furthermore, if the partition has a lot of rows, the index will contain a so called "promoted index", which is a sampled index of the clustering keys found therein. So a query will first locate the "index-page" using a binary search in the in-memory summary. The index-page is the portion of the index file which contains the index entry for the sough-after partition. This index-page is then read linearly until the index-entry of interest is found. This gives us the start position of the partition in the data file. If the index entry also contains a promoted index, we can furthermore do a lookup in that to get a position closer to the start of said row in the data file. We then start parsing the data file at the position we got until the given row is found.
Note that clustering keys are not found one column at a time. A clustering key, regardless of how many components it has, is treated as a single value: a tuple of 1+ components. When a query doesn't specify all the components, an incomplete tuple called a "prefix" is created. This can be compared to other partial or full keys.

Alternatives for data entry k* in index k

I'm having a hard time wrapping my head around this concept of Alternative 1 vs 2/3 for data entry into an index. Here is an excerpt from some notes:
Alternative 1:
Actual data record (with key
value k)
– If this is used, index structure is a file
organization for data records (like Heap
files or sorted files).
– At most one index on a given collection of
data records can use Alternative 1.
– This alternative saves pointer lookups but
can be expensive to maintain with
insertions and deletions.
Alternative 2: (k, rid of matching data record) and
Alternative 3: (k, list of rids of matching data records)
– Easier to maintain than Alt 1.
– If more than one index is required on a given file, at most
one index can use Alternative 1; rest must use Alternatives 2
or 3.
– Alternative 3 more compact than Alternative 2, but leads to
variable sized data entries even if search keys are of fixed
length.
– Even worse, for large rid lists the data entry would have to
span multiple blocks!
Can someone help me understand this by providing some concrete examples?

Searching a file and returning value - Super Fast

I have a set of data which has a name, some sub values and then a associative numeric value. For example:
James Value1 Value2 "1.232323/1.232334"
Jim Value1 Value2 "1.245454/1.232999"
Dave Value1 Value2 "1.267623/1.277777"
There will be around 100,000 entries like this stored in either a file or database. I would like to know, what is the quickest way of being able to return the results which match a search, along with their associated numeric value.
For example, a query of "J" would return both James and Jim results which the numeric values in the last column.
I've heard people mention binary tree searching, dictionary searching, indexed searching. I have no idea which is a good route to peruse.

This is a poorly characterized problem. As with many optimization problems, there are trade-offs in resources. If you truly want the fastest response possible, then a likely approach is to compile all possible searches into a table of prepared results, so that, given a search key, you can look the search key up in the table and return the result.
Assuming your character set is limited to A-Z and a-z, a table with an entry for each search key from 0 to 4 characters will use a modest amount of memory by today’s standards. Each table entry merely needs to have two values in it: The start and end positions in a list of the numeric values. (Compile the list in this way: Sort the records by the name field. Extract just the numeric values from the records, maintaining the order, putting them in a list. Any search key must return a sublist of contiguous records from that list. This is because the search is for a prefix string of the name field, so any records that match the search key are adjacent, when sorted by the name field.)
Thus, to create a table to look up any key of 0 to 4 characters, you need fewer than 534 entries in a table of pairs, where each member of the pair contains a record number (32 bits or fewer). So 8•534 = 60.2 MiB suffices. (53 is because you have 52 characters plus one sentinel character to mark the end of the key. Alternate encodings could reduce this some.)
To support keys of more than 4 characters, you need to extend this. With typical data, 4 characters will have narrowed down the search greatly, so you can take the set of records indicated by the first 4 characters and prune it to get the final results. If the data has pathological cases where 4 characters does not reduce the search much, you can embellish this technique.
So, is that really what you want to do, make the speed as fast as possible, regardless of other resources (including engineering time) consumed? If not, what are your actual goals?

Counting number of unique pairs and instances of non-unique pairs in unsorted data

I have data in the form of:
ID ATTR
3 10
1 20
1 20
4 30
... ...
Where ID and Attr are unsorted and may contain duplicates. The range for the IDs are 1-20,000 or so, and ATTR are unsigned int. There may be anywhere between 100,000 and 500,000 pairs that I need to process at a single time.
I am looking for:
The number of unique pairs.
The number of times a non-unique pair pops up.
So in the above data, I'd want to know that (1,20) appeared twice and that there were 3 unique pairs.
I'm currently using a hash table in my naive approach. I keep a counter of unique pairs, and decrement the counter if the item I am inserting is already there. I also keep an array of IDs of the non-unique pairs. (All on first encounters)
Performance and size are about equal concerns. I'm actually OK with a relatively high (say 0.5%) rate of false positives given the performance and size concerns. (I've also implemented this using a spectral bloom)
I'm not that smart, so I'm sure there's a better solution out there, and I'd like to hear about your favorite hash table implementations/any other ideas. :)

A hash table with keys like <id>=<attr> is an excellent solution to this problem. If you can tolerate errors, you can get smaller/faster with a bloom, I guess. But do you really need to do that?

Understanding hash tables

I understand that some hash tables use "buckets", which is a linked list of "entries".
HashTable
-size //total possible buckets to use
-count // total buckets in use
-buckets //linked list of entries
Entry
-key //key identifier
-value // the object you are storing for reference
-next //the next entry
In order to get the bucket by index, you have to call:
myBucket = someHashTable[hashIntValue]
Then, you could iterate the linked list of entries until you find the one you are looking for or null.
Does the hash function always return a NUMBER % HashTable.size? That way, you stay within the limit? Is that how the hash function should work?

Mathematically speaking, a hash function is usually defined as a mapping from the universe of elements you want to store in the hash table to the range {0, 1, 2, .., numBuckets - 1}. This means that in theory, there's no requirement whatsoever that you use the mod operator to map some integer hash code into the range of valid bucket indices.
However, in practice, almost universally programmers will use a generic hash code that produces a uniformly-distributed integer value and then mod it down so that it fits in the range of the buckets. This allows hash codes to be developed independently of the number of buckets used in the hash table.
EDIT: Your description of a hash table is called a chained hash table and uses a technique called closed addressing. There are many other implementations of hash tables besides the one you've described. If you're curious - and I hope you are! :-) - you might want to check out the Wikipedia page on the subject.

what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.

There is no predefined rule for how a hash function should behave. You can have all of your values map to index 0 - a perfectly valid hash function (performs poorly, but works).
Of course, if your hash function returns a value outside of the range of indices in your associated array, it won't work correctly. Thats not to say however, that you need to use the formula (number % TABLE_SIZE)

No, the table is typically an array of entries. You don't iterate it until you found the same hash, you use the hash result (or usually hash modulo numBuckets) to directly index into the array of entries. That gives you the O(1) behaviour (iterating would be O(n)).
When you try to store two different objects with the same hash result (called a 'hash collision'), you have to find some way to make space. Different implementations vary in how they handle collisions. You can create a linked list of all the objects with same hash, or use some rehashing to store in a different entry of the table.