I have a relation R with 10 blocks. S with 1000
I have also 50 unique records for attribute A in relation R, and 5000 unique records for attribute A in relation S.
I have 100 records on each block.
Note that we assume a uniform distribution of the different values in each relation.
S has a clustring index on the join attribute A.
The question is : How many blocks of S store any of the records that participate in the join with R. I need to answer with the best and the worst case.
I thought that if R has 50 unique records for A and it's clustering index, it will take minimum 1 block for each unique and maximum 2. and then the answer is 50 or 100.
But, why can't I put 5 unique records in each block so the maximum number of block is 10?
As far as I understand this is the situation:
S has 1000 blocks with 100 records/block which leads to 100000 records (max). Of those 100000 records are 5000 unique (different) values for attribute A.
Edit:
If they are all evenly distributed every unique value for A would have 20 rows in S . If all of the 50 unique values for A in R are present in S, then 50 row groups would be fetched.
In the best case there are all stored together (thanks to the clustered index) and you need to read 10 Blocks. [(50 values for A * 20 rows with same value in S / 100 records per block = 10 blocks]
In the worst case the 20 rows for every value in A are using 2 blocks. This would lead to 100 blocks you need to read from S.
To your second question:
Since you have a clustered index containing the column A all the same values for A will be stored together. They only use more than one block if they don’t fit into one or if the block was filled by other values and therefore can’t fit in one block.
Attention: I may not have fully understand your initial question and therefore my answer could be totally wrong!
Related
I have a table with 3 clustering keys:
K1 K2 C1 C2 C3 V1 V2
Where K1 & K2 are the partition keys, C1, C2 and C3 are the clustering keys and V1 and V2 are two value columns.
In this example, C1, C2 and C3 represent the 3 coordinates of some shape (where each coordinate is a number from 1 to approx. 500). Each partition key in our table is linked to several hundred different values.
If I want to search for a row of K1 & K2 that is has a clustering key equal to C1 = 50, C2 = 450 and C3 = 250 how would Scylla execute this search assuming the clustering key is sorted from lowest to highest (ASC order)? Does Scylla always start searching from the beginning of the column to see whether a given key exists? In other words, I’m assuming Scylla will first search C1 column for the value 50. If Scylla detects that we are at value 51+ and could not find C1 contains 50 it could stop searching the rest of the C1's column data since the data is sorted so there’s no way for the value 50 to appear after 51. In this case, it would not even need to check whether C2 contains the value 450 since we need all 3 clustering columns to match. If, however, C1 contains the value 50, it will move onto C2 and search (once again starting from the first entry of C2 column) whether C2 contains the value 450.
However, when C1 = 50 it would indeed be more efficient to start from the beginning. But when C2 = 450 (and the highest index = 500) it would be more efficient to start from the end. This is assuming Scylla “knows” the lowest / highest value for each of the clustering columns.
Does Scylla, in fact, optimize search in this fashion or does it take an entirely different approach?
Perhaps another way to phrase the question is as follows: Does it normally take Scylla longer to search for C1 = 450 vs. C1 = 50? Obviously, since we only have a small dateset in this example the effect won’t be huge but if the dateset contained tens of thousands of entries the effect would be more pronounced.
Thanks
Scylla has two data structures to search when executing a query on the replica:
The in-memory data structure, used by row-cache and memtables
The on-disk data structure, used by SStables
In both cases, data is organized on two levels: on the partition and row level. So Scylla will first do a lookup of the sought-after partition, then do a lookup of the sought-after row (by its clustering key) in the already looked-up partition.
Both of these data structures are sorted on both levels and in general the lookup happens via a binary search.
We use a btree in memory, binary search in this works as expected.
The SStable files include two indexes, an index, which covers all the partition found in the data file and a so called "summary", which is a sampled index of the index, this latter is always kept in memory in its entirety. Furthermore, if the partition has a lot of rows, the index will contain a so called "promoted index", which is a sampled index of the clustering keys found therein. So a query will first locate the "index-page" using a binary search in the in-memory summary. The index-page is the portion of the index file which contains the index entry for the sough-after partition. This index-page is then read linearly until the index-entry of interest is found. This gives us the start position of the partition in the data file. If the index entry also contains a promoted index, we can furthermore do a lookup in that to get a position closer to the start of said row in the data file. We then start parsing the data file at the position we got until the given row is found.
Note that clustering keys are not found one column at a time. A clustering key, regardless of how many components it has, is treated as a single value: a tuple of 1+ components. When a query doesn't specify all the components, an incomplete tuple called a "prefix" is created. This can be compared to other partial or full keys.
I have a reference table (REF), and a table of incomplete data (ICD). Both record structures are the same. In this case it happens to contain person related data. I am trying to fill in records in ICD with data from the best matching row in REF, if a best matching row exists within reason. The ICD table may have or be missing first name, middle name, last name, address, city, state, zip, dob, and some others.
I need to ensure that the data I fill in is 100% accurate, so I already know I will not be able to fill it in on ICD rows where there isn't a reasonable match.
To this point I have written close to a dozen variations of matching queries, running over currently unmatched rows, and bringing over data in the case of a match. I know there must be a better way.
What I would like to do is to set up a weight on certain criteria, and then take the summed weight for all satisfied criteria, and if above a certain threshold, honor the match. For example:
first name match = 15 points
first name soundex match = 7 points
last name match = 25 points
last name soundex match = 10 points
middle name match = 5 points
address match = 25 points
dob match = 15 points
Then I search for previously unmatched rows, and any match above a sum of 72 points is declared a confident match.
I can't seem to get my head around how to do this in a single query. I also have a unique id in each table (they don't align or overlap, but it may prove useful in keeping track of a subquery result).
Thanks!!!
Here are some basic ideas:
If there is no column which is guaranteed to match you will have to consider every row in REF for every row in ICD. Basically this is just a join of the two tables without any join condition.
This join gives you pairs of rows, i.e. twice the number of columns in each table. Now you'll have to apply your ranking system and compute an additional column weight. You can do this with a stored procedure or lots of decodes, something like
decode(REF.firstName, ICD.firstName, 15, 0) + decode ...
Now you select the ones which have a weight greater than 72. The problem is that one ICD row may have more than one confident match. If you want the best match only, then you better order your result by weight descending and perform your updates. If an ICD row has more than one confident REF row, you will do some obsolete updates, but the better update will always come last and overwrite previous updates.
If you want to do it all in a single update is yet another challenge.
The comment by #tbone was the most correct, but I cannot mark it as an answer because it was just a comment.
I did in fact create a package to deal with this, which contains two functions.
The first function queries any record from the reference table that matches any of a variety of indexes. It then calculates a score associated with each field. For example, last name exact match is worth 25, last name soundex match is worth 16, null is worth 0 and a total mismatch is worth -15. These are summed together, and then I ran various queries to tell how low the threshold could be before letting in junk. The first function returned each row from ref, and the score associated with it, in desc order.
The second function uses lead and lag to specifically consider the highest score returned by the first function, and then I added a second check, that the highest score was at least a reasonable number of points in the lead from the second highest score. This trick eliminated my false matches. The second function returns a single unique id of the match if there is a match.
From there is was pretty easy to write an update statement on the ICD table.
All the records in a database is saved in (key, value) pair formats.Records can always be retrieved by specifying key value. Data structure needs to be developed to handle following scenarios
Access all the records in a linear fashion (Array or linked list is best data structure for this scenario to access in O(N) time)
retrieve the record by providing the key (hash table can be implemented to index it in O(1) complexity)
Retrieve set of records for a value at a particular byte in the key . Ex: List of all records for which 2nd number(10's place) in the key should be 5 and if the keys are 256, 1452, 362, 874, the records for keys , 256 and 1452 should be returned
I am assuming you keys are at most d digits long (in decimal).
How about a normal hashtable and an additional 10*d two dimensional array (let's call it A) of sets. A[i][j] is the set of keys which have digit i in the jth position. The sets can support O(1) insert/delete if implemented themselves as hashtables.
For 1 and 2, I think Linked Hash Map is a good choice.
For the point 3, an additional Hash map with (digit, position) tuple as key and list of pointers to the values.
Both data structures can be wrapped inside one, and both will point to the same data, of course.
Store the keys in a trie. For the numbers in your example (assuming 4 digit numbers) it looks like this:
*root*
|
0 -- 2 - 5 - 6
| |
| +- 3 - 6 - 2
| |
| +- 8 - 7 - 4
|
1 - 4 - 5 - 2
This data structure can be traversed in a way that returns (1) or (3). It won't be quite as fast for (3) as would maintaining an index for each digit, so I guess it's a question of whether space or lookup time is your primary concern. For (2), it is already O(log n), but if you need O(1), you could store the keys in both the trie and a hash table.
The first thing that comes to mind is embedding a pair of nodes in each record. One of the nodes would be a part of a tree sorted by the record index and the other, part of a tree sorted by the record key. You can then have quick access to the records by index or key using these trees. With this you can also quickly visit records in sequential index or key order. This covers the first and second requirement.
You can add another node for a tree of records whose values contain 5 in the tens position. That covers the third requirement.
Extra benefit: the same tree handling code will be used in all cases.
A dictionary (hash map, etc.) would easily handle those requirements, although your third requirement would be an O(N) operation. You just iterate over the keys and select those that match your criteria. You don't say what your desired performance is for that case.
But O(N) might be plenty fast enough. How many items are in your data set, and how often will you be performing that third function?
I have data in the form of:
ID ATTR
3 10
1 20
1 20
4 30
... ...
Where ID and Attr are unsorted and may contain duplicates. The range for the IDs are 1-20,000 or so, and ATTR are unsigned int. There may be anywhere between 100,000 and 500,000 pairs that I need to process at a single time.
I am looking for:
The number of unique pairs.
The number of times a non-unique pair pops up.
So in the above data, I'd want to know that (1,20) appeared twice and that there were 3 unique pairs.
I'm currently using a hash table in my naive approach. I keep a counter of unique pairs, and decrement the counter if the item I am inserting is already there. I also keep an array of IDs of the non-unique pairs. (All on first encounters)
Performance and size are about equal concerns. I'm actually OK with a relatively high (say 0.5%) rate of false positives given the performance and size concerns. (I've also implemented this using a spectral bloom)
I'm not that smart, so I'm sure there's a better solution out there, and I'd like to hear about your favorite hash table implementations/any other ideas. :)
A hash table with keys like <id>=<attr> is an excellent solution to this problem. If you can tolerate errors, you can get smaller/faster with a bloom, I guess. But do you really need to do that?
I've been trying to estimate the size of an Access table with a certain number of records.
It has 4 Longs (4 bytes each), and a Currency (8 bytes).
In theory: 1 Record = 24 bytes, 500,000 = ~11.5MB
However, the accdb file (even after compacting) increases by almost 30MB (~61 bytes per record). A few extra bytes for padding wouldn't be so bad, but 2.5X seems a bit excessive - even for Microsoft bloat.
What's with the discrepancy? The four longs are compound keys, would that matter?
This is the result of my tests, all conducted with an A2003 MDB, not with A2007 ACCDB:
98,304 IndexTestEmpty.mdb
131,072 IndexTestNoIndexesNoData.mdb
11,223,040 IndexTestNoIndexes.mdb
15,425,536 IndexTestPK.mdb
19,644,416 IndexTestPKIndexes1.mdb
23,838,720 IndexTestPKIndexes2.mdb
24,424,448 IndexTestPKCompound.mdb
28,041,216 IndexTestPKIndexes3.mdb
28,655,616 IndexTestPKCompoundIndexes1.mdb
32,849,920 IndexTestPKCompoundIndexes2.mdb
37,040,128 IndexTestPKCompoundIndexes3.mdb
The names should be pretty self-explanatory, I think. I used an append query with Rnd() to append 524,288 records of fake data, which made the file 11MBs. The indexes I created on the other fields were all non-unique. But if you see the compound 4-column index increased the size from 11MBs (no indexes) to well over 24MBs. A PK on the first column only increased the size only from 11MBs to 15.4MBs (using fake MBs, of course, i.e., like hard drive manufacturers).
Notice how each single-column index added approximately 4MBs to the file size. If you consider that 4 columns with no indexes totalled 11MBs, that seems about right based on my comment above, i.e., that each index should increase the file size by about the amount of data in the field being indexed. I am surprised that the clustered index did this, too -- I thought that the clustered index would use less space, but it doesn't.
For comparison, a non-PK (i.e., non-clustered) unique index on the first column, starting from IndexTestNoIndexes.mdb is exactly the same size as the database with the first column as the PK, so there's no space savings from the clustered index at all. On the off chance that perhaps the ordinal position of the indexed field might make a difference, I also tried a unique index on the second column only, and this came out exactly the same size.
Now, I didn't read your question carefully, and omitted the Currency field, but if I add that to the non-indexed table and the table with the compound index and populate it with random data, I get this:
98,304 IndexTestEmpty.mdb
131,072 IndexTestNoIndexesNoData.mdb
11,223,040 IndexTestNoIndexes.mdb
15,425,536 IndexTestPK.mdb
15,425,536 IndexTestIndexUnique2.mdb
15,425,536 IndexTestIndexUnique1.mdb
15,482,880 IndexTestNoIndexes+Currency.mdb
19,644,416 IndexTestPKIndexes1.mdb
23,838,720 IndexTestPKIndexes2.mdb
24,424,448 IndexTestPKCompound.mdb
28,041,216 IndexTestPKIndexes3.mdb
28,655,616 IndexTestPKCompoundIndexes1.mdb
28,692,480 IndexTestPKCompound+Currency.mdb
32,849,920 IndexTestPKCompoundIndexes2.mdb
37,040,128 IndexTestPKCompoundIndexes3.mdb
The points of comparison are:
11,223,040 IndexTestNoIndexes.mdb
15,482,880 IndexTestNoIndexes+Currency.mdb
24,424,448 IndexTestPKCompound.mdb
28,692,480 IndexTestPKCompound+Currency.mdb
So, the currency field added another 4.5MBs, and its index added another 4MBs. And if I add non-unique indexes to the 2nd, 3rd and 4th long fields, the database 41,336,832, and increase in size of just under 12MBs (or ~4MBs per additional index).
So, this basically replicates your results, no? And I ended up with the same file sizes, roughly speaking.
The answer to your question is INDEXES, though there is obviously more overhead in the A2007 ACCDB format, since I saw an increase in size of only 20MBs, not 30MBs.
One thing I did notice was that I could implement an index that would make the file larger, then delete the index and compact, and it would return to exactly the same file size as it had before, so you should be able to take a single copy of your database and experiment with what removing the indexes does to your file size.