How are bitmap indexes helpful? - database

Wikipedia gives this example
Identifier Gender Bitmaps
F M
1 Female 1 0
2 Male 0 1
3 Male 0 1
4 Unspecified 0 0
5 Female 1 0
But I do not understand this.
How is this an index first of all? Isn't an index supposed to point to rows (using rowid's) given the key?
What would be the typical queries where such indexes would be useful? How are they better than B-tree indexes? I know that if we use a B-tree index on Gender here, we will get a lot of results if for example, we look for Gender = Male, which need to be filtered out further (so not very useful). How does a Bitmap improve the situation?

A better representation of a bitmap index, is if given the sample above:
Identifier Gender RowID
1 Female R1
2 Male R2
3 Male R3
4 Unspecified R4
5 Female R5
the a bitmap index on the gender column would (conceptually) look like this:
Gender R1 R2 R3 R4 R5
Female 1 0 0 0 1
Male 0 1 1 0 0
Unspecified 0 0 0 1 0
Bitmap indexes are used when the number of distinct values in a column is relatively low (consider the opposite where all values are unique: the bitmap index would be as wide as every row, and as long making it kind of like one big identity matrix.)
So with this index in place a query like
SELECT * FROM table1 WHERE gender = 'Male'
the database looks for a match in the gender values in the index, finds all the rowids where the bit was set to 1, and then goes and gets the table results.
A query like:
SELECT * FROM table1 WHERE gender IN ('Male', 'Unspecified')
would get the 1 bits for Male, the 1 bits for Unspecified, do a bitwise-OR then go get the rows where the resulting bits are 1.
So, the advantages of using a bitmap index over a b*tree index are storage (with low cardinality, bitmap indexes are pretty compact), and the ability to do bitwise operations before resolving the actual rowids which can be pretty quick.
Note that bitmap indexes can have performance implications with inserts/deletes (conceptually, you add/remove a column to/from the bitmap and rejig it accordingly...), and can create a whole lot of contention as an update on a row can lock the entire corresponding bitmap entry and you can't update a different row (with the same bitmap value) until the first update is committed/rolled back.

The benefit comes when filtering on multiple columns, then the corresponding indexes can be merged with bitwise operations before actually selecting the data.
If you have gender, eye_colour, hair_colour
then the query
select * from persons where
gender = 'male' and
(eye_colour = 'blue' or hair_colour = 'blonde')
would first make a bitwise or between the eye_colour['blue'] index and the hair_colour['blonde'] index and finally bitwise and between the result and the gender['male'] index. This operation performs really fast both computationally and I/O.
The resulting bit stream would be used for picking the actual rows.
Bitmap indexes are typically used in "star joins" in data warehouse applications.

As indicated in the Wikipedia article, they use bitwise operations, which can perform better than comparing data types such as integers, so the short answer is increased speed of queries.
Theoretically, it should take up less computations and less time to select all males or all females from your example.
Just thinking about how this works under the hood should make why this is faster obvious. A bit is logically either true or false. If you want to do a query using a WHERE clause, this will eventually evaluate to either a true or a false for the records in order to determine whether to include them in your results.
Preface - the rest of this is meant to be layman's terns and non-techie
So the next question is what does it take to evaluate to true? Even comparing numeric values means that the computer has to...
Allocate memory for the value you want to evaluate
Allocate memory for the control value
Assign the value to each (count this as two steps)
Compare the two - for a numeric this should be quick, but for strings, there are more bytes to compare.
Assign the results to a 0(false) or 1 (true) value.
repeat if you're using a multiple part where clause such as Where "this = this AND that = that"
perform bitwise operations on the results generated in step 5
Come up with the final value
deallocate the memory allocated in steps 1-3
But using bitwise logic, you're just looking at 0 (false) and 1 (true) values. 90% of the overhead for the comparison work is eliminated.

Related

What is the way to select only first value in InfluxDB sequences?

I have a measurement that indicates the process of changing the state of smth. Every second the system asks whether smth is changing, if so it stores 1 in db, otherwise nothing. So I have sequences of "ones".
As here. Distance between points is 1s
I want to get only the time of first point of each "one" sequence. On this particular example it would be
Time Value
2019-01-01 11:46:55 1
2019-01-01 12:36:45 1
In red squares
Is there a way to do it using queries? Or may be easy python pattern?
P.S. first() selector requires GROUP BY, but I cannot assume that sequences are less then some_time_interval
You could probably achieve what you need with a nested query and the difference function. https://docs.influxdata.com/influxdb/v1.7/query_language/functions/#difference
For example:
Let's say your measurement is called change and the field is called value
SELECT diff FROM
(SELECT difference(value) as diff FROM
(SELECT value FROM change WHERE your_time_filter GROUP BY time(1s) fill(0))
)
WHERE diff > 0
So you first fill in the empty spaces with zeros. Then take the differences between subsequent values. Then pick the differences that are positive. If the difference is 0 then it is between two zeros or two ones, no change. If it is 1 then the change is from zero to one (what you need). If it is -1 then the change is from one to zero, end of sequence of ones.
You may need to tweak the query a bit with time intervals and groupings.

Need an optimized way to handle combination of entities to improve performance

So, I am working on a feature in a web application. The problem is like this-
I have four different entities. Let's say those are - Item1, Item2, Item3, Item4. There's two phase of the feature. Let's say the first phase is - Choose entities. In the first phase, User will have option to choose multiple items for each entity and for every combination from that choosing, I need to do some calculation. Then in the second phase(let's say Relocate phase) - based on the calculation done in the first phase, for each combination I would have to let user choose another combination where the value of the first combination would get removed to the row of the second combination.
Here's the data model for further clarification -
EntityCombinationTable
(
Id
Item1_Id,
Item2_Id,
Item3_Id,
Item4_Id
)
ValuesTable
(
Combination_Id,
Value
)
So suppose I have following values in both values -
EntityCombinationTable
Id -- Item1_Id -- Item2_Id -- Item3_Id -- Item4_Id
1 1 1 1 1
2 1 2 1 1
3 2 1 1 1
4 2 2 1 1
ValuesTable
Combination_Id -- Value
1 10
2 0
3 0
4 20
So if in the first phase - I choose (1,2) for Item1, (1,2) for Item_2 and 1 for both Item_3 and Item4, then total combination would be 2*2*1*1 = 4.
Then in the second phase, for each of the combination that has value greater than zero, I would have to let the user choose different combination where the values would get relocated.
For example - As only combination with Id 1 and 2 has value greater than zero, only two relocation combination would need to be shown in the second dialog. So if the user choose (3,3,3,3) and (4,4,4,4) as relocation combination in the second phase, then new row will need to be inserted in
EntityCombinationTable for (3,3,3,3) and (4,4,4,4). And values of (1,1,1,1) and (2,2,1,1) will be relocated respectively to rows corresponding to (3,3,3,3) and (4,4,4,4) in the ValuesTable.
So the problem is - each of the entity can have items upto 100 or even more. So in worst case the total number of combinations can be 10^8 which would lead to a very heavy load in database(inserting and updating a huge number rows in the table) and also generating all the combination in the code level would require a substantial time.
I have thought about an alternative approach to not keep the items as combination. Rather keep separate table for each entity. and then make the combination in the runtime. Which also would cause performance issue. As there's a lot more different stages where I might need the combination. So every time I would need to generate all the combinations.
I have also thought about creating key-value pair type table, where I would keep the combination as a string. But in this approach I am not actually reducing number of rows to be inserted rather number of columns is getting reduced.
So my question is - Is there any better approach this kind of situation where I can keep track of combination and manipulate in an optimized way?
Note - I am not sure if this would help or not, but a lot of the rows in the values table will probably have zero as value. So in the second phase we would need to show a lot less rows than the actual number of possible combinations

fast poker hand ranking

I am working on a simulation of poker and now I have to rank hands effectively:
Every hand is a combination of 5 cards and is represented as an uint64_t.
Every bit from 0 (Ace of Spades), 1 (Ace of Hearts) to 51 (Two of Clubs) indicates if the corresponding card is part (bit == 1) or isn't part (bit == 0) of the hand. The bits from 52 to 63 are always set to zero and don't hold any information.
I already know how I theoretically could generate a table, so that every valid hand can be mapped to rang (represented as uint16_t) between 1 (2,3,4,5,7 - not in the same color) and 7462 (Royal Flush) and all the others to the rang zero.
So a naive lookup table (with the integer value of the card as index) would have an enormous size of
2 bytes * 2^52 >= 9.007 PB.
Most of this memory would be filled with zeros, because almost all uint64_t's from 0 to 2^52-1 are invalid hands and therefor have a rang equal to zero.
The valuable data occupies only
2 bytes * 52!/(47!*5!) = 5.198 MB.
What method can I use for the mapping so that I only have to save the ranks from the valid cards and some overhead (max. 100 MB's) and still don't have to do some expensive search...
It should be as fast as possible!
If you have any other ideas, you're welcome! ;)
You need only a table of 13^5*2, with the extra level of information dictating if all the cards are of the same suit. If for some reason 'heart' outranks 'diamond', you need still at most a table with size of 13^6, as the last piece of information encodes as '0 = no pattern, 1 = all spades, 2 = all hearts, etc.'.
A hash table is probably also a good and fast approach -- Creating a table from nCk(52,5) combinations doesn't take much time (compared to all possible hands). One would, however, need to store 65 bits of information for each entry to store both the key (52 bits) and the rank (13 bits).
Speeding out evaluation of the hand, one first rules out illegal combinations from the mask:
if (popcount(mask) != 5); afterwards once can use enough bits from e.g. crc32(mask), which has instruction level support in i7-architecture at least.
If I understand your scheme correctly, you only need to know that the hamming weight of a particular hand is exactly 5 for it to be a valid hand. See Calculating Hamming Weight in O(1) for information on how to calculate the hamming weight.
From there, it seems you could probably work out the rest on your own. Personally, I'd want to store the result in some persistent memory (if it's available on your platform of choice) so that subsequent runs are quicker since they don't need to generate the index table.
This is a good source
Cactus Kev's
For a hand you can take advantage of at most 4 of any suit
4 bits for the rank (0-12) and 2 bits for the suit
6 bits * 5 cards is just 30 bit
Call it 4 bytes
There are only 2598960 hands
Total size a little under 10 mb
A simple implementation that comes to mind would be to change your scheme to a 5-digit number in base 52. The resulting table to hold all of these values would still be larger than necessary, but very simple to implement and it would easily fit into RAM on modern computers.
edit: You could also cut down even more by only storing the rank of each card and an additional flag (e.g., lowest bit) to specify if all cards are of the same suit (i.e., flush is possible). This would then be in base 13 + one bit for the ranking representation. You would presumably then need to store the specific suits of the cards separately to reconstruct the exact hand for display and such.
I would represent your hand in a different way:
There are only 4 suits = 2bits and only 13 cards = 4 bits for a total of 6 bits * 5 = 30 - so we fit into a 32bit int - we can also force this to always be sorted as per your ordering
[suit 0][suit 1][suit 2][suit 3][suit 4][value 0][value 1][value 2][value 3][value 4]
Then I would use a separate hash for:
consectutive values (very small) [mask off the suits]
1 or more multiples (pair, 2 pair, full house) [mask off the suits]
suits that are all the same (very small) [mask off the values]
Then use the 3 hashes to calculate your rankings
At 5MB you will likely have enough caching issues that will make a bit of math and three small lookups faster

Database Bitmap index

I don't understad why bitmasps indexes are useful:
Ident. Name Gender Bitmaps
F M
1 Ann Female 1 0
2 John Male 0 1
3 Jacob Male 0 1
4 Pieter Unsp. 0 0
5 Elise Female 1 0
If query need to find all person with some Gender - it is clear.
But when e.g. need to find all that name starts from "J" ?
Bitmaps are generally useful only for columns like Gender where the number of distinct values is fairly small. You would not use a bitmap index on names. They are also more useful in data warehousing than in OLTP databases due to the higher cost of maintaining bitmap indexes.
One advantage of bitmap indexes is that a number of them can be ANDed and ORed together to answer queries very efficiently.

Looking for precision on how ROW_OVERFLOW_DATA happen

I'm currently in the initial phases of planning a rewrite for a large module in our CRM application.
One area I am currently looking into is database optimization, I haven't made any decision yet but I just want to make sure I understand the concept of ROW_OVERFLOW_DATA properly - http://msdn.microsoft.com/en-us/library/ms186981.aspx
We are using SQL server 2005, it's my understanding that the row size limit is 8,060 bytes and that after that overflow will occur.
I ran a query to get my max row size for a particular read intensive database
SELECT OBJECT_NAME (sc.[id]) tablename
, COUNT (1) nr_columns
, SUM (sc.length) maxrowlength
FROM syscolumns sc
join sysobjects so
on sc.[id] = so.[id]
WHERE so.xtype = 'U'
GROUP BY OBJECT_NAME (sc.[id])
ORDER BY SUM (sc.length) desc
This gave me a few tables with a maxrowlength that was sligtly above 8,000, but under 10,000. Another query shows that the average row size is actually quite small, around 1,000 bytes.
My question is: is ROW_OVERFLOW_DATA based on each row or is it per column? Once the 8,060 bytes limit is expanded is the entire column that caused it to overflow moved to another page or is it only the specific row?
So for example given the following simplified schema:
col1 (int) | col 2 (varchar (4000)) | col 3(varchar(5000))
1 | 4000 characters | 5000 characters ***This row is overflowing
2 | 4000 characters | 100 characters
3 | 150 characters | 150 characters
4 | 500 characters | 600 characters
Would every the col 3 of row 1 to 4 get replaced by a 24 bytes pointer or only rowID 1?
I am wondering cause if it's every row gets a pointer it becomes important to fix it, if it's only a few rows maybe we can take the performance hit.
Also, I've seen many blogs suggesting to move nullable columns toward the end of the database so that if the values are in fact NULL they don't take any row space. Is this true? We tend to keep our timestamp and tracking columns at the end cause it's easier to visualize. Now I am wondering if maybe we shouldn't move them further up as they are never NULL.
If you have one row in, say, a 100 million that overflows would you move the whole column? No.
For reference, a technet article from Paul Randal who is the God of this stuff (my bold)
The feature you are using, row-overflow, is great for allowing the occasional row to be longer than 8,060 bytes, but it is not well suited for the majority of rows being oversized and can lead to a drop in query performance, as you are experiencing.
The reason for this is that when a row is about to become oversized, one of the variable-length columns in the row is pushed "off-row." This means the column is taken from the row on the data or index page and moved to a text page. In place of the old column value, a pointer is substituted that points to the new location of the column value in the data file.
And MSDN (my bold)
ROW_OVERFLOW_DATA Allocation Unit
For every partition used by a table (heap or clustered table), index, or indexed view, there is one ROW_OVERFLOW_DATA allocation unit. This allocation unit contains zero (0) pages until a data row with variable length columns (varchar, nvarchar, varbinary, or sql_variant) in the IN_ROW_DATA allocation unit exceeds the 8 KB row size limit. When the size limitation is reached, SQL Server moves the column with the largest width from that row to a page in the ROW_OVERFLOW_DATA allocation unit. A 24-byte pointer to this off-row data is maintained on the original page.
As for your NULLable columns, this is false. NULLable columns are stored at the end of the disk structure anyway regardless of column order in the table definition. And a reference from Paul Randal: Inside the Storage Engine: Anatomy of a record again. Any some previous answers from me here on SO
Only if a particular row overflows will the offending data for that row be moved off into a separate overflow page - imagine the headache if the entire table needed rebuilding just because one value in one column overflowed!
I'd not heard of the idea of moving NULLables to the end of the table - I'll have to check into that!

Resources