I don't understad why bitmasps indexes are useful:
Ident. Name Gender Bitmaps
F M
1 Ann Female 1 0
2 John Male 0 1
3 Jacob Male 0 1
4 Pieter Unsp. 0 0
5 Elise Female 1 0
If query need to find all person with some Gender - it is clear.
But when e.g. need to find all that name starts from "J" ?
Bitmaps are generally useful only for columns like Gender where the number of distinct values is fairly small. You would not use a bitmap index on names. They are also more useful in data warehousing than in OLTP databases due to the higher cost of maintaining bitmap indexes.
One advantage of bitmap indexes is that a number of them can be ANDed and ORed together to answer queries very efficiently.
Related
I have a simple table as follows:
day order_id customer_id
1 1 1
1 2 1
1 3 2
2 4 1
2 5 1
I want to find a number of unique customers from Day 1 to Day 2. And the answer is 2.
But my size of the table is huge and querying takes long. So I want to store an aggregated data in another table to reduce the data size and query faster. I have created a new table out of the above table.
day uniq_customer
1 2
2 1
Now if I want to find a unique customer from Day 1 to Day 2, I am getting 2 + 1 = 3, whereas the answer is 2.
Is there any way to find a work around without having to query the old table.
PS: I am using Druid as a data source.
This depends on the trends on your data. For example, if you have a small number of distinct customers and days, you can keep the customers in a bit vector per day. At the end, just or the bitvectors of the days in the query and the result will be the sum of the bits. May be boring to implement.
If you have a large number of distinct customers and days, chunk them per customer and sort by date. Then for each customer, get the index of first row where day is greater than or equal to the query start and get the index of the first row where day is less than or equal to the query end using binary search. Difference between those two indices plus 1 gives you the number-of-query-fitting-days for the customer. Complexity becomes #customers x 2 x O(log #customerRecords).
Apache Druid supports the use of approximations for this kind of query. Take a look at the tutorial on using approximations in Druid: https://druid.apache.org/docs/latest/tutorials/tutorial-sketches-theta.html
In Druid you can also partially aggregate into Theta Sketches at ingestion time and aggregated them over time or over other grouping dimensions at query time. This is specifically designed to deal with large data volumes and you can control the accuracy of the approximations.
I have a dataset containing a number of persons who have been involved in an accident. Each person have been in an accident at a different time and I have coded a variable start_week which indicates what week number after a certain date (january 1st 2011), the accident occurred.
For each individual I also have a a variable for each week after january 1st 2011, that shows whether or not this individual has been hospitalized. I now need to count how many weeks a person has been hospitalized XX weeks after the accident.
The desired results should be a column like sum_week that sums number of weeks after the accident depending on the value shown in the variable start_week.
Id
start_week
week_1
week_2
week_3
week_4
sum_week
1
2
1
0
1
1
2
2
3
1
0
0
1
1
I think this can be done using an array, but I have no idea how. If it isn't possible to count across columns based on the variable start_week, I am planning on transposing my data. I would however prefer if this could be done without having to transpose my data.
Any help is much appreciated!
Just use the START_WEEK as the initial value in the DO loop you use to check the array.
data want;
set have ;
array week_[4];
sum_week=0;
do index=start_week to dim(week_);
sum_week+week_[index];
end;
drop index;
run;
So, I am working on a feature in a web application. The problem is like this-
I have four different entities. Let's say those are - Item1, Item2, Item3, Item4. There's two phase of the feature. Let's say the first phase is - Choose entities. In the first phase, User will have option to choose multiple items for each entity and for every combination from that choosing, I need to do some calculation. Then in the second phase(let's say Relocate phase) - based on the calculation done in the first phase, for each combination I would have to let user choose another combination where the value of the first combination would get removed to the row of the second combination.
Here's the data model for further clarification -
EntityCombinationTable
(
Id
Item1_Id,
Item2_Id,
Item3_Id,
Item4_Id
)
ValuesTable
(
Combination_Id,
Value
)
So suppose I have following values in both values -
EntityCombinationTable
Id -- Item1_Id -- Item2_Id -- Item3_Id -- Item4_Id
1 1 1 1 1
2 1 2 1 1
3 2 1 1 1
4 2 2 1 1
ValuesTable
Combination_Id -- Value
1 10
2 0
3 0
4 20
So if in the first phase - I choose (1,2) for Item1, (1,2) for Item_2 and 1 for both Item_3 and Item4, then total combination would be 2*2*1*1 = 4.
Then in the second phase, for each of the combination that has value greater than zero, I would have to let the user choose different combination where the values would get relocated.
For example - As only combination with Id 1 and 2 has value greater than zero, only two relocation combination would need to be shown in the second dialog. So if the user choose (3,3,3,3) and (4,4,4,4) as relocation combination in the second phase, then new row will need to be inserted in
EntityCombinationTable for (3,3,3,3) and (4,4,4,4). And values of (1,1,1,1) and (2,2,1,1) will be relocated respectively to rows corresponding to (3,3,3,3) and (4,4,4,4) in the ValuesTable.
So the problem is - each of the entity can have items upto 100 or even more. So in worst case the total number of combinations can be 10^8 which would lead to a very heavy load in database(inserting and updating a huge number rows in the table) and also generating all the combination in the code level would require a substantial time.
I have thought about an alternative approach to not keep the items as combination. Rather keep separate table for each entity. and then make the combination in the runtime. Which also would cause performance issue. As there's a lot more different stages where I might need the combination. So every time I would need to generate all the combinations.
I have also thought about creating key-value pair type table, where I would keep the combination as a string. But in this approach I am not actually reducing number of rows to be inserted rather number of columns is getting reduced.
So my question is - Is there any better approach this kind of situation where I can keep track of combination and manipulate in an optimized way?
Note - I am not sure if this would help or not, but a lot of the rows in the values table will probably have zero as value. So in the second phase we would need to show a lot less rows than the actual number of possible combinations
I have made a loop that creates a variable, expectedgpa.
So now I have 1,000 variables for each observation, labeled expectedgpa1, expectedgpa2...expectedgpa1000.
I want to get the average and standard deviation for all the expectedgpas for each observation.
So if I have this
Joe 1 2 1 2 4
Sally 2 4 2 4 3
Larry 3 3 3 3 3
I want a variable returned that gives
Joe 2
Sally 3
Larry 3
Any help?
First, for future questions:
Please post code showing what you've tried. Your question shows no research effort.
Second, to clarify the terminology:
You created 1000 variables, each
one corresponding to some expected gpa. Each observation corresponds
to a different person. You want, as a result, three variables. One with the person's id
and another two with the the mean and sd of the gpa (by person).
This is my interpretation, at least.
One solution involves reshaping your data:
clear all
set more off
input ///
str5 id exgpa1 exgpa2 exgpa3 exgpa4 exgpa5
Joe 1 2 1 2 4
Sally 2 4 2 4 3
Larry 3 3 3 3 3
end
list
reshape long exgpa, i(id) j(exgpaid)
collapse (mean) mexgpa=exgpa (sd) sdexgpa=exgpa, by(id)
list
Instead of collapse, you can also run by id: summarize exgpa after the reshape, but this doesn't create new variables.
See help reshape, help collapse and help summarize for details.
You should not have created 1000 new variables without a strategy for how you were going to analyse them!
You could also use egen functions rowmean() and rowsd() and keep the same data structure.
A review of working "rowwise" in Stata is accessible at http://www.stata-journal.com/sjpdf.html?articlenum=pr0046
Wikipedia gives this example
Identifier Gender Bitmaps
F M
1 Female 1 0
2 Male 0 1
3 Male 0 1
4 Unspecified 0 0
5 Female 1 0
But I do not understand this.
How is this an index first of all? Isn't an index supposed to point to rows (using rowid's) given the key?
What would be the typical queries where such indexes would be useful? How are they better than B-tree indexes? I know that if we use a B-tree index on Gender here, we will get a lot of results if for example, we look for Gender = Male, which need to be filtered out further (so not very useful). How does a Bitmap improve the situation?
A better representation of a bitmap index, is if given the sample above:
Identifier Gender RowID
1 Female R1
2 Male R2
3 Male R3
4 Unspecified R4
5 Female R5
the a bitmap index on the gender column would (conceptually) look like this:
Gender R1 R2 R3 R4 R5
Female 1 0 0 0 1
Male 0 1 1 0 0
Unspecified 0 0 0 1 0
Bitmap indexes are used when the number of distinct values in a column is relatively low (consider the opposite where all values are unique: the bitmap index would be as wide as every row, and as long making it kind of like one big identity matrix.)
So with this index in place a query like
SELECT * FROM table1 WHERE gender = 'Male'
the database looks for a match in the gender values in the index, finds all the rowids where the bit was set to 1, and then goes and gets the table results.
A query like:
SELECT * FROM table1 WHERE gender IN ('Male', 'Unspecified')
would get the 1 bits for Male, the 1 bits for Unspecified, do a bitwise-OR then go get the rows where the resulting bits are 1.
So, the advantages of using a bitmap index over a b*tree index are storage (with low cardinality, bitmap indexes are pretty compact), and the ability to do bitwise operations before resolving the actual rowids which can be pretty quick.
Note that bitmap indexes can have performance implications with inserts/deletes (conceptually, you add/remove a column to/from the bitmap and rejig it accordingly...), and can create a whole lot of contention as an update on a row can lock the entire corresponding bitmap entry and you can't update a different row (with the same bitmap value) until the first update is committed/rolled back.
The benefit comes when filtering on multiple columns, then the corresponding indexes can be merged with bitwise operations before actually selecting the data.
If you have gender, eye_colour, hair_colour
then the query
select * from persons where
gender = 'male' and
(eye_colour = 'blue' or hair_colour = 'blonde')
would first make a bitwise or between the eye_colour['blue'] index and the hair_colour['blonde'] index and finally bitwise and between the result and the gender['male'] index. This operation performs really fast both computationally and I/O.
The resulting bit stream would be used for picking the actual rows.
Bitmap indexes are typically used in "star joins" in data warehouse applications.
As indicated in the Wikipedia article, they use bitwise operations, which can perform better than comparing data types such as integers, so the short answer is increased speed of queries.
Theoretically, it should take up less computations and less time to select all males or all females from your example.
Just thinking about how this works under the hood should make why this is faster obvious. A bit is logically either true or false. If you want to do a query using a WHERE clause, this will eventually evaluate to either a true or a false for the records in order to determine whether to include them in your results.
Preface - the rest of this is meant to be layman's terns and non-techie
So the next question is what does it take to evaluate to true? Even comparing numeric values means that the computer has to...
Allocate memory for the value you want to evaluate
Allocate memory for the control value
Assign the value to each (count this as two steps)
Compare the two - for a numeric this should be quick, but for strings, there are more bytes to compare.
Assign the results to a 0(false) or 1 (true) value.
repeat if you're using a multiple part where clause such as Where "this = this AND that = that"
perform bitwise operations on the results generated in step 5
Come up with the final value
deallocate the memory allocated in steps 1-3
But using bitwise logic, you're just looking at 0 (false) and 1 (true) values. 90% of the overhead for the comparison work is eliminated.