I have a simple table as follows:
day order_id customer_id
1 1 1
1 2 1
1 3 2
2 4 1
2 5 1
I want to find a number of unique customers from Day 1 to Day 2. And the answer is 2.
But my size of the table is huge and querying takes long. So I want to store an aggregated data in another table to reduce the data size and query faster. I have created a new table out of the above table.
day uniq_customer
1 2
2 1
Now if I want to find a unique customer from Day 1 to Day 2, I am getting 2 + 1 = 3, whereas the answer is 2.
Is there any way to find a work around without having to query the old table.
PS: I am using Druid as a data source.
This depends on the trends on your data. For example, if you have a small number of distinct customers and days, you can keep the customers in a bit vector per day. At the end, just or the bitvectors of the days in the query and the result will be the sum of the bits. May be boring to implement.
If you have a large number of distinct customers and days, chunk them per customer and sort by date. Then for each customer, get the index of first row where day is greater than or equal to the query start and get the index of the first row where day is less than or equal to the query end using binary search. Difference between those two indices plus 1 gives you the number-of-query-fitting-days for the customer. Complexity becomes #customers x 2 x O(log #customerRecords).
Apache Druid supports the use of approximations for this kind of query. Take a look at the tutorial on using approximations in Druid: https://druid.apache.org/docs/latest/tutorials/tutorial-sketches-theta.html
In Druid you can also partially aggregate into Theta Sketches at ingestion time and aggregated them over time or over other grouping dimensions at query time. This is specifically designed to deal with large data volumes and you can control the accuracy of the approximations.
Related
Looking to create pricing groups bases on multiple criteria. Each group could have multiple items within the group. I'm struggling with the autocreation the naming of each group. I estimate there should be about 6.5K pricing groups out of 14K items.
Below is the criteria -
QTY per case - is the number of bottles in a case
Size - size of the bottle
Family Brand - contains a group of like items
Code - CS1 - This is my unique code for each group that contains each of the above and lowest possible case price.
enter image description here
The "Thinking" column is how I want each group to look, but how do I do this with 14K items quickly?
If I understood correctly your pricing group name consists of two parts: a simple combination of columns and a "special" column, that should be counted.
Part 1 is simple: =C2&"-"&B2&"-"&A1&"-"
To make Part 2 easier you could sort, sorting fields Part 1, CODE-CS1.
After have done this you could use helping columns. If Part 1 is in column x and code-CS1 in column y you could find a formula for
Part 2 (column z): ="T"&IF(X1=X2;IF(Y1=Y2;Z1;Z1+1);1)
That means: If Part 1 is changing your counter starts with T1, if not so if your CODE CS1 changes, it counts, if not, so it keeps last number.
the result code would be =X2&Z2
It is untested and I use german excel, maybe the code doesn't work without any adaption, but in general it should work
I have a dataset containing a number of persons who have been involved in an accident. Each person have been in an accident at a different time and I have coded a variable start_week which indicates what week number after a certain date (january 1st 2011), the accident occurred.
For each individual I also have a a variable for each week after january 1st 2011, that shows whether or not this individual has been hospitalized. I now need to count how many weeks a person has been hospitalized XX weeks after the accident.
The desired results should be a column like sum_week that sums number of weeks after the accident depending on the value shown in the variable start_week.
Id
start_week
week_1
week_2
week_3
week_4
sum_week
1
2
1
0
1
1
2
2
3
1
0
0
1
1
I think this can be done using an array, but I have no idea how. If it isn't possible to count across columns based on the variable start_week, I am planning on transposing my data. I would however prefer if this could be done without having to transpose my data.
Any help is much appreciated!
Just use the START_WEEK as the initial value in the DO loop you use to check the array.
data want;
set have ;
array week_[4];
sum_week=0;
do index=start_week to dim(week_);
sum_week+week_[index];
end;
drop index;
run;
I have a system that stores measurements from machines with many transducers, once per second. I'm considering using Cassandra and would like to store the 1 second sample of machine state measurements in a single table, which would be something like:
create table inst_samples (
machine_id text,
batch_id int,
sample_time timestamp,
var1 double,
var2 double,
.....
varN double,
PRIMARY KEY ((machine_id, batch_id), sample_time)
);
There are approximately 20 machines with 400 state variables each and the batch_id will update every 1-2 hours. I have reviewed the documentation on the 2 billion cells maximum per table and noted similar questions
here What are the maximum number of columns allowed in Cassandra and here Cassandra has a limit of 2 billion cells per partition, but what's a partition?
If I am understanding this limit correctly I would hit the 2 billion cell limit for a single machine in the inst_samples table in approximately 60 days?
(2e9 cells / 400 cols/row) / (3600 rows / hour) / (24 hours / day) =~ 58 days?
I am a total Cassandra newbie. Thanks.
This 2 billion limit is for partition, and if you have good data model, you should have many partitions. In practice, it's recommended to keep number of cells per partition under control - something like, not more 100,000 cells per partition, otherwise there could be some performance problems, etc. But the actual limit depends on the multiple factors, like Cassandra version, what queries are executed, etc.
In your case, we have partition key of machine_id + batch_id, and that gives us for batch size of 2 hours: 400x7200 = 2880000 - almost 3 million cells. It may still work (would be better if you set batch size to 1 hour), but will require testing on real hardware - this could be done for example, with NoSQLBench.
There are also other ways to optimize your data model - for example, instead of allocating a separate column for every variable, just use frozen<map<text, double>> - in this case, all measurements will be stored as a single cell. The drawback of it - you can't change the individual values without reading the map & inserting it with changed value. Another drawback is that you'll need to read all measurements at once - but this could be ok.
So, I am working on a feature in a web application. The problem is like this-
I have four different entities. Let's say those are - Item1, Item2, Item3, Item4. There's two phase of the feature. Let's say the first phase is - Choose entities. In the first phase, User will have option to choose multiple items for each entity and for every combination from that choosing, I need to do some calculation. Then in the second phase(let's say Relocate phase) - based on the calculation done in the first phase, for each combination I would have to let user choose another combination where the value of the first combination would get removed to the row of the second combination.
Here's the data model for further clarification -
EntityCombinationTable
(
Id
Item1_Id,
Item2_Id,
Item3_Id,
Item4_Id
)
ValuesTable
(
Combination_Id,
Value
)
So suppose I have following values in both values -
EntityCombinationTable
Id -- Item1_Id -- Item2_Id -- Item3_Id -- Item4_Id
1 1 1 1 1
2 1 2 1 1
3 2 1 1 1
4 2 2 1 1
ValuesTable
Combination_Id -- Value
1 10
2 0
3 0
4 20
So if in the first phase - I choose (1,2) for Item1, (1,2) for Item_2 and 1 for both Item_3 and Item4, then total combination would be 2*2*1*1 = 4.
Then in the second phase, for each of the combination that has value greater than zero, I would have to let the user choose different combination where the values would get relocated.
For example - As only combination with Id 1 and 2 has value greater than zero, only two relocation combination would need to be shown in the second dialog. So if the user choose (3,3,3,3) and (4,4,4,4) as relocation combination in the second phase, then new row will need to be inserted in
EntityCombinationTable for (3,3,3,3) and (4,4,4,4). And values of (1,1,1,1) and (2,2,1,1) will be relocated respectively to rows corresponding to (3,3,3,3) and (4,4,4,4) in the ValuesTable.
So the problem is - each of the entity can have items upto 100 or even more. So in worst case the total number of combinations can be 10^8 which would lead to a very heavy load in database(inserting and updating a huge number rows in the table) and also generating all the combination in the code level would require a substantial time.
I have thought about an alternative approach to not keep the items as combination. Rather keep separate table for each entity. and then make the combination in the runtime. Which also would cause performance issue. As there's a lot more different stages where I might need the combination. So every time I would need to generate all the combinations.
I have also thought about creating key-value pair type table, where I would keep the combination as a string. But in this approach I am not actually reducing number of rows to be inserted rather number of columns is getting reduced.
So my question is - Is there any better approach this kind of situation where I can keep track of combination and manipulate in an optimized way?
Note - I am not sure if this would help or not, but a lot of the rows in the values table will probably have zero as value. So in the second phase we would need to show a lot less rows than the actual number of possible combinations
How can I get a random element out of a Google App Engine datastore using Objectify? Should I fetch all of an entity's keys and choose randomly from them or is there a better way?
Assign a random number between 0 and 1 to each entity when you store it. To fetch a random record, generate another random number between 0 and 1, and query for the smallest entity with a random value greater than that.
You don't need to fetch all.
For example:
countall = query(X.class).count()
// http://groups.google.com/group/objectify-appengine/browse_frm/thread/3678cf34bb15d34d/82298e615691d6c5?lnk=gst&q=count#82298e615691d6c5
rnd = Generate random number [0..countall]
ofy.query(X.class).order("- date").limit(rnd); //for example -date or some chronic indexed field
Last id is your...
(in average you fatch 50% or at lest first read is in average 50% less)
Improvements (to have smaller key table in cache)!
After first read remember every X elements.
Cache id-s and their position. So next time query condition from selected id further (max ".limit(rnd%X)" will be X-1).
Random is just random, if it doesn't need to be close to 100% fair, speculate chronic field value (for example if you have 1000 records in 10 days, for random 501 select second element greater than fifth day).
Other options, if you have chronic field date (or similar), fetch elements older than random date and younger then random date + 1 (you need to know first date and last date). Second select random between fetched records. If query is empty select greater than etc...
Quoted from this post about selecting some random elements from an Objectified datastore:
If your ids are sequential, one way would be to randomly select 5
numbers from the id range known to be in use. Then use a query with an
"in" filter().
If you don't mind the 5 entries being adjacent, you can use count(),
limit(), and offset() to randomly find a block of 5 entries.
Otherwise, you'll probably need to use limit() and offset() to
randomly select one entry out at a time.
-- Josh
I pretty much adapt the algorithm provided Matejc. However, 3 things:
Instead of using count() or the datastore service factory (DatastoreServiceFactory.getDatastoreService()), I have an entity that keep track of the total count of the entities that I am interested in. The reason for this approach is that:
a. count() could be expensive when you are dealing with a lot of objects
b. You can't test the datastore service factory locally...testing in prod is just a bad practice.
Generating the random number: ThreadLocalRandom.current().nextLong(1, maxRange)
Instead of using limit(), I use offset, so I don't have to worry about "sorting."