Random selection of weighted items

Random selection of weighted items - arrays

My model TicketBuyer columns include ticket_buyer name and number_of_tickets purchased. I want to randomly select a winning ticket. I think my preference is to create another model (SelectTable) that has the ticket_buyer name replicated over the number of rows equal to the number_of_tickets purchased, thereby giving equal weight to each record. I can then just run a simple sort and pick the first record in the new table. I am having trouble auto creating the table with the correct number of rows for each ticket_buyer. Of course, there may be a more eloquent/ efficient way to do this as well. Advice is much appreciated.

If speed isn't a problem you can do it without creating any extra data structures.
Find the maximum number_of_tickets bought by anyone
Randomly select a ticket buyer
Do another draw, with a number_of_tickets / max_number_of_tickets chance of them becoming the winner
If they fail the second draw, randomly select another ticket buyer and repeat the process until someone wins
Unless there is some crazy distribution like one person buys a million tickets and everyone else buys one, this shouldn't take too long.
Pseudo-code:
max_tickets = max(table:number_of_tickets)
while true do:
// select a random buyer
buyer = select random row from table
// assume random(n) returns an integer number from 0 to n - 1 inclusive:
if random(max_tickets) < buyer.number_of_tickets then
return buyer // we have a winner
end

Related

Excel: Need to Generate IDs based on multiple criteria with repeating IDs

Looking to create pricing groups bases on multiple criteria. Each group could have multiple items within the group. I'm struggling with the autocreation the naming of each group. I estimate there should be about 6.5K pricing groups out of 14K items.
Below is the criteria -
QTY per case - is the number of bottles in a case
Size - size of the bottle
Family Brand - contains a group of like items
Code - CS1 - This is my unique code for each group that contains each of the above and lowest possible case price.
enter image description here
The "Thinking" column is how I want each group to look, but how do I do this with 14K items quickly?

If I understood correctly your pricing group name consists of two parts: a simple combination of columns and a "special" column, that should be counted.
Part 1 is simple: =C2&"-"&B2&"-"&A1&"-"
To make Part 2 easier you could sort, sorting fields Part 1, CODE-CS1.
After have done this you could use helping columns. If Part 1 is in column x and code-CS1 in column y you could find a formula for
Part 2 (column z): ="T"&IF(X1=X2;IF(Y1=Y2;Z1;Z1+1);1)
That means: If Part 1 is changing your counter starts with T1, if not so if your CODE CS1 changes, it counts, if not, so it keeps last number.
the result code would be =X2&Z2
It is untested and I use german excel, maybe the code doesn't work without any adaption, but in general it should work

How many trucks came empty but bought something

I think this is the hardest to date I have had to crack - so hard I had a hard time finding a good headline.
So we have a site where trucks come and buy say Gravel, or sand or other building materials.
Sometimes they also unload demolition waste first.
I need to find out a couple of things
how many trucks (and from what companys) came empty
if they came empty what did they buy from us.
what companys are sending full trucks and what are sending empty trucks.
a tope 10 of materials they will drive to us from to buy even when coming empty to our facility.
a list of all the order numbers that they drove to us til fill and came with empty trucks. ( I have distances linked to order numbers, so now I can estimate the value of our products)
The data I have available:
I have a full data set of when what customer buys what and / or pay to deliver.
E.G.:
I can see the parts I need to split the data into I think it should be something like this
find all unique licence plates
somehow map if they bought materials within 30 minutes of
offloading demolition waste (most trucks will come between 2 and 10
times per day)
Present all this data (on a normal day we have about 800 trucks = 2000 lines since they weigh in, weigh out, and then some buy something = 2 more weigh lines)
I can easily find unique licence plates per day (either by formula or by Excel function Data/delete doublets,
but after that I have no clue where to start.
I think I need some sheets in between, where I somehow mark if a material was bought from an "empty truck" and I need a counter for that .. somehow...
Any help on how to get started is appreciated.

It seems like the best way to start is with a helper column (in the following exampes, I have chosen "Column M") to flag whether the truck arrived empty.
In the helper column, you can use something similar to the following formula.
{=IF(ISBLANK(B2),0,IF(C2="In",0,IF(B2=$B$2:$B$13,IF($C$2:$C$13="In",IF($A$2:$A$13>(A2-TIME(0,30,0)),0,1),1),1)))}
This is an array formula, which means you have to press ctrl+shift+enter after pasting it in the cell. Then you can copy that cell down the column.
Just to explain, the first if statement knows the truck is not arriving empty if Column C is 'In'. The second if statement creates an array and tests to see if other the same truck appears in other rows. The third if statement checks to see if the same truck checked 'In' in the matching rows, and the fourth if statement verifies if the time they checked in was less than thirty minutes ago. You can adjust the length by editing the TIME(0,30,0) function. The format is TIME(hours,minuites,seconds). Unless the truck matches all three of the second, third and fourth if statements, it is marked as coming empty.
Once you have this helper column, just about all of your tasks are quite simple.
1a: How many trucks came empty? Sum Column M
1b: How many trucks from what company? Create a unique list of companies. Then create a COUNTIFS formula based on Column M = 1 and Column K = Company. For example, if C32 had Company B then the formula =COUNTIFS($M$2:$M$13,1,$K$2:$K$13,C32) would return 2
1c: How many times did a truck come empty? Similar to 1b, create a unique list of License Plates, then use a COUNTIFS based on Column M = 1 and Column B = License Plate.
2: Similar to 1b, just use a unique list of products tested against Column F
3: Similar to 1b, just create a second column, next to the first that uses =COUNTIFS($M$2:$M$13,0,$K$2:$K$13,C53,$C$2:$C$13,"In") Which tests that Column M reports the truck did not come empty, that matches the company in Column K and that the truck came 'In' so you don't double count the same truck when it goes 'out'
4: Just sort list created by number 2. You can highlight the range, right-click and select "Sort" > "Custom Sort", then select the column you want to sort on and largest to smallest.
5: There are a couple of different ways, you could do this. The formula
{=TEXTJOIN(", ",TRUE,IF($M$2:$M$13=1,$J$2:$J$13,""))}
(again, entered as an array formula)
would create a comma separated list of order numbers. An alternative if you want a column of order numbers (but would only work if they are actually numbers), is to paste the formula {=MAX(IF($M$2:$M$13=1,$J$2:$J$13,))} in the first row of the column (in my example, its O2) and then {=MAX(IF($M$2:$M$13=1,IF($J$2:$J$13<O2,$J$2:$J$13,)))} in the row below (change the reference to O2 if you pasted it in a different spot)(again, note that both of these are array formulas). Then copy and paste the second formula down the column. When order numbers of trucks that came in empty are exhausted, the formula will report 0.

Formula for sales volume in 'N' period

My report layout is linked to a stored procedure in SQL, which returns data regarding inventory from our SAP B1 system. This identifies appropriate Minimum and Maximum stock holding values.
I draw in several fields into calculated fields. A pair of parameter fields input a 'no. of periods' value (such as 6 months) and specify an item group from the list in SAP. I currently get the sum total of items invoiced out, minus the sum total of credited in all the way back to record one in the DB.
I need to be able to get a new sales total which is sum total of items invoiced out minus sum total of credits in, divided by the no. of periods value to get a new average sales value. but I simply can't think about how to go about creating the field.
How can I get the desired behavior?

Getting random entry from Objectify entity

How can I get a random element out of a Google App Engine datastore using Objectify? Should I fetch all of an entity's keys and choose randomly from them or is there a better way?

Assign a random number between 0 and 1 to each entity when you store it. To fetch a random record, generate another random number between 0 and 1, and query for the smallest entity with a random value greater than that.

You don't need to fetch all.
For example:
countall = query(X.class).count()
// http://groups.google.com/group/objectify-appengine/browse_frm/thread/3678cf34bb15d34d/82298e615691d6c5?lnk=gst&q=count#82298e615691d6c5
rnd = Generate random number [0..countall]
ofy.query(X.class).order("- date").limit(rnd); //for example -date or some chronic indexed field
Last id is your...
(in average you fatch 50% or at lest first read is in average 50% less)
Improvements (to have smaller key table in cache)!
After first read remember every X elements.
Cache id-s and their position. So next time query condition from selected id further (max ".limit(rnd%X)" will be X-1).
Random is just random, if it doesn't need to be close to 100% fair, speculate chronic field value (for example if you have 1000 records in 10 days, for random 501 select second element greater than fifth day).
Other options, if you have chronic field date (or similar), fetch elements older than random date and younger then random date + 1 (you need to know first date and last date). Second select random between fetched records. If query is empty select greater than etc...

Quoted from this post about selecting some random elements from an Objectified datastore:
If your ids are sequential, one way would be to randomly select 5
numbers from the id range known to be in use. Then use a query with an
"in" filter().
If you don't mind the 5 entries being adjacent, you can use count(),
limit(), and offset() to randomly find a block of 5 entries.
Otherwise, you'll probably need to use limit() and offset() to
randomly select one entry out at a time.
-- Josh

I pretty much adapt the algorithm provided Matejc. However, 3 things:
Instead of using count() or the datastore service factory (DatastoreServiceFactory.getDatastoreService()), I have an entity that keep track of the total count of the entities that I am interested in. The reason for this approach is that:
a. count() could be expensive when you are dealing with a lot of objects
b. You can't test the datastore service factory locally...testing in prod is just a bad practice.
Generating the random number: ThreadLocalRandom.current().nextLong(1, maxRange)
Instead of using limit(), I use offset, so I don't have to worry about "sorting."

Web stats: Calculating/estimating unique visitors for arbitary time intervals

I am writing an application which is recording some 'basic' stats -- page views, and unique visitors. I don't like the idea of storing every single view, so have thought about storing totals with a hour/day resolution. For example, like this:
Tuesday 500 views 200 unique visitors
Wednesday 400 views 210 unique visitors
Thursday 800 views 420 unique visitors
Now, I want to be able to query this data set on chosen time periods -- ie, for a week. Calculating views is easy enough: just addition. However, adding unique visitors will not give the correct answer, since a visitor may have visited on multiple days.
So my question is how do I determine or estimate unique visitors for any time period without storing each individual hit. Is this even possible? Google Analytics reports these values -- surely they don't store every single hit and query the data set for every time period!?
I can't seem to find any useful information on the net about this. My initial instinct is that I would need to store 2 sets of values with different resolutions (ie day and half-day), and somehow interpolate these for all possible time ranges. I've been playing with the maths, but can't get anything to work. Do you think I may be on to something, or on the wrong track?
Thanks,
Brendon.

If you are OK with approximations, I think tom10 is onto something, but his notion of random subsample is not the right one or needs a clarification. If I have a visitor that comes on day1 and day2, but is sampled only on day2, that is going to introduce a bias in the estimation. What I would do is to store full information for a random subsample of users (let's say, all users whose hash(id)%100 == 1). Then you do the full calculations on the sampled data and multiply by 100. Yes tom10 said about just that, but there are two differences: he said "for example" sample based on the ID and I say that's the only way you should sample because you are interested in unique visitors. If you were interested in unique IPs or unique ZIP codes or whatever you would sample accordingly. The quality of the estimation can be assessed using the normal approximation to the binomial if your sample is big enough. Beyond this, you can try and use a model of user loyalty, like you observe that over 2 days 10% of visitors visit on both days, over three days 11% of visitors visit twice and 5% visit once and so forth up to a maximum number of day. These numbers unfortunately can depend on time of the week, season and even modeling those, loyalty changes over time as the user base matures, changes in composition and the service changes as well, so any model needs to be re-estimated. My guess is that in 99% of practical situations you'd be better served by the sampling technique.

You could store a random subsample of the data, for example, 10% of the visitor IDs, then compare these between days.
The easiest way to do this is to store a random subsample of each day for future comparisons, but then, for the current day, temporarily store all your IDs and compare them to the subsampled historical data and determine the fraction of repeats. (That is, you're comparing the subsampled data to a full dataset for a given day and not comparing two subsamples -- it's possible to compare two subsamples and get an estimate for the total but the math would be a bit trickier.)

You don't need to store every single view, just each unique session ID per hour or day depending on the resolution you need in your stats.
You can keep these log files containing session IDs sorted to count unique visitors quickly, by merging multiple hours/days. One file per hour/day, one unique session ID per line.
In *nix, a simple one-liner like this one will do the job:
$ sort -m sorted_sid_logs/2010-09-0[123]-??.log | uniq | wc -l
It counts the number of unique visitors during the first three days of September.

You can calculate the uniqueness factor (UF) on each day and use it to calculate the composite (week by example) UF.
Let's say that you counted:
100 visits and 75 unique session id's on monday (you have to store the sessions ID's at least for a day, or the period you use as unit).
200 visits and 100 unique session id's on tuesday.
If you want to estimate the UF for the period Mon+Tue you can do:
UV = UVmonday + UVtuesday = TVmonday*UFmonday + TVtuesday*UFtuesday
being:
UV = Unique Visitors
TV = Total Visits
UF = Uniqueness Factor
So...
UV = (Sum(TVi*UFi))
UF = UV / TV
TV = Sum(TVi)
I hope it helps...
This math counts two visits of the same person as two unique visitors. I think it's ok if the only way you have to identify somebody is via the session ID.