So I play heroes of newerth. I have the desire to make a statistical program that shows which team of 5 heroes vs another 5 heroes wins the most. Given there are 85 heroes and games are 85 choose 5 vs 80 choose 5, that's a lot of combinations.
Essentially I'm going to take the stats data the game servers allow me to get and just put a 1 in an array which has heroes when they get a win [1,2,3,4,5][6,7,8,9,10][W:1][L:0]
So after I parse and build the array from the historical game data, I can put in what 5 heroes I want to see, and I can get back all the relevant game data telling me which 5 hero lineup has won/lost the most.
What I need help starting is a simple algorithm to write out my array. Here's similar output I need: (I have simplified this to 1-10, where the code I get I can just change 10 to x for how many heroes there are).
[1,2,3,4,5][6,7,8,9,10]
[1,2,3,4,6][5,7,8,9,10]
[1,2,3,4,7][5,6,8,9,10]
[1,2,3,4,8][5,6,7,9,10]
[1,2,3,4,9][5,6,7,8,10]
[1,2,3,4,10][5,6,7,8,9]
[1,2,3,5,6][4,7,8,9,10]
[1,2,3,5,7][4,6,8,9,10]
[1,2,3,5,8][4,6,7,9,10]
[1,2,3,5,9][4,6,7,8,10]
[1,2,3,5,10][4,6,7,8,9]
[1,2,3,6,7][4,5,8,9,10]
[1,2,3,6,8][4,5,7,9,10]
[1,2,3,6,9][4,5,7,8,10]
[1,2,3,6,10][4,5,7,8,9]
[1,2,3,7,8][4,5,6,9,10]
[1,2,3,7,9][4,5,6,8,10]
[1,2,3,7,10][4,5,6,8,9]
[1,2,3,8,9][4,5,6,7,10]
[1,2,3,8,10][4,5,6,7,9]
[1,2,3,9,10][4,5,6,7,8]
[1,2,4,5,6][3,7,8,9,10]
[1,2,4,5,7][3,6,8,9,10]
[1,2,4,5,8][3,6,7,9,10]
[1,2,4,5,9][3,6,7,8,10]
[1,2,4,5,10][3,6,7,8,9]
[1,2,4,6,7][3,5,8,9,10]
[1,2,4,6,8]...
[1,2,4,6,9]
[1,2,4,6,10]
[1,2,4,7,8]
[1,2,4,7,9]
[1,2,4,7,10]
[1,2,4,8,9]
[1,2,4,8,10]
[1,2,4,9,10]
...
You get the Idea. No repeating and order doesn't matter. Its essentially cut in half doesn't matter the order of the arrays either. Just need a list of all the combinations of teams that can be played against each other.
EDIT: additional thinking...
After quite a bit of thinking. I have come up with some ideas. Instead of writting out the entire array of [85*84*83*82*81][80*79*78*77*76*75] possible combinations of characters, which would have to be made larger for the introduction of of new heroes as to keep the array relevant and constantly updating.
To instead when reading from the server parse the information and build the array from there. It would be much simpler to just make an element in the array when one is not found, ei the combinations have never been played before. Then parsing the data would be 1 pass, and build your array as it complies along. Yes it might take a while, but the values that are created will be worth the wait. It can be done over time too. Starting with a small test case say 1000 games and working up the the number of matches that have been played. Another Idea would be to start from our current spot in time and build the data base from there. There is no need to go back to the first games ever played based off the amount of changes that have occurred to heroes over that time frame, but say go back 2-3 months to give it some foundation and reliability of data, and with each passing day only getting more accurate.
Example parse and build of the array:
get match(x)
if length < 15/25, x++; //determine what length matches we want and discredit shorter than 15 for sure.
if players != 10, x++; //skip the match because it didn't finish with 10 players.
if map != normal_mm_map // rule out non mm games, and mid wars
if != mm, rule out custom games
//and so forth
match_psr = match(x).get(average_psr);
match_winner = match(x).get(winner);
//Hero ids of winners
Wh1 = match.(x).get(winner.player1(hero_id)))
Wh2 = match.(x).get(winner.player2(hero_id)))
Wh3 = match.(x).get(winner.player3(hero_id)))
Wh4 = match.(x).get(winner.player4(hero_id)))
Wh5 = match.(x).get(winner.player5(hero_id)))
//hero ids of losers
Lh1 = match.(x).get(loser.player1(hero_id)))
Lh2 = match.(x).get(loser.player2(hero_id)))
Lh3 = match.(x).get(loser.player3(hero_id)))
Lh4 = match.(x).get(loser.player4(hero_id)))
Lh5 = match.(x).get(loser.player5(hero_id)))
//some sort of sorting algorithim to put the Wh1-5 in order of hero id from smallest to largest
//some sort of sorting algorithim to put the Lh1-5 in order of hero id from smallest to largest
if(array([Wh1, Wh2, Wh3, Wh4, Wh5],[Lh1,Lh2,Lh3,Lh4,Lh5],[],[],[],[],[],[],[],[],[]) != null)
array([Wh1, Wh2, Wh3, Wh4, Wh5],[Lh1,Lh2,Lh3,Lh4,Lh5],[],[],[],[],[],[],[],[],[]) += array([],[],[1],[][][][](something with psr)[][][[])
else(array.add_element([Wh1, Wh2, Wh3, Wh4, Wh5],[Lh1,Lh2,Lh3,Lh4,Lh5],[1],[][][][](something with psr)[][][[])
Any thoughts?
Encode each actor in the game using a simple scheme 0 ... 84
You can maintain a 2D matrix of 85*85 actors in the game.
Initialize each entry in this array to zero.
Now use just the upper triangular portion of your matrix.
So, for any two players P1,P2 you have a unique entry in the array, say array[small(p1,p2)][big(p1,p2)].
array(p1,p2) signifies how much p1 won against p2.
You event loop can be like this :
For each stat like H=(H1,H2,H3,H4,H5) won against L=(L1,L2,L3,L4,L5) do
For each tuple in H*L (h,l) do
if h<l
increment array[h][l] by one
else
decrement array[l][h] by one
Now, at the end of this loop, you have an aggregate information about players information against each other. Next step is an interesting optimization problem.
wrong approach : select 5 fields in this matrix such that no two field's row and column are same and the summation of their absolute values is maximum. I think you can get good optimization algorithms for this problem. Here, we will calculate five tuples (h1,l1), (h2,l2), (h3,l3) ... where h1 wins against l1 is maximized but you still did not see it l1 is good against h2.
The easier and correct options is to use brute force on the set of (85*84)C5 tuples.
I read one of this question being asked for a job interview of software engineer.
If there are 1000 websites and 1000 users, write a program and Data-structure such that i can query for the followin at real time: 1. Given any user, I get the list of all sites he/she has visited 2. Given any website, I get the list of all users who have visited it.
I think they wanted sort of a pseudo code or designing algorithm..
Can you guys give any tips for this?
One thing is certain - in order to be able to answer both queries, you need to store all the pairs which mean that the user has visited the given website. So what I propose is the following:
You have a structure:
struct VisitPair{
int websiteId;
int userId;
VisitPair* nextForUser;
VisitPair* nextForWebsite;
};
nextForUser will point to the next pair for the given user or NULL if there is no next pair for the given user, similarly nextForWebsite will point to the next pair for the webSite. User and website will look something like:
struct User {
char* name;
VisitPair* firstPair;
};
struct Website {
char* url;
VisitPair* firstPair;
};
I assume both Website-s and users are stored in arrays, say these arrays are websites and users. Now adding a new visitPair is relatively easy:
void addNewPair(int siteId, int userId) {
VisitPair* newPair = (VisitPair*)malloc(sizeof(VizitPair));
newPair->nextForUser = users[userId]->firstPair;
users[userid]->firstPair = newPair;
newPair->nextForWesite = websites[siteId]->firstPair;
websites[siteId]->firstPair = newPair;
}
Printing all users for a website and all the websites for a user is done by simply iterating over a list so you should be able to do that.
In short what I create is a structure that has two lists integrated. I do not think there can be a solution with better complexity as this one has linear complexity with respect to the answer and constant complexity for adding a pair.
Hope this helps.
Since both number of sites, and number of users are bounded and known in advance, you can use a 2D array of 1000 x 1000 dimension, with user being one dimension ,and website being another.
The array would be a boolean array.
bool tracker[1000][1000] ;
when user x visits website y, it is marked as 1 ( true ).
tracker[x][y] = 1;
To return all users who have visited website J,
return all values in column J , which have value 1,
to return all websites visited by user i, return all values in row i, which have value 1.
The complexity of lookup is O(n) , but this approach is space efficient, and updates are 0(1),
unlike linked list which would require O(n) complexity to add a user to website linked list, or to add a website to user's linked list.(But that gives a O(1) complexity when doing lookups).
For each web site and user, keep a linked list for visitors and web sites visited, respectively. Whenever a user visits a web site, add an entry in the user linked list as well as the web site linked list.
This has minimal memory overhead and a fast updates and queries.
In general case with N users and M sites have two maps for queries like
map<user, set<site> > sitesOfUser;
map<size, set<user> > usersOfSite;
When user u visits site s you update this with
sitesOfUser[ u ].insert( s );
usersOfSite[ s ].insert( y );
set is used here to avoid duplication. If duplication is ok (or you will take care of it later), you can have just list and reduce update time by another log.
In this case update will take O( logN + logM ) time ( or just O( logN ), see above) and query will take O( logN ) time.
In your particular case when the maximal number of sites and users is not too much and is known beforehand (let's say it's K) you can just have two arrays like
set<site> sitesOfUser[ K ];
set<user> usersOfSite[ K ];
Here you will get O( logN ) time for update (or O(1) if duplicated information is not a problem and you use list or some another linear container), and O(1) time for query.
Here is a summary of posted answers.
Let m be the number of sites, n the number of users.
For each data structure we give the complexity for update, resp. get.
two arrays of linked lists. O(1), resp. O(len(answer)).
an m×n matrix. O(1), resp. O(m) or O(n). The least memory usage if most users visit most sites, but not optimal in space and time if most users visit only a few sites.
two arrays of sets. O(log m) or O(log n), resp. O(len(answer)).
izomorphius's answer is very close to linked lists.
O(len(answer)) is the time required to read the whole answer, but for sets and lists, one can get an iterator in 0(1), that has a next method which is also guaranteed O(1).
I have a data set that will potentially look like this:
user_name
time_1
time_2
time_3
Where the times are different hours on a given day they are free. There are 22 slots each week, and the user is allowed to pick from three and submit them. I will have about 100-150 users, and I'm wondering what is the best way to go about sorting them in such a way that distributes the amount of people evenly across each time slot. My best guess for a starting approach is to see what it looks like if all the users are put in their first slots (time_1), then 2 and 3 and compare which one gives the best results, then from there, look at what will happen if a user is added or removed from a slot and how this will affect overall performance. Any help would be appreciated as I haven't done a lot of optimization algorithms.
Regards,
I'm answering because previous answers apparently break down in cases where many people choose the same slot and many slots have no or few choosers. For example, if all users choose slots (1,2,3) in that order, topological sort will provide no help.
In the ideal situation, each person would choose one slot, and all slots would have the same number of choosers (+/- 1). If I were handling the problem myself, I'd try a first-come, first-served regime with a real-time online server, such that people can choose only from those slots that remain open at the time they log in.
If online first-come, first-served isn't feasible, I'd
use a method that motivates people to choose distinct slots, possibly with an element of randomness. Here's one such method:
Let there be U people in all, vying for H time slots. (H=22.) Suppose each person is assigned to exactly one slot. Let P = [U/H] (that is, U/H truncated to integer) be the nominal number of persons per slot. (U mod H slots will have P+1 persons in them.) For slot j, let D_j be 3*R1j + 2*R2j + 1*R3j, where Rij is the number of times slot j is requested as choice i. D_j is higher for more-desired slots. Give each user k a score W_k = 1/D_{C1k} + 2/D_{C2k} + 3/D_{C3k}, where Cik is the i'th choice of user k. That is, a user gets more points for choosing slots with low D values, and 2nd- or 3rd-choice selections are weighted more heavily than 1st-choice selections.
Now sort the slots into increasing order by D_j. (The "busiest" slots will be filled first.) Sort the users into decreasing order by W_k scores, and call this list S.
Then, for each slot j: While j is not full, {Find first person k in S who chose slot j as choice 1; if found, move k from S to slot j. If none found, find first person k in S who chose slot j as choice 2; if found, move k from S to slot j. If none found, find first person k in S who chose slot j as choice 3; if found, move k from S to slot j. If none found, add the last person k from S to slot j, and remove k from S.}
In the bad case mentioned earlier, where all users choose slots (1,2,3) in order, this method would assign random sets of people to all slots. Given the problem statement, that's as good as can be expected.
Update 1: Completely filling busiest slots first may put some people into their professed 2nd or 3rd choice places when they could have been placed without conflict in their first-choice places. There are pros and cons to filling busiest-first, which game-theoretic analysis might resolve. Absent that analysis, it now seems to me better to fill via the following (simpler) method instead: As before, create sorted user list S, in decreasing order by W_k scores. Now go through list S in order, placing people into the first available slot they chose and fit into, else into the most-popular slot that still has an opening. For example, if user k chose slots p, q, r, put k into p if p has room, else q if q has room, else r if r has room, else j where j is among slots with openings and D_j is largest.
This approach would be easier to explain to users, is a
little easier to program, and in general may come closer to optimal. In cases where slots can be filled without resorting to third-place choices, it will do so.
This is just an heuristic but maybe it would work well enough:
For each Timeslot calculate the number of people who are available for that slot
Take the timeslot with the least available people and fill it with 22/(amount of overall people) or the maximum number of people that are available for that slot.
Remove the added people from the pool and repeat the procedure for the remaining timeslots.
If you need an optimal result you might want to use a constraint solver or linear program solver.
This is graph-theory problem and can be solved with a topological sort: http://en.wikipedia.org/wiki/Topological_sorting.
I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.
"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
QED.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.
What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.
Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.
I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share
I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
numberlist.sort()
pagesize=1000
#initbuckets
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
buckets[thisb].append(k-(thisb*pagesize))
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.
If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
all_keys.remove(key)
result_keys.append(key)
# result_keys now contain 5 random keys.