Six degree of separation interview problem - database

A was asked an interesting question on an interview lately.
You have 1 million users
Each user has 1 thousand friends
Your system should efficiently answer on Do I know him? question for each couple of users. A user "knows" another one, if they are connected through 6 levels of friends.
E.g. A is friend of B, B is a friend of C, C is friend of D, D is a friend of E, E is a friend of F. So we can say that, A knows F.
Obviously you can't to solve this problem efficiently using BFS or other standard traversing technic. The question is - how to store this data structure in DB and how to quickly perform this search.

What's wrong with BFS?
Execute three steps of BFS from the first node, marking accessible users by flag 1. It requires 10^9 steps.
Execute three steps of BFS from the second node, marking accessible users by flag 2. If we meet mark 1 - bingo.

What about storing the data as 1 million x 1 million matrix A where A[i][j] is the minimum number of steps to reach from user i to user j. Then you can query it almost instantly. The update however is more costly.

Related

Stack to handle large computation and data explosion

I recently started working with a team that has been building a solution that involves parallel calculations & data explosion
The input to the system is provided in a set of excel files. Says there are 5 sets of data A, B, C, D and E the calculated output is a multiple of A, B, C, D and E. This output also grows over years - i.e. if the data is spread across 5 yrs - the output for yr1 is the smallest and output of yr5 is the largest (~3 billion rows)
We currently use Microsoft SQL Server to store the input, Microsoft Orleans for computation and store the calculated output in Hadoop. Some concerns I have here are - what we are doing seems to be opposite of map reduce and we have limited big data skills on the team.
I wanted to see if someone has experience working on similar systems and what kind of solution stack was used
Thanks

What is a good approach to check if an item is in a very big hashset?

I have a hashset that cannot be entirely loaded into the memory. So let's say it has ABC part and each one could be loaded into memory but not all at one time.
I also have random entries coming in from time to time which I can barely tell which part it could potentially belong to. So one of the approaches could be that I load A first and then make a check, and then B, C. But next entry could belong to B so I have to unload C, and then load A, then B...Hopefully I make this understood.
This clearly would be very slow so I wonder is there a better way to do that? (if using db is not an alternative)
I suggest that you don't use some criteria to put data entry either to A or to B. In other words, A,B,C - it's just result of division of whole data to 3 equal parts. Am I right? If so I recommend you add some criteria when you adding new entry to your set. For example, if your entries are numbers put those who starts from 0-3 to A, those who starts from 4-6 - to B, from 7-9 to C. When your search something, you apriori now that you have to search in A or in B, or in C. If your entries are words - the same solution, but now criteria is first letter. May be here better use not 3 sets but 26 - size of english alphabet. Please note, that you anyway have to store one of sets in memory. You see one advantage - you do maximum 1 load/unload operation, you don't need to check all sets - you now which of them can really store your value. This idea is widely using in DB - partitioning. If you store in sets nor numbers nor words but some complex objects you anyway can invent some simple criteria.

designing a algorithm for a large data

I read one of this question being asked for a job interview of software engineer.
If there are 1000 websites and 1000 users, write a program and Data-structure such that i can query for the followin at real time: 1. Given any user, I get the list of all sites he/she has visited 2. Given any website, I get the list of all users who have visited it.
I think they wanted sort of a pseudo code or designing algorithm..
Can you guys give any tips for this?
One thing is certain - in order to be able to answer both queries, you need to store all the pairs which mean that the user has visited the given website. So what I propose is the following:
You have a structure:
struct VisitPair{
int websiteId;
int userId;
VisitPair* nextForUser;
VisitPair* nextForWebsite;
};
nextForUser will point to the next pair for the given user or NULL if there is no next pair for the given user, similarly nextForWebsite will point to the next pair for the webSite. User and website will look something like:
struct User {
char* name;
VisitPair* firstPair;
};
struct Website {
char* url;
VisitPair* firstPair;
};
I assume both Website-s and users are stored in arrays, say these arrays are websites and users. Now adding a new visitPair is relatively easy:
void addNewPair(int siteId, int userId) {
VisitPair* newPair = (VisitPair*)malloc(sizeof(VizitPair));
newPair->nextForUser = users[userId]->firstPair;
users[userid]->firstPair = newPair;
newPair->nextForWesite = websites[siteId]->firstPair;
websites[siteId]->firstPair = newPair;
}
Printing all users for a website and all the websites for a user is done by simply iterating over a list so you should be able to do that.
In short what I create is a structure that has two lists integrated. I do not think there can be a solution with better complexity as this one has linear complexity with respect to the answer and constant complexity for adding a pair.
Hope this helps.
Since both number of sites, and number of users are bounded and known in advance, you can use a 2D array of 1000 x 1000 dimension, with user being one dimension ,and website being another.
The array would be a boolean array.
bool tracker[1000][1000] ;
when user x visits website y, it is marked as 1 ( true ).
tracker[x][y] = 1;
To return all users who have visited website J,
return all values in column J , which have value 1,
to return all websites visited by user i, return all values in row i, which have value 1.
The complexity of lookup is O(n) , but this approach is space efficient, and updates are 0(1),
unlike linked list which would require O(n) complexity to add a user to website linked list, or to add a website to user's linked list.(But that gives a O(1) complexity when doing lookups).
For each web site and user, keep a linked list for visitors and web sites visited, respectively. Whenever a user visits a web site, add an entry in the user linked list as well as the web site linked list.
This has minimal memory overhead and a fast updates and queries.
In general case with N users and M sites have two maps for queries like
map<user, set<site> > sitesOfUser;
map<size, set<user> > usersOfSite;
When user u visits site s you update this with
sitesOfUser[ u ].insert( s );
usersOfSite[ s ].insert( y );
set is used here to avoid duplication. If duplication is ok (or you will take care of it later), you can have just list and reduce update time by another log.
In this case update will take O( logN + logM ) time ( or just O( logN ), see above) and query will take O( logN ) time.
In your particular case when the maximal number of sites and users is not too much and is known beforehand (let's say it's K) you can just have two arrays like
set<site> sitesOfUser[ K ];
set<user> usersOfSite[ K ];
Here you will get O( logN ) time for update (or O(1) if duplicated information is not a problem and you use list or some another linear container), and O(1) time for query.
Here is a summary of posted answers.
Let m be the number of sites, n the number of users.
For each data structure we give the complexity for update, resp. get.
two arrays of linked lists. O(1), resp. O(len(answer)).
an m×n matrix. O(1), resp. O(m) or O(n). The least memory usage if most users visit most sites, but not optimal in space and time if most users visit only a few sites.
two arrays of sets. O(log m) or O(log n), resp. O(len(answer)).
izomorphius's answer is very close to linked lists.
O(len(answer)) is the time required to read the whole answer, but for sets and lists, one can get an iterator in 0(1), that has a next method which is also guaranteed O(1).

Algorithm or Script for Sorting Multiple User Schedules

I have a data set that will potentially look like this:
user_name
time_1
time_2
time_3
Where the times are different hours on a given day they are free. There are 22 slots each week, and the user is allowed to pick from three and submit them. I will have about 100-150 users, and I'm wondering what is the best way to go about sorting them in such a way that distributes the amount of people evenly across each time slot. My best guess for a starting approach is to see what it looks like if all the users are put in their first slots (time_1), then 2 and 3 and compare which one gives the best results, then from there, look at what will happen if a user is added or removed from a slot and how this will affect overall performance. Any help would be appreciated as I haven't done a lot of optimization algorithms.
Regards,
I'm answering because previous answers apparently break down in cases where many people choose the same slot and many slots have no or few choosers. For example, if all users choose slots (1,2,3) in that order, topological sort will provide no help.
In the ideal situation, each person would choose one slot, and all slots would have the same number of choosers (+/- 1). If I were handling the problem myself, I'd try a first-come, first-served regime with a real-time online server, such that people can choose only from those slots that remain open at the time they log in.
If online first-come, first-served isn't feasible, I'd
use a method that motivates people to choose distinct slots, possibly with an element of randomness. Here's one such method:
Let there be U people in all, vying for H time slots. (H=22.) Suppose each person is assigned to exactly one slot. Let P = [U/H] (that is, U/H truncated to integer) be the nominal number of persons per slot. (U mod H slots will have P+1 persons in them.) For slot j, let D_j be 3*R1j + 2*R2j + 1*R3j, where Rij is the number of times slot j is requested as choice i. D_j is higher for more-desired slots. Give each user k a score W_k = 1/D_{C1k} + 2/D_{C2k} + 3/D_{C3k}, where Cik is the i'th choice of user k. That is, a user gets more points for choosing slots with low D values, and 2nd- or 3rd-choice selections are weighted more heavily than 1st-choice selections.
Now sort the slots into increasing order by D_j. (The "busiest" slots will be filled first.) Sort the users into decreasing order by W_k scores, and call this list S.
Then, for each slot j: While j is not full, {Find first person k in S who chose slot j as choice 1; if found, move k from S to slot j. If none found, find first person k in S who chose slot j as choice 2; if found, move k from S to slot j. If none found, find first person k in S who chose slot j as choice 3; if found, move k from S to slot j. If none found, add the last person k from S to slot j, and remove k from S.}
In the bad case mentioned earlier, where all users choose slots (1,2,3) in order, this method would assign random sets of people to all slots. Given the problem statement, that's as good as can be expected.
Update 1: Completely filling busiest slots first may put some people into their professed 2nd or 3rd choice places when they could have been placed without conflict in their first-choice places. There are pros and cons to filling busiest-first, which game-theoretic analysis might resolve. Absent that analysis, it now seems to me better to fill via the following (simpler) method instead: As before, create sorted user list S, in decreasing order by W_k scores. Now go through list S in order, placing people into the first available slot they chose and fit into, else into the most-popular slot that still has an opening. For example, if user k chose slots p, q, r, put k into p if p has room, else q if q has room, else r if r has room, else j where j is among slots with openings and D_j is largest.
This approach would be easier to explain to users, is a
little easier to program, and in general may come closer to optimal. In cases where slots can be filled without resorting to third-place choices, it will do so.
This is just an heuristic but maybe it would work well enough:
For each Timeslot calculate the number of people who are available for that slot
Take the timeslot with the least available people and fill it with 22/(amount of overall people) or the maximum number of people that are available for that slot.
Remove the added people from the pool and repeat the procedure for the remaining timeslots.
If you need an optimal result you might want to use a constraint solver or linear program solver.
This is graph-theory problem and can be solved with a topological sort: http://en.wikipedia.org/wiki/Topological_sorting.

Determining which inputs to weigh in an evolutionary algorithm

I once wrote a Tetris AI that played Tetris quite well. The algorithm I used (described in this paper) is a two-step process.
In the first step, the programmer decides to track inputs that are "interesting" to the problem. In Tetris we might be interested in tracking how many gaps there are in a row because minimizing gaps could help place future pieces more easily. Another might be the average column height because it may be a bad idea to take risks if you're about to lose.
The second step is determining weights associated with each input. This is the part where I used a genetic algorithm. Any learning algorithm will do here, as long as the weights are adjusted over time based on the results. The idea is to let the computer decide how the input relates to the solution.
Using these inputs and their weights we can determine the value of taking any action. For example, if putting the straight line shape all the way in the right column will eliminate the gaps of 4 different rows, then this action could get a very high score if its weight is high. Likewise, laying it flat on top might actually cause gaps and so that action gets a low score.
I've always wondered if there's a way to apply a learning algorithm to the first step, where we find "interesting" potential inputs. It seems possible to write an algorithm where the computer first learns what inputs might be useful, then applies learning to weigh those inputs. Has anything been done like this before? Is it already being used in any AI applications?
In neural networks, you can select 'interesting' potential inputs by finding the ones that have the strongest correlation, positive or negative, with the classifications you're training for. I imagine you can do similarly in other contexts.
I think I might approach the problem you're describing by feeding more primitive data to a learning algorithm. For instance, a tetris game state may be described by the list of occupied cells. A string of bits describing this information would be a suitable input to that stage of the learning algorithm. actually training on that is still challenging; how do you know whether those are useful results. I suppose you could roll the whole algorithm into a single blob, where the algorithm is fed with the successive states of play and the output would just be the block placements, with higher scoring algorithms selected for future generations.
Another choice might be to use a large corpus of plays from other sources; such as recorded plays from human players or a hand-crafted ai, and select the algorithms who's outputs bear a strong correlation to some interesting fact or another from the future play, such as the score earned over the next 10 moves.
Yes, there is a way.
If you choose M selected features there are 2^M subsets, so there is a lot to look at.
I would to the following:
For each subset S
run your code to optimize the weights W
save S and the corresponding W
Then for each pair S-W, you can run G games for each pair and save the score L for each one. Now you have a table like this:
feature1 feature2 feature3 featureM subset_code game_number scoreL
1 0 1 1 S1 1 10500
1 0 1 1 S1 2 6230
...
0 1 1 0 S2 G + 1 30120
0 1 1 0 S2 G + 2 25900
Now you can run some component selection algorithm (PCA for example) and decide which features are worth to explain scoreL.
A tip: When running the code to optimize W, seed the random number generator, so that each different 'evolving brain' is tested against the same piece sequence.
I hope it helps in something!

Resources