Optimizing People Placement - Large Data Set

Optimizing People Placement - Large Data Set - database

I'm trying to figure out the best method / program to handle this computation to get the most people happy, ie the highest value for each person while still having all values be almost equal.
There are 24 people, 100 days and 4 people need to be selected for each day. All days must be full, ie the 24 people must be spread over 400 the slots with each person getting about 8 slots.
How can I create a program / algorithm that will allow the people to rank all 100 days in order of preference as well as the top 5 people they would prefer to be selected with. I was thinking that each day and each of the preferred people would get some sort of point value. Then the algorithm would run through the data set and find the combination that would yield the highest amount of people the happiest while still making everyone roughly even.
Is this easily possible using something like excel?
Thanks

Read up on "The Assignment Problem"; this is a well studied class of problems. Off the top of my head, the Hungarian Assignment method and the Stable Marriage/Stable Roommate method might be relevant.

You can solve this problem as a MILP using Solver as shown in this video and many others like it, but I am afraid that the build-in Solver may not allow enough binary variables for it to work. Get a feeling for how the problem work on a small scale and then download a nicer solver.

Related

Algorithm sorting details, but without excluding

I have come across a problem.
I’m not asking for help how to construct what I’m searching for, but only to guide me to what I’m looking for! 😊
The thing I want to create is some sort of ‘Sorting Algorithm/Mechanism’.
Example:
Imagine I have a database with over 1000 pictures of different vehicles.
A person sees a vehicle, he now tries to get as much information and details about that vehicle, such as:
Shape
number of wheels
number and shape of windows
number and shape of light(s)
number and shape of exhaust(s)
Etc…
He then gives me all information about that vehicle he saw. BUT! Without telling me anything about:
Make and model.
…
I will now take that information and tell my database to sort out every vehicle so that it arranges all 1000 vehicle by best match, based by the description it have been given.
But it should NOT exclude any vehicle!
So…
If the person tells me that the vehicle only has 4 wheels, but in reality it has 5 (he might not have seen the fifth wheel) it should just get a bad score in the # of wheels.
But if every other aspect matches that vehicle perfect it will still get a high score.
That way we don’t exclude the vehicle that he has seen, and we still have a change to find the correct vehicle.
The whole aspect of this mechanism is to, as said, sort out the most, so instead of looking through 1000 vehicles we only need to sort through the best matches which is 10 to maybe 50 vehicles out of a 1000 (hopefully).
I tried to describe it the best I could in a language that isn’t ‘my father’s tongue’. So bear with me.
Again, I’m not looking for anybody telling me how to make this algorithm, I’m pretty sure nobody even wants of have the time to do that for me, without getting paid somehow...
But I just need to know where to look regarding learning and understanding how to create this mess of a mechanism.
Kind regards
Gent!

Assuming that all your pictures have been indexed with the relevant fields (number of wheels, window shapes...), and given that they are not too numerous (a thousand is peanuts for a computer), you can proceed as follows:
for every criterion, weight the possible discrepancies (e.g. one wheel too much costs 5, one wheel too few costs 10, bad window shape costs 8...). Make this in a coherent way so that the costs of the criteria are well balanced.
to perform a search, evaluate the total discrepancy cost of every car, and sort the values increasingly. Report the first ten.
Technically, what you are after is called a "nearest neighbor search" in a high dimensional space. This problem has been well studied. There are fast solutions but they are extremely complex, and in your case are absolutely not worth using.

The default way of doing this for example in artificial intelligence is to encode all properties as a vector and applying certain weights to each property. The distance can then be calculated using any metric you like. In your case manhatten-distance should be fine. So in pseudocode:
distance(first_car, second_car):
return abs(first_car.n_wheels - second_car.n_wheels) * wheels_weight+ ... +
abs(first_car.n_windows - second_car.n_windows) * windows_weight
This works fine for simple properties like the number of wheels. For more complex properties like the shape of a window you'll probably need to split it up into multiple attributes depending on your requirements on similarity.
Weights are usually picked in such a way as to normalize all values, if their range is known. Optionally an additional factor can be multiplied to increase the impact of a specific attribute on the overall distance.

How to go about creating a prolog program that can work backwards to determine steps needed to reach a goal

I'm not sure what exactly I'm trying to ask. I want to be able to make some code that can easily take an initial and final state and some rules, and determine paths/choices to get there.
So think, for example, in a game like Starcraft. To build a factory I need to have a barracks and a command center already built. So if I have nothing and I want a factory I might say ->Command Center->Barracks->Factory. Each thing takes time and resources, and that should be noted and considered in the path. If I want my factory at 5 minutes there are less options then if I want it at 10.
Also, the engine should be able to calculate available resources and utilize them effectively. Those three buildings might cost 600 total minerals but the engine should plan the Command Center when it would have 200 (or w/e it costs).
This would ultimately have requirements similar to 10 marines # 5 minutes, infantry weapons upgrade at 6:30, 30 marines at 10 minutes, Factory # 11, etc...
So, how do I go about doing something like this? My first thought was to use some procedural language and make all the decisions from the ground up. I could simulate the system and branching and making different choices. Ultimately, some choices are going quickly make it impossible to reach goals later (If I build 20 Supply Depots I'm prob not going to make that factory on time.)
So then I thought weren't functional languages designed for this? I tried to write some prolog but I've been having trouble with stuff like time and distance calculations. And I'm not sure the best way to return the "plan".
I was thinking I could write:
depends_on(factory, barracks)
depends_on(barracks, command_center)
builds_from(marine, barracks)
build_time(command_center, 60)
build_time(barracks, 45)
build_time(factory, 30)
minerals(command_center, 400)
...
build(X) :-
depends_on(X, Y),
build_time(X, T),
minerals(X, M),
...
Here's where I get confused. I'm not sure how to construct this function and a query to get anything even close to what I want. I would have to somehow account for rate at which minerals are gathered during the time spent building and other possible paths with extra gold. If I only want 1 marine in 10 minutes I would want the engine to generate lots of plans because there are lots of ways to end with 1 marine at 10 minutes (maybe cut it off after so many, not sure how you do that in prolog).
I'm looking for advice on how to continue down this path or advice about other options. I haven't been able to find anything more useful than towers of hanoi and ancestry examples for AI so even some good articles explaining how to use prolog to DO REAL THINGS would be amazing. And if I somehow can get these rules set up in a useful way how to I get the "plans" prolog came up with (ways to solve the query) other than writing to stdout like all the towers of hanoi examples do? Or is that the preferred way?
My other question is, my main code is in ruby (and potentially other languages) and the options to communicate with prolog are calling my prolog program from within ruby, accessing a virtual file system from within prolog, or some kind of database structure (unlikely). I'm using SWI-Prolog atm, would I be better off doing this procedurally in Ruby or would constructing this in a functional language like prolog or haskall be worth the extra effort integrating?
I'm sorry if this is unclear, I appreciate any attempt to help, and I'll re-word things that are unclear.

Your question is typical and very common for users of procedural languages who first try Prolog. It is very easy to solve: You need to think in terms of relations between successive states of your world. A state of your world consists for example of the time elapsed, the minerals available, the things you already built etc. Such a state can be easily represented with a Prolog term, and could look for example like time_minerals_buildings(10, 10000, [barracks,factory])). Given such a state, you need to describe what the state's possible successor states look like. For example:
state_successor(State0, State) :-
State0 = time_minerals_buildings(Time0, Minerals0, Buildings0),
Time is Time0 + 1,
can_build_new_building(Buildings0, Building),
building_minerals(Building, MB),
Minerals is Minerals0 - MB,
Minerals >= 0,
State = time_minerals_buildings(Time, Minerals, Building).
I am using the explicit naming convention (State0 -> State) to make clear that we are talking about successive states. You can of course also pull the unifications into the clause head. The example code is purely hypothetical and could look rather different in your final application. In this case, I am describing that the new state's elapsed time is the old state's time + 1, that the new amount of minerals decreases by the amount required to build Building, and that I have a predicate can_build_new_building(Bs, B), which is true when a new building B can be built assuming that the buildings given in Bs are already built. I assume it is a non-deterministic predicate in general, and will yield all possible answers (= new buildings that can be built) on backtracking, and I leave it as an exercise for you to define such a predicate.
Given such a predicate state_successor/2, which relates a state of the world to its direct possible successors, you can easily define a path of states that lead to a desired final state. In its simplest form, it will look similar to the following DCG that describes a list of successive states:
states(State0) -->
( { final_state(State0) } -> []
; [State0],
{ state_successor(State0, State1) },
states(State1)
).
You can then use for example iterative deepening to search for solutions:
?- initial_state(S0), length(Path, _), phrase(states(S0), Path).
Also, you can keep track of states you already considered and avoid re-exploring them etc.
The reason you get confused with the example code you posted is essentially that build/1 does not have enough arguments to describe what you want. You need at least two arguments: One is the current state of the world, and the other is a possible successor to this given state. Given such a relation, everything else you need can be described easily. I hope this answers your question.

Caveat: my Prolog is rusty and shallow, so this may be off base
Perhaps a 'difference engine' approach would be appropriate:
given a goal like 'build factory',
backwards-chaining relations would check for has-barracks and tell you first to build-barracks,
which would check for has-command-center and tell you to build-command-center,
and so on,
accumulating a plan (and costs) along the way
If this is practical, it may be more flexible than a state-based approach... or it may be the same thing wearing a different t-shirt!

How do I handle uncertainty/missing data in an Artifical Neural Network?

The context:
I'm experimenting with using a feed-forward artificial neural network to create AI for a video game, and I've run into the problem that some of my input features are dependent upon the existence or value of other input features.
The most basic, simplified example I can think of is this:
feature 1 is the number of players (range 2...5)
feature 2 to ? is the score of each player (range >=0)
The number of features needed to inform the ANN of the scores is dependent on the number of players.
The question: How can I represent this dynamic knowledge input to an ANN?
Things I've already considered:
Simply not using such features, or consolidating them into static input.
I.E using the sum of the players scores instead. I seriously doubt this is applicable to my problem, it would result in the loss of too much information and the ANN would fail to perform well.
Passing in an error value (eg -1) or default value (eg 0) for non-existant input
I'm not sure how well this would work, in theory the ANN could easily learn from this input and model the function appropriately. In practise I'm worried about the sheer number of non-existant input causing problems for the ANN. For example if the range of players was 2-10, if there were only 2 players, 80% of the input data would be non-existant and would introduce weird bias into the ANN resulting in a poor performance.
Passing in the mean value over the training set in place on non-existant input
Again, the amount of non-existant input would be a problem, and I'm worried this would introduce weird problems for discrete-valued inputs.
So, I'm asking this, does anybody have any other solutions I could think about? And is there a standard or commonly used method for handling this problem?
I know it's a rather niche and complicated question for SO, but I was getting bored of the "how do I fix this code?" and "how do I do this in PHP/Javascript?" questions :P, thanks guys.

It sounds like you have multiple data sets (for each number of players) that aren't really compatible with each other. Would lessons learned from a 5-player game really apply to a 2-player game? Try simplifying the problem, such as #1, and see how the program performs. In AI, absurd simplifications can sometimes give you a lot of traction, like bag of words in spam filters.

Try thinking about some model like the following:
Say xi (e.g. x1) is one of the inputs that a variable number of can exist. You can have n of these (x1 to xn). Let y be the rest of the inputs.
On your first hidden layer, pass x1 and y to the first c nodes, x1,x2 and y to the next c nodes, x1,x2,x3 and y to the next c nodes, and so on. This assumes x1 and x3 can't both be active without x2. The model will have to change appropriately if this needs to be possible.
The rest of the network is a standard feed-forward network with all nodes connected to all nodes of the next layer, or however you choose.
Whenever you have w active inputs, disable all but the wth set of c nodes (completely exclude them from training for that input set, don't include them when calculating the value for the nodes they output to, don't update the weights for their inputs or outputs). This will allow most of the network to train, but for the first hidden layer, only parts applicable to that number of inputs.
I suggest c is chosen such that c*n (the number of nodes in the first hidden layer) is greater than (or equal to) the number of nodes in the 2nd hidden layer (and have c be at the very least 10 for a moderately sized network (into the 100s is also fine)) and I also suggest the network have at least 2 other hidden layers (so 3 in total excluding input and output). This is not from experience, but just what my intuition tells me.
This working is dependent on a certain (possibly undefinable) similarity between the different numbers of inputs, and might not work well, if at all, if this similarity doesn't exist. This also probably requires quite a bit of training data for each number of inputs.
If you try it, let me / us know if it works.
If you're interested in Artificial Intelligence discussions, I suggest joining some Linked-In group dedicated to it, there are some that are quite active and have interesting discussions. There doesn't seem to be much happening on stackoverflow when it comes to Artificial Intelligence, or maybe we should just work to change that, or both.
UPDATE:
Here is a list of the names of a few decent Artificial Intelligence LinkedIn groups (unless they changed their policies recently, it should be easy enough to join):
'Artificial Intelligence Researchers, Faculty + Professionals'
'Artificial Intelligence Applications'
'Artificial Neural Networks'
'AGI — Artificial General Intelligence'
'Applied Artificial Intelligence' (not too much going on at the moment, and still dealing with some spam, but it is getting better)
'Text Analytics' (if you're interested in that)

Help--100% accuracy with LibSVM?

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.

Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.

Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.

large test data for knapsack problem

i am researcher student. I am searching large data for knapsack problem. I wanted test my algorithm for knapsack problem. But i couldn't find large data. I need data has 1000 item and capacity is no matter. The point is item as much as huge it's good for my algorithm. Is there any huge data available in internet. Does anybody know please guys i need urgent.

You can quite easily generate your own data. Just use a random number generator and generate lots and lots of values. To test that your algorithm gives the correct results, compare it to the results from another known working algorithm.

I have the same requirement.
Obviously only Brute force will give the optimal answer and that won't work for large problems.
However we could pitch our algorithms against each other...
To be clear, my algorithm works for 0-1 problems (i.e. 0 or 1 of each item), Integer or decimal data.
I also have a version that works for 2 dimensions (e.g. Volume and Weight vs. Value).
My file reader uses a simple CSV format (Item-name, weight, value):
X229257,9,286
X509192,11,272
X847469,5,184
X457095,4,88
etc....
If I recall correctly, I've tested mine on 1000 items too.
Regards.
PS:
I ran my algorithm again the problem on Rosette Code that Mark highlighted (thank you). I got the same result but my solution is much more scalable than the dynamic programming / LP solutions and will work on much bigger problems

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight