Recommended AI/machine learning: profiles input, income prediction - artificial-intelligence

My project looks like this: my data set is a bunch of profiles of people, with various attributes, e.g. boolean hasJob and int healthScore, and their income. Using this data, I'm trying to predict their income for the future. Each profile also has a history: e.g., what their attributes and income were in the past.
So in essence I'm trying to map multiple sets of (x booleans, y numbers) to a number (salary in the coming year).
I've considered neural networks, Bayes nets, and genetic algorithms for function-fitting. Any suggestions or input?
Thanks in advance!
--Emily

What you want to do is called "time series modeling". However you probably have only very little data per series (per person). I think it is difficult to find one model that fits every person as you make some general assumptions that e.g. everyone is equally career oriented. Also this is such a noisy target, it could be that e.g. you have to take into account if someone is a sweettalker or not. How do you measure such a thing? I'm pretty sure your current attributes have enough noise that will make it difficult to predict anything. When you say health status, do you mean physical health only or mental health. In different businesses different things are important. What about the business or industry they are working in? Its health and growth potential? I would assume this highly influences their income. I also think that you have dependent variables as well as attributes could (and likely are) influenced by your target variable. E. g. people with higher income have better health. It sounds like a very very complex and difficult thing and definitely nothing where "I naively grouped my data and tried a bunch of methods" is going to give meaningful results. I would suggest to learn more about time series modeling and especially also about the data that you have. Maybe try starting out with clustering persons by their initial attributes and see how they develop. Are there any variables that correlate with this development?
What is your research question?

Related

Auto incrementing from two different points on the same table?

Each customer has an ID. For every new customer it's incremented by 1.
My company is wanting to start selling to a new type of customer (lets call it customer B), and have their ID start at 30k.
Is this a poor design? Our system has an option to set the customer type, so there is no need for this. We're not expected to reach this number for our other type of customers until 2050 (it's currently at 10k).
I'm trying to convince them that this is a bad idea... It also doesn't help the third party trying to implement this is ok with it.
Confusing the technical and functional keys is the really bad idea. Technical keys are designed to be unchanged, to maintain the referential integrity and to speed up the joins. Technical keys are also to avoid of exposing at the application level (i.e. in the screen forms).
Functional keys are visible to application users and may be a subject to change accordingly to new requirements.
Other bad idea is to use number parts to separate data partitions (customer groups in your case). The character code having a complex structure like a social security number is a more strong and scalable solution (however it's a sort of denormalization, too).
The recommended solution is to have a technical ID as an auto-incremental number but also add a functional customer code.
If the identifier is customer-facing or business user-facing then it makes sense to design it with readability and usability in mind. Most people find it very natural work with identification schemes that have information encoded within them: think of phone numbers, vehicle licence plate numbers, airline flight numbers. There is evidence that a well structured identification scheme can improve usability and reduce the rate of errors in day to day use. You could expect that a well-designed "meaningful" identification scheme will result in fewer errors than an arbitrary incrementing number with no identifiable features.
There's nothing inherently wrong with doing this. Whether it makes sense in your business context depends a lot on how the number will be used. Technically speaking it isn't difficult to build multiple independent sequences, e.g. using the SEQUENCE feature in SQL. You may want to consider including other features as well, like check digits.

Is it better to use pattern recognition tool to predict traffic condition using previous data?

I wish to use Artificial neural network pattern recognition tool to predict traffic flow of the urban area with the use of previous traffic count data.
I want to know whether it is a good technique to predict traffic condition.
Probably should be posted on CrossValidated.
The exact effectiveness is based on what features you are looking at in predicting traffic conditions. The question "whether it's a good technique" is too vague. Neural networks might work pretty well under certain circumstances, while it might also work really badly on other situations. Without a specific context it's hard to tell.
Typically neural networks work pretty well on predicting patterns. If you can form your problem into specific pattern recognition tasks then it's possible that neural networks will work pretty well.
-- Update --
Based on the following comment
What I need to predict is vehicle count of a given road, according to the given time and given day with the use of previous data set. As a example when I enter the road name that I need to travel, the time that I wish to travel and the day, I need to get the vehicle count of that road at that time and day.
I would say be very cautious with using neural networks, because depending on your data source, your data may get really sparse. Lets say you have 10000 roads, then for a month period, you are dividing your data set by 30 days, then 24 hours, then 10000 roads.
If you want your neural network to work you need to at least have enough data for each partition of your data set. If you divide your data set in the way described above, you have 7200000 partitions already. Just think about how much data you need in total. The result of having a small dataset means most of your 7 million partitions will have no data available in it, which then implies that your neural network prediction will not work most of the time, since you don't have data to start with.
This is part of the reason why big companies are sort of crazy about big data, because you just never get enough of it.
But anyway, do ask on CrossValidated since people there are more statistician-y and can provide better explanations.
And please note, there might be other ways to split your data (or not splitting at all) to make it work. The above is just an example of pitfalls you might encounter.

Does a Decision Network / Decision Forest take into account relationships between inputs

I have experience dealing with Neural Networks, specifically ones of the Back-Propagating nature, and I know that of the inputs passed to the trainer, dependencies between inputs are part of the resulting models knowledge when a hidden layer is introduced.
Is the same true for decision networks?
I have found that information around these algorithms (ID3) etc somewhat hard to find. I have been able to find the actual algorithms, but information such as expected/optimal dataset formats and other overviews are rare.
Thanks.
Decision Trees are actually very easy to provide data to because all they need is a table of data, and which column out of that data what feature (or column) you want to predict on. That data can be discrete or continuous for any feature. Now there are several flavors of decision trees with different support for continuous and discrete values. And they work differently so understanding how each one works can be challenging.
Different decision tree algorithms with comparison of complexity or performance
Depending on the type of algorithm you are interested in it can be hard to find information without reading the actual papers if you want to try and implement it. I've implemented the CART algorithm, and the only option for that was to find the original 200 page book about it. Most of other treatments only discuss ideas like splitting with enough detail, but fail to discuss any other aspect at more than a high level.
As for if they take into account the dependencies between things. I believe it only assumes dependence between each input feature and the prediction feature. If the input was independent from the prediction feature you couldn't use it as a split criteria. But, between other input features I believe they must be independent of each other. I'd have to check the book to ensure that was true or not, but off the top of my head I think that's true.

Data quality database model

Need an example of a database model to be attached to a database for data quality. Best form of the answer would at the very least be DDL that's executable in MySQL; other RDMS DDL's are okay, I'll just post another question asking for a porting of the code.
A good explaintion would be a huge plus.
Questions, comments, feedback, etc. -- just comment, thanks!!
The biggest problem is identifying meaningful measures of quality. That's so highly application-dependent, I doubt that anybody will be able to help you very much. (At least not without a lot more information--perhaps more than you're allowed to give.)
But let's say your application records observations of birds by individuals. (I'm just throwing this together off the top of my head. Read it for the gist, and expect the details to crumble under scrutiny.) Under average field conditions,
some species are hard for even a beginner to get wrong
some species are hard for an expert to get right
a specific individual's ability varies irregularly over time (good days, bad days)
individuals usually become more skilled over time
you might be highly skilled at identifying hawks, and totally suck at identifying gulls
individuals are prone to suggestion (who they're with makes a difference in their reliability)
So, to take a shot at assessing the quality of an identification, you might try to record a lot of information besides the observation "3 red-tailed hawks at Cape May on 05-Feb-2011 at 4:30 pm". You might try to record
weather
lighting
temperature (some birders suck in the cold)
hours afield (some birders suck after 3 hours, or after 20 cold minutes)
names of others present
average difficulty of correctly
identifying red-tailed hawks
probability that this individual
could correctly identify red-tails
under these field conditions
alcohol intake
Although this might be "meta" to field birders, to the database designer it's just data. And you'd design the tables just like you'd design them for any other application. (That's what I did, anyway.)

Best way to automate testing of AI algorithms?

I'm wondering how people test artificial intelligence algorithms in an automated fashion.
One example would be for the Turing Test - say there were a number of submissions for a contest. Is there any conceivable way to score candidates in an automated fashion - other than just having humans test them out.
I've also seen some data sets (obscured images of numbers/letters, groups of photos, etc) that can be fed in and learned over time. What good resources are out there for this.
One challenge I see: you don't want an algorithm that tailors itself to the test data over time, since you are trying to see how well it does in the general case. Are there any techniques to ensure it doesn't do this? Such as giving it a random test each time, or averaging its results over a bunch of random tests.
Basically, given a bunch of algorithms, I want some automated process to feed it data and see how well it "learned" it or can predict new stuff it hasn't seen yet.
This is a complex topic - good AI algorithms are generally the ones which can generalize well to "unseen" data. The simplest method is to have two datasets: a training set and an evaluation set used for measuring the performances. But generally, you want to "tune" your algorithm so you may want 3 datasets, one for learning, one for tuning, and one for evaluation. What defines tuning depends on your algorithm, but a typical example is a model where you have a few hyper-parameters (for example parameters in your Bayesian prior under the Bayesian view of learning) that you would like to tune on a separate dataset. The learning procedure would already have set a value for it (or maybe you hardcoded their value), but having enough data may help so that you can tune them separately.
As for making those separate datasets, there are many ways to do so, for example by dividing the data you have available into subsets used for different purposes. There is a tradeoff to be made because you want as much data as possible for training, but you want enough data for evaluation too (assuming you are in the design phase of your new algorithm/product).
A standard method to do so in a systematic way from a known dataset is cross validation.
Generally when it comes to this sort of thing you have two datasets - one large "training set" which you use to build and tune the algorithm, and a separate smaller "probe set" that you use to evaluate its performance.
#Anon has the right of things - training and what I'll call validation sets. That noted, the bits and pieces I see about developments in this field point at two things:
Bayesian Classifiers: there's something like this probably filtering your email. In short you train the algorithm to make a probabilistic decision if a particular item is part of a group or not (e.g. spam and ham).
Multiple Classifiers: this is the approach that the winning group involved in the Netflix challenge took, whereby it's not about optimizing one particular algorithm (e.g. Bayesian, Genetic Programming, Neural Networks, etc..) by combining several to get a better result.
As for data sets Weka has several available. I haven't explored other libraries for data sets, but mloss.org appears to be a good resource. Finally data.gov offers a lot of sets that provide some interesting opportunities.
Training data sets and test sets are very common for K-means and other clustering algorithms, but to have something that's artificially intelligent without supervised learning (which means having a training set) you are building a "brain" so-to-speak based on:
In chess: all possible future states possible from the current gameState.
In most AI-learning (reinforcement learning) you have a problem where the "agent" is trained by doing the game over and over. Basically you ascribe a value to every state. Then you assign an expected value of each possible action at a state.
So say you have S states and a actions per state (although you might have more possible moves in one state, and not as many in another), then you want to figure out the most-valuable states from s to be in, and the most valuable actions to take.
In order to figure out the value of states and their corresponding actions, you have to iterate the game through. Probabilistically, a certain sequence of states will lead to victory or defeat, and basically you learn which states lead to failure and are "bad states". You also learn which ones are more likely to lead to victory, and these are subsequently "good" states. They each get a mathematical value associated, usually as an expected reward.
Reward from second-last state to a winning state: +10
Reward if entering a losing state: -10
So the states that give negative rewards then give negative rewards backwards, to the state that called the second-last state, and then the state that called the third-last state and so-on.
Eventually, you have a mapping of expected reward based on which state you're in, and based on which action you take. You eventually find the "optimal" sequence of steps to take. This is often referred to as an optimal policy.
It is true of the converse that normal courses of actions that you are stepping-through while deriving the optimal policy are called simply policies and you are always implementing a certain "policy" with respect to Q-Learning.
Usually the way of determining the reward is the interesting part. Suppose I reward you for each state-transition that does not lead to failure. Then the value of walking all the states until I terminated is however many increments I made, however many state transitions I had.
If certain states are extremely unvaluable, then loss is easy to avoid because almost all bad states are avoided.
However, you don't want to discourage discovery of new, potentially more-efficient paths that don't follow just this-one-works, so you want to reward and punish the agent in such a way as to ensure "victory" or "keeping the pole balanced" or whatever as long as possible, but you don't want to be stuck at local maxima and minima for efficiency if failure is too painful, so no new, unexplored routes will be tried. (Although there are many approaches in addition to this one).
So when you ask "how do you test AI algorithms" the best part is is that the testing itself is how many "algorithms" are constructed. The algorithm is designed to test a certain course-of-action (policy). It's much more complicated than
"turn left every half mile"
it's more like
"turn left every half mile if I have turned right 3 times and then turned left 2 times and had a quarter in my left pocket to pay fare... etc etc"
It's very precise.
So the testing is usually actually how the A.I. is being programmed. Most models are just probabilistic representations of what is probably good and probably bad. Calculating every possible state is easier for computers (we thought!) because they can focus on one task for very long periods of time and how much they remember is exactly how much RAM you have. However, we learn by affecting neurons in a probabilistic manner, which is why the memristor is such a great discovery -- it's just like a neuron!
You should look at Neural Networks, it's mindblowing. The first time I read about making a "brain" out of a matrix of fake-neuron synaptic connections... A brain that can "remember" basically rocked my universe.
A.I. research is mostly probabilistic because we don't know how to make "thinking" we just know how to imitate our own inner learning process of try, try again.

Resources