Do you keep the true distribution between classes when you manually create your own dataset, or do you make it balanced? - dataset

(I struggled a bit to phrase the title - please feel free to suggest another title).
I have a text-dataset which I need to classify, say there's three classes. I need to create the targets by manually setting the labels based on the text (say the three classes are dog,cat,bird).
When I do so I notice we have, say, 70% dog, 20% cat and 10% bird.
Since a lot of machine learning models struggle with imbalanced data, my first thought would be to force the dataset being balanced simply to ignore some of the dog and cat text (i.e "undersampling") thus ending up with (almost) a balanced dataset, making it more easy to train the model.
My concern is though that if we want to train e.g a neural network and get the probability for each class, not training over the correct distribution of the data would result in over/under-confident predictions?

Indeed if your dataset is imbalanced, there is a risk of affecting the performance of your classifier.
You'll find plenty of libraries to help you deal with this problem (see below) and the bottom line is if classes are equally represented in your dataset, it can only help prevent biais of your classifier:
https://imbalanced-learn.org/stable/auto_examples/index.html#general-examples
https://github.com/ufoym/imbalanced-dataset-sampler
https://github.com/MaxHalford/pytorch-resample etc...
(but you can also do that sampling yourself, shouldn't be too difficult, eg libraries like pandas have such functionality)
As a safeguard, split your dataset into 3:
Training (eg 70% of our data): the bulk of the data used for learning
Validation (eg 20%): what your classifier uses for regularization (ie to prevent over fitting)
Test (eg 10%): this data is NEVER exposed to your classifier for learning purposes, you keep it separate and just use it at the end on your model to evaluate its true performance (you call predict and compare with expected classes).
This should be a good starting point.

Related

How to eliminate "unnecessary" values to a neural network?

My professor asked my class to make a neural network to try to predict if a breast cancer is benign or malignant. To do this I'm using the Breast Cancer Wisconsin (Diagnostic) Data Set.
As a tip for doing this my professor said not all 30 atributes needs to be used as an input (there are 32, but the first 2 are the ID and Diagnosis), what I want to ask is: How am I supposed to take those 30 inputs (that would create like 100+ weights depending on how many neurons I would use) and get them into a lesser number?
I've already found how to "prune" a neural net, but I don't think that's what I want. I'm not trying to eliminate unnecessary neurons, but to shrink the input itself.
PS: Sorry for any english errors, it's not my native language.
That is a question that is being under research right now. It is called feature selection and there are some techniques already. One is Principal Componetns Analysis (PCA) that reduces the dimensionality of your dataset taking those feature that keeps the most variance. Another thing you can do is to see if there are highly corelated variables. If two inputs are highly correlated may mean that they carry almost the same information so it may be remove without worsen much the performance of your classifier. As a third technique you could use is deep-learning which is a technique that tries to learn the features that will later be used to feed your trainer. More info about deep learning and PCA can be found here http://deeplearning.stanford.edu/wiki/index.php/Main_Page
This problem is called feature selection. It is mostly the same for neural networks as for other classifiers. You could prune your dataset while retaining the most variance using PCA. To go further, you could use a greedy approach and evaluate your features one by one by training and testing your network with each feature excluded in turn.
There is a technique for feature selection using just neural networks
Split your dataset into three groups:
Training data used for supervised training
Validation data used to verify that the neural network is able to generalize
Accuracy testing used to test which of the features are required
The steps:
Train a network on your training and validation set, just like you would normally do.
Test the accuracy of the network with the third dataset.
Locate the varible which yields the smallest drop in the accuracy test above when dropped (dropped meaning always feeding a zero as the input signal )
Retrain your network with the new selection of features
Keep doing this either to the network fails to be trained or there is just one variable left.
Here is a paper on the technique

How to make the database structure extensible for a turn based game?

I'm working on a fantasy turn base game.
I now have to create the database structure for my spells. The problem is that I don't really have a good idea on how to create it. Maybe the effects of those spells should not be stored in a database?
For instance, effects could be; increase attack, pull an enemy, heal, teleport, hide, put a mine and so on... Effects are pretty different and I would like the database structure to be extensible.
Edit:
It's a turn based game, time is the same as turns and distance represents the squares.
Some examples of what I mean below.
Let's say we have Incinerate:
it can target only 1 enemy (not ally)
it can be casted at a distance of 3 squares
it deals 5 damage per turn
it lasts 3 turns
Now we can take Shock Wave:
it travels in a line for 4 squares
it starts from a square near the caster
it damages the first target it hits (ally or enemy)
it deals 5 damage to the target and knocks it back 1 square
And the last one Rain Call:
it can be casted at any distance
it's a cloud the size of a 5x5 square
it can target both ally and enemies
only fire creatures take damage
while casting the caster is immobilized and it loses 5 mana/turn
As you can see there are a lot of possible columns: the distance it travels, turns, casting distance, type (damage, heal, armor, etc), value (+2), target (enemy, ally, both), size, etc.
I would not use a relational database for storing spells. Relational databases are good in cases when most of the following conditions apply:
you have very large amount of data,
the data can logically be organized as n-ary relations (tables, rows, columns),
you have many users that access to the data concurrently,
you need ACID properties,
et cetera
Databases are like trucks. They are big. They are difficult to use. They are expensive. (in terms of needed expertise, maintenance time, run time efficiency, etc. if not monetarily) They are very good at what they are good at, but not at anything else. Don't use a truck when a bicycle would suffice.
Let's come to your problem. The number of different types of spells is surely bounded and known at compile time, why don't you define an interface ISpell, and let each spell type be a class that implements ISpell? (You can also define an abstract class for common code) Then a SpellFactory may construct and provide access to all the spells when the program starts. Do you really need the spells be accessible from outside independent of your code?
If hard coding a SpellFactory is not flexible enough for your purposes, you can use xml configuration files. <spell type="blind" description="bla bla" picture="file.jpg"> <effects> <effect .. /> .. </effects> <range>5</range> etc. I don't know much about computer games, but this is what they did in sid meier civilization game, for example. Then, instead of hard coding the different spells in the SpellFactory, you can let it read them from the configuration file at the start up.
As far as I can see, using configuration files instead of a database has the following advantages:
It is a fast, easy, lightweight solution,
It is much more flexible than having all the spells having the same set of columns, (most of which will not make sense for a specific spell)
It is much easier to have more than one version of set of spells at the same time, for experiments, variations, etc,
You can let end users access and manipulate xml files for customizing the game without letting them access the database that would also contain sensitive data,
et cetera.
The disadvantages:
More people know about relational databases than xml format, so you might need a couple of hours to learn how to read and manipulate xml "elements".
Your question is pretty large. It depends on a lot of things, are you going to load the spell during runtime? Maybe you will load them at the beginning of the game? What database will you be using?
Amit Bhargava's suggestion is good and has the advantage of being user-understandable. However string are pretty slow, so what you could do is use flags in your spell table. Then, based on the flag you know which type of spell it is.

Does a Decision Network / Decision Forest take into account relationships between inputs

I have experience dealing with Neural Networks, specifically ones of the Back-Propagating nature, and I know that of the inputs passed to the trainer, dependencies between inputs are part of the resulting models knowledge when a hidden layer is introduced.
Is the same true for decision networks?
I have found that information around these algorithms (ID3) etc somewhat hard to find. I have been able to find the actual algorithms, but information such as expected/optimal dataset formats and other overviews are rare.
Thanks.
Decision Trees are actually very easy to provide data to because all they need is a table of data, and which column out of that data what feature (or column) you want to predict on. That data can be discrete or continuous for any feature. Now there are several flavors of decision trees with different support for continuous and discrete values. And they work differently so understanding how each one works can be challenging.
Different decision tree algorithms with comparison of complexity or performance
Depending on the type of algorithm you are interested in it can be hard to find information without reading the actual papers if you want to try and implement it. I've implemented the CART algorithm, and the only option for that was to find the original 200 page book about it. Most of other treatments only discuss ideas like splitting with enough detail, but fail to discuss any other aspect at more than a high level.
As for if they take into account the dependencies between things. I believe it only assumes dependence between each input feature and the prediction feature. If the input was independent from the prediction feature you couldn't use it as a split criteria. But, between other input features I believe they must be independent of each other. I'd have to check the book to ensure that was true or not, but off the top of my head I think that's true.

Feature selection and unsupervised learning for multilingual data + machine learning algorithm selection

Questions
I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.
What features can I use in a machine learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.
If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.
I was thinking of using decision trees, or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). This post suggests random forests instead of SVMs. Any thoughts?
Pragmatical approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)
Some context
We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).
We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.
Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.
We intend to use the Orange machine learning package.
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.

Best way to automate testing of AI algorithms?

I'm wondering how people test artificial intelligence algorithms in an automated fashion.
One example would be for the Turing Test - say there were a number of submissions for a contest. Is there any conceivable way to score candidates in an automated fashion - other than just having humans test them out.
I've also seen some data sets (obscured images of numbers/letters, groups of photos, etc) that can be fed in and learned over time. What good resources are out there for this.
One challenge I see: you don't want an algorithm that tailors itself to the test data over time, since you are trying to see how well it does in the general case. Are there any techniques to ensure it doesn't do this? Such as giving it a random test each time, or averaging its results over a bunch of random tests.
Basically, given a bunch of algorithms, I want some automated process to feed it data and see how well it "learned" it or can predict new stuff it hasn't seen yet.
This is a complex topic - good AI algorithms are generally the ones which can generalize well to "unseen" data. The simplest method is to have two datasets: a training set and an evaluation set used for measuring the performances. But generally, you want to "tune" your algorithm so you may want 3 datasets, one for learning, one for tuning, and one for evaluation. What defines tuning depends on your algorithm, but a typical example is a model where you have a few hyper-parameters (for example parameters in your Bayesian prior under the Bayesian view of learning) that you would like to tune on a separate dataset. The learning procedure would already have set a value for it (or maybe you hardcoded their value), but having enough data may help so that you can tune them separately.
As for making those separate datasets, there are many ways to do so, for example by dividing the data you have available into subsets used for different purposes. There is a tradeoff to be made because you want as much data as possible for training, but you want enough data for evaluation too (assuming you are in the design phase of your new algorithm/product).
A standard method to do so in a systematic way from a known dataset is cross validation.
Generally when it comes to this sort of thing you have two datasets - one large "training set" which you use to build and tune the algorithm, and a separate smaller "probe set" that you use to evaluate its performance.
#Anon has the right of things - training and what I'll call validation sets. That noted, the bits and pieces I see about developments in this field point at two things:
Bayesian Classifiers: there's something like this probably filtering your email. In short you train the algorithm to make a probabilistic decision if a particular item is part of a group or not (e.g. spam and ham).
Multiple Classifiers: this is the approach that the winning group involved in the Netflix challenge took, whereby it's not about optimizing one particular algorithm (e.g. Bayesian, Genetic Programming, Neural Networks, etc..) by combining several to get a better result.
As for data sets Weka has several available. I haven't explored other libraries for data sets, but mloss.org appears to be a good resource. Finally data.gov offers a lot of sets that provide some interesting opportunities.
Training data sets and test sets are very common for K-means and other clustering algorithms, but to have something that's artificially intelligent without supervised learning (which means having a training set) you are building a "brain" so-to-speak based on:
In chess: all possible future states possible from the current gameState.
In most AI-learning (reinforcement learning) you have a problem where the "agent" is trained by doing the game over and over. Basically you ascribe a value to every state. Then you assign an expected value of each possible action at a state.
So say you have S states and a actions per state (although you might have more possible moves in one state, and not as many in another), then you want to figure out the most-valuable states from s to be in, and the most valuable actions to take.
In order to figure out the value of states and their corresponding actions, you have to iterate the game through. Probabilistically, a certain sequence of states will lead to victory or defeat, and basically you learn which states lead to failure and are "bad states". You also learn which ones are more likely to lead to victory, and these are subsequently "good" states. They each get a mathematical value associated, usually as an expected reward.
Reward from second-last state to a winning state: +10
Reward if entering a losing state: -10
So the states that give negative rewards then give negative rewards backwards, to the state that called the second-last state, and then the state that called the third-last state and so-on.
Eventually, you have a mapping of expected reward based on which state you're in, and based on which action you take. You eventually find the "optimal" sequence of steps to take. This is often referred to as an optimal policy.
It is true of the converse that normal courses of actions that you are stepping-through while deriving the optimal policy are called simply policies and you are always implementing a certain "policy" with respect to Q-Learning.
Usually the way of determining the reward is the interesting part. Suppose I reward you for each state-transition that does not lead to failure. Then the value of walking all the states until I terminated is however many increments I made, however many state transitions I had.
If certain states are extremely unvaluable, then loss is easy to avoid because almost all bad states are avoided.
However, you don't want to discourage discovery of new, potentially more-efficient paths that don't follow just this-one-works, so you want to reward and punish the agent in such a way as to ensure "victory" or "keeping the pole balanced" or whatever as long as possible, but you don't want to be stuck at local maxima and minima for efficiency if failure is too painful, so no new, unexplored routes will be tried. (Although there are many approaches in addition to this one).
So when you ask "how do you test AI algorithms" the best part is is that the testing itself is how many "algorithms" are constructed. The algorithm is designed to test a certain course-of-action (policy). It's much more complicated than
"turn left every half mile"
it's more like
"turn left every half mile if I have turned right 3 times and then turned left 2 times and had a quarter in my left pocket to pay fare... etc etc"
It's very precise.
So the testing is usually actually how the A.I. is being programmed. Most models are just probabilistic representations of what is probably good and probably bad. Calculating every possible state is easier for computers (we thought!) because they can focus on one task for very long periods of time and how much they remember is exactly how much RAM you have. However, we learn by affecting neurons in a probabilistic manner, which is why the memristor is such a great discovery -- it's just like a neuron!
You should look at Neural Networks, it's mindblowing. The first time I read about making a "brain" out of a matrix of fake-neuron synaptic connections... A brain that can "remember" basically rocked my universe.
A.I. research is mostly probabilistic because we don't know how to make "thinking" we just know how to imitate our own inner learning process of try, try again.

Resources