Bioinformatics databases with negative results? - database

Bioinformatics databases such as BioGRID collect a lot of interaction results for proteins and genes in different species from all sorts of publications and experiments, but such collations sufer from testing biases since not all combinations are tested, and some are tested more than once. Shouldn't they also collect all negative results? Is there such a resource which systematically collects both positive and negative interactions from High Throughput and Low Throughput experiments?

These might help:
http://www.jnrbm.com/info/about/
http://www.jnr-eeb.org/index.php/jnr
and so far I know databases of non-hitters or non-binding drug-like compounds exist

You should look for the 'Negatome', database of non-interacting protein pairs.
Smialowski P, Pagel P, Wong P, Brauner B, Dunger I, Fobo G, Frishman G,
Montrone C, Rattei T, Frishman D, Ruepp A. The Negatome database: a reference set
of non-interacting protein pairs. Nucleic Acids Res. 2010 Jan;38(Database
issue):D540-4. Epub 2009 Nov 17. PubMed PMID: 19920129; PubMed Central PMCID:
PMC2808923. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19920129

1) High throughput screens that are published in peer reviewed journals often have such data. Cessarini has published negative results regaring domain/peptide interactions.
2) You can contact data bases like mint/reactome/etc... and mention that you want the negative results where they are available. Many such organizations are required by mandate to share any such data with you, even if its not on their site.
3) A good resource on this subject is here http://www.nature.com/nmeth/journal/v4/n5/full/nmeth0507-377.html

We have been working on an opensource protein interactions meta-database & prediction server (which does include data from BioGRID among other sources) that deals with both negative and positive data as you have asked for...
MAYETdb does the following:
Classify protein interactions as either "interacting" or "not interacting"
Includes data from a variety of experimental set-ups (Y2H, TAP-MS, more) and species (yeast, human, c.elegans) (inc. both literature-mined and databased data e.g. BioGRID).
It also yields false-positive and false-negative errors of those classifications.
Random forest machine learning system makes predictions for previously un-tested interactions by learning form a wide variety of protein features and works at rather high accuracy (~92% AUG).
It is not yet running on a server but the source code is available and heavily commented if you are curiously impatient: https://bitbucket.org/dknightg/ppidb/src
Please ask if you have any queries :)

Related

Data Mining KNN Classifier

Suppose a data analyst working for an insurance company was asked to build a predictive model for predicting whether a customer will buy a mobile home insurance policy. S/he tried kNN classifier with different number of neighbours (k=1,2,3,4,5). S/he got the following F-scores measured on the training data: (1.0; 0.92; 0.90; 0.85; 0.82). Based on that the analyst decided to deploy kNN with k=1. Was it a good choice? How would you select an optimal number of neighbours in this case?
It is not a good idea to select a parameter of a prediction algorithm using the whole training set as the result will be biased towards this particular training set and has no information about generalization performance (i.e. performance towards unseen cases). You should apply a cross-validation technique e.g. 10-fold cross-validation to select the best K (i.e. K with largest F-value) within a range.
This involves splitting your training data in 10 equal parts retain 9 parts for training and 1 for validation. Iterate such that each part has been left out for validation. If you take enough folds this will allow you as well to obtain statistics of the F-value and then you can test whether these values for different K values are statistically significant.
See e.g. also:
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm
The subtlety here however is that there is likely a dependency between the number of data points for prediction and the K-value. So If you apply cross-validation you use 9/10 of the training set for training...Not sure whether any research has been performed on this and how to correct for that in the final training set. Anyway most software packages just use the abovementioned techniques e.g. see SPSS in the link.
A solution is to use leave-one-out cross-validation (each data samples is left out once for testing) in that case you have N-1 training samples(the original training set has N).

How do I handle uncertainty/missing data in an Artifical Neural Network?

The context:
I'm experimenting with using a feed-forward artificial neural network to create AI for a video game, and I've run into the problem that some of my input features are dependent upon the existence or value of other input features.
The most basic, simplified example I can think of is this:
feature 1 is the number of players (range 2...5)
feature 2 to ? is the score of each player (range >=0)
The number of features needed to inform the ANN of the scores is dependent on the number of players.
The question: How can I represent this dynamic knowledge input to an ANN?
Things I've already considered:
Simply not using such features, or consolidating them into static input.
I.E using the sum of the players scores instead. I seriously doubt this is applicable to my problem, it would result in the loss of too much information and the ANN would fail to perform well.
Passing in an error value (eg -1) or default value (eg 0) for non-existant input
I'm not sure how well this would work, in theory the ANN could easily learn from this input and model the function appropriately. In practise I'm worried about the sheer number of non-existant input causing problems for the ANN. For example if the range of players was 2-10, if there were only 2 players, 80% of the input data would be non-existant and would introduce weird bias into the ANN resulting in a poor performance.
Passing in the mean value over the training set in place on non-existant input
Again, the amount of non-existant input would be a problem, and I'm worried this would introduce weird problems for discrete-valued inputs.
So, I'm asking this, does anybody have any other solutions I could think about? And is there a standard or commonly used method for handling this problem?
I know it's a rather niche and complicated question for SO, but I was getting bored of the "how do I fix this code?" and "how do I do this in PHP/Javascript?" questions :P, thanks guys.
It sounds like you have multiple data sets (for each number of players) that aren't really compatible with each other. Would lessons learned from a 5-player game really apply to a 2-player game? Try simplifying the problem, such as #1, and see how the program performs. In AI, absurd simplifications can sometimes give you a lot of traction, like bag of words in spam filters.
Try thinking about some model like the following:
Say xi (e.g. x1) is one of the inputs that a variable number of can exist. You can have n of these (x1 to xn). Let y be the rest of the inputs.
On your first hidden layer, pass x1 and y to the first c nodes, x1,x2 and y to the next c nodes, x1,x2,x3 and y to the next c nodes, and so on. This assumes x1 and x3 can't both be active without x2. The model will have to change appropriately if this needs to be possible.
The rest of the network is a standard feed-forward network with all nodes connected to all nodes of the next layer, or however you choose.
Whenever you have w active inputs, disable all but the wth set of c nodes (completely exclude them from training for that input set, don't include them when calculating the value for the nodes they output to, don't update the weights for their inputs or outputs). This will allow most of the network to train, but for the first hidden layer, only parts applicable to that number of inputs.
I suggest c is chosen such that c*n (the number of nodes in the first hidden layer) is greater than (or equal to) the number of nodes in the 2nd hidden layer (and have c be at the very least 10 for a moderately sized network (into the 100s is also fine)) and I also suggest the network have at least 2 other hidden layers (so 3 in total excluding input and output). This is not from experience, but just what my intuition tells me.
This working is dependent on a certain (possibly undefinable) similarity between the different numbers of inputs, and might not work well, if at all, if this similarity doesn't exist. This also probably requires quite a bit of training data for each number of inputs.
If you try it, let me / us know if it works.
If you're interested in Artificial Intelligence discussions, I suggest joining some Linked-In group dedicated to it, there are some that are quite active and have interesting discussions. There doesn't seem to be much happening on stackoverflow when it comes to Artificial Intelligence, or maybe we should just work to change that, or both.
UPDATE:
Here is a list of the names of a few decent Artificial Intelligence LinkedIn groups (unless they changed their policies recently, it should be easy enough to join):
'Artificial Intelligence Researchers, Faculty + Professionals'
'Artificial Intelligence Applications'
'Artificial Neural Networks'
'AGI — Artificial General Intelligence'
'Applied Artificial Intelligence' (not too much going on at the moment, and still dealing with some spam, but it is getting better)
'Text Analytics' (if you're interested in that)

Can a hypergraph represent a nondeterministic Turing machine?

Does anyone know of any papers, texts, or other documents that discuss using a hypergraph to implement or represent a nondeterministic Turing machine? Are they in fact equivalent?
I'm pretty sure that a hypergraph is able to properly and completely represent the state transitions of a nondeterministic Turing machine, for instance. But I've so far been unable to find anything in print that can verify this. This seems to me like such an obvious relationship, however the fact that I'm not finding prior art makes me think I'm on the wrong track. (It could also be the case that what I'm finding is just not accessible enough for me to understand what it's saying.) ;-)
Why I ask: I'm working on an open-source package that does distributed data storage and distributed computation in a peer-to-peer network. I'm looking for the most primitive data structure that might support the functionality needed. So far, a distributed hypergraph looks promising. My reasoning is that, if a hypergraph can support something as general as a nondeterministic Turing machine, then it should be able to support a higher-level Turing-complete DSL. (There are other reasons the "nondeterministic" piece might be valuable to me as well, having to do with version control of the distributed data and/or computation results. Trying to avoid a dissertation here though.)
Partial answers:
The opencog folks had a tantalyzing discussion of how hypergraphs fit into different computing models; this apparently was related to the development of the HypergraphDB package: http://markmail.org/message/5oiv3qmoexvo4v5j
Over on MathOverflow, there's a question discussing what hypergraphs can do -- no mention of turing yet, but I'm about to add it: https://mathoverflow.net/questions/13750/what-are-the-applications-of-hypergraphs
If a hypergraph can represent a nondeterministic Turing machine, then I'd think a hypergraph with weighted edges would be equivalent to a probabalistic Turing machine. http://en.wikipedia.org/wiki/Probabilistic_Turing_machine
A hypergraph is just a graph G=(V,E) where V is the set of vertices (nodes) and E is a subset of the powerset of V. It is a data structure.
So a common graph is just a hypergraph with rank 2. (i.e each set in E contains exactly two vertices). A directed hypergraph uses pairs (X,Y) as the edges where X and Y are sets.
If you want to model a turing machine then you need to model the 'tape'. Do you want the tape 'embedded' in the graph? I think you might have more luck thinking about the Church-Turing thesis (Alonso Church, Lambda calculus). The Lambda calculus is a form of re-writing system and there is most certainly a branch that uses Graph re-writing (and hypergrpahs).
Of course the transitions can be modelled as a graph (I'm not sure what you had in mind, but the straight forward approach doesn't really help)
if you were modelling it normally you would probably create a dictionary/hashmap with tuples as keys (State, Symbol) and the values being (State, Rewrite, Left|Right). eg
states = {1,2,3}
symbols = {a,b,c}
moves = L, R
delta = { (1,a) -> (1,b,R)
(1,b) -> (2,c,L)
...
}
so if you wanted a graph you would first need V = states U symbols U moves.
Clearly they need to be disjoint sets.
as {1,a} -> {1,b,R} is by definition equal to {a,1} -> {b,R,1} etc.
states = {1,2,3}
symbols = {a,b,c}
moves = L, R
V = {1,2,3,a,b,c,L,R}
E = { ({1,a},{1,b,R})
({b,1},{L,2,c})
...
}
turing-hypergraph = (V,E)
As I mentioned earlier, look up graph re-writing or term re-writing.

Help--100% accuracy with LibSVM?

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.

What classifiers to use for deciding if two datasets depict the same individual?

Suppose I have pictures of faces of a set of individuals. The question I'm trying to answer is: "do these two pictures represent the same individual"?
As usual, I have a training set containing several pictures for a number of individuals. The individuals and pictures the algorithm will have to process are of course not in the training set.
My question is not about image processing algorithms or particular features I should use, but on the issue of classification. I don't see how traditional classifier algorithms such as SVM or Adaboost can be used in this context. How should I use them? Should I use other classifiers? Which ones?
NB: my real application is not faces (I don't want to disclose it), but it's close enough.
Note: the training dataset isn't enormous, in the low thousands at best. Each dataset is pretty big though (a few megabytes), even if it doesn't hold a lot of real information.
You should probably look at the following methods:
P. Jonathon Phillips: Support Vector Machines Applied to Face Recognition. NIPS 1998: 803-809
Haibin Ling, Stefano Soatto, Narayanan Ramanathan, and David W.
Jacobs, A Study of Face Recognition as People Age, IEEE International
Conference on Computer Vision (ICCV), 2007.
These methods describe using SVMs to same person/different person problems like the one you describe. If the alignment of the features (eyes, nose, mouth) is good, these methods work very nicely.
How big is your dataset?
I would start this problem by coming up with some kind of distance metric (say euclidean) that would characterize differences between image(such as differences in color,shape etc. or say local differences)..Two image representing same individual would have small distance as compared to image representing different individual..though it would highly depend on the type of data set you are currently working.
Forgive me for stating the obvious, but why not use any supervised classifier (SVM, GMM, k-NN, etc.), get one label for each test sample (e.g., face, voice, text, etc.), and then see if the two labels match?
Otherwise, you could perform a binary hypothesis test. H0 = two samples do not match. H1 = two samples match. For two test samples, x1 and x2, compute a distance, d(x1, x2). Choose H1 if d(x1, x2) < epsilon and H0 otherwise. Adjusting epsilon will adjust your probability of detection and probability of false alarm. Your application would dictate which epsilon is best; for example, maybe you can tolerate misses but cannot tolerate false alarms, or vice versa. This is called Neyman-Pearson hypothesis testing.

Resources