Nominal Attributes in LibSVM - artificial-intelligence

When creating a libsvm training file, how do you differentiate between a nominal attribute verses a numeric attribute? I'm trying to encode certain nominal attributes as integers, but I want to ensure libsvm doesn't misinterpret them as numeric values. Unfortunately, libsvm's site seems to have very little documentation. Pentaho's docs seem to imply libsvm makes this distinction, but I'm still not clear how it's made.

Don't do this I'm trying to encode certain nominal attributes as integers.
Rather, use a separate binary feature for each value of each nominal attribute.
The way SVMs are formulated, all attributes/features are numeric and class labels are nominal. Nominal attributes are essentially faked by using mutually exclusive binary features.

I think you cant do that in libsvm, weka or SVM-light. One approach that you could use is to use something like a decision tree for your nominal attributes and svm or any distance based classifier for your numeric attributes and then combine the results. I hope it helps.

Related

Comparing the two feature sets

I am working on a classification of two feature sets derived from a dataset. We first obtain two feature matrices derived from two feature extraction methods. Now, I need to compare them. However, the recognition accuracy for two feature sets, reaches almost the same recognition accuracy (using 10-fold cross validation by SVM). My question is:
Is there a way to design a meaningful experiment to show the difference between the two methods? What are your suggestions?
Note: I already saw the similar questions in stackoverflow, however, I am looking for another approach.
You can
Perform a dimensionality reduction on the features in their respective spaces. This will allow you to see differences in the distribution of data points. Kudos if you apply a kernel, being the one used by the SVM (linear kernel otherwise).
Do distribution testing on the features to see if they differ much.
Augment the predictions into an output space, and see distance between the vectors.

Format for storing EXIF metadata in database

I'm working on an application for which I need to be able to store EXIF metadata in a relational database. In the future I'd like to also support XMP and IPTC metadata, but at the moment my focus is on EXIF.
There are a few questions on Stack Overflow about what the table structure should look like when storing EXIF metadata. However, none of them really address my concern. The problem I have is that different EXIF tags have values in different formats, and there isn't really one column type which conveniently stores them all.
The most common type is a "rational" which is an array of two four-byte integers representing a fraction. But there are also non-fractional short and long integers, ASCII strings, byte arrays, and "undefined" (an 8-bit type which must be interpreted according to a priori knowledge of the specific tag.) I'd like to support all of these types, and I want to do so in a convenient, efficient, lossless (i.e. without converting the rationals to floats), extensible and searchable manner.
Here's what I've considered so far:
My current solution is to store everything as a string. This makes it pretty easy to store all of the different types, and is also convenient for searching and debugging. However, it's kind of clunky and inefficient because when I want to actually use the data, I have to do a bunch of string manipulation to convert the rational values into their fractional equivalents, e.g. fraction = float(value.split('/')[0]) / float(value.split('/')[1]). (It's not actually a messy one-liner like that in my real code, but this demonstrates the problem.)
I could grab the raw EXIF bytes for each value from the file and store them in a blob column, but then I'd have to reinterpret the raw bytes every time. This could be marginally more CPU-efficient than the string solution, but it's much, much worse in every other way - on the whole, not worth it.
I could have a different table for each different EXIF datatype. Using this pattern I can maintain my foreign key relationships while storing my values in several different tables. However, this will make my most common query, which is to select all EXIF metadata for a given photo, kind of nasty. It will also become unwieldy very quickly when I add support for other metadata formats.
I'm not a database expert by any means, so there some pattern or magic union-style column type I'm missing that can make this problem go away? Or am I stuck picking my poison from among the three options above?
This is probably a very cheap solution, but I would personally just store the json or something like that within the database.
There is a cool way to extract EXIF data and parse it to json.
Here is the link: Img2JSON
I hope this kind of helps you!

Regression Model for categorical data

I have very large dataset in csv file (1,700,000 raws and 300 sparse features).
- It has a lot of missing values.
- the data varies between numeric and categoral values.
- the dependant variable (the class) is binary (either 1 or 0).
- the data is highly skewed, the number of positive response is low.
Now what is required from me is to apply regression model and any other machine learning algorithm on this data.
I'm new on this and I need help..
-how to deal with categoral data in case of regression model? and does the missing values affects too much on it?
- what is the best prediction model i can try for large, sparse, skewed data like this?
- what program u advice me to work with? I tried Weka but it can't even open that much of data (memory failure). I know that matlab can open either numeric csv or categories csv not mixed, beside the missing values has to be imputed to allow it to open the file. I know a little bit of R.
I'm trying to manipulate the data using excel, access and perl script. and that's really hard with that amount of data. excel can't open more than almost 1M record and access can't open more than 255 columns. any suggestion.
Thank you for help in advance
First of all, you are talking about classification, not regression - classification allows to predict value from the fixed set (e.g. 0 or 1) while regression produces real numeric output (e.g. 0, 0.5, 10.1543, etc.). Also don't be confused with so called logistic regression - it is classifier too, and its name just shows that it is based on linear regression.
To process such a large amount of data you need inductive (updatable) model. In particular, in Weka there's a number of such algorithms under classification section (e.g. Naive Bayes Updatable, Neutral Networks Updatable and others). With inductive model you will be able to load data portion by portion and update model in appropriate way (for Weka see Knowledge Flow interface for details of how to use it easier).
Some classifiers may work with categorical data, but I can't remember any updatable from them, so most probably you still need to transform categorical data to numeric. Standard solution here is to use indicator attributes, i.e. substitute every categorical attribute with several binary indicator. E.g. if you have attribute day-of-week with 7 possible values you may substitute it with 7 binary attributes - Sunday, Monday, etc. Of course, in each particular instance only one of 7 attributes may hold value 1 and all others have to be 0.
Importance of missing values depend on the nature of your data. Sometimes it worth to replace them with some neutral value beforehand, sometimes classifier implementation does it itself (check manuals for an algorithm for details).
And, finally, for highly skewed data use F1 (or just Precision / Recall) measure instead of accuracy.

Is the HTM cortical learning algorithm defined by Numenta's paper restricted by Euclidean geometry?

Specifically, their most recent implementation.
http://www.numenta.com/htm-overview/htm-algorithms.php
Essentially, I'm asking whether non-euclidean relationships, or relationships in patterns that exceed the dimensionality of the inputs, can be effectively inferred by the algorithm in its present state?
HTM uses Euclidean geometry to determine "neighborship" when analyzing patterns. Consistently framed input causes the algorithm to exhibit predictive behavior, and sequence length is practically unlimited. This algorithm learns very well - but I'm wondering whether it has the capacity to infer nonlinear attributes from its input data.
For example, if you input the entire set of texts from Project Gutenberg, it's going to pick up on the set of probabilistic rules that comprise English spelling, grammar, and readily apparent features from the subject matter, such as gender associations with words, and so forth. These are first level "linear" relations, and can be easily defined with probabilities in a logical network.
A nonlinear relation would be an association of assumptions and implications, such as "Time flies like an arrow, fruit flies like a banana." If correctly framed, the ambiguity of the sentence causes a predictive interpretation of the sentence to generate many possible meanings.
If the algorithm is capable of "understanding" nonlinear relations, then it would be able to process the first phrase and correctly identify that "Time flies" is talking about time doing something, and "fruit flies" are a type of bug.
The answer to the question is probably a simple one to find, but I can't decide either way. Does mapping down the input into a uniform, 2d, Euclidean plane preclude the association of nonlinear attributes of the data?
If it doesn't prevent nonlinear associations, my assumption would then be that you could simply vary the resolution, repetition, and other input attributes to automate the discovery of nonlinear relations - in effect, adding a "think harder" process to the algorithm.
From what I understand of HTM's, the structure of layers and columns mimics the structure of the neocortex. See appendix B here: http://www.numenta.com/htm-overview/education/HTM_CorticalLearningAlgorithms.pdf
So the short answer would be that since the brain can understand non-linear phenomenon with this structure, so can an HTM.
Initial, instantaneous sensory input is indeed mapped to 2D regions within an HTM. This does not limit HTM's to dealing with 2D representations any more than a one dimensional string of bits is limited to representing only one dimensional things. It's just a way of encoding stuff so that sparse distributed representations can be formed and their efficiencies can be taken advantage of.
To answer your question about Project Gutenberg, I don't think an HTM will really understand language without first understanding the physical world on which language is based and creates symbols for. That said, this is a very interesting sequence for an HTM, since predictions are only made in one direction, and in a way the understanding of what's happening to the fruit goes backwards. i.e. I see the pattern 'flies like a' and assume the phrase applies to the fruit the same way it did to time. HTM's do group subsequent input (words in this case) together at higher levels, so if you used Fuzzy Grouping (perhaps) as Davide Maltoni has shown to be effective, the two halves of the sentence could be grouped together into the same high level representation and feedback could be sent down linking the two specific sentences. Numenta, to my knowledge has not done too much with feedback messages yet, but it's definitely part of the theory.
The software which runs the HTM is called NuPIC (Numenta Platform for Intelligent Computing). A NuPIC region (representing a region of neocortex) can be configured to either use topology or not, depending on the type of data it's receiving.
If you use topology, the usual setup maps each column to a set of inputs which is centred on the corresponding position in the input space (the connections will be selected randomly according to a probability distribution which favours the centre). The spatial pattern recognising component of NuPIC, known as the Spatial Pooler (SP), will then learn to recognise and represent localised topological features in the data.
There is absolutely no restriction on the "linearity" of the input data which NuPIC can learn. NuPIC can learn sequences of spatial patterns in extremely high-dimensional spaces, and is limited only by the presence (or lack of) spatial and temporal structure in the data.
To answer the specific part of your question, yes, NuPIC can learn non-Euclidean and non-linear relationships, because NuPIC is not, and cannot be modelled by, a linear system. On the other hand, it seems logically impossible to infer relationships of a dimensionality which exceeds that of the data.
The best place to find out about HTM and NuPIC, its Open Source implementation, is at NuPIC's community website (and mailing list).
Yes, It can do non-linear. Basically it is multilayer. And all multilayer neural networks can infer non linear relationships. And I think the neighborship is calculated locally. If it is calcualted locally then globally it can be piece wise non linear for example look at Local Linear Embedding.
Yes HTM uses euclidean geometry to connect synapses, but this is only because it is mimicking a biological system that sends out dendrites and creates connections to other nearby cells that have strong activation at that point in time.
The Cortical Learning Algorithm (CLA) is very good at predicting sequences, so it would be good at determining "Time flies like an arrow, fruit flies like a" and predict "banana" if it has encountered this sequence before or something close to it. I don't think it could infer that a fruit fly is a type of insect unless you trained it on that sequence. Thus the T for Temporal. HTMs are sequence association compressors and retrievers (a form of memory). To get the pattern out of the HTM you play in a sequence and it will match the strongest representation it has encountered to date and predict the next bits of the sequence. It seems to be very good at this and the main application for HTMs right now are predicting sequences and anomalies out of streams of data.
To get more complex representations and more abstraction you would cascade a trained HTMs outputs to another HTMs inputs along with some other new sequence based input to correlate to. I suppose you could wire in some feedback and do some other tricks to combine multiple HTMs, but you would need lots of training on primitives first, just like a baby does, before you will ever get something as sophisticated as associating concepts based on syntax of the written word.
ok guys, dont get silly, htms just copy data into them, if you want a concept, its going to be a group of the data, and then you can have motor depend on the relation, and then it all works.
our cortex, is probably way better, and actually generates new images, but a computer cortex WONT, but as it happens, it doesnt matter, and its very very useful already.
but drawing concepts from a data pool, is tricky, the easiest way to do it is by recording an invarient combination of its senses, and when it comes up, associate everything else to it, this will give you organism or animal like intelligence.
drawing harder relations, is what humans do, and its ad hoc logic, imagine a set explaining the most ad hoc relation, and then it slowly gets more and more specific, until it gets to exact motor programs... and all knowledge you have is controlling your motor, and making relations that trigger pathways in the cortex, and tell it where to go, from the blast search that checks all motor, and finds the most successful trigger.
woah that was a mouthful, but watch out dummies, you wont get no concepts from a predictive assimilator, which is what htm is, unless you work out how people draw relations in the data pool, like a machine, and if you do that, its like a program thats programming itself.
no shit.

Adaptive Neuro-Fuzzy Inference System (ANFIS)

Do you have an example or an explanation of ANFIS (Adaptive Neuro-Fuzzy Inference System), I am reading that this could be applied to classify some diseases, What do you think about it?
Usually in order to develop a fuzzy system you have to determine the if-then rules, suitable membership functions, and their parameters. This is not always a trivial task, especially the development of correct if-then rules may be time consuming as we first have to "extract" the expert knowledge somehow.
This is where ANFIS comes into play: Under certain circumstances it can automatically determine suitable parameters for the membership functions. This is the case in particular when we already have a set of input and related output variables and values. Like in an artificial neural network the ANFIS system is able to adapt its nodes and connections between them "automatically".
To your question: you could of course create an ANFIS system for your desease classification, as long as you already have input and output data for system training available. But its not necessarily tied to such systems, you can see ANFIS more an approach usable under the mentioned circumstances, than a tool for a specific problem. It all depends on the requirements for the system you want to create, as well as the known (external) preconditions...
Hope that helps!
As Matthias said ANFIS is not mapped to a particular problem, you can use it on the basis of problem requirement. But where to use ANFIS: You can use it with any problem where something is ambiguous.
Actually this is the property of FIS(Fuzzy Inference System). Adaptive come in role as Matthias explained.
For ex. took famous classification problem, classifying a input to any class is not always perfectly determined, it somewhat ambiguous. So there using ANFIS may give better results then other classification algorithms depending upon whether you are able to model the system correctly or not using ANFIS.
But using ANFIS is computationally expensive as compared to other non-fuzzy approches. As to make FIS to perfect model your problem you will add AN part to it. This only make membership function selection adaptive. What about if-then rules. For that you have to do unsupervised rule selection from the complete possible rule base(this is basically a kind of unsupervised clustering problem, where you are trying to group all the rules whose effect would be same).
So far I have found a university 'Monash' that explains (based on the guide of Matlabs's Fuzzy Logic Toolbox) ANFIS.
The fuzzy inference system that we have considered is a model that maps:
input characteristics to input
membership functions
input
membership function to rules
rules
to a set of output characteristics
output characteristics to output
membership functions
the output membership function to a
single-valued output, or, a decision
associated with the output.
Yes it can be used for Diseases Classification.
Since the idea of ANFIS is combine fuzzy system in architecture of ANN. In this case, ANFIS have two main benefit.
first, you can use fuzzy variable which is support for Linguistic variable and it's fit for Diseases's symptoms that are commonly used as system's input (example of input >> pain levels : low, mid, high).
Second, since the architecture is mapped to ANN layers, ANFIS can do training process which aims to create more accurate result (ex : use Backpropagation method).

Resources