Discriminant Correspondence Analysis - Nominal Data as Input or Integer Data as Input? - linear-discriminant

Here's where I'm getting the definition of Discriminant Correspondence Analysis from:
I'd like to use Discriminant Correspondence Analysis using functions from two different packages.
TExPosition accepts nominal data as input, which meets the definition.
But ade4 only accepts integer data, which does not meet the definition.
library(TExPosition) #https://cran.r-project.org/web/packages/TExPosition/
Usage: dica.res <- tepDICA(dica.wine$data,DESIGN=dica.wine$design,make_design_nominal=FALSE)
head(dica.wine$data) #nominal transformed to 0's and 1's.
library(ade4) #https://cran.r-project.org/web/packages/ade4/ade4.pdf
Usage: discrimin.coa(perthi02$tab, perthi02$cla, scan = FALSE)
head(perthi02$tab) 'Integers, possibly counts, which does not look like it was derived from nominal data.
Can anyone explain how I could get ade4 to accept nominal data as input?


Error converting to stm after tf-idf weighting

For several dfms, I have no problem converting them to stm/lda/topicmodels format. However, if I weight the dfms with dfm_tfidf() before converting, I get the following error:
Error in convert.dfm(users_dfm, to = "stm") : cannot convert a
non-count dfm to a topic model format
Any idea why this might be? I've tried different weighting schemes for both term and document frequency (to try and make the weighted dfm a 'count' dfm), but I keep getting the error.
So, this works:
users_dfm <- dfm(users_tokens)
users_stm <- convert(users_dfm, to = "stm")
But this doesn't:
users_dfm <- dfm(users_tokens)
weighted_dfm <- dfm_tfidf(users_dfm)
users_stm <- convert(weighted_dfm, to = "stm")
This is because topic models require counts as inputs, because that is the nature of the assumed statistical distribution for the latent Dirichlet allocation model. tf-idf weighting of the dfm turns the matrix into non-integer values, which are not valid inputs for stm (or any other topic model).
So in short, don't weight your dfm before using it with a topic model.
You should also note that conversion of a dfm to the stm format is not strictly required, since stm::stm() can take a dfm object directly as an input.

fit a skewed t-distribution or normal distribution in Matlab

I have a dataset that I know for sure that has some sort of skewness (and potentially excess kurtosis). I would like to fit this dataset to some sort of distribution, and I thought the most simplistic is to have a skewed student's t-distribution or skewed normal distribution. What sort of distribution in Matlab can I fit the data to?
There may be no pearspdf function in Matlab, because the seven distribution types of the Pearson distribution mostly correspond to or are based on extant functions for other distributions:
Type 0: Normal distribution, normpdf
Type I: Beta distribution, betapdf
Type II: Student's t-distribution, tpdf
Type III: Gamma distribution, gampdf
Type IV: Not related to any standard distribution
Type V: Inverse gamma distribution, Calculated via gampdf
Type VI: F-distribution, fpdf
Type VII: Student's t-distribution/t location scale distribution, tpdf/prob.tLocationScaleDistribution
The summary above simplifies a lot of course and it would be useful to have one function that calculates your PDF according to the system, like pearsrnd does for random variate generation. Luckily someone has already done that and posted it on the MathWorks File Exchange: pearspdf.
You can also use the second argument of the pearsrnd function, which returns the type of the distribution in the Pearson system (see this page for examples). If, for example, it suggests that your data is Type III, you could attempt to fit it directly using gamfit to estimate the parameter values. gamfit, and other similarly-names functions, are based on robust maximum-likelihood estimation (MLE).

ELKI how to increase the precision?

I am using ELKI mini GUI for clustering my data points. I have some 1300 GPS data points which I would like to cluster my GPS points (DBSCAN and OPTICS). As an input file for dbc.in I am using a csv file with only 2 columns (X,Y). The problem is, my X,Y (in projected) coordinates are very precise upto 6 decimal places. But after running the cluster algo I am getting lower precision (upto 3 decimal places). How can I increase the precision of output points?
And also when it is generating the clusters, it is automatically invoking some virtual IDs which are not corresponding to my actual point IDs (ID, X, Y). However, ID is not given in the input csv. It comprises only two columns (X,Y).
ELKI relies on double for representing numbers. If you need a higher precision, you will have to implement your own parser and output modules (it's easy though, as we have a highly modular architecture).
Default output serialization to text is handled by Java. Precision is therefore what you get from Java by default. This should be 15-16 digits of precision, if you are using DoubleVector, and 7-8 digits if you are using FloatVector.
A quick check with groovysh:
new DoubleVector([12345.678901234567890, 3456.109453] as double[]);
===> 12345.678901234567 3456.109453
new FloatVector([12345.678901234567890, 3456.109453] as float[]);
===> 12345.679 3456.1094
yields only the loss to be expected from double and float precision.
The best way to get row labels is to... add row labels to your data.
Wrt. to your add-on question in the comments: The default parser will treat a text row at the beginning of your file as column labels. So just put "X Y" into the first line of your file.
A reasonable input format will therefore be:
X Y Label
1 2 Point7
3 4 "Point 8"
The following are not-so-good ideas:
5 6 123shouldwork
7 8 don't do this: 3 parser will retain the 3
label should be non-numeric, so that the parser will treat it as label automatically. Otherwise, you have to set the appropriate parameter.
DBIDs are meant for internal handling. Maybe we should not write them to the output at all. FixedDBIDFilter is a hackish work-around; it is meant to be used to get reproducible hashing when using algorithms that need id-based hashing and doing multiple runs in the MiniGUI. Because on multiple runs, DBIDs will be continuously enumerated.

nominal-value inputs for Neural Network

I have a set of training data, each item in this set consists of 4 numerical values and 1 nominal-value which is the name of the method that these values have been calculated with. (There are 8 methods)
I'm training a Neural Network with these. To get rid of the nominal-value I simply assigned a value from 1 to 8 to each method and used one input to pass it to Neural Network and 4 other inputs for numerical-values. It is sort of working, but the result is not as amazing as I want.
So my question is could it be because of this simple assignment of numbers to nominal-values? or maybe it is because of mixing two different categories of inputs which are not really at the same level (numbers and method types)
As a general note, a better way for coding nominal values would be a binary vector. In your case, in addition to the 4 continuous-valued inputs, you'd have 8 binary input neurons, where only one is activated (1) and the other 7 are inactive.
The way you did it implies an artificial relationship between the computation methods, which is almost certainly an artifact. For example, 1 and 2 are numerically (and from your network's point of view!) nearer than 1 and 8. But are the methods nr. 1 and 2 really more similar, or related, than the methods 1 and 8?
Since you don't provide much detail, my answer can't be very specific.
Generally speaking neural networks tend to perform worse when coding nominal values as numeric values since the transformation will impose a (probably) false ordering on the variables. Mixing inputs with very varied levels also tend to worsen the performance.
However, given the little information provided here there is no way of telling if this is the reason that the networks performance is "not as amazing" as you want. It could just as well be the case that you don't have enough training data, or that your training data contains a lot of noise. Perhaps you need to pre-scale your data, perhaps there is an error in your network code, perhaps you have chosen ill-suited values of constants for your learning algorithm...
The reasons a neural network doesn't perform as expected are many and diverse (on of them beeing unreasonably high expectations). Without much more information there is no way of knowing what the problem is in your case.
Mapping categories to numerical values is not a good practice in statistics. Especially in the case of neural networks. Bear in mind that neural networks tend to map similar inputs to similar outputs. If you map category A to 1 and category B to 2 (both as inputs), the NN will try to output similar values for both categories, even if they have nothing to do with each other.
A sparser representation is preferred. If you have 4 categories, map them like this:
A -> 0001
B -> 0010
Take a look at the "Subject: How should categories be encoded?" in this link:
The previous answers are right - do not map nominal values into arbitrary numeric ones. However, if the attribute has an ordinal nature ("Low", "Medium", High" for example), you can replace the nominal values by ascending numeric values. Note that this may not be the optimal solution - since there is no guarantee for example that "High"=3 by the nature of your data. Instead, use one-hot bit encoding as suggested.
The reason for this is that a neural network is very similar to regression in the sense that multiple numeric values go through some kind of an aggregating function - but this happens multiple times. Each input is also multiplied by a weight.
So when you enter a numeric value, it undergoes a series of mathematical manipulations that adjusts its weights in the network. So if you use numeric values for non-nomial data - nominal values that were mapped to closer numeric values will be treated about the same in the best case, in the worst case - it can harm your model.

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here
