How to work with sparse data using ELKI? - sparse-matrix

i'm trying to use a sparse matrix as input data in ELKI SOD algorithm to detect outliers.
I was looking for help in howto and faqs page about sparse data, so i've tried to use SparseNumberVectorLabelParser and SparseVectorFieldFilter like this:
//data is a mxn matrix
ArrayAdapterDatabaseConnection dataArray = new ArrayAdapterDatabaseConnection(data);
SparseDoubleVector.Factory sparseVector = new SparseDoubleVector.Factory();
SparseNumberVectorLabelParser<SparseDoubleVector> parser = new SparseNumberVectorLabelParser<SparseDoubleVector>(Pattern.compile("s*[,;s]s*")," \" ",Pattern.compile("^s*(#|//|;).*$"),null, sparseVector);
SparseVectorFieldFilter<SparseDoubleVector> sparseFilter = new SparseVectorFieldFilter<SparseDoubleVector>();
ListParameterization params = new ListParameterization();
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dataArray);
params.addParameter(AbstractDatabaseConnection.Parameterizer.PARSER_ID, parser);
params.addParameter(AbstractDatabaseConnection.Parameterizer.FILTERS_ID, sparseFilter);
Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();
params = new ListParameterization();
params.addParameter(SOD.Parameterizer.KNN_ID, 25);
params.addParameter(SharedNearestNeighborPreprocessor.Factory.NUMBER_OF_NEIGHBORS_ID, 10);
SOD<DoubleVector> sodAlg = ClassGenericsUtil.parameterizeOrAbort(SOD.class, params);
OutlierResult result = sodAlg.run(db);
But i've got this runtime exception:
Exception in thread "main" de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field
Available types: DBID DoubleVector,field,mindim=7606,maxdim=12968
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:154)
at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:80)
Is this the right way to use SparseNumberVectorLabelParser and SparseVectorFieldFilter within java code?

ArrayAdapterDatabaseConnection is designed for dense data. For sparse data, it does not make much sense to first encode it into a dense array, then re-encode it into sparse vectors. Consider reading the data as sparse vectors directly to avoid overhead.
The error you are seeing has a different reason, though:
SOD is specified on a vector field of fixed dimensionality, but the sparse vectors yield a relation that has a variable dimensionality. So it doesn't find the requested data type (hence, NoSupportedDataTypeException).
You can force the data to be of fixed dimensionality using SparseVectorFieldFilter.
But I'm not sure if SOD is an appropriate algorithm to use on sparse data. Even though it should work then, the runtime and performance may be bad; because the algorithm isn't operating on data that satisfies the assumptions it was designed for. Sparse data is usually best handled with algorithms that exploit and handle data sparsity. (As is, you would also compute the shared nearest neighbors using Euclidean distance, which may not work well for sparse data. If the SNN are bad, SOD won't work well either)

Related

Minimize a linear programming system in C

I need to minimize a huge linear programming system where all related data (objective function, constraints) are stored in the memory in arrays and structures but not in lp file format or CPLEX
I saw that there are many solvers like here and here but the problem is how can I minimize the model without calling it from a file of a special format?
I did the same work previously in R and Python by solving the model directly after producing it without the need to save it initially in a special file and then call it by the solver. Here is an example in Python:
from lpsolve55 import *
from lp_maker import *
from lp_solve import *
lp = lp_maker(obj_func, constraints , rhs, sense_equality)
solvestat = lpsolve('solve', lp)
obj = lpsolve('get_objective', lp)
I think this is possible to do in C but really I don't know where to find how it is possible to do it.
One option is to use the APIs that commercial solvers like CPLEX and Gurobi provide for C/C++. Essentially, these APIs let you build the model in logical chunks (objective function, constraints, etc.). The APIs do the work of translating the logic of the model to the matrices and vectors that the solver actually needs in order to solve the model.
Another approach is to use a modeling language like AMPL or GAMS. AMPL, for example, also provides a C/C++ API.
Which one you choose probably depends on what solver you plan to use and how often you need to modify your model and/or data programmatically.

MPI Master/Slave with 2D array

I am quite new in using MPI parallel process.
I am dealing with the following problem related to the MASTER/SLAVE approach.
I have a 2D-squared array of SIZE=500, and I need to break it into several blocks of dimension:
D < SIZE.
I should implement a Master/Slave MPI where each processor receives, and sends back to the master N blocks, where N depends on the number of processors involved and the dimension D of the subblocks.
I managed to solve the problem by dividing the original array in stripes, but I don't know how to deal with squares!
In order to simplify your problem, D must be a divider of 500. Now, the total number of blocks should be blocks = sqr(500/D). and N should be something along the lines n = blocks/cpus.
IMHO simplest way would be to create a square of DxD elements from the array and send that chunk of data to the client. Depending on the language and method, you can build small objects and send them to the client or just replicate the full matrix and send the coordinates for the chunk.
An other option is to use MPI_Type_create_subarray() in order to create a derived datatype for a given (sub) square/rectangle of your array.
On the down side, this derived datatype cannot be used with collective operations such as MPI_Scatter[v]() and MPI_Gather[v](), which is usually the "natural" MPI way of distributing / re-assembling data.

How do I fill a histogram in Matlab if one gets extremely many different copies of the vector to be histogramed?

I was trying to collect statistics of a 6D vector and plot a 1D histogram for each coordinate. I get 729000000 different copies of this vector (each 6 dimensional). For this I create an array of zeros of size 729000000x6 before I get any of the actual W's and this seems to be a problem in matlab since it says:
Error using zeros
Requested 729000000x6 (32.6GB) array exceeds maximum array size preference. Creation of arrays
greater than this limit may take a long time and cause MATLAB to become unresponsive. See array
size limit or preference panel for more information.
The reason I did this at first was because it was easy to fill W_history and then just feed it to the histogram plotter:
histogram(W_history(:,d),nbins,'Normalization','probability')
however filling W_history seemed impossible for high number of copies of W. Is there a way to do this in matlab automatically? It feels that there should be and didn't want to re-invent the wheel.
I am sure I could potentially create for each coordinate some array of counters where I count how many times a specific value of the coordinate W falls. However, implementing that and having the checks for in which bin each one should fall seemed inefficient or even unnecessary. Is this really the only solution or what do matlab experts people recommend? Is this re-inventing the wheel? Seems also inefficient if I implement it myself?
Also, I thought I could manually have matlab put thing in memory then bring them back etc (as in store W_history in disk as it fills and then put more back in disk as it fills and eventually somehow plug it in to the histogram plotter), that seemed overwork. I hope I can avoid a solution like this one. It feels a wrong solution since it should be "easy" and high level to use matlab and going down to disk and memory doesn't seem to me what matlab is intended.
Currently through the comment that was given the best solution that I have so far is using histcounts as follow:
for i=2:iter+1
%
W = get_new_W(W)
%
[W_hist_counts_current, edges2] = histcounts(W,edges);
W_hist_counts = W_hist_counts + W_hist_counts_current;
end
however, after this it seems difficult to convert W_hist_counts to pdf/probability or other values since it seems they have to be processed manually. Is there no official way to do this processing without the user having to implement the normalizations again?

Access main table's values (DISTANCE and SIZE)

I am storing a vast amount of mathematical formulas as Content MathML in BaseX databases. To speed up lookup with different search algorithms implemented as XQuery expressions I want to access the main table's values especially PRE, DISTANCE and SIZE. The plan is to get rid of all subtrees which provide the wrong amount of the subtree's total nodes (SIZE).
The PRE value is available via the function db:node-pre and working just fine. How can I access the DISTANCE and SIZE values? I could not find a way in the documentation.
The short answer is: you don't, stay with the offered APIs
If you really want that IDs, use the parent::node() and following-sibling::node()[1] axes and query their pre values. Following equations hold:
PRE(.) = PRE(parent) + SIZE(parent)
PRE(following-sibling[1]) = PRE(.) + SIZE(.)
so you could read those values in constant time by reordering those equations.
The long answer: dig deep into BaseX internals
You'll touching the core (and probably shouldn't, kittens might die!). Implement a BaseX Java binding to get access to the queryContext variable, holding the database context context, which you can query to get a data() reference:
Data data = queryContext.context.data();
Once you have the Data reference, you get access to several functions to query values of the internal data structure:
int Data.dist(int pre, int kind)
int Data.size(int pre, int kind)
where kind is always 1 for element nodes.
Be brave, and watch your step, you're leaving the safe grounds now!

KD-Trees and missing values (vector comparison)

I have a system that stores vectors and allows a user to find the n most similar vectors to the user's query vector. That is, a user submits a vector (I call it a query vector) and my system spits out "here are the n most similar vectors." I generate the similar vectors using a KD-Tree and everything works well, but I want to do more. I want to present a list of the n most similar vectors even if the user doesn't submit a complete vector (a vector with missing values). That is, if a user submits a vector with three dimensions, I still want to find the n nearest vectors (stored vectors are of 11 dimensions) I have stored.
I have a couple of obvious solutions, but I'm not sure either one seem very good:
Create multiple KD-Trees each built using the most popular subset of dimensions a user will search for. That is, if a user submits a query vector of thee dimensions, x, y, z, I match that query to my already built KD-Tree which only contains vectors of three dimensions, x, y, z.
Ignore KD-Trees when a user submits a query vector with missing values and compare the query vector to the vectors (stored in a table in a DB) one by one using something like a dot product.
This has to be a common problem, any suggestions? Thanks for the help.
Your first solution might be fastest for queries (since the tree-building doesn't consider splits in directions that you don't care about), but it would definitely use a lot of memory. And if you have to rebuild the trees repeatedly, it could get slow.
The second option looks very slow unless you only have a few points. And if that's the case, you probably didn't need a kd-tree in the first place :)
I think the best solution involves getting your hands dirty in the code that you're working with. Presumably the nearest-neighbor search computes the distance between the point in the tree leaf and the query vector; you should be able to modify this to handle the case where the point and the query vector are different sizes. E.g. if the points in the tree are given in 3D, but your query vector is only length 2, then the "distance" between the point (p0, p1, p2) and the query vector (x0, x1) would be
sqrt( (p0-x0)^2 + (p1-x1)^2 )
I didn't dig into the java code that you linked to, but I can try to find exactly where the change would need to go if you need help.
-Chris
PS - you might not need the sqrt in the equation above, since distance squared is usually equivalent.
EDIT
Sorry, didn't realize it would be so obvious in the source code. You should use this version of the neighbor function:
nearest(double [] key, int n, Checker<T> checker)
And implement your own Checker class; see their EuclideanDistance.java to see the Euclidean version. You may also need to comment out any KeySizeException that the query code throws, since you know that you can handle differently sized keys.
Your second option looks like a reasonable solution for what you want.
You could also populate the missing dimensions with the most important( or average or whatever you think it should be) values if there are any.
You could try using the existing KD tree -- by taking both branches when the split is for a dimension that is not supplied by the source vector. This should take less time than doing a brute force search, and might be less trouble than trying to maintain a bunch of specialized trees for dimension subsets.
You would need to adapt your N-closest algorithm (without more info I can't advise you on that...), and for distance you would use the sum of the squares of only those elements supplied by the source vector.
Here's what I ended up doing: When a user didn't specify a value (when their query vector lacked a dimension), I I simply adjusted my matching range (in the API) to something huge so that I match any value.

Resources