ELKI how to increase the precision? - dbscan

I am using ELKI mini GUI for clustering my data points. I have some 1300 GPS data points which I would like to cluster my GPS points (DBSCAN and OPTICS). As an input file for dbc.in I am using a csv file with only 2 columns (X,Y). The problem is, my X,Y (in projected) coordinates are very precise upto 6 decimal places. But after running the cluster algo I am getting lower precision (upto 3 decimal places). How can I increase the precision of output points?
And also when it is generating the clusters, it is automatically invoking some virtual IDs which are not corresponding to my actual point IDs (ID, X, Y). However, ID is not given in the input csv. It comprises only two columns (X,Y).

ELKI relies on double for representing numbers. If you need a higher precision, you will have to implement your own parser and output modules (it's easy though, as we have a highly modular architecture).
Default output serialization to text is handled by Java. Precision is therefore what you get from Java by default. This should be 15-16 digits of precision, if you are using DoubleVector, and 7-8 digits if you are using FloatVector.
A quick check with groovysh:
new DoubleVector([12345.678901234567890, 3456.109453] as double[]);
===> 12345.678901234567 3456.109453
new FloatVector([12345.678901234567890, 3456.109453] as float[]);
===> 12345.679 3456.1094
yields only the loss to be expected from double and float precision.
The best way to get row labels is to... add row labels to your data.
Wrt. to your add-on question in the comments: The default parser will treat a text row at the beginning of your file as column labels. So just put "X Y" into the first line of your file.
A reasonable input format will therefore be:
X Y Label
1 2 Point7
3 4 "Point 8"
The following are not-so-good ideas:
5 6 123shouldwork
7 8 don't do this: 3 parser will retain the 3
label should be non-numeric, so that the parser will treat it as label automatically. Otherwise, you have to set the appropriate parameter.
DBIDs are meant for internal handling. Maybe we should not write them to the output at all. FixedDBIDFilter is a hackish work-around; it is meant to be used to get reproducible hashing when using algorithms that need id-based hashing and doing multiple runs in the MiniGUI. Because on multiple runs, DBIDs will be continuously enumerated.

Related

Named range of consistent random numbers

Background
Following on from a question I asked a while ago about getting an array of different (but not necessarily unique) random numbers to which the answer was this:
=RANDBETWEEN(ROW(A1:A10)^0,10)
To get an array of 10 random numbers between 1 and 10
The Problem
If I create a named range (called "randArray") with the formula above I hoped I would be able to reference randArray a number of times and get the same set of random numbers. Granted, they would change each time I press F9 or update the worksheet -- but change together.
This is what I get instead, two completely different sets of random numbers
I'm not surprised by this behavior but how can I achieve this without using VBA and without putting the random numbers onto the worksheet?
If you're interested
This example is intended to be MCVE. In my actual case, I am using random numbers to estimate Pi. The user stipulates how many random points to apply and gets an accordingly accurate estimation. The problem arises because I also graph the points and when there are a small number of points it's very clear to see that the estimation and the graph don't represent the same dataset
Update
I have awarded the initial bounty to #Michael for providing an interesting and different solution. I am still looking for a complete solution which allows the user to stipulate how many random points to use, and although there might not be a perfect answer I'm still interested in any other possible solutions and more than happy to put up further bounties.
Thank you to everyone who has contributed so far.
This solution generates 10 seemingly random numbers between 1 and 10 that persist for nearly 9 seconds at a time. This allows repeated calls of the same formula to return the same set of values in a single refresh.
You can modify the time frame if required. Shorter time periods allow for more frequent updates, but also slightly increase the extremely unlikely chance that some calls to the formula occur after the cutover point resulting in a 2nd set of 10 random numbers for subsequent calls.
Firstly, define an array "Primes" with 10 different prime numbers:
={157;163;167;173;179;181;191;193;197;199}
Then, define this formula that will return an array of 10 random numbers:
=MOD(ROUND(MOD(ROUND(NOW(),4)*70000,Primes),0),10)+1
Explanation:
We need to build our own random number generator that we can seed with the same value for an amount of time; long enough for the called formula to keep returning the same value.
Firstly, we create a seed: ROUND(NOW(),4) creates a new seed number every 0.0001 days = 8.64 seconds.
We can generate rough random numbers using the following formula:
Random = Seed * 7 mod Prime
https://cdsmith.wordpress.com/2011/10/10/build-your-own-simple-random-numbers/
Ideally, a sequence of random numbers is generated by taking input from the previous output, but we can't do that in a single function. So instead, this uses 10 different prime numbers, essentially starting 10 different random number generators. Now, this has less reliability at generating random numbers, but testing results further below shows it actually seems to do a pretty good job.
ROUND(NOW(),4)*70000 gets our seed up to an integer and multiplies by 7 at the same time
MOD(ROUND(NOW(),4)*70000,Prime) generates a sequence of 10 random numbers from 0 to the respective prime number
ROUND(MOD(ROUND(NOW(),4)*70000,Prime),0) is required to get us back to an integer because Excel seems to struggle with apply Mod to floating point numbers.
=MOD(ROUND(MOD(ROUND(NOW(),4)*70000,Prime),0),10)+1 takes just the value from the ones place (random number from 0 to 9) and shifts it to give us a random number from 1 to 10
Testing results:
I generated 500 lots of 10 random numbers (in columns instead of rows) for seed values incrementing by 0.0001 and counted the number of times each digit occurred for each prime number. You can see that each digit occurred nearly 500 times in total and that the distribution of each digit is nearly equal between each prime number. So, this may be adequate for your purposes.
Looking at the numbers generated in immediate succession you can see similarities between adjacent prime numbers, they're not exactly the same but they're pretty close in places, even if they're offset by a few rows. However, if the refresh is occurring at random intervals, you'll still get seemingly random numbers and this should be sufficient for your purposes. Otherwise, you can still apply this approach to a more complex random number generator or try a different mix of prime numbers that are further apart.
Update 1: Trying to find a way of being able to specify the number of random numbers generated without storing a list of primes.
Attempt 1: Using a single prime with an array of seeds:
=MOD(ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))/10000,4)*70000,1013),0),10)+1
This does give you an even distribution, but it really is just repeating the exact same sequence of 10 numbers over and over. Any analysis of the sample would be identical to analysing =MOD(ROW(1:SampleSize),10)+1. I think you want more variation than that!
Attempt 2: Working on a 2-dimensional array that still uses 10 primes....
Update 2: Didn't work. It had terrible performance. A new answer has been submitted that takes a similar but different approach.
OK, here's a solution where users can specify the number of values in defined name SAMPLESIZE
=MOD(ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize)),4)*10000*163,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*2,4)*10000*211,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*3,4)*10000*17,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*5,4)*10000*179,53),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*7,4)*10000*6101,1013),0),10)+1
It's a long formula, but has good efficiency and can be used in other functions. Attempts at a shorter formula resulted in unusably poor performance and arrays that for some reason couldn't be used in other functions.
This solution combines 5 different prime number generators to increase variety in the generated random numbers. Some arbitrary constants were introduced to try to reduce repeating patterns.
This has correct distribution and fairly good randomness. Repeated testing with a SampleSize of 10,000 resulted in frequencies of individual numbers varying between 960 and 1040 with no overall favoritism. However it seems to have the strange property of never generating the same number twice in a row!
You can achieve this using just standard spreadsheet formulas.
One way is to use the so called Lehmer random number method. It generates a sequence of random numbers in your spreadsheet that stays the same until you change the "seed number", a number you choose yourself and will recreate a different random sequence for each seed number you choose.
The short version:
In cell B1, enter your "seed" number, it can be any number from 1 to 2,147,483,647
In cell B2 enter the formula =MOD(48271*B1,2^31-1) , this will generate the first random number of your sequence.
Now copy this cell down as far as the the random sequence you want to generate.
That's it. For your named range, go ahead and name the range from B2 down as far as your sequence goes. If you want a different set of numbers, just change the seed in B1. If you ever want to recreate the same set of numbers just use the same seed and the same random sequence will appear.
More details in this tutorial:
How to generate random numbers that don't change in Excel and Google Sheets
It's not a great answer but considering the limitation of a volatile function, it is definitely a possible answer to use the IF formula with Volatile function and a Volatile variable placed somewhere in the worksheet.
I used the below formula to achieve the desired result
=IF(rngIsVolatile,randArray,A1:A10)
I set cell B12 as rngIsVolatile. I pasted the screenshots below to see it in working.
When rngIsVolatile is set to True, it picks up new values from randArray:
When rngIsVolatile is set to False, it picks up old values from A1:A10:

How to compress/archive a temperature curve effectively?

Summary: The industrial thermometer is used to sample temperature at the technology device. For few months, the samples are simply stored in the SQL database. Are there any well-known ways to compress the temperature curve so that much longer history could be stored effectively (say for the audit purpose)?
More details: Actually, there are much more thermometers, and possibly other sensors related to the technology. And there are well known time intervals where the curve belongs to a batch processed on the machine. The temperature curves should be added to the batch documentation.
My idea was that the temperature is a smooth function that could be interpolated somehow -- say the way a sound is compressed using MP3 format. The compression need not to be looseless. However, it must be possible to reconstruct the temperature curve (not necessarily the identical sample values, and the identical sampling interval) -- say, to be able to plot the curve or to tell what was the temperature in certain time.
The raw sample values from the SQL table would be processed, the compressed version would be stored elsewhere (possibly also in SQL database, as a blob), and later the raw samples can be deleted to save the database space.
Is there any well-known and widely used approach to the problem?
A simple approach would be code the temperature into a byte or two bytes, depending on the range and precision you need, and then to write the first temperature to your output, followed by the difference between temperatures for all the rest. For two-byte temperatures you can restrict the range some and write one or two bytes depending on the difference with a variable-length integer. E.g. if the high bit of the first byte is set, then the next byte contains 8 more bits of difference, allowing for 15 bits of difference. Most of the time it will be one byte, based on your description.
Then take that stream and feed it to a standard lossless compressor, e.g. zlib.
Any lossiness should be introduced at the sampling step, encoding only the number of bits you really need to encode the required range and precision. The rest of the process should then be lossless to avoid systematic drift in the decompressed values.
Subtracting successive values is the simplest predictor. In that case the prediction of the next value is the value before it. It may also be the most effective, depending on the noisiness of your data. If your data is really smooth, then you could try a higher-order predictor to see if you get better performance. E.g. a predictor for the next point using the last two points is 2a - b, where a is the previous point and b is the point before that, or using the last three points 3a - 3b + c, where c is the point before b. (These assume equal time steps between each.)

Giving the shape as output using GA

The scenario is, I want to get the output as a shape, when the number of edges, vertices and the interior angle is given as input. And am trying to do this using Genetic Algorithms.
My problem is, am having a starting trouble. How would I create the initial population randomly for this case? And how could I define the chromosomes in bitwise representation?
I was referring some PPTs.
But in my case, I think I can't represent the chromosome as bits. Because it's numeric value that I would be giving isn't it? Any clues to make me move forward?
Genetic Algorithms don't have to be represented as bits, although I prefer to do it this way. The best way is probably to just convert the numbers from binary to whatever form you need to represent your shapes and back again.
You can either scale the binary or clip the edges to make it fit whatever boundary you need.
In terms of initialisation all you need to do is work out how many bits you need to represent all your input and generate this randomly. For example, if you wanted 3 whole numbers between 0-255 you would need 24 bits (8 * 3). Just randomly generate this number for each chromosome in the population. When creating the shape you just split the chromosome into 3, convert into your 3 whole numbers and use them.

How to normalize multiple array of different size in matlab

I use set of images for image processing in which each image generates unique code (Freeman chain code). The size of array for each image varies. However the value ranges from 0 to 7. For e.g. First image creates array of 3124 elements. Second image creates array of 1800 elements.
Now for further processing, I need a fixed size of those array. So, is there any way to Normalize it ?
There is a reason why you are getting different sized arrays when applying a chain code algorithm to different images. This is because the contours that represent each shape are completely different. For example, the letter C and D will most likely contain chain codes that are of a different length because you are describing a shape as a chain of values from a starting position. The values ranging from 0-7 simply tell you which direction you need to look next given the current position of where you're looking in the shape. Usually, chain codes have the following convention:
3 2 1
4 x 0
5 6 7
0 means to move to the east, 1 means to move north east, 2 means to move north and so on. Therefore, if we had the following contour:
o o x
o
o o o
With the starting position at x, the chain code would be:
4 4 6 6 0 0
Chain codes encode how we should trace the perimeter of an object given a starting position. Now, what you are asking is whether or not we can take two different contours with different shapes and represent them using the same number of values that represent their chain code. You can't because of the varying length of the chain code.
tl;dr
In general, you can't. The different sized arrays mean that the contours that are represented by those chain codes are of different lengths. What you are actually asking is whether or not you can represent two different and unrelated contours / chain codes with the same amount of elements.... and the short answer is no.
What you need to think about is why you want to try and do this? Are you trying to compare the shapes between different contours? If you are, then doing chain codes is not the best way to do that due to how sensitive chain codes are with respect to how the contour changes. Adding the slightest bit of noise would result in an entirely different chain code.
Instead, you should investigate shape similarity measures instead. An authoritative paper by Remco Veltkamp talks about different shape similarity measures for the purposes of shape retrieval. See here: http://www.staff.science.uu.nl/~kreve101/asci/smi2001.pdf . Measures such as the Hausdorff distance, Minkowski distance... or even simple moments are some of the most popular measures that are used.

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources