CDO regridding and calculating grid fractions - dataset

I have a global IGBP land use dataset in which the land cover exists out of forest cover (depicted with a '1') and non-forest cover (depicted with a '0'), hence, each land grid cell has either the value 1 or 0.
This dataset has a spatial resolution of approximately 1 km at the equator, however, I am going to regrid the dataset to a spatial resolution of approx 100 km at the equator. For this new grid resolution I want to calculate the fraction of forest cover (so the fraction of 1's) for each grid cell, but I am not sure how this can be done without GIS. Is there a way to do this with cdo remapping or perhaps with python?
Thank you in advance!

if you want to translate to a new grid that is an integer multiple of the original then you can do
cdo gridboxmean,n,m in.nc out.nc
where n and m are the numbers of points to average over in the lon and lat directions.
Otherwise you can interpolate using the conversative remapping which means that you don't need to worry if the new grid is not a multiple of the old
cdo remapcon,new_grid_specification in.nc out.nc
Note that in the latter case, however, the result is only first order accurate. There is also a slightly slower second order conservative remapping available using the command remapcon2. The paper describing the two implemented conservative remapping methods is Jones (1999). For further info on remapping you can also see my video guide.
Thanks to Robert for reminding also that you may need to convert to float, which would mean using the option
cdo -b f32

Related

How to efficiently access Microsoft.Maui.Devices.Sensor.Locations in SQL Server

This is more a design question so please bear with me.
I have a system that stores locations consisting of the ID, Longitude and Latitude.
I need to compare the distance between my current location and the locations in the database a only choose ones that are within a certain distance.
I have the formula that calculates the distance between 2 locations based on the long/lat and that works great.
My issue is I may have 10 of thousands of locations in the database and don't want to loop through them all every time I need a list of locations close by.
Not sure what other datapoint I can store with the location to make it so I only have to compare a smaller subset.
Thanks.
As was mentioned in the comments, SQL Server has had support for geospatial since (iirc) SQL 2008. And I know that there is support within .NET for that as well so you should be able to define the data and query it from within your application.
Since the datatype is index-able, k nearest neighbor queries are pretty efficient. There's even a topic in the documentation for that use case. Doing a lift and shift from that page:
DECLARE #g geography = 'POINT(-121.626 47.8315)';
SELECT TOP(7) SpatialLocation.ToString(), City
FROM Person.Address
WHERE SpatialLocation.STDistance(#g) IS NOT NULL
ORDER BY SpatialLocation.STDistance(#g);
If you need all the points within that radius, omit the top clause and change the predicate on STDistance() to something like SpatialLocation.STDistance(#g) < 1000 (the SRID I typically use has meters as the unit of measure, so this would say 'within 1 km').
https://gis.stackexchange.com/ is a good place for in-depth advice on this topic.
A classic approach to quickly locating "nearby" values, is to "grid" the area of interest:
Associate each location with a "grid cell", where each cell is a convenient size. Pick a cell-edge-length such that most cells will hold a small number of values and/or that is similar to the distance range you typically query.
If cell edge is 1 km, and you need locations within 2 km, then get data from 5x5 cells centered at the "target" location.
This is guaranteed to include all data +- 2 km from any location within the central cell.
Apply distance formula to each returned location; some will be beyond 2 km.
I've only done this in memory, not from a DB. I think you add two columns, one for X cell number, other for Y cell number.
With indexes on both of those. So can efficiently get a range of Xs by a range of Ys.
Not sure if a combined "X,Y" index helps or not.

Spatial search with ravenDB

I have a rather specific spatial search I need to do. Basically, have an object (lets call it obj1) with two locations, lets call them point A and point B.
I then have a collection of objects(lets call each one obj2) each with their own A and B locations.
I want to return the top 10 objects from the collection sorted by:
(distance from obj1 A to obj2A) + (the distance from obj1B to obj2B)
Any ideas?
Thanks,
Nick
Update:
Here's a little more detail on the documents and how I want to compare them.
The domain model:
Listing:
ListingId int
Title string
Price double
Origin Location
Destination Location
Location:
Post / Zipcode string
Latitude decimal
Longitude decimal
What i want to do is take a listing object (not in the database) and compare it with the collection of listings in the database. I want the query to return the top 12 (or x) number of listings sorted by the crow flies distance from the origins plus the crow flies distance from destinations.
I don't care about the distance from origin to destination - only about the distance of origin to origin plus destination to destination.
Basically Im trying to find listings where the starting and ending locations are close.
Please let me know if I can clarify more.
Thanks!
Here is how one would solve such a problem in
mysql 4.1 &
mysql 5.
The link from mysql 4.1 seems quite helpful, esp. the first example, it's pretty much what you are asking about.
But if this is not quite helpful, I guess you'd have to loop and do queries either on obj1 or obj2 against its counterpart table.
From algorithmic perspective, I'd find the center of the bounding box, then picked candidates with increasing radius while I find enough.
Also I just want to remind that crow fly distance over the globe is not Pythagoras distance and different formula must be used:
public static double GetDistance(double lat1, double lng1, double lat2, double lng2)
{
double deltaLat = DegreesToRadians(lat2 - lat1);
double deltaLong = DegreesToRadians(lng2 - lng1);
double a = Math.Pow(Math.Sin(deltaLat / 2), 2) +
Math.Cos(DegreesToRadians(lat1))
* Math.Cos(DegreesToRadians(lat2))
* Math.Pow(Math.Sin(deltaLong / 2), 2);
return earthMeanRadiusMiles * (2 * Math.Atan2(Math.Sqrt(a), Math.Sqrt(1 - a)));
}
Sounds like you're building a rideshare website. :)
The bottom line is that in order to sort your query result by surface distance, you'll need spatial indexing built into the database engine. I think your options here are MySQL with OpenGIS extensions (already mentioned) or PostgreSQL with PostGIS. It looks like it's possible in ravenDB too: http://ravendb.net/documentation/indexes/sptial
But if that's not an option, there's a few other ways. Let's simplify the problem and say you just want to sort your database records by their distance to location A, since you're just doing that twice and summing the result.
The simplest solution is to pull every record from the database and calculate the distance to location A one by one, then sort, in code. Trouble is, you end up doing a lot of redundant computations and pulling down the entire table for every query.
Let's once again simplify and pretend we only care about the Chebyshev (maximum) distance. This will work for narrowing our scope within the db before we get more accurate. We can do a "binary search" for nearby records. We must decide an approximate number of closest records to return; let's say 10. Then we query inside of a square area, let's say 1 degree latitude by 1 degree longitude (that's about 60x60 miles) around the location of interest. Let's say our location of interest is lat,lng=43.5,86.5. Then our db query is SELECT COUNT(*) FROM locations WHERE (lat > 43 AND lat < 44) AND (lng > 86 AND lng < 87). If you have indexes on the lat/lng fields, that should be a fast query.
Our goal is to get just above 10 total results inside the box. Here's where the "binary search" comes in. If we only got 5 results, we double the box area and search again. If we got 100 results, we cut the area in half and search again. If we get 3 results immediately after that, we increase the box area by 50% (instead of 100%) and try again, proceeding until we get close enough to our 10 result target.
Finally we take this manageable set of records and calculate their euclidean distance from the location of interest, and sort, in code.
Good luck!
I do not think that you find a solution directly out of the box.
It'll be much more efficient if you use a bounding sphere instead of a bounding box to specify your object.
http://en.wikipedia.org/wiki/Bounding_sphere
C = ( A + B)/2 and R = distance(A,B) /2
You do not precise how much data you want to compare. And if you want to see the closests or the farthest objects pair.
For both case, I think that you have to encode C coordinate as a path in an octtree if you are using 3D or quadtree if you are using 2D.
http://en.wikipedia.org/wiki/Quadtree
This is a first draft I can add more information if this not enough.
If you are not familiar with 3D start with 2D it easier to start with.
I show your latest add, it seems that your problem is very similar to clash detection algorithm.
I think that if you change the coordinate system of the "end-point" by polar coordinate relative to the "start-point". If you round the radial coordinate to your tolerance (x miles), and order them by this value.

Calculate distance between Zip Codes... AND users.

This is more of a challenge question than something I urgently need, so don't spend all day on it guys.
I built a dating site (long gone) back in 2000 or so, and one of the challenges was calculating the distance between users so we could present your "matches" within an X mile radius. To just state the problem, given the following database schema (roughly):
USER TABLE
UserId
UserName
ZipCode
ZIPCODE TABLE
ZipCode
Latitude
Longitude
With USER and ZIPCODE being joined on USER.ZipCode = ZIPCODE.ZipCode.
What approach would you take to answer the following question: What other users live in Zip Codes that are within X miles of a given user's Zip Code.
We used the 2000 census data, which has tables for zip codes and their approximate lattitude and longitude.
We also used the Haversine Formula to calculate distances between any two points on a sphere... pretty simple math really.
The question, at least for us, being the 19 year old college students we were, really became how to efficiently calculate and/store distances from all members to all other members. One approach (the one we used) would be to import all the data and calculate the distance FROM every zip code TO every other zip code. Then you'd store and index the results. Something like:
SELECT User.UserId
FROM ZipCode AS MyZipCode
INNER JOIN ZipDistance ON MyZipCode.ZipCode = ZipDistance.MyZipCode
INNER JOIN ZipCode AS TheirZipCode ON ZipDistance.OtherZipCode = TheirZipCode.ZipCode
INNER JOIN User AS User ON TheirZipCode.ZipCode = User.ZipCode
WHERE ( MyZipCode.ZipCode = 75044 )
AND ( ZipDistance.Distance < 50 )
The problem, of course, is that the ZipDistance table is going to have a LOT of rows in it. It isn't completely unworkable, but it is really big. Also it requires complete pre-work on the whole data set, which is also not unmanageable, but not necessarily desireable.
Anyway, I was wondering what approach some of you gurus might take on something like this. Also, I think this is a common issue programmers have to tackle from time to time, especially if you consider problems that are just algorithmically similar. I'm interested in a thorough solution that includes at least HINTS on all the pieces to do this really quickly end efficiently. Thanks!
Ok, for starters, you don't really need to use the Haversine formula here. For large distances where a less accurate formula produces a larger error, your users don't care if the match is plus or minus a few miles, and for closer distances, the error is very small. There are easier (to calculate) formulas listed on the Geographical Distance Wikipedia article.
Since zip codes are nothing like evenly spaced, any process that partitions them evenly is going to suffer mightily in areas where they are clustered tightly (east coast near DC being a good example). If you want a visual comparison, check out http://benfry.com/zipdecode and compare the zipcode prefix 89 with 07.
A far better way to deal with indexing this space is to use a data structure like a Quadtree or an R-tree. This structure allows you to do spatial and distance searches over data which is not evenly spaced.
Here's what an Quadtree looks like:
To search over it, you drill down through each larger cell using the index of smaller cells that are within it. Wikipedia explains it more thoroughly.
Of course, since this is a fairly common thing to do, someone else has already done the hard part for you. Since you haven't specified what database you're using, the PostgreSQL extension PostGIS will serve as an example. PostGIS includes the ability to do R-tree spatial indexes which allow you to do efficient spatial querying.
Once you've imported your data and built the spatial index, querying for distance is a query like:
SELECT zip
FROM zipcode
WHERE
geom && expand(transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661), 16093)
AND
distance(
transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661),
geom) < 16093
I'll let you work through the rest of the tutorial yourself.
http://unserializableone.blogspot.com/2007/02/using-postgis-to-find-points-of.html
Here are some other references to get you started.
http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_tut02
http://www.manning.com/obe/PostGIS_MEAPCH01.pdf
http://postgis.refractions.net/docs/ch04.html
I'd simply just create a zip_code_distances table and pre-compute the distances between all 42K zipcodes in the US which are within a 20-25 mile radius of each other.
create table zip_code_distances
(
from_zip_code mediumint not null,
to_zip_code mediumint not null,
distance decimal(6,2) default 0.0,
primary key (from_zip_code, to_zip_code),
key (to_zip_code)
)
engine=innodb;
Only including zipcodes within a 20-25 miles radius of each other reduces the number of rows you need to store in the distance table from it's maximum of 1.7 billion (42K ^ 2) - 42K to a much more manageable 4 million or so.
I downloaded a zipcode datafile from the web which contained the longitudes and latitudes of all the official US zipcodes in csv format:
"00601","Adjuntas","Adjuntas","Puerto Rico","PR","787","Atlantic", 18.166, -66.7236
"00602","Aguada","Aguada","Puerto Rico","PR","787","Atlantic", 18.383, -67.1866
...
"91210","Glendale","Los Angeles","California","CA","818","Pacific", 34.1419, -118.261
"91214","La Crescenta","Los Angeles","California","CA","818","Pacific", 34.2325, -118.246
"91221","Glendale","Los Angeles","California","CA","818","Pacific", 34.1653, -118.289
...
I wrote a quick and dirty C# program to read the file and compute the distances between every zipcode but only output zipcodes that fall within a 25 mile radius:
sw = new StreamWriter(path);
foreach (ZipCode fromZip in zips){
foreach (ZipCode toZip in zips)
{
if (toZip.ZipArea == fromZip.ZipArea) continue;
double dist = ZipCode.GetDistance(fromZip, toZip);
if (dist > 25) continue;
string s = string.Format("{0}|{1}|{2}", fromZip.ZipArea, toZip.ZipArea, dist);
sw.WriteLine(s);
}
}
The resultant output file looks as follows:
from_zip_code|to_zip_code|distance
...
00601|00606|16.7042215574185
00601|00611|9.70353520976393
00601|00612|21.0815707704904
00601|00613|21.1780461311929
00601|00614|20.101431539283
...
91210|90001|11.6815708119899
91210|90002|13.3915723402714
91210|90003|12.371251171873
91210|90004|5.26634939906721
91210|90005|6.56649623829871
...
I would then just load this distance data into my zip_code_distances table using load data infile and then use it to limit the search space of my application.
For example if you have a user whose zipcode is 91210 and they want to find people who are within a 10 mile radius of them then you can now simply do the following:
select
p.*
from
people p
inner join
(
select
to_zip_code
from
zip_code_distances
where
from_zip_code = 91210 and distance <= 10
) search
on p.zip_code = search.to_zip_code
where
p.gender = 'F'....
Hope this helps
EDIT: extended radius to 100 miles which increased the number of zipcode distances to 32.5 million rows.
quick performance check for zipcode 91210 runtime 0.009 seconds.
select count(*) from zip_code_distances
count(*)
========
32589820
select
to_zip_code
from
zip_code_distances
where
from_zip_code = 91210 and distance <= 10;
0:00:00.009: Query OK
You could shortcut the calculation by just assuming a box instead of a circular radius. Then when searching you simply calculate the lower/upper bound of lat/lon for a given point+"radius", and as long as you have an index on the lat/lon columns you could pull back all records that fall within the box pretty easily.
I know that this post is TOO old, but making some research for a client I've found some useful functionality of Google Maps API and is so simple to implement, you just need to pass to the url the origin and destination ZIP codes, and it calculates the distance even with the traffic, you can use it with any language:
origins = 90210
destinations = 93030
mode = driving
http://maps.googleapis.com/maps/api/distancematrix/json?origins=90210&destinations=93030&mode=driving&language=en-EN&sensor=false%22
following the link you can see that it returns a json. Remember that you need an API key to use this on your own hosting.
source:
http://stanhub.com/find-distance-between-two-postcodes-zipcodes-driving-time-in-current-traffic-using-google-maps-api/
You could divide your space into regions of roughly equal size -- for instance, approximate the earth as a buckyball or icosahedron. The regions could even overlap a bit, if that's easier (e.g. make them circular). Record which region(s) each ZIP code is in. Then you can precalculate the maximum distance possible between every region pair, which has the same O(n^2) problem as calculating all the ZIP code pairs, but for smaller n.
Now, for any given ZIP code, you can get a list of regions that are definitely within your given range, and a list of regions that cross the border. For the former, just grab all the ZIP codes. For the latter, drill down into each border region and calculate against individual ZIP codes.
It's certainly more complex mathematically, and in particular the number of regions would have to be chosen for a good balance between the size of the table vs. the time spent calculating on the fly, but it reduces the size of the precalculated table by a good margin.
I would use latitude and longitude. For example, if you have a latitude of 45 and a longitude of 45 and were asked to find matches within 50 miles, then you could do it by moving 50/69 ths up in latitude and 50/69 ths down in latitude (1 deg latitude ~ 69 miles). Select zip codes with latitudes in this range. Longitudes are a little different, because they get smaller as you move closer to the poles.
But at 45 deg, 1 longitude ~ 49 miles, so you could move 50/49ths left in latitude and 50/49ths right in latitude, and select all zip codes from the latitude set with this longitude. This gives you all zip codes within a square with lengths of a hundred miles. If you wanted to be really precise, you could then use the Haversine formula witch you mentioned to weed out zips in the corners of the box, to give you a sphere.
Not every possible pair of zip codes are going to be used. I would build zipdistance as a 'cache' table. For each request calculate the distance for that pair and save it in the cache. When a request for a distance pair comes, first look in the cache, then compute if it's not available.
I do not know the intricacies of distance calculations, so I would also check whether computing on the fly is cheaper than looking up (also taking into consideration how often you have to compute).
I have the problem running great, and pretty much everyone's answer got used. I was thinking about this in terms of the old solution instead of just "starting over." Babtek gets the nod for stating in in simplest terms.
I'll skip the code because I'll provide references to derive the needed formulas, and there is too much to cleanly post here.
Consider Point A on a sphere, represented by latitude and longitude. Figure out North, South, East, and West edges of a box 2X miles across with Point A at the center.
Select all point within the box from the ZipCode table. This includes a simple WHERE clause with two Between statements limiting by Lat and Long.
Use the haversine formula to determine the spherical distance between Point A and every point B returned in step 2.
Discard all points B where distance A -> B > X.
Select users where ZipCode is in the remaining set of points B.
This is pretty fast for > 100 miles. Longest result was ~ 0.014 seconds to calculate the match, and trivial to run the select statement.
Also, as a side note, it was necessary to implement the math in a couple of functions and call them in SQL. Once I got past a certain distance the matching number of ZipCodes was too large to pass back to SQL and use as an IN statement, so I had to use a temp table and join the resulting ZipCodes to User on the ZipCode column.
I suspect that using a ZipDistance table will not provide a long-term performance gain. The number of rows just gets really big. If you calculate the distance from every zip to to every other zip code (eventually) then the resultant row count from 40,000 zip codes would be ~ 1.6B. Whoah!
Alternately, I am interested in using SQL's built in geography type to see if that will make this easier, but good old int/float types served fine for this sample.
So... final list of online resources I used, for your easy reference:
Maximum Difference, Latitude and Longitude.
The Haversine Formula.
Lengthy but complete discussion of the whole process, which I found from Googling stuff in your answers.

About curse of dimensionality

My question is about this topic I've been reading about a bit. Basically my understanding is that in higher dimensions all points end up being very close to each other.
The doubt I have is whether this means that calculating distances the usual way (euclidean for instance) is valid or not. If it were still valid, this would mean that when comparing vectors in high dimensions, the two most similar wouldn't differ much from a third one even when this third one could be completely unrelated.
Is this correct? Then in this case, how would you be able to tell whether you have a match or not?
Basically the distance measurement is still correct, however, it becomes meaningless when you have "real world" data, which is noisy.
The effect we talk about here is that a high distance between two points in one dimension gets quickly overshadowed by small distances in all the other dimensions. That's why in the end, all points somewhat end up with the same distance. There exists a good illustration for this:
Say we want to classify data based on their value in each dimension. We just say we divide each dimension once (which has a range of 0..1). Values in [0, 0.5) are positive, values in [0.5, 1] are negative. With this rule, in 3 dimensions, 12.5% of the space are covered. In 5 dimensions, it is only 3.1%. In 10 dimensions, it is less than 0.1%.
So in each dimension we still allow half of the overall value range! Which is quite much. But all of it ends up in 0.1% of the total space -- the differences between these data points are huge in each dimension, but negligible over the whole space.
You can go further and say in each dimension you cut only 10% of the range. So you allow values in [0, 0.9). You still end up with less than 35% of the whole space covered in 10 dimensions. In 50 dimensions, it is 0.5%. So you see, wide ranges of data in each dimension are crammed into a very small portion of your search space.
That's why you need dimensionality reduction, where you basically disregard differences on less informative axes.
Here is a simple explanation in layman terms.
I tried to illustrate this with a simple illustration shown below.
Suppose you have some data features x1 and x2 (you can assume they are blood pressure and blood sugar levels) and you want to perform K-nearest neighbor classification. If we plot the data in 2D, we can easily see that the data nicely group together, each point has some close neighbors that we can use for our calculations.
Now let's say we decide to consider a new third feature x3 (say age) for our analysis.
Case (b) shows a situation where all of our previous data comes from people the same age. You can see that they are all located at the same level along the age (x3) axis.
Now we can quickly see that if we want to consider age for our classification, there is a lot of empty space along the age(x3) axis.
The data that we currently have only over a single level for the age. What happens if we want to make a prediction for someone that has a different age(red dot)?
As you can see there are not enough data points close this point to calculate the distance and find some neighbors. So, If we want to have good predictions with this new third feature, we have to go and gather more data from people of different ages to fill the empty space along the age axis.
(C) It is essentially showing the same concept. Here assume our initial data, were gathered from people of different ages. (i.e we did not care about the age in our previous 2 feature classification task and might have assumed that this feature does not have an effect on our classification).
In this case , assume our 2D data come from people of different ages ( third feature). Now, what happens to our relatively closely located 2d data, if we plot them in 3D? If we plot them in 3D, we can see that now they are more distant from each other,(more sparse) in our new higher dimension space(3D). As a result, finding the neighbors becomes harder since we don't have enough data for different values along our new third feature.
You can imagine that as we add more dimensions the data become more and more apart. (In other words, we need more and more data if you want to avoid having sparsity in our data)

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources