Memory-efficient line stitching in very large images - c

Background
I work with very large datasets from Synthetic Aperture Radar satellites. These can be thought of as high dynamic range greyscale images of the order of 10k pixels on a side.
Recently, I've been developing applications of a single-scale variant of Lindeberg's scale-space ridge detection algorithm method for detecting linear features in a SAR image. This is an improvement on using directional filters or using the Hough Transform, methods that have both previously been used, because it is less computationally expensive than either. (I will be presenting some recent results at JURSE 2011 in April, and I can upload a preprint if that would be helpful).
The code I currently use generates an array of records, one per pixel, each of which describes a ridge segment in the rectangle to bottom right of the pixel and bounded by adjacent pixels.
struct ridge_t { unsigned char top, left, bottom, right };
int rows, cols;
struct ridge_t *ridges; /* An array of rows*cols ridge entries */
An entry in ridges contains a ridge segment if exactly two of top, left, right and bottom have values in the range 0 - 128. Suppose I have:
ridge_t entry;
entry.top = 25; entry.left = 255; entry.bottom = 255; entry.right = 76;
Then I can find the ridge segment's start (x1,y1) and end (x2,y2):
float x1, y1, x2, y2;
x1 = (float) col + (float) entry.top / 128.0;
y1 = (float) row;
x2 = (float) col + 1;
y2 = (float) row + (float) entry.right / 128.0;
When these individual ridge segments are rendered, I get an image something like this (a very small corner of a far larger image):
Each of those long curves are rendered from a series of tiny ridge segments.
It's trivial to determine whether two adjacent locations which contain ridge segments are connected. If I have ridge1 at (x, y) and ridge2 at (x+1, y), then they are parts of the same line if 0 <= ridge1.right <= 128 and ridge2.left = ridge1.right.
Problem
Ideally, I would like to stitch together all of the ridge segments into lines, so that I can then iterate over each line found in the image to apply further computations. Unfortunately, I'm finding it hard to find an algorithm for doing this which is both low complexity and memory-efficient and suitable for multiprocessing (all important consideration when dealing with really huge images!)
One approach that I have considered is scanning through the image until I find a ridge which only has one linked ridge segment, and then walking the resulting line, flagging any ridges in the line as visited. However, this is unsuitable for multiprocessing, because there's no way to tell if there isn't another thread walking the same line from the other direction (say) without expensive locking.
What do readers suggest as a possible approach? It seems like the sort of thing that someone would have figured out an efficient way to do in the past...

I'm not entirely sure this is correct, but I thought I'd throw it out for comment. First, let me introduce a lockless disjoint set algorithm, which will form an important part of my proposed algorithm.
Lockless disjoint set algorithm
I assume the presence of a two-pointer-sized compare-and-swap operation on your choice of CPU architecture. This is available on x86 and x64 architectures at the least.
The algorithm is largely the same as described on the Wikipedia page for the single threaded case, with some modifications for safe lockless operation. First, we require that the rank and parent elements to both be pointer-sized, and aligned to 2*sizeof(pointer) in memory, for atomic CAS later on.
Find() need not change; the worst case is that the path compression optimization will fail to have full effect in the presence of simultaneous writers.
Union() however, must change:
function Union(x, y)
redo:
x = Find(x)
y = Find(y)
if x == y
return
xSnap = AtomicRead(x) -- read both rank and pointer atomically
ySnap = AtomicRead(y) -- this operation may be done using a CAS
if (xSnap.parent != x || ySnap.parent != y)
goto redo
-- Ensure x has lower rank (meaning y will be the new root)
if (xSnap.rank > ySnap.rank)
swap(xSnap, ySnap)
swap(x, y)
-- if same rank, use pointer value as a fallback sort
else if (xSnap.rank == ySnap.rank && x > y)
swap(xSnap, ySnap)
swap(x, y)
yNew = ySnap
yNew.rank = max(yNew.rank, xSnap.rank + 1)
xNew = xSnap
xNew.parent = y
if (!CAS(y, ySnap, yNew))
goto redo
if (!CAS(x, xSnap, xNew))
goto redo
return
This should be safe in that it will never form loops, and will always result in a proper union. We can confirm this by observing that:
First, prior to termination, one of the two roots will always end up with a parent pointing to the other. Therefore, as long as there is no loop, the merge succeeds.
Second, rank always increases. After comparing the order of x and y, we know x has lower rank than y at the time of the snapshot. In order for a loop to form, another thread would need to have increased x's rank first, then merged x and y. However in the CAS that writes x's parent pointer, we check that rank has not changed; therefore, y's rank must remain greater than x.
In the event of simultaneous mutation, it is possible that y's rank may be increased, then return to redo due to a conflict. However, this implies that either y is no longer a root (in which case rank is irrelevant) or that y's rank has been increased by another process (in which case the second go around will have no effect and y will have correct rank).
Therefore, there should be no chance of loops forming, and this lockless disjoint-set algorithm should be safe.
And now on to the application to your problem...
Assumptions
I make the assumption that ridge segments can only intersect at their endpoints. If this is not the case, you will need to alter phase 1 in some manner.
I also make the assumption that co-habitation of a single integer pixel location is sufficient for ridge segments can be connected. If not, you will need to change the array in phase 1 to hold multiple candidate ridge segments+disjoint-set pairs, and filter through to find ones that are actually connected.
The disjoint set structures used in this algorithm shall carry a reference to a line segment in their structures. In the event of a merge, we choose one of the two recorded segments arbitrarily to represent the set.
Phase 1: Local line identification
We start by dividing the map into sectors, each of which will be processed as a seperate job. Multiple jobs may be processed in different threads, but each job will be processed by only one thread. If a ridge segment crosses a sector, it is split into two segments, one for each sector.
For each sector, an array mapping pixel position to a disjoint-set structure is established. Most of this array will be discarded later, so its memory requirements should not be too much of a burden.
We then proceed over each line segment in the sector. We first choose a disjoint set representing the entire line the segment forms a part of. We first look up each endpoint in the pixel-position array to see if a disjoint set structure has already been assigned. If one of the endpoints is already in this array, we use the assigned disjoint set. If both are in the array, we perform a merge on the disjoint sets, and use the new root as our set. Otherwise, we create a new disjoint-set, and associate with the disjoint-set structure a reference to the current line segment. We then write back into the pixel-position array our new disjoint set's root for each of our endpoints.
This process is repeated for each line segment in the sector; by the end, we will have identified all lines completely within the sector by a disjoint set.
Note that since the disjoint sets are not yet shared between threads, there's no need to use compare-and-swap operations yet; simply use the normal single-threaded union-merge algorithm. Since we do not free any of the disjoint set structures until the algorithm completes, allocation can also be made from a per-thread bump allocator, making memory allocation (virtually) lockless and O(1).
Once a sector is completely processed, all data in the pixel-position array is discarded; however data corresponding to pixels on the edge of the sector is copied to a new array and kept for the next phase.
Since iterating over the entire image is O(x*y), and disjoint-merge is effectively O(1), this operation is O(x*y) and requires working memory O(m+2*x*y/k+k^2) = O(x*y/k+k^2), where t is the number of sectors, k is the width of a sector, and m is the number of partial line segments in the sector (depending on how often lines cross borders, m may vary significantly, but it will never exceed the number of line segments). The memory carried over to the next operation is O(m + 2*x*y/k) = O(x*y/k)
Phase 2: Cross-sector merges
Once all sectors have been processed, we then move to merging lines that cross sectors. For each border between sectors, we perform lockless merge operations on lines that cross the border (ie, where adjacent pixels on each side of the border have been assigned to line sets).
This operation has running time O(x+y) and consumes O(1) memory (we must retain the memory from phase 1 however). Upon completion, the edge arrays may be discarded.
Phase 3: Collecting lines
We now perform a multi-threaded map operation over all allocated disjoint-set structure objects. We first skip any object which is not a root (ie, where obj.parent != obj). Then, starting from the representative line segment, we move out from there and collect and record any information desired about the line in question. We are assured that only one thread is looking at any given line at a time, as intersecting lines would have ended up in the same disjoint-set structure.
This has O(m) running time, and memory usage dependent on what information you need to collect about these line segments.
Summary
Overall, this algorithm should have O(x*y) running time, and O(x*y/k + k^2) memory usage. Adjusting k gives a tradeoff between transient memory usage on the phase 1 processes, and the longer-term memory usage for the adjacency arrays and disjoint-set structures carried over into phase 2.
Note that I have not actually tested this algorithm's performance in the real world; it is also possible that I have overlooked concurrency issues in the lockless disjoint-set union-merge algorithm above. Comments welcome :)

You could use a non-generalized form of the Hough Transform. It appears that it reaches an impressive O(N) time complexity on N x N mesh arrays (if you've got access to ~10000x10000 SIMD arrays and your mesh is N x N - note: in your case, N would be a ridge struct, or cluster of A x B ridges, NOT a pixel). Click for Source. More conservative (non-kernel) solutions list the complexity as O(kN^2) where k = [-π/2, π]. Source.
However, the Hough Transform does have some steep-ish memory requirements, and the space complexity will be O(kN) but if you precompute sin() and cos() and provide appropriate lookup tables, it goes down to O(k + N), which may still be too much, depending on how big your N is... but I don't see you getting it any lower.
Edit: The problem of cross-thread/kernel/SIMD/process line elements is non-trivial. My first impulse tells me to subdivide the mesh into recursive quad-trees (dependent on a certain tolerance), check immediate edges and ignore all edge ridge structs (you can actually flag these as "potential long lines" and share them throughout your distributed system); just do the work on everything INSIDE that particular quad and progressively move outward. Here's a graphical representation (green is the first pass, red is the second, etc). However, my intuition tells me that this is computationally-expensive..

If the ridges are resolved enough that the breaks are only a few pixels then the standard dilate - find neighbours - erode steps you would do for finding lines / OCR should work.
Joining longer contours from many segments and knowing when to create a neck or when to make a separate island is much more complex

Okay, so having thought about this a bit longer, I've got a suggestion that seems like it's too simple to be efficient... I'd appreciate some feedback on whether it seems sensible!
1) Since I can easily determine whether each ridge_t ridge segment at is connected to zero, one or two adjacent segments, I could colour each one appropriately (LINE_NONE, LINE_END or LINE_MID). This can easily be done in parallel, since there is no chance of a race condition.
2) Once colouring is complete:
for each `LINE_END` ridge segment X found:
traverse line until another `LINE_END` ridge segment Y found
if X is earlier in memory than Y:
change X to `LINE_START`
else:
change Y to `LINE_START`
This is also free of race conditions, since even if two threads are simultaneously traversing the same line, they will make the same change.
3) Now every line in the image will have exactly one end flagged as LINE_START. The lines can be located and packed into a more convenient structure in a single thread, without having to do any look-ups to see if the line has already been visited.
It's possible that I should consider whether statistics such as line length should be gathered in step 2), to help with the final re-packing...
Are there any pitfalls that I've missed?
Edit: The obvious problem is that I end up walking the lines twice, once to locate RIDGE_STARTs and once to do the final re-packing, leading to a computational inefficiency. It's still appears to be O(N) in terms of storage and computation time, though, which is a good sign...

Related

Matlab's bvp4c: output arrays not always the same length as the initial guess

The Matlab function bvp4c solves boundary value problems. It takes a differential equation, boundary conditions and an initial guess as input, and returns a structure array containing arrays of x, y and yp (which stands for "y prime", or y').
The length of the output arrays should be the same as that of the initial guess, but I found that it isn't always. I have checked the dimensions of the input (the initial guess, always 1x101 double for x and 16x101 double for y) and the output (sometimes 1x101 double for x and 16x101 double for y and yp as it should be, but often different values, such as 1x91 double and 16x91 double or 1x175 double and 16x175 double).
Looking at the output array x when its length is off, some extra values are squeezed in, or some are taken out. For example, the initial guess has 100 positions between x=0 and x=1, and the x array should be [0 0.01 0.02 ... 1], but sometimes a new position like 0.015 shows up.
Question: Why does this happen, and how can this be solved?
"The length of the output arrays should be the same as that of the initial guess ...." This is incorrect.
As described in the bvp4c documentation, sol.x contains a "[mesh] selected by bvp4c" with an "[approximation] to y(x) at the mesh points of sol.x". In order to evaluate bvp4c's solution on your mesh, use deval.
Why does bvp4c choose a mesh? Quoting from the cited paper1, which you can get in full here if you have a MathWorks account:
Because BVPs can have more than one solution, BVP codes require users to supply a guess for the solution desired. The guess includes a guess for an initial mesh that reveals the behavior of the desired solution. The codes then adapt the mesh so as to obtain an accurate numerical solution with a modest number of mesh points.
Because a steady BVP generally has a global behavior strongly dependent on its boundary values, the spatial mesh between the two boundaries may need to be refined in order to properly approximate the desired solution with the locally chosen basis functions for the method. However, there may also be portions of the mesh that do not need to be refined and can even be coarsened in some cases to maintain a reasonably small residual and accurate approximation. Therefore, for general efficiency, the guess mesh is adaptively refined or coarsened depending on some locally chosen metric (since bvp4c is collocation based, the metric is probably point-based or division-integrated based) such that the mesh returned by bvp4c is, in some sense, adequate enough for generic interpolation within the boundaries.
I'll also note that this is different from numerically solving IVPs since their state is not global across the entire time integration locus and only depends on the current state to the next time-step, and possibly previous time steps if using a multi-step method or solving a delay differential equation, which makes the refinement inherently local. This local behavior of IVPs is what allows functions like ode45 to return a solution at pre-selected time values because it can locally refine the solution at the selected point while performing the time march (this is known as dense output).
1 Shampine, L.F., M.W. Reichelt, and J. Kierzenka, "Solving Boundary Value Problems for Ordinary Differential Equations in MATLAB with bvp4c".

Efficiency of Fortran ndarray versus n*1d arrays

Well this is one I'm struggling with since I started working on the actual code I'm working with right now.
My advisor wrote this for the past ten years and had, at some point, to stock values that we usually store in matrix or tensors.
Actually we look at matrix with six independent composents calculated from the Virial theorem (from Molecular dynamics simulation) and he had the habits to store 6*1D arrays, one for each value, at each recorded step, ie xy(n), xz(n) yz(n)... n being the number of records.
I assume that a single array s(n,3,3) could be more efficient as the values will be stored closer from one another (xy(n) and xz(n) have no reason to be stored side to side in memory) and rise less error concerning corrupted memory or wrong memory access. I tried to discuss it in the lab but eventually no one cares and again, this is just an assumption.
This would not have buggued me if everything in the code wasn't stored like that. Every 3d quantity is stored in 3 different arrays instead of 1 and this feels weird to me as for the performance of the code.
Is their any comparable effect for long calculations and large data size? I decided to post here after resolving an error I had due to wrong memory access with one of these as I find the code more readable and the data more easy to compute (s = s+... instead of six line of xy = xy+... for example).
The fact that the columns are close to each other is not very important, especially if the leading dimension n is large. Your CPU has multiple prefetch streams and can prefetch simultaneously in different arrays of different columns.
If you make some random access in an array A(n,3,3) where A is allocatable, the dimensions are not known at compile time. Therefore, the address of a random element A(i,j,k) will be address_of(A(1,1,1)) + i + (j-1)*n + (k-1)*3*n, and it will have to be calculated at the execution every time you make a random access to the array. The calculation of the address involves 3 integer multiplications (3 CPU cycles each) and at least 3 adds (1 cycle each). But regular accesses (predictible) can be optimized by the compiler using relative addresses.
If you have different 1-index arrays, the calculation of the address involves only one integer add (1 cycle), so you get a peformance penalty of at least 11 cycles for each access when using a single 3-index array.
Moreover, if you have 9 different arrays, each one of them can be aligned on a cache-line boundary, whereas you would be forced to use padding at the end of lines to ensure this behavior with a single array.
So I would say that in the particular case of A(n,3,3), as the two last indices are small and known at compile time, you can safely do the transformation into 9 different arrays to potentially gain some performance.
Note that if you use often the data of the 9 arrays at the same index i in a random order, re-organizing the data into A(3,3,n) will give you a clear performance increase. If a is in double precision, A(4,4,n) could be even better if A is aligned on a 64-byte boundary as every A(1,1,i) will be located at the 1st position of a cache line.
Assuming that you always loop along n and inside each loop need to access all the components in the matrix, storing the array like s(6,n) or s(3,3,n) will benefit from cache optimization.
do i=1,n
! do some calculation with s(:,i)
enddo
However, if your innerloop looks like this
resultarray(i)=xx(i)+yy(i)+zz(i)+2*(xy(i)+yz(i)+xz(i))
Don't border to change the array layout because you may break the SIMD optimization.

Rendering image using Multithread

I have a ray tracing algorithm, which works with only 1 thread and I am trying to make it work with any number of threads.
My question is, which way can I divide this task among threads.
At first my Instructor told me to just divide the width of the image, for example if I have an 8x8 image, and I want 2 threads to do the task, let thread 1 render 0 to 3 horizontal area ( of course all the way down vertically ) and thread 2 render 4 to 7 horizontal area.
I found this approach to work perfect when both my image length and number of threads are powers of 2, but I have no idea how can I deal with odd number of threads or any number of threads that cant divide width without a reminder.
My approach to this problem was to let threads render the image by alternating, for example if I have an 8x8 image, andlets say if I have 3 threads.
thread 1 renders pixels 0,3,6 in horizontal direction
thread 1 renders pixels 1,4,7 in horizontal direction
thread 1 renders pixels 2,5 in horizontal direction
Sorry that I cant provide all my code, since there are more than 5 files with few hundreds line of code in each one.
Here is the for loops that loop trough horizontal area, and the vertical loop is inside these but I am not going to provide it here.
My Instructor`s suggestion
for( int px=(threadNum*(width/nthreads)); px < ((threadNum+1)*(width/nthreads)); ++px )
threadNum is the current thread that I am on (meaning thread 0,1,2 and so on)
width is the width of the image
nthreads is the overall number of threads.
My solution to this problem
for( int px= threadNum; px< width; px+=nthreads )
I know my question is not so clear, and sorry but I cant provide the whole code here, but basically all I am asking is which way is the best way to divide the rendering of the image among given number of threads ( can be any positive number). Also I want threads to render the image by columns, meaning I cant touch the part of the code which handles vertical rendering.
Thank you, and sorry for chaotic question.
First thing, let me tell you that under the assumption that the rendering of each pixel is independent from the other pixels, your task is what in the HPC field is called an "embarassing parallel problem"; that is, a problem that can be efficiently divided between any number of thread (until each thread has a single "unit of work"), without any intercommunication between the processes (which is very good).
That said, it doesn't mean that any parallelization scheme is as good as any other. For your specific problem, I would say that the two main factors to keep in mind are load balancing and cache efficiency.
Load balancing means that you should divide the work among threads in a way that each thread has roughly the same amount of work: in this way you prevent one or more threads from waiting for that one last thread that has to finish it's last job.
E.g.
You have 5 threads and you split your image in 5 big chunks (let's say 5 horizontal strips, but they could be vertical and it wouldn't change the point). Being the problem embarassing parallel, you expect a 5x speedup, and instead you get a meager 1.2x.
The reason might be that your image has most of computationally expensive details in the lower part of the image (I know nothing of rendering, but I assume that a reflective object might take far more time to render than a flat empty space), because is composed by a set of polished metal marbles on the floor on an empty frame.
In this scenario, only one thread (the one with the bottom 1/5 of the image) does all the work anyway, while the other 4 remains idling after finishing their brief tasks.
As you can imagine, this isn't a good parallelization: keeping load balancing in mind alone, the best parallelization scheme would be to assign interleaved pixels to each core for them to process, under the (very reasonable) assumption that the complexity of the image would be averaged on each thread (true for natural images, might yield surprises in very very limited scenarios).
With this solution, your image is eavenly distributed among pixels (statistically) and the worst case scenario is N-1 threads waiting for a single thread to compute a single pixel (you wouldn't notice, performance-wise).
To do that you need to cycle over all pixels forgetting about lines, in this way (pseudo code, not tested):
for(i = thread_num; i < width * height; i+=thread_num)
The second factor, cache efficiency deals with the way computers are designed, specifically, the fact that they have many layers of cache to speed up computations and prevent the CPUs to starve (remain idle while waiting for data), and accessing data in the "right way" can speed up computations considerably.
It's a very complex topic, but in your case, a rule of thumb might be "feeding to each thread the right amount of memory will improve the computation" (emphasys on "right amount" intended...).
It means that, even if passing to each thread interleaved pixels is probably the perfect balancing, it's probably the worst possible memory access pattern you could devise, and you should pass "bigger chunks" to them, because this would keep the CPU busy (note: memory aligment comes also heavily into play: if your image has padding after each line keep them multiples of, say, 32 bytes, like some image formats, you should keep it into consideration!!)
Without expanding an already verbose answer to alarming sizes, this is what I would do (I'm assuming the memory of the image is consecutive, without padding between lines!):
create a program that splits the image into N consecutive pixels (use a preprocessor constant or a command argument for N, so you can change it!) for each of M threads, like this:
1111111122222222333333334444444411111111
do some profiling for various values of N, stepping from 1 to, let's say, 2048, by powers of two (good values to test might be: 1 to get a base line, 32, 64, 128, 256, 512, 1024, 2048)
find out where the perfect balance is between perfect load balancing (N=1), and best caching (N <= the biggest cache line in your system)
a try the program on more than one system, and keep the smalles value of N that gives you the best test results among the machines, in order to make your code run fast everywhere (as the caching details vary among systems).
b If you really really want to squeeze every cycle out of every system you install your code on, forget step 4a, and create a code that automatically finds out the best value of N by rendering a small test image before tackling the appointed task :)
fool around with SIMD instructions (just kidding... sort of :) )
A bit theoretical (and overly long...), but still I hope it helps!
An alternating division of the columns will probably lead to a suboptimal cache usage. The threads should operate on a larger continuous range of data. By the way, if your image is stored row-wise it would also be better to distribute the rows instead of the columns.
This is one way to divide the data equally with any number of threads:
#define min(x,y) (x<y?x:y)
/*...*/
int q = width / nthreads;
int r = width % nthreads;
int w = q + (threadNum < r);
int start = threadNum*q + min(threadNum,r);
for( int px = start; px < start + w; px++ )
/*...*/
The remainder r is distributed over the first r threads. This is important when calculating the start index for a thread.
For the 8x8 image this would lead to:
thread 0 renders columns 0-2
thread 1 renders columns 3-5
thread 2 renders columns 6-7

Fast spatial data structure for nearest neighbor search amongst non-uniformly sized hyperspheres

Given a k-dimensional continuous (euclidean) space filled with rather unpredictably moving/growing/shrinking  hyperspheres I need to repeatedly find the hypersphere whose surface is nearest to a given coordinate. If some hyperspheres are of the same distance to my coordinate, then the biggest hypersphere wins. (The total count of hyperspheres is guaranteed to stay the same over time.)
My first thought was to use a KDTree but it won't take the hyperspheres' non-uniform volumes into account.
So I looked further and found BVH (Bounding Volume Hierarchies) and BIH (Bounding Interval Hierarchies), which seem to do the trick. At least in 2-/3-dimensional space. However while finding quite a bit of info and visualizations on BVHs I could barely find anything on BIHs.
My basic requirement is a k-dimensional spatial data structure that takes volume into account and is either super fast to build (off-line) or dynamic with barely any unbalancing.
Given my requirements above, which data structure would you go with? Any other ones I didn't even mention?
Edit 1: Forgot to mention: hypershperes are allowed (actually highly expected) to overlap!
Edit 2: Looks like instead of "distance" (and "negative distance" in particular) my described metric matches the power of a point much better.
I'd expect a QuadTree/Octree/generalized to 2^K-tree for your dimensionality of K would do the trick; these recursively partition space, and presumably you can stop when a K-subcube (or K-rectangular brick if the splits aren't even) does not contain a hypersphere, or contains one or more hyperspheres such that partitioning doesn't separate any, or alternatively contains the center of just a single hypersphere (probably easier).
Inserting and deleting entities in such trees is fast, so a hypersphere changing size just causes a delete/insert pair of operations. (I suspect you can optimize this if your sphere size changes by local additional recursive partition if the sphere gets smaller, or local K-block merging if it grows).
I haven't worked with them, but you might also consider binary space partitions. These let you use binary trees instead of k-trees to partition your space. I understand that KDTrees are a special case of this.
But in any case I thought the insertion/deletion algorithms for 2^K trees and/or BSP/KDTrees was well understood and fast. So hypersphere size changes cause deletion/insertion operations but those are fast. So I don't understand your objection to KD-trees.
I think the performance of all these are asymptotically the same.
I would use the R*Tree extension for SQLite. A table would normally have 1 or 2 dimensional data. SQL queries can combine multiple tables to search in higher dimensions.
The formulation with negative distance is a little weird. Distance is positive in geometry, so there may not be much helpful theory to use.
A different formulation that uses only positive distances may be helpful. Read about hyperbolic spaces. This might help to provide ideas for other ways to describe distance.

About curse of dimensionality

My question is about this topic I've been reading about a bit. Basically my understanding is that in higher dimensions all points end up being very close to each other.
The doubt I have is whether this means that calculating distances the usual way (euclidean for instance) is valid or not. If it were still valid, this would mean that when comparing vectors in high dimensions, the two most similar wouldn't differ much from a third one even when this third one could be completely unrelated.
Is this correct? Then in this case, how would you be able to tell whether you have a match or not?
Basically the distance measurement is still correct, however, it becomes meaningless when you have "real world" data, which is noisy.
The effect we talk about here is that a high distance between two points in one dimension gets quickly overshadowed by small distances in all the other dimensions. That's why in the end, all points somewhat end up with the same distance. There exists a good illustration for this:
Say we want to classify data based on their value in each dimension. We just say we divide each dimension once (which has a range of 0..1). Values in [0, 0.5) are positive, values in [0.5, 1] are negative. With this rule, in 3 dimensions, 12.5% of the space are covered. In 5 dimensions, it is only 3.1%. In 10 dimensions, it is less than 0.1%.
So in each dimension we still allow half of the overall value range! Which is quite much. But all of it ends up in 0.1% of the total space -- the differences between these data points are huge in each dimension, but negligible over the whole space.
You can go further and say in each dimension you cut only 10% of the range. So you allow values in [0, 0.9). You still end up with less than 35% of the whole space covered in 10 dimensions. In 50 dimensions, it is 0.5%. So you see, wide ranges of data in each dimension are crammed into a very small portion of your search space.
That's why you need dimensionality reduction, where you basically disregard differences on less informative axes.
Here is a simple explanation in layman terms.
I tried to illustrate this with a simple illustration shown below.
Suppose you have some data features x1 and x2 (you can assume they are blood pressure and blood sugar levels) and you want to perform K-nearest neighbor classification. If we plot the data in 2D, we can easily see that the data nicely group together, each point has some close neighbors that we can use for our calculations.
Now let's say we decide to consider a new third feature x3 (say age) for our analysis.
Case (b) shows a situation where all of our previous data comes from people the same age. You can see that they are all located at the same level along the age (x3) axis.
Now we can quickly see that if we want to consider age for our classification, there is a lot of empty space along the age(x3) axis.
The data that we currently have only over a single level for the age. What happens if we want to make a prediction for someone that has a different age(red dot)?
As you can see there are not enough data points close this point to calculate the distance and find some neighbors. So, If we want to have good predictions with this new third feature, we have to go and gather more data from people of different ages to fill the empty space along the age axis.
(C) It is essentially showing the same concept. Here assume our initial data, were gathered from people of different ages. (i.e we did not care about the age in our previous 2 feature classification task and might have assumed that this feature does not have an effect on our classification).
In this case , assume our 2D data come from people of different ages ( third feature). Now, what happens to our relatively closely located 2d data, if we plot them in 3D? If we plot them in 3D, we can see that now they are more distant from each other,(more sparse) in our new higher dimension space(3D). As a result, finding the neighbors becomes harder since we don't have enough data for different values along our new third feature.
You can imagine that as we add more dimensions the data become more and more apart. (In other words, we need more and more data if you want to avoid having sparsity in our data)

Resources