Efficiency of Fortran ndarray versus n*1d arrays - arrays

Well this is one I'm struggling with since I started working on the actual code I'm working with right now.
My advisor wrote this for the past ten years and had, at some point, to stock values that we usually store in matrix or tensors.
Actually we look at matrix with six independent composents calculated from the Virial theorem (from Molecular dynamics simulation) and he had the habits to store 6*1D arrays, one for each value, at each recorded step, ie xy(n), xz(n) yz(n)... n being the number of records.
I assume that a single array s(n,3,3) could be more efficient as the values will be stored closer from one another (xy(n) and xz(n) have no reason to be stored side to side in memory) and rise less error concerning corrupted memory or wrong memory access. I tried to discuss it in the lab but eventually no one cares and again, this is just an assumption.
This would not have buggued me if everything in the code wasn't stored like that. Every 3d quantity is stored in 3 different arrays instead of 1 and this feels weird to me as for the performance of the code.
Is their any comparable effect for long calculations and large data size? I decided to post here after resolving an error I had due to wrong memory access with one of these as I find the code more readable and the data more easy to compute (s = s+... instead of six line of xy = xy+... for example).

The fact that the columns are close to each other is not very important, especially if the leading dimension n is large. Your CPU has multiple prefetch streams and can prefetch simultaneously in different arrays of different columns.
If you make some random access in an array A(n,3,3) where A is allocatable, the dimensions are not known at compile time. Therefore, the address of a random element A(i,j,k) will be address_of(A(1,1,1)) + i + (j-1)*n + (k-1)*3*n, and it will have to be calculated at the execution every time you make a random access to the array. The calculation of the address involves 3 integer multiplications (3 CPU cycles each) and at least 3 adds (1 cycle each). But regular accesses (predictible) can be optimized by the compiler using relative addresses.
If you have different 1-index arrays, the calculation of the address involves only one integer add (1 cycle), so you get a peformance penalty of at least 11 cycles for each access when using a single 3-index array.
Moreover, if you have 9 different arrays, each one of them can be aligned on a cache-line boundary, whereas you would be forced to use padding at the end of lines to ensure this behavior with a single array.
So I would say that in the particular case of A(n,3,3), as the two last indices are small and known at compile time, you can safely do the transformation into 9 different arrays to potentially gain some performance.
Note that if you use often the data of the 9 arrays at the same index i in a random order, re-organizing the data into A(3,3,n) will give you a clear performance increase. If a is in double precision, A(4,4,n) could be even better if A is aligned on a 64-byte boundary as every A(1,1,i) will be located at the 1st position of a cache line.

Assuming that you always loop along n and inside each loop need to access all the components in the matrix, storing the array like s(6,n) or s(3,3,n) will benefit from cache optimization.
do i=1,n
! do some calculation with s(:,i)
enddo
However, if your innerloop looks like this
resultarray(i)=xx(i)+yy(i)+zz(i)+2*(xy(i)+yz(i)+xz(i))
Don't border to change the array layout because you may break the SIMD optimization.

Related

Data Structure for a recursive function with two parameters one of which is Large the other small

Mathematician here looking for a bit of help. (If you ever need math help I'll try to reciprocate on math.stackexchange!) Sorry if this is a dup. Couldn't find it myself.
Here's the thing. I write a lot of code (mostly in C) that is extremely slow and I know it could be sped up considerably but I'm not sure what data structure to use. I went to school 20 years ago and unfortunately never got to take a computer science course. I have watched a lot of open-course videos on data structures but I'm still a bit fuddled never taking an actual class.
Mostly my functions just take integers to integers. I almost always use 64-bit numbers and I have three use cases that I'm interested in. I use the word small to mean no more than a million or two in quantity.
Case 1: Small numbers as input. Outputs are arbitrary.
Case 2: Any 64-bit values as input, but only a small number of them. Outputs are arbitrary.
Case 3: Two parameter functions with one parameter that's small in value (say less than two million), and the other parameter is Large but with only a small number of possible inputs. Outputs are arbitrary.
For Case 1, I just make an array to cache the values. Easy and fast.
For Case 2, I think I should be using a hash. I haven't yet done this but I think I could figure it out if I took the time.
Case 3 is the one I'd like help with and I'm not even sure what I need.
For a specific example take a function F(n,p) that takes large inputs n for the first parameter and a prime p for the second. The prime is at most the square root of n. so even if n is about 10^12, the primes are only up to about a million. Suppose this function is recursive or otherwise difficult to calculate (expensive) and will be called over and over with the same inputs. What might be a good data structure to use to easily create and retrieve the possible values of F(n,p) so that I don't have to recalculate it every time? Total number of possible inputs should be 10 or 20 million at most.
Help please! and Thank you in advance!
You are talking about memoizing I presume. Trying to answer without a concrete exemple...
If you have to retrieve values from a small range (the 2nd parameter), say from 0 to 10^6, and that needs to be upper fast, and... you have enough memory, you could simply declare an array of int (long...), which basically stores the output values from all input.
To make things simple, let say the value 0 means there is no-value set
long *small = calloc(MAX, sizeof(*small)); // Calloc intializes to 0
then in a function that gives the value for a small range
if (small[ input ]) return small[ input ];
....calculate
small[ input ] = value;
+/-
+ Very fast
- Memory consumption takes the whole range, [ 0, MAX-1 ].
If you need to store arbitrary input, use the many libraries available (there are so many). Use a Set structure, that tells if the items exists or no.
if (set.exists( input )) return set.get( input );
....calculate
set.set( input, value );
+/-
+ less memory usage
+ still fast (said to be O(1))
- but, not as fast as a mere array
Add to this the hashed set (...), which are faster, as in terms of probabilities, values (hashes) are better distributed.
+/-
+ less memory usage than array
+ faster than a simple Set
- but, not as fast as a mere array
- use more memory than a simple Set

Algorithm - How to select one number from each column in an array so that their sum is as close as possible to a particular value

I have an m x n matrix of real numbers. I want to choose a single value from each column such that the sum of my selected values is as close as possible to a pre-specified total.
I am not an experienced programmer (although I have an experienced friend who will help). I would like to achieve this using Matlab, Mathematica or c++ (MySQL if necessary).
The code only needs to run a few times, once every few days - it does not necessarily need to be optimised. I will have 16 columns and about 12 rows.
Normally I would suggest dynamic programming, but there are a few features of this situation suggesting an alternative approach. First, the performance demands are light; this program will be run only a couple times, and it doesn't sound as though a running time on the order of hours would be a problem. Second, the matrix is fairly small. Third, the matrix contains real numbers, so it would be necessary to round and then do a somewhat sophisticated search to ensure that the optimal possibility was not missed.
Instead, I'm going to suggest the following semi-brute-force approach. 12**16 ~ 1.8e17, the total number of possible choices, is too many, but 12**9 ~ 5.2e9 is doable with brute force, and 12**7 ~ 3.6e7 fits comfortably in memory. Compute all possible choices for the first seven columns. Sort these possibilities by total. For each possible choice for the last nine columns, use an efficient search algorithm to find the best mate among the first seven. (If you have a lot of memory, you could try eight and eight.)
I would attempt a first implementation in C++, using std::sort and std::lower_bound from the <algorithm> standard header. Measure it; if it's too slow, then try an in-memory B+-tree (does Boost have one?).
I spent some more time thinking about how to implement what I wrote above in the simplest way possible. Here's an approach that will work well for a 12 by 16 matrix on a 64-bit machine with roughly 4 GB of memory.
The number of choices for the first eight columns is 12**8. Each choice is represented by a 4-byte integer between 0 and 12**8 - 1. To decode a choice index i, the row for the first column is given by i % 12. Update i /= 12;. The row for the second column now is given by i % 12, et cetera.
A vector holding all choices requires roughly 12**8 * 4 bytes, or about 1.6 GB. Two such vectors require 3.2 GB. Prepare one for the first eight columns and one for the last eight. Sort them by sum of the entries that they indicate. Use saddleback search to find the best combination. (Initialize an iterator into the first vector and a reverse iterator into the second. While neither iterator is at its end, compare the current combination against the current best and update the current best if necessary. If the current combination sums to than the target, increment the first iterator. If the sum is greater than the target, increment the second iterator.)
I would estimate that this requires less than 50 lines of C++.
Without knowing the range of values that might fill the arrays, how about something generic like this:
divide the target by the number of remaining columns.
Pick the number from that column closest to that value.
Repeat from 1. Until each column picked.

Memory-efficient line stitching in very large images

Background
I work with very large datasets from Synthetic Aperture Radar satellites. These can be thought of as high dynamic range greyscale images of the order of 10k pixels on a side.
Recently, I've been developing applications of a single-scale variant of Lindeberg's scale-space ridge detection algorithm method for detecting linear features in a SAR image. This is an improvement on using directional filters or using the Hough Transform, methods that have both previously been used, because it is less computationally expensive than either. (I will be presenting some recent results at JURSE 2011 in April, and I can upload a preprint if that would be helpful).
The code I currently use generates an array of records, one per pixel, each of which describes a ridge segment in the rectangle to bottom right of the pixel and bounded by adjacent pixels.
struct ridge_t { unsigned char top, left, bottom, right };
int rows, cols;
struct ridge_t *ridges; /* An array of rows*cols ridge entries */
An entry in ridges contains a ridge segment if exactly two of top, left, right and bottom have values in the range 0 - 128. Suppose I have:
ridge_t entry;
entry.top = 25; entry.left = 255; entry.bottom = 255; entry.right = 76;
Then I can find the ridge segment's start (x1,y1) and end (x2,y2):
float x1, y1, x2, y2;
x1 = (float) col + (float) entry.top / 128.0;
y1 = (float) row;
x2 = (float) col + 1;
y2 = (float) row + (float) entry.right / 128.0;
When these individual ridge segments are rendered, I get an image something like this (a very small corner of a far larger image):
Each of those long curves are rendered from a series of tiny ridge segments.
It's trivial to determine whether two adjacent locations which contain ridge segments are connected. If I have ridge1 at (x, y) and ridge2 at (x+1, y), then they are parts of the same line if 0 <= ridge1.right <= 128 and ridge2.left = ridge1.right.
Problem
Ideally, I would like to stitch together all of the ridge segments into lines, so that I can then iterate over each line found in the image to apply further computations. Unfortunately, I'm finding it hard to find an algorithm for doing this which is both low complexity and memory-efficient and suitable for multiprocessing (all important consideration when dealing with really huge images!)
One approach that I have considered is scanning through the image until I find a ridge which only has one linked ridge segment, and then walking the resulting line, flagging any ridges in the line as visited. However, this is unsuitable for multiprocessing, because there's no way to tell if there isn't another thread walking the same line from the other direction (say) without expensive locking.
What do readers suggest as a possible approach? It seems like the sort of thing that someone would have figured out an efficient way to do in the past...
I'm not entirely sure this is correct, but I thought I'd throw it out for comment. First, let me introduce a lockless disjoint set algorithm, which will form an important part of my proposed algorithm.
Lockless disjoint set algorithm
I assume the presence of a two-pointer-sized compare-and-swap operation on your choice of CPU architecture. This is available on x86 and x64 architectures at the least.
The algorithm is largely the same as described on the Wikipedia page for the single threaded case, with some modifications for safe lockless operation. First, we require that the rank and parent elements to both be pointer-sized, and aligned to 2*sizeof(pointer) in memory, for atomic CAS later on.
Find() need not change; the worst case is that the path compression optimization will fail to have full effect in the presence of simultaneous writers.
Union() however, must change:
function Union(x, y)
redo:
x = Find(x)
y = Find(y)
if x == y
return
xSnap = AtomicRead(x) -- read both rank and pointer atomically
ySnap = AtomicRead(y) -- this operation may be done using a CAS
if (xSnap.parent != x || ySnap.parent != y)
goto redo
-- Ensure x has lower rank (meaning y will be the new root)
if (xSnap.rank > ySnap.rank)
swap(xSnap, ySnap)
swap(x, y)
-- if same rank, use pointer value as a fallback sort
else if (xSnap.rank == ySnap.rank && x > y)
swap(xSnap, ySnap)
swap(x, y)
yNew = ySnap
yNew.rank = max(yNew.rank, xSnap.rank + 1)
xNew = xSnap
xNew.parent = y
if (!CAS(y, ySnap, yNew))
goto redo
if (!CAS(x, xSnap, xNew))
goto redo
return
This should be safe in that it will never form loops, and will always result in a proper union. We can confirm this by observing that:
First, prior to termination, one of the two roots will always end up with a parent pointing to the other. Therefore, as long as there is no loop, the merge succeeds.
Second, rank always increases. After comparing the order of x and y, we know x has lower rank than y at the time of the snapshot. In order for a loop to form, another thread would need to have increased x's rank first, then merged x and y. However in the CAS that writes x's parent pointer, we check that rank has not changed; therefore, y's rank must remain greater than x.
In the event of simultaneous mutation, it is possible that y's rank may be increased, then return to redo due to a conflict. However, this implies that either y is no longer a root (in which case rank is irrelevant) or that y's rank has been increased by another process (in which case the second go around will have no effect and y will have correct rank).
Therefore, there should be no chance of loops forming, and this lockless disjoint-set algorithm should be safe.
And now on to the application to your problem...
Assumptions
I make the assumption that ridge segments can only intersect at their endpoints. If this is not the case, you will need to alter phase 1 in some manner.
I also make the assumption that co-habitation of a single integer pixel location is sufficient for ridge segments can be connected. If not, you will need to change the array in phase 1 to hold multiple candidate ridge segments+disjoint-set pairs, and filter through to find ones that are actually connected.
The disjoint set structures used in this algorithm shall carry a reference to a line segment in their structures. In the event of a merge, we choose one of the two recorded segments arbitrarily to represent the set.
Phase 1: Local line identification
We start by dividing the map into sectors, each of which will be processed as a seperate job. Multiple jobs may be processed in different threads, but each job will be processed by only one thread. If a ridge segment crosses a sector, it is split into two segments, one for each sector.
For each sector, an array mapping pixel position to a disjoint-set structure is established. Most of this array will be discarded later, so its memory requirements should not be too much of a burden.
We then proceed over each line segment in the sector. We first choose a disjoint set representing the entire line the segment forms a part of. We first look up each endpoint in the pixel-position array to see if a disjoint set structure has already been assigned. If one of the endpoints is already in this array, we use the assigned disjoint set. If both are in the array, we perform a merge on the disjoint sets, and use the new root as our set. Otherwise, we create a new disjoint-set, and associate with the disjoint-set structure a reference to the current line segment. We then write back into the pixel-position array our new disjoint set's root for each of our endpoints.
This process is repeated for each line segment in the sector; by the end, we will have identified all lines completely within the sector by a disjoint set.
Note that since the disjoint sets are not yet shared between threads, there's no need to use compare-and-swap operations yet; simply use the normal single-threaded union-merge algorithm. Since we do not free any of the disjoint set structures until the algorithm completes, allocation can also be made from a per-thread bump allocator, making memory allocation (virtually) lockless and O(1).
Once a sector is completely processed, all data in the pixel-position array is discarded; however data corresponding to pixels on the edge of the sector is copied to a new array and kept for the next phase.
Since iterating over the entire image is O(x*y), and disjoint-merge is effectively O(1), this operation is O(x*y) and requires working memory O(m+2*x*y/k+k^2) = O(x*y/k+k^2), where t is the number of sectors, k is the width of a sector, and m is the number of partial line segments in the sector (depending on how often lines cross borders, m may vary significantly, but it will never exceed the number of line segments). The memory carried over to the next operation is O(m + 2*x*y/k) = O(x*y/k)
Phase 2: Cross-sector merges
Once all sectors have been processed, we then move to merging lines that cross sectors. For each border between sectors, we perform lockless merge operations on lines that cross the border (ie, where adjacent pixels on each side of the border have been assigned to line sets).
This operation has running time O(x+y) and consumes O(1) memory (we must retain the memory from phase 1 however). Upon completion, the edge arrays may be discarded.
Phase 3: Collecting lines
We now perform a multi-threaded map operation over all allocated disjoint-set structure objects. We first skip any object which is not a root (ie, where obj.parent != obj). Then, starting from the representative line segment, we move out from there and collect and record any information desired about the line in question. We are assured that only one thread is looking at any given line at a time, as intersecting lines would have ended up in the same disjoint-set structure.
This has O(m) running time, and memory usage dependent on what information you need to collect about these line segments.
Summary
Overall, this algorithm should have O(x*y) running time, and O(x*y/k + k^2) memory usage. Adjusting k gives a tradeoff between transient memory usage on the phase 1 processes, and the longer-term memory usage for the adjacency arrays and disjoint-set structures carried over into phase 2.
Note that I have not actually tested this algorithm's performance in the real world; it is also possible that I have overlooked concurrency issues in the lockless disjoint-set union-merge algorithm above. Comments welcome :)
You could use a non-generalized form of the Hough Transform. It appears that it reaches an impressive O(N) time complexity on N x N mesh arrays (if you've got access to ~10000x10000 SIMD arrays and your mesh is N x N - note: in your case, N would be a ridge struct, or cluster of A x B ridges, NOT a pixel). Click for Source. More conservative (non-kernel) solutions list the complexity as O(kN^2) where k = [-π/2, π]. Source.
However, the Hough Transform does have some steep-ish memory requirements, and the space complexity will be O(kN) but if you precompute sin() and cos() and provide appropriate lookup tables, it goes down to O(k + N), which may still be too much, depending on how big your N is... but I don't see you getting it any lower.
Edit: The problem of cross-thread/kernel/SIMD/process line elements is non-trivial. My first impulse tells me to subdivide the mesh into recursive quad-trees (dependent on a certain tolerance), check immediate edges and ignore all edge ridge structs (you can actually flag these as "potential long lines" and share them throughout your distributed system); just do the work on everything INSIDE that particular quad and progressively move outward. Here's a graphical representation (green is the first pass, red is the second, etc). However, my intuition tells me that this is computationally-expensive..
If the ridges are resolved enough that the breaks are only a few pixels then the standard dilate - find neighbours - erode steps you would do for finding lines / OCR should work.
Joining longer contours from many segments and knowing when to create a neck or when to make a separate island is much more complex
Okay, so having thought about this a bit longer, I've got a suggestion that seems like it's too simple to be efficient... I'd appreciate some feedback on whether it seems sensible!
1) Since I can easily determine whether each ridge_t ridge segment at is connected to zero, one or two adjacent segments, I could colour each one appropriately (LINE_NONE, LINE_END or LINE_MID). This can easily be done in parallel, since there is no chance of a race condition.
2) Once colouring is complete:
for each `LINE_END` ridge segment X found:
traverse line until another `LINE_END` ridge segment Y found
if X is earlier in memory than Y:
change X to `LINE_START`
else:
change Y to `LINE_START`
This is also free of race conditions, since even if two threads are simultaneously traversing the same line, they will make the same change.
3) Now every line in the image will have exactly one end flagged as LINE_START. The lines can be located and packed into a more convenient structure in a single thread, without having to do any look-ups to see if the line has already been visited.
It's possible that I should consider whether statistics such as line length should be gathered in step 2), to help with the final re-packing...
Are there any pitfalls that I've missed?
Edit: The obvious problem is that I end up walking the lines twice, once to locate RIDGE_STARTs and once to do the final re-packing, leading to a computational inefficiency. It's still appears to be O(N) in terms of storage and computation time, though, which is a good sign...

finding a number appearing again among numbers stored in a file

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

About curse of dimensionality

My question is about this topic I've been reading about a bit. Basically my understanding is that in higher dimensions all points end up being very close to each other.
The doubt I have is whether this means that calculating distances the usual way (euclidean for instance) is valid or not. If it were still valid, this would mean that when comparing vectors in high dimensions, the two most similar wouldn't differ much from a third one even when this third one could be completely unrelated.
Is this correct? Then in this case, how would you be able to tell whether you have a match or not?
Basically the distance measurement is still correct, however, it becomes meaningless when you have "real world" data, which is noisy.
The effect we talk about here is that a high distance between two points in one dimension gets quickly overshadowed by small distances in all the other dimensions. That's why in the end, all points somewhat end up with the same distance. There exists a good illustration for this:
Say we want to classify data based on their value in each dimension. We just say we divide each dimension once (which has a range of 0..1). Values in [0, 0.5) are positive, values in [0.5, 1] are negative. With this rule, in 3 dimensions, 12.5% of the space are covered. In 5 dimensions, it is only 3.1%. In 10 dimensions, it is less than 0.1%.
So in each dimension we still allow half of the overall value range! Which is quite much. But all of it ends up in 0.1% of the total space -- the differences between these data points are huge in each dimension, but negligible over the whole space.
You can go further and say in each dimension you cut only 10% of the range. So you allow values in [0, 0.9). You still end up with less than 35% of the whole space covered in 10 dimensions. In 50 dimensions, it is 0.5%. So you see, wide ranges of data in each dimension are crammed into a very small portion of your search space.
That's why you need dimensionality reduction, where you basically disregard differences on less informative axes.
Here is a simple explanation in layman terms.
I tried to illustrate this with a simple illustration shown below.
Suppose you have some data features x1 and x2 (you can assume they are blood pressure and blood sugar levels) and you want to perform K-nearest neighbor classification. If we plot the data in 2D, we can easily see that the data nicely group together, each point has some close neighbors that we can use for our calculations.
Now let's say we decide to consider a new third feature x3 (say age) for our analysis.
Case (b) shows a situation where all of our previous data comes from people the same age. You can see that they are all located at the same level along the age (x3) axis.
Now we can quickly see that if we want to consider age for our classification, there is a lot of empty space along the age(x3) axis.
The data that we currently have only over a single level for the age. What happens if we want to make a prediction for someone that has a different age(red dot)?
As you can see there are not enough data points close this point to calculate the distance and find some neighbors. So, If we want to have good predictions with this new third feature, we have to go and gather more data from people of different ages to fill the empty space along the age axis.
(C) It is essentially showing the same concept. Here assume our initial data, were gathered from people of different ages. (i.e we did not care about the age in our previous 2 feature classification task and might have assumed that this feature does not have an effect on our classification).
In this case , assume our 2D data come from people of different ages ( third feature). Now, what happens to our relatively closely located 2d data, if we plot them in 3D? If we plot them in 3D, we can see that now they are more distant from each other,(more sparse) in our new higher dimension space(3D). As a result, finding the neighbors becomes harder since we don't have enough data for different values along our new third feature.
You can imagine that as we add more dimensions the data become more and more apart. (In other words, we need more and more data if you want to avoid having sparsity in our data)

Resources