How can I solve the choice of the right cluster? - loops

For a project I have to implement the Bisecting k-means algorithm. For now I have written these lines of code.
My problem, however, is that in this case I am going to always apply kmeans to the two new clusters, without taking into account the previous clusters, which I should do instead.
For example, with the first kmeans I get the clusters 'A' and 'B', after which I notice that the cluster 'B' has a higher WCSS(Within-Cluster Sum of Square) and then I go to apply the kmeans on it and I get the cluster 'C' and 'D'. At this point I have to reapply the kmeans to the cluster with the highest WCSS value, but I have to take into account all the clusters obtained so far, so not only 'C' and 'D' but also 'A' and so on
All this I have to repeat until I reach the set number of clusters.
How can I solve?
X = rand(1482,74);
nCluster = 12;
[idx,C,sumd] = kmeans(X,2);
for pp = 3 : nCluster
[~, index_max_cluster] = max(sumd);
max_wcss_cluster = X(idx==index_max_cluster, :);
[idx,C,sumd] = kmeans(max_wcss_cluster,2);
end

Related

How to efficiently store 1 million words and query them by starts_with, contains, or ends_with?

How do sites like this store tens of thousands of words "containing c", or like this, "words with d and c", or even further, "unscrambling" the word like CAUDK and finding that the database has duck. Curious from an algorithms/efficiency perspective how they would accomplish this:
Would a database be used, or would the words simply be stored in memory and quickly traversed? If a database was used (and each word was a record), how would you make these sorts of queries (with PostgreSQL for example, contains, starts_with, ends_with, and unscrambles)?
I guess the easiest thing to do would be to store all words in memory (sorted?), and just traverse the whole million or less word list to find the matches? But how about the unscramble one?
Basically wondering the efficient way this would be done.
"Containing C" amounts to count(C) > 0. Unscrambling CAUDC amounts to count(C) <= 2 && count(A) <= 1 && count(U) <= 1 && count(D) <= 1. So both queries could be efficiently answered by a database with 26 indices, one for the count of each letter in the alphabet.
Here is a quick and dirty python sqlite3 demo:
from collections import defaultdict, Counter
import sqlite3
conn = sqlite3.connect(':memory:')
cur = conn.cursor()
alphabet = [chr(ord('A')+i) for i in range(26)]
alphabet_set = set(alphabet)
columns = ['word TEXT'] + [f'{c}_count TINYINT DEFAULT 0' for c in alphabet]
create_cmd = f'CREATE TABLE abc ({", ".join(columns)})'
cur.execute(create_cmd)
for c in alphabet:
cur.execute(f'CREATE INDEX {c}_index ON abc ({c}_count)')
def insert(word):
counts = Counter(word)
columns = ['word'] + [f'{c}_count' for c in counts.keys()]
counts = [f'"{word}"'] + [f'{n}' for n in counts.values()]
var_str = f'({", ".join(columns)})'
val_str = f'({", ".join(counts)})'
insert_cmd = f'INSERT INTO abc {var_str} VALUES {val_str}'
cur.execute(insert_cmd)
def unscramble(text):
counts = {a:0 for a in alphabet}
for c in text:
counts[c] += 1
where_clauses = [f'{c}_count <= {n}' for (c, n) in counts.items()]
select_cmd = f'SELECT word FROM abc WHERE {" AND ".join(where_clauses)}'
cur.execute(select_cmd)
return list(sorted([tup[0] for tup in cur.fetchall()]))
print('Building sqlite table...')
with open('/usr/share/dict/words') as f:
word_set = set(line.strip().upper() for line in f)
for word in word_set:
if all(c in alphabet_set for c in word):
insert(word)
print('Table built!')
d = defaultdict(list)
for word in unscramble('CAUDK'):
d[len(word)].append(word)
print("unscramble('CAUDK'):")
for n in sorted(d):
print(' '.join(d[n]))
Output:
Building sqlite table...
Table built!
unscramble('CAUDK'):
A C D K U
AC AD AK AU CA CD CU DA DC KC UK
AUK CAD CUD
DUCK
I don't know for sure what they're doing, but I suggest this algorithm for contains and unscramble (and, I think, can be trivially extended to starts with or end with):
User submits a set of letters in the form of a string. Say, user submits bdsfa.
The algorithm sorts that string in (1). So, query becomes abdfs
Then, to find all words with those letters in them, the algorithm simply accesses the directory database/a/b/d/f/s/ and finds all words with those letters in. In case it finds the directory to be empty, it goes one level up: database/a/b/d/f/ and shows result there.
So, now, the question is, how to index the database of millions of words as done in step (3)? database/ directory will have 26 directories inside it for a to z, each of which will have 26-1 directories for all letters, except their parent's. E.g.:
database/a/{b,c,...,z}`
database/b/{a,c,...,z}`
...
database/z/{a,c,...,y}`
This tree structure will be only 26 level deep. Each branch will have no more than 26 elements. So browsing this directory structure is scalable.
Words will be stored in the leaves of this tree. So, the word apple will be stored in database/a/e/l/p/leaf_apple. In that place, you will also find other words such as leap. More specifically:
database/
a/
e/
l/
p/
leaf_apple
leaf_leap
leaf_peal
...
This way, you can efficiently reach the subset of target words as O(log n), where n is total number of words in your database.
You can further optimise this by adding additional indices. For example, there are too many words containing a, and the website won't display them all (at least not in the 1st page). Instead, the website may say there are total 500,000 many words containing 'a', here is 100 examples. In order to obtain 500,000 efficiently, the number of children at every level can be added during the indexing. E.g. `database/{a,b,...,z}/{num_children,
`database/
{...}/
{num_children,...}/
{num_childred,...}/
...
Here, num_children is just a leaf node, just like leaf_WORD. All leafs are files.
Depending on the load that this website has, it may not require to load this database in memory. It may simply leave it to the operating system to decide which portion of its file system to cache in memory as a read-time optimisation.
Personally, I think, as a criticism to applications, I think developers tend to jump into requiring RAM too fast even when a simple file system trick can do the job without any noticeable difference to the end user.

Stop Matlab from treating a 1xn matrix as a column vector

I'm very frustrated with MATLAB right now. Let me illustrate the problem. I'm going to use informal notation here.
I have a column cell vector of strings called B. For now, let's say B = {'A';'B';'C';'D'}.
I want to have a matrix G, which is m-by-n, and I want to replace the numbers in G with the respective elements of B... For example, let's say G is [4 3; 2 1]
Let's say I have a variable n which says how many rows of G I want to take out.
When I do B(G(1:2,:)), I get what I want ['D' 'C'; 'B' 'A']
However, if I do B(G(1:1,:)) I get ['D';'C'] when what I really want to get is ['D' 'C']
I am using 1:n, and I want it to have the same behavior for n = 1 as it does for n = 2 and n = 3. Basically, G actually is a n-by-1500 matrix, and I want to take the top n rows and use it as indexes into B.
I could use an if statement that transposes the result if n = 1 but that seems so unnecessary. Is there really no way to make it so that it stops treating my 1-by-n matrix as if it was a column vector?
According to this post by Loren Shure:
Indexing with one array C = A(B) produces output the size of B unless both A and B are vectors.
When both A and B are vectors, the number of elements in C is the number of elements in B and with orientation of A.
You are in second case, hence the behaviour you see.
To make it work, you need to maintain the output to have as many columns as in G. To achieve the same, you can do something like this -
out = reshape(B(G(1:n,:)),[],size(G,2))
Thus, with n = 1:
out =
'D' 'C'
With n = 2:
out =
'D' 'C'
'B' 'A'
I think this will only happen in 1-d case. In default, matlab will return column vector since it is the way how it stores matrix. If you want a row vector, you could just use transpose. Well in my opinion it should be fine when n > 1.

improve growing cell array in matlab

I would like to know how to efficiently run this part of my progam. Basically for a background of what I aim to do, I want to cluster points. There are around 10,000 points. These points have computed "forces" between them that is stored in matrix F. So the force between point 1 and point 2 is F(1,2). I would then like to "cluster points" with sufficient F acting on them (Force setting/threshold), that is 2 points with sufficient F between them belong to the same cluster.
I have a code as seen below. A cell array CLUSTER was made to contain the cluster assignments. So CLUSTER{i} is the cell containing the ith cluster with clustered points in it.
However, for some F settings, the implementation takes forever. I have read about preallocation and parfor (parfor can't be done since there is dependency in iterations). But does preallocation of cell arrays mean that individual cells are not preallocated with memory? Is there any other way around this? Profiling tells me that ismember has the biggest share in computing. I hope to improve the code with your suggestions. Thanks a lot!
CLUSTER = {};
for fi = 1:srow
for fj = 1:scol
if fj > fi % to eliminate redundancy, diagonal mirror elements of F !check on this
if F(fi,fj) >= 2000 % Force setting
if( (~ismember(1,cellfun(#(x)ismember(fi,x),CLUSTER))) && (~ismember(1,cellfun(#(x)ismember(fj,x),CLUSTER))) ) % fi & fj are not in CLUSTER
CLUSTER{end+1} = [fi fj];
end
%%if( (ismember(1,cellfun(#(x)ismember(fi,x),CLUSTER))) && (ismember(1,cellfun(#(x)ismember(fj,x),CLUSTER))) ) % fi & fj are in CLUSTER
%do nothing since lfi and fj are in CLUSTER
%%end
if( (ismember(1,cellfun(#(x)ismember(fi,x),CLUSTER))) && (~ismember(1,cellfun(#(x)ismember(fj,x),CLUSTER))) ) % fi in CLUSTER, fj not in CLUSTER
c = find(cellfun(#(x)ismember(fi,x),CLUSTER));
CLUSTER{c} = [CLUSTER{c} fj];
end
if( (~ismember(1,cellfun(#(x)ismember(fi,x),CLUSTER))) && (ismember(1,cellfun(#(x)ismember(fj,x),CLUSTER))) ) % fi not in CLUSTER, fj in CLUSTER
c = find(cellfun(#(x)ismember(fj,x),CLUSTER));
CLUSTER{c} = [CLUSTER{c} fi];
end
end
end
end
end
I am not sure if this is the best way, but my thought is that this should be possible to do by some kind of update algorithm. Begin with looking at node 1. Find all the nodes clustered with node 1 as,
clusteredNbr = find(F(1,:)>=2000); (a)
This will give you the node number for all nodes that are clustered with node 1. Then find all nodes that is clustered with clusteredNbr, by repeting (a) with for all new nodes. These nodes can then be added to the same vector as the old nodes in clusteredNbr. You may have some duplicates here, but they can be removed later. Then check for new unique nodes in this resulting vector. If there are any, repeat (a) for these. Continue until you find all nodes in the first cluster. Then you know that any of the nodes in the first cluster is not clustered with the others.
Repeat this process for the next cluster and so on. The gain with this is that you use find instead of ismember and that the operation can be vectorized, which allows you to run only one for loop.

Efficient histogram implementation using a hash function

Is there a more efficient approach to computing a histogram than a binary search for a non-linear bin distribution?
I'm actually only interested in the bit of the algorithm that matches the key (value) to the bin (the transfer function?) , i.e. for a bunch of floating point values I just want to know the appropriate bin index for each value.
I know that for a linear bin distribution you can get O(1) by dividing the value by the bin width, and that for non linear bins a binary search gets you O(logN). My current implementation uses a binary search on unequal bin widths.
In the spirit of improving efficiency I was curious as to whether you could use a hash function to map a value to its appropriate bin and achieve O(1) time complexity when you have bins of unequal widths?
In some simple cases you can get O(1).
Suppose, your values are 8-bit, from 0 to 255.
If you split them into 8 bins of sizes 2, 2, 4, 8, 16, 32, 64, 128, then the bin value ranges will be: 0-1, 2-3, 4-7, 8-15, 16-31, 32-63, 64-127, 128-255.
In binary these ranges look like:
0000000x (bin 0)
0000001x
000001xx
00001xxx
0001xxxx
001xxxxx
01xxxxxx
1xxxxxxx (bin 7)
So, if you can quickly (in O(1)) count how many most significant zero bits there are in the value, you can get the bin number from it.
In this particular case you may precalculate a look-up table of 256 elements, containing the bin number and finding the appropriate bin for a value is just one table look-up.
Actually, with 8-bit values you can use bins of arbitrary sizes since the look-up table is small.
If you were to go with bins of sizes of powers of 2, you could reuse this look-up table for 16-bit values as well. And you'd need two look-ups. You can extend it to even longer values.
Ordinary hash functions are intended to scatter different values quite randomly across some range. A single-bit difference in arguments may lead to dozens of bits different in results. For that reason, ordinary hash functions are not suitable for the situation described in the question.
An alternative is to build an array P with entries that index into the table B of bin limits. Given some value x, we find the bin j it belongs to (or sometimes a nearby bin) via j = P[⌊x·r⌋] where r is a ratio that depends on the size of P and the maximum value in B. The effectiveness of this approach depends on the values in B and the size of P.
The behavior of functions like P[⌊x·r⌋] can be seen via the python code shown below. (The method is about the same in any programming language. However, tips for Python-to-C are given below.) Suppose the code is stored in file histobins.py and loaded into the ipython interpreter with the command import histobins as hb. Then a command like hb.betterparts(27, 99, 9, 80,155) produces output like
At 80 parts, steps = 20 = 7+13
At 81 parts, steps = 16 = 7+9
At 86 parts, steps = 14 = 6+8
At 97 parts, steps = 13 = 12+1
At 108 parts, steps = 12 = 3+9
At 109 parts, steps = 12 = 8+4
At 118 parts, steps = 12 = 6+6
At 119 parts, steps = 10 = 7+3
At 122 parts, steps = 10 = 3+7
At 141 parts, steps = 10 = 5+5
At 142 parts, steps = 10 = 4+6
At 143 parts, steps = 9 = 7+2
These parameters to betterparts set nbins=27, topsize=99, seed=9, plo=80, phi=155 which creates a test set of 27 bins for values from 0 to 99, with random seed 9, and size of P from 80 to 155-1. The number of “steps” is the number of times the two while loops in testparts() operated during a test with 10*nbins values from 0 to topsize. Eg, “At 143 parts, steps = 9 = 7+2” means that when the size of P is 143, out of 270 trials, 261 times P[⌊x·r⌋] produced the correct index at once; 7 times the index had to be decreased, and twice it had to be increased.
The general idea of the method is to trade off space for time. Another tradeoff is preparation time versus operation time. If you are going to be doing billions of lookups, it is worthwhile to do a few thousand trials to find a good value of |P|, the size of P. If you are going to be doing only a few millions of lookups, it might be better to just pick some large value of |P| and run with it, or perhaps just run betterparts over a narrow range. Instead of doing 75 tests as above, if we start with larger |P| fewer tests may give a good enough result. For example, 10 tests via “hb.betterparts(27, 99, 9, 190,200)” produces
At 190 parts, steps = 11 = 5+6
At 191 parts, steps = 5 = 3+2
At 196 parts, steps = 5 = 4+1
As long as P fits into some level of cache (along with other relevant data) making |P| larger will speed up access. So, making |P| as large as practical is a good idea. As |P| gets larger, the difference in performance between one value of |P| and the next gets smaller and smaller. The limiting factors on speed then include time to multiply and time to set up while loops. One approach for faster multiplies may be to choose a power of 2 as a multiplier; compute |P| to match; then use shifts or adds to exponents instead of multiplies. One approach to spending less time setting up while loops is to move the statement if bins[bin] <= x < bins[bin+1]: (or its C equivalent, see below) to before the while statements and do the while's only if the if statement fails.
Python code is shown below. Note, in translating from Python to C,
• # begins a comment
• def begins a function
• a statement like ntest, right, wrong, x = 10*nbins, 0, 0, 0 assigns values to respective identifiers
• a statement like return (ntest, right, wrong, stepdown, stepup) returns a tuple of 5 values that the caller can assign to a tuple or to respective identifiers
• the scope of a def, while, or if ends with a line not indented farther than the def, while, or if
• bins = [0] initializes a list (an extendible indexable array) with value 0 as its initial entry
• bins.append(t) appends value t at the end of list bins
• for i,j in enumerate(p): runs a loop over the elements of iterable p (in this case, p is a list), making the index i and corresponding entry j == p[i] available inside the loop
• range(nparts) stands for a list of the values 0, 1, ... nparts-1
• range(plo, phi) stands for a list of the values plo, plo+1, ... phi-1
• if bins[bin] <= x < bins[bin+1] means if ((bins[bin] <= x) && (x < bins[bin+1]))
• int(round(x*float(nparts)/topsize))) actually rounds x·r, instead of computing ⌊x·r⌋ as advertised above
def makebins(nbins, topsize):
bins, t = [0], 0
for i in range(nbins):
t += random.random()
bins.append(t)
for i in range(nbins+1):
bins[i] *= topsize/t
bins.append(topsize+1)
return bins
#________________________________________________________________
def showbins(bins):
print ''.join('{:6.2f} '.format(x) for x in bins)
def showparts(nbins, bins, topsize, nparts, p):
ratio = float(topsize)/nparts
for i,j in enumerate(p):
print '{:3d}. {:3d} {:6.2f} {:7.2f} '.format(i, j, bins[j], i*ratio)
print 'nbins: {} topsize: {} nparts: {} ratio: {}'.format(nbins, topsize, nparts, ratio)
print 'p = ', p
print 'bins = ',
showbins(bins)
#________________________________________________________________
def testparts(nbins, topsize, nparts, seed):
# Make bins and make lookup table p
import random
if seed > 0: random.seed(seed)
bins = makebins(nbins,topsize)
ratio, j, p = float(topsize)/nparts, 0, range(nparts)
for i in range(nparts):
while j<nbins and i*ratio >= bins[j+1]:
j += 1
p[i] = j
p.append(j)
#showparts(nbins, bins, topsize, nparts, p)
# Count # of hits and steps with avg. of 10 items per bin
ntest, right, wrong, x = 10*nbins, 0, 0, 0
delta, stepdown, stepup = topsize/float(ntest), 0, 0
for i in range(ntest):
bin = p[min(nparts, max(0, int(round(x*float(nparts)/topsize))))]
while bin < nbins and x >= bins[bin+1]:
bin += 1; stepup += 1
while bin > 0 and x < bins[bin]:
bin -= 1; stepdown += 1
if bins[bin] <= x < bins[bin+1]: # Test if bin is correct
right += 1
else:
wrong += 1
print 'Wrong bin {} {:7.3f} at x={:7.3f} Too {}'.format(bin, bins[bin], x, 'high' if bins[bin] > x else 'low')
x += delta
return (ntest, right, wrong, stepdown, stepup)
#________________________________________________________________
def betterparts(nbins, topsize, seed, plo, phi):
beststep = 1e9
for parts in range(plo, phi):
ntest, right, wrong, stepdown, stepup = testparts(nbins, topsize, parts, seed)
if wrong: print 'Error with ', parts, ' parts'
steps = stepdown + stepup
if steps <= beststep:
beststep = steps
print 'At {:3d} parts, steps = {:d} = {:d}+{:d}'.format(parts, steps, stepdown, stepup)
#________________________________________________________________
Interpolation search is your friend. It's kind of an optimistic, predictive binary search where it guesses where the bin should be based on a linear assumption about the distribution of inputs, rather than just splitting the search space in half at each step. It will be O(1) if the linear assumption is true, but still works (though more slowly) when the assumption is not. To the degree that its predictions are accurate, the search is fast.
Depends on the implementation of the hashing and the type of data you're working with. For smaller data sets a more simple algorithm like binary search might outperform constant lookup if the lookup-overhead of hashing is larger on average.
The usual implementation of hashing, consists of an array of linked lists and a hashing function that maps a string to an index in the array of linked lists. There's a thing called the load factor, which is the number of elements in the hash map / length of the linked-list array. Thus for load factors < 1 you'll achieve constant lookup in the best case because no linked-list will contain more than one element (best case).
There's only one way to find out which is better - implement a hash map and see for yourself. You should be able to get something near constant lookup :)

What's the fastest way to find deepest path in a 3D array?

I've been trying to find solution to my problem for more than a week and I couldn't find out anything better than a milion iterations prog, so I think it's time to ask someone to help me.
I've got a 3D array. Let's say, we're talking about the ground and the first layer is a surface.
Another layers are floors below the ground. I have to find deepest path's length, count of isolated caves underground and the size of the biggest cave.
Here's the visualisation of my problem.
Input:
5 5 5 // x, y, z
xxxxx
oxxxx
xxxxx
xoxxo
ooxxx
xxxxx
xxoxx
and so...
Output:
5 // deepest path - starting from the surface
22 // size of the biggest cave
3 // number of izolated caves (red ones) (izolated - cave that doesn't reach the surface)
Note, that even though red cell on the 2nd floor is placed next to green one, It's not the same cave because it's placed diagonally and that doesn't count.
I've been told that the best way to do this, might be using recursive algorithm "divide and rule" however I don't really know how could it look like.
I think you should be able to do it in O(N).
When you parse your input, assign each node a 'caveNumber' initialized to 0. Set it to a valid number whenever you visit a cave:
CaveCount = 0, IsolatedCaveCount=0
AllSizes = new Vector.
For each node,
ProcessNode(size:0,depth:0);
ProcessNode(size,depth):
If node.isCave and !node.caveNumber
if (size==0) ++CaveCount
if (size==0 and depth!=0) IsolatedCaveCount++
node.caveNumber = CaveCount
AllSizes[CaveCount]++
For each neighbor of node,
if (goingDeeper) depth++
ProcessNode(size+1, depth).
You will visit each node 7 times at worst case: once from the outer loop, and possibly once from each of its six neighbors. But you'll only work on each one once, since after that the caveNumber is set, and you ignore it.
You can do the depth tracking by adding a depth parameter to the recursive ProcessNode call, and only incrementing it when visiting a lower neighbor.
The solution shown below (as a python program) runs in time O(n lg*(n)), where lg*(n) is the nearly-constant iterated-log function often associated with union operations in disjoint-set forests.
In the first pass through all cells, the program creates a disjoint-set forest, using routines called makeset(), findset(), link(), and union(), just as explained in section 22.3 (Disjoint-set forests) of edition 1 of Cormen/Leiserson/Rivest. In later passes through the cells, it counts the number of members of each disjoint forest, checks the depth, etc. The first pass runs in time O(n lg*(n)) and later passes run in time O(n) but by simple program changes some of the passes could run in O(c) or O(b) for c caves with a total of b cells.
Note that the code shown below is not subject to the error contained in a previous answer, where the previous answer's pseudo-code contains the line
if (size==0 and depth!=0) IsolatedCaveCount++
The error in that line is that a cave with a connection to the surface might have underground rising branches, which the other answer would erroneously add to its total of isolated caves.
The code shown below produces the following output:
Deepest: 5 Largest: 22 Isolated: 3
(Note that the count of 24 shown in your diagram should be 22, from 4+9+9.)
v=[0b0000010000000000100111000, # Cave map
0b0000000100000110001100000,
0b0000000000000001100111000,
0b0000000000111001110111100,
0b0000100000111001110111101]
nx, ny, nz = 5, 5, 5
inlay, ncells = (nx+1) * ny, (nx+1) * ny * nz
masks = []
for r in range(ny):
masks += [2**j for j in range(nx*ny)][nx*r:nx*r+nx] + [0]
p = [-1 for i in range(ncells)] # parent links
r = [ 0 for i in range(ncells)] # rank
c = [ 0 for i in range(ncells)] # forest-size counts
d = [-1 for i in range(ncells)] # depths
def makeset(x): # Ref: CLR 22.3, Disjoint-set forests
p[x] = x
r[x] = 0
def findset(x):
if x != p[x]:
p[x] = findset(p[x])
return p[x]
def link(x,y):
if r[x] > r[y]:
p[y] = x
else:
p[x] = y
if r[x] == r[y]:
r[y] += 1
def union(x,y):
link(findset(x), findset(y))
fa = 0 # fa = floor above
bc = 0 # bc = floor's base cell #
for f in v: # f = current-floor map
cn = bc-1 # cn = cell#
ml = 0
for m in masks:
cn += 1
if m & f:
makeset(cn)
if ml & f:
union(cn, cn-1)
mr = m>>nx
if mr and mr & f:
union(cn, cn-nx-1)
if m & fa:
union(cn, cn-inlay)
ml = m
bc += inlay
fa = f
for i in range(inlay):
findset(i)
if p[i] > -1:
d[p[i]] = 0
for i in range(ncells):
if p[i] > -1:
c[findset(i)] += 1
if d[p[i]] > -1:
d[p[i]] = max(d[p[i]], i//inlay)
isola = len([i for i in range(ncells) if c[i] > 0 and d[p[i]] < 0])
print "Deepest:", 1+max(d), " Largest:", max(c), " Isolated:", isola
It sounds like you're solving a "connected components" problem. If your 3D array can be converted to a bit array (e.g. 0 = bedrock, 1 = cave, or vice versa) then you can apply a technique used in image processing to find the number and dimensions of either the foreground or background.
Typically this algorithm is applied in 2D images to find "connected components" or "blobs" of the same color. If possible, find a "single pass" algorithm:
http://en.wikipedia.org/wiki/Connected-component_labeling
The same technique can be applied to 3D data. Googling "connected components 3D" will yield links like this one:
http://www.ecse.rpi.edu/Homepages/wrf/pmwiki/pmwiki.php/Research/ConnectedComponents
Once the algorithm has finished processing your 3D array, you'll have a list of labeled, connected regions, and each region will be a list of voxels (volume elements analogous to image pixels). You can then analyze each labeled region to determine volume, closeness to the surface, height, etc.
Implementing these algorithms can be a little tricky, and you might want to try a 2D implementation first. Thought it might not be as efficient as you like, you could create a 3D connected component labeling algorithm by applying a 2D algorithm iteratively to each layer and then relabeling the connected regions from the top layer to the bottom layer:
For layer 0, find all connected regions using the 2D connected component algorithm
For layer 1, find all connected regions.
If any labeled pixel in layer 0 sits directly over a labeled pixel in layer 1, change all the labels in layer 1 to the label in layer 0.
Apply this labeling technique iteratively through the stack until you reach layer N.
One important considering in connected component labeling is how one considers regions to be connected. In a 2D image (or 2D array) of bits, we can consider either the "4-connected" region of neighbor elements
X 1 X
1 C 1
X 1 X
where "C" is the center element, "1" indicates neighbors that would be considered connected, and "X" are adjacent neighbors that we do not consider connected. Another option is to consider "8-connected neighbors":
1 1 1
1 C 1
1 1 1
That is, every element adjacent to a central pixel is considered connected. At first this may sound like the better option. In real-world 2D image data a chessboard pattern of noise or diagonal string of single noise pixels will be detected as a connected region, so we typically test for 4-connectivity.
For 3D data you can consider either 6-connectivity or 26-connectivity: 6-connectivity considers only the neighbor pixels that share a full cube face with the center voxel, and 26-connectivity considers every adjacent pixel around the center voxel. You mention that "diagonally placed" doesn't count, so 6-connectivity should suffice.
You can observe it as a graph where (non-diagonal) adjacent elements are connected if they both empty (part of a cave). Note that you don't have to convert it to a graph, you can use normal 3d array representation.
Finding caves is the same task as finding the connected components in a graph (O(N)) and the size of a cave is the number of nodes of that component.

Resources