I am trying to analyze 2 billion rows (of text files in HDFS). Each file's lines contain an array of sorted integers:
[1,2,3,4]
The integer values can be 0 to 100,000. I am looking to overlap within each array of integers all possibly combinations (one-way aka 1,2 and 2,1 are not necessary). Then reduce and sum the counts of those overlaps. For example:
File:
[1,2,3,4]
[2,3,4]
Final Output:
(1,2) - 1
(1,3) - 1
(1,4) - 1
(2,3) - 2
(2,4) - 2
(3,4) - 2
The methodology that I have tried is using Apache Spark, to create a simple job that parallelizes the processing and reducing of blocks of data. However I am running into issues where the memory can't hold a hash of ((100,000)^2)/2 options and thus I am having to result in running traditional map reduce of map, sort, shuffle, reduce locally, sort, shuffle, reduce globally. I know creating the combinations is a double for loop so O(n^2) but what is the most efficient way to programmatically do this so I can minimally write to disk? I am trying to perform this task sub 2 hours on a cluster of 100 nodes (64gb ram/2 cores) Also any recommended technologies or frameworks. Below is what I have been using in Apache Spark and Pydoop. I tried using more memory optimized Hashs, however they still were too much memory.
import collection.mutable.HashMap
import collection.mutable.ListBuffer
def getArray(line: String):List[Int] = {
var a = line.split("\\x01")(1).split("\\x02")
var ids = new ListBuffer[Int]
for (x <- 0 to a.length - 1){
ids += Integer.parseInt(a(x).split("\\x03")(0))
}
return ids.toList
}
var textFile = sc.textFile("hdfs://data/")
val counts = textFile.mapPartitions(lines => {
val hashmap = new HashMap[(Int,Int),Int]()
lines.foreach( line => {
val array = getArray(line)
for((x,i) <- array.view.zipWithIndex){
for (j <- (i+1) to array.length - 1){
hashmap((x,array(j))) = hashmap.getOrElse((x,array(j)),0) + 1
}
}
})
hashmap.toIterator
}).reduceByKey(_ + _)
Also Tried PyDoop:
def mapper(_, text, writer):
columns = text.split("\x01")
slices = columns[1].split("\x02")
slice_array = []
for slice_obj in slices:
slice_id = slice_obj.split("\x03")[0]
slice_array.append(int(slice_id))
val array = getArray(line)
for (i, x) in enumerate(array):
for j in range(i+1, len(array) - 1):
write.emit((x,array[j]),1)
def reducer(key, vals, writer):
writer.emit(key, sum(map(int, vals)))
def combiner(key, vals, writer):
writer.count('combiner calls', 1)
reducer(key, vals, writer)
I think your problem can be reduced to word count where the corpus contains at most 5 billion distinct words.
In both of your code examples, you're trying to pre-count all of the items appearing in each partition and sum the per-partition counts during the reduce phase.
Consider the worst-case memory requirements for this, which occur when every partition contains all of the 5 billion keys. The hashtable requires at least 8 bytes to represent each key (as two 32-bit integers) and 8 bytes for the count if we represent it as a 64-bit integer. Ignoring the additional overheads of Java/Scala hashtables (which aren't insignificant), you may need at least 74 gigabytes of RAM to hold the map-side hashtable:
num_keys = 100000**2 / 2
bytes_per_key = 4 + 4 + 8
bytes_per_gigabyte = 1024 **3
hashtable_size_gb = (num_keys * bytes_per_key) / (1.0 * bytes_per_gigabyte)
The problem here is that the keyspace at any particular mapper is huge. Things are better at the reducers, though: assuming a good hash partitioning, each reducer processes an even share of the keyspace, so the reducers only require roughly (74 gigabytes / 100 machines) ~= 740 MB per machine to hold their hashtables.
Performing a full shuffle of the dataset with no pre-aggregation is probably a bad idea, since the 2 billion row dataset probably becomes much bigger once you expand it into pairs.
I'd explore partial pre-aggregation, where you pick a fixed size for your map-side hashtable and spill records to reducers once the hashtable becomes full. You can employ different policies, such as LRU or randomized eviction, to pick elements to evict from the hashtable. The best technique might depend on the distribution of keys in your dataset (if the distribution exhibits significant skew, you may see larger benefits from partial pre-aggregation).
This gives you the benefit of reducing the amount of data transfer for frequent keys while using a fixed amount of memory.
You could also consider using a disk-backed hashtable that can spill blocks to disk in order to limit its memory requirements.
Related
I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The resulting tcm would then be used to measure a bunch of lexical statistics and serve as input into vector semantics methods (Glove, LSA, LDA).
For reference, the Google Books (v2) dataset is formatted as follows (tab-separated)
ngram year match_count volume_count
some word 1999 32 12 # example bigram
However, problem is of course, these data be superhuge. Although, I will only need a subset of the data from certain decades (about 20 years worth of ngrams), and I am happy with a context window of up to 2 (i.e., use the trigram corpus). I have a few ideas but none seem particularly, well, good.
-Idea 1- initially was more or less this:
# preprocessing (pseudo)
for file in trigram-files:
download $file
filter $lines where 'year' tag matches one of years of interest
find the frequency of each of those ngrams (match_count)
cat those $lines * $match_count >> file2
# (write the same line x times according to the match_count tag)
remove $file
# tcm construction (using R)
grams <- # read lines from file2 into list
library(text2vec)
# treat lines (ngrams) as documents to avoid unrelated ngram overlap
it <- itoken(grams)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2)
tcm <- create_tcm(it, vectorizer) # nice and sparse
However, I have a hunch this might not be the best solution. The ngram data files already contain the co-occurrence data in the form of n-grams, and there is a tag that gives the frequency. I have a feeling there should be a more direct way.
-Idea 2- I was also thinking of cat'ing each filtered ngram only once into the new file (instead of replicating it match_count times), then creating an empty tcm and then looping over the whole (year-filtered) ngram dataset and record instances (using the match_count tag) where any two words co-occur to populate the tcm. But, again, the data is big, and this kind of looping would probably take ages.
-Idea 3- I found a Python library called google-ngram-downloader that apparently has a co-occurrence matrix creation function, but looking at the code, it would create a regular (not sparse) matrix (which would be massive, given most entries are 0s), and (if I got it right) it simply loops through everything (and I assume a Python loop over this much data would be superslow), so it seems to be more aimed at rather smaller subsets of data.
edit -Idea 4- Came across this old SO question asking about using Hadoop and Hive for a similar task, with a a short answer with a broken link and a comment about MapReduce (none of which I am familiar with, so I would not know where to start).
But I'm thinking I can't be the first one with the need to tackle such a task, given the popularity of the Ngram dataset, and the popularity of (non-word2vec) distributed semantics methods that operate on a tcm or dtm input; hence ->
...the question: what would be a more reasonable/effective way of constructing a term-term co-occurrence matrix from Google Books Ngram data? (be it a variation of the proposed ideas of something completely different; R preferred but not necessary)
I will give an idea of how you can do this. But it can be improved in several places. I specially wrote in a "spagetti-style" for better interpretability, but it can be generalized to more than tri-grams
ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54))
# here we split tri-grams to obtain words
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array()
# vocab here is vocabulary from chunk, but you can be interested first
# to create vocabulary from whole corpus of ngrams and filter non
# interesting/rare words
vocab = unique(tokens_matrix)
# convert char matrix to integer matrix for faster downstream calculations
tokens_matrix_int = match(tokens_matrix, vocab)
dim(tokens_matrix_int) = dim(tokens_matrix)
ngram_dt[, token_1 := tokens_matrix_int[1, ]]
ngram_dt[, token_2 := tokens_matrix_int[2, ]]
ngram_dt[, token_3 := tokens_matrix_int[3, ]]
dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)]
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)]
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1 / distance
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)]
dt = rbindlist(list(dt_12, dt_13, dt_23))
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)]
tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T,
giveCsparse = F, check = F, dimnames = list(vocab, vocab))
I want to sample from a Scala array, the sample size can be much larger than the length of the array. How can I do this efficiently? By using the following code the running time is linear to the sample size, when the sample size is very big it is slow if we need to do the sampling many times:
def getSample(dataArray: Array[Double], sampleSize: Int, seed: Int): Array[Double] =
{
val arrLength = dataArray.length
val r = new scala.util.Random(seed)
Array.fill(sampleSize)(dataArray(r.nextInt(arrLength)))
}
val myArr= Array(1.0,5.0,9.0,4.0,7.0)
getSample(myArr, 100000, 28)
The probability that any given element of an array of length $n$ appears at least once in a sample of size $k$ is $1-(1-1/n)^k$. If this value is close to 1, which occurs when $k$ is large compared to $n$, then the following algorithm might be a good choice depending on your needs:
import org.apache.commons.math3.random.MersennseTwister
import org.apache.commons.math3.distribution.BinomialDistribution
def getSampleCounts[T](data: Array[T], k: Int, seed: Long): Array[Int] = {
val rng = new MersenneTwister(seed)
val counts = new Array[Int](data.length)
var i = k
do {
val j = new BinomialDistribution(rng.nextLong(), i, 1.0/i)
counts(i) = j
i -= j
} while (i > 0)
counts
}
Note that this algorithm does not return a sample. Instead it returns an Array[Int] whose $i$-th entry is equal to the number of times data(i) appears in the random sample. This may not be suitable for all applications, but for some use cases having the sample in the form of some sort of Iterable over (value, count) pairs (which can be obtained by data.view.zip(getSampleCounts(data, k, seed)), for example) is actually very convenient since it often enables us to do a computation once for groups of samples (since they are equal.) For example, suppose I had an expensive function f: T => Double and I wanted to compute the sample mean of f applied to a random sample of size $k$ draw from data. Then we could do the following:
data.view.zip(getSampleCounts(data, k, seed)).map({case (x, count) => f(x)*count}).sum/k
This computation for the sample mean evaluates f $n$ instead of $k$ times (recall that we are assuming that $k$ is large compared to $n$.)
Note that getSampleCounts will loop at most $n$ times where $n$ is data.length. Also, sampling from the binomial distribution in each iteration, assuming this is done in a reasonable fashion in the apache.commons.math3 library, should have complexity no worse than $O(\log k)$ (inverse CDF method and binary search.) So the complexity of the above algorithm is $O(n \log k)$ where $n$ is data.length and $k$ is the number of samples you want to draw.
There is no way around it. If you need to take N elements with constant time element access the complexity will be O(n) (linear) no matter what.
You can deffer/amortize the cost by making it lazy. For instance you can return a Stream or Iterator that evaluates each element as you access it. This will help you save on memory usage if you can fold that stream as you are consuming it. In other words you can skip the copy part and work directly with initial array - not always possible, depends on the task.
To make this sampling program run faster, use Akka actor framework to run the sampling jobs in parallel.
Create a master actor for distributing the sampling works to Worker actors and also to concatenate the elements from different workers. So each Worker actor would prepare/collect a fixed number of sample elements and give back the resulting collection as an immutable array to the master. Upon receiving the 'WorkDone' user-defined message from Worker, the Master actor concatenates the elements into the final collection.
it is easy with a list. Use the following implicit function
object ListImplicits {
implicit class SampledArray[T](in: List[T]) {
def sample(n: Int, seed:Option[Long]=None): List[T] = {
seed match {
case Some(s) => Random.setSeed(s)
case _ => // nothing
}
Random.shuffle(in).take(n)
}
}
}
And then import the object and use collection conversions to switch from Array to list (slight overhead):
import ListImplicits.SampledArray
val n = 100000
val list = (0 to n).toList.map(i => Random.nextInt())
val array = list.toArray
val t0 = System.currentTimeMillis()
array.toList.sample(5).toArray
val t1 = System.currentTimeMillis()
list.sample(5)
val t2 = System.currentTimeMillis()
println( "Array (conversion) => delta = " + (t1-t0) + " ms") // 10 ms
println( "List => delta = " + (t2-t1) + " ms") // 8 ms
For one of my homework problems, we had to write a function that creates an array containing n random numbers between 1 and 365. (Done). Then, check if any of these n birthdays are identical. Is there a shorter way to do this than doing several loops or several logical expressions?
Thank you!
CODE SO FAR, NOT DONE YET!!
function = [prob] bdayprob(N,n)
N = input('Please enter the number of experiments performed: N = ');
n = input('Please enter the sample size: n = ');
count = 0;
for(i=1:n)
x(i) = randi(365);
if(x(i)== x)
count = count + 1
end
return
If I'm interpreting your question properly, you want to check to see if generating n integers or days results in n unique numbers. Given your current knowledge in MATLAB, it's as simple as doing:
n = 30; %// Define sample size
N = 10; %// Define number of trials
%// Define logical array where each location tells you whether
%// birthdays were repeated for a trial
check = false(1, N);
%// For each trial...
for idx = 1 : N
%// Generate sample size random numbers
days = randi(365, n, 1);
%// Check to see if the total number of unique birthdays
%// are equal to the sample size
check(idx) = numel(unique(days)) == n;
end
Woah! Let's go through the code slowly shall we? We first define the sample size and the number of trials. We then specify a logical array where each location tells you whether or not there were repeated birthdays generated for that trial. Now, we start with a loop where for each trial, we generate random numbers from 1 to 365 that is of n or sample size long. We then use unique and figure out all unique integers that were generated from this random generation. If all of the birthdays are unique, then the total number of unique birthdays generated should equal the sample size. If we don't, then we have repeats. For example, if we generated a sample of [1 1 1 2 2], the output of unique would be [1 2], and the total number of unique elements is 2. Since this doesn't equal 5 or the sample size, then we know that the birthdays generated weren't unique. However, if we had [1 3 4 6 7], unique would give the same output, and since the output length is the same as the sample size, we know that all of the days are unique.
So, we check to see if this number is equal to the sample size for each iteration. If it is, then we output true. If not, we output false. When I run this code on my end, this is what I get for check. I set the sample size to 30 and the number of trials to be 10.
check =
0 0 1 1 0 0 0 0 1 0
Take note that if you increase the sample size, there is a higher probability that you will get duplicates, because randi can be considered as sampling with replacement. Therefore, the larger the sample size, the higher the chance of getting duplicate values. I made the sample size small on purpose so that we can see that it's possible to get unique days. However, if you set it to something like 100, or 200, you will most likely get check to be all false as there will most likely be duplicates per trial.
Here are some more approaches that avoid loops. Let
n = 20; %// define sample size
x = randi(365,n,1); %// generate n values between 1 and 365
Any of the following code snippets returns true (or 1) if there are two identical values in x, and false (or 0) otherwise:
Sort and then check if any two consecutive elements are the same:
result = any(diff(sort(x))==0);
Do all pairwise comparisons manually; remove self-pairs and duplicate pairs; and check if any of the remaining comparisons is true:
result = nnz(tril(bsxfun(#eq, x, x.'),-1))>0;
Compute the distance between distinct values, considering each pair just once, and then check if any distance is 0:
result = any(pdist(x(:))==0);
Find the number of occurrences of the most common value (mode):
[~, occurs] = mode(x);
result = occurs>1;
I don't know if I'm supposed to solve the problem for you, but perhaps a few hints may lead you in the right direction (besides I'm not a matlab expert so it will be in general terms):
Maybe not, but you have to ask yourself what they expect of you. The solution you propose requires you to loop through the array in two nested loops which will mean n*(n-1)/2 times through the loop (ie quadratic time complexity).
There are a number of ways you can improve the time complexity of the problem. The most straightforward would be to have a 365 element table where you can keep track if a particular number has been seen yet - which would require only a single loop (ie linear time complexity), but perhaps that's not what they're looking for either. But maybe that solution is a little bit ad-hoc? What we're basically looking for is a fast lookup if a particular number has been seen before - there exists more memory efficient structures that allows look up in O(1) time and O(log n) time (if you know these you have an arsenal of tools to use).
Then of course you could use the pidgeonhole principle to provide the answer much faster in some special cases (remember that you only asked to determine whether two or more numbers are equal or not).
So I have a problem which i'm pretty sure is solvable, but after many, many hours of thinking and discussion, only partial progress has been made.
The issue is as follows. I'm building a BTree of, potentially, a few million keys. When searching the BTree, it is paged on demand from disk into memory, and each page in operation is relatively expensive. This effectively means that we want to need to traverse as few nodes as possible (although after a node has been traversed to, the cost of traversing through that node, up to that node is 0). As a result, we don't want to waste space by having lots of nodes nearing minimum capacity. In theory, this should be preventable (within reason) as the structure of the tree is dependent on the order that the keys were inserted in.
So, the question is how to reorder the keys such that after the BTree is built the fewest number of nodes are used. Here's an example:
I did stumble on this question In what order should you insert a set of known keys into a B-Tree to get minimal height? which unfortunately asks a slightly different question. The answers, also don't seem to solve my problem. It is also worth adding that we want the mathematical guarantees that come from not building the tree manually, and only using the insert option. We don't want to build a tree manually, make a mistake, and then find it is unsearchable!
I've also stumbled upon 2 research papers which are so close to solving my question but aren't quite there!
Time-and Space-Optimality in B-Trees and Optimal 2,3-Trees (where I took the above image from in fact) discuss and quantify the differences between space optimal and space pessimal BTrees, but don't go as far as to describe how to design an insert order as far as I can see.
Any help on this would be greatly, greatly appreciated.
Thanks
Research papers can be found at:
http://www.uqac.ca/rebaine/8INF805/Automne2007/Sujets2007Automne/p174-rosenberg.pdf
http://scholarship.claremont.edu/cgi/viewcontent.cgi?article=1143&context=hmc_fac_pub
EDIT:: I ended up filling a btree skeleton constructed as described in the above papers with the FILLORDER algorithm. As previously mentioned, I was hoping to avoid this, however I ended up implementing it before the 2 excellent answers were posted!
The algorithm below should work for B-Trees with minimum number of keys in node = d and maximum = 2*d I suppose it can be generalized for 2*d + 1 max keys if way of selecting median is known.
Algorithm below is designed to minimize the number of nodes not just height of the tree.
Method is based on idea of putting keys into any non-full leaf or if all leaves are full to put key under lowest non full node.
More precisely, tree generated by proposed algorithm meets following requirements:
It has minimum possible height;
It has no more then two nonfull nodes on each level. (It's always two most right nodes.)
Since we know that number of nodes on any level excepts root is strictly equal to sum of node number and total keys number on level above we can prove that there is no valid rearrangement of nodes between levels which decrease total number of nodes. For example increasing number of keys inserted above any certain level will lead to increase of nodes on that level and consequently increasing of total number of nodes. While any attempt to decrease number of keys above the certain level will lead to decrease of nodes count on that level and fail to fit all keys on that level without increasing tree height.
It also obvious that arrangement of keys on any certain level is one of optimal ones.
Using reasoning above also more formal proof through math induction may be constructed.
The idea is to hold list of counters (size of list no bigger than height of the tree) to track how much keys added on each level. Once I have d keys added to some level it means node filled in half created in that level and if there is enough keys to fill another half of this node we should skip this keys and add root for higher level. Through this way, root will be placed exactly between first half of previous subtree and first half of next subtree, it will cause split, when root will take it's place and two halfs of subtrees will become separated. Place for skipped keys will be safe while we go through bigger keys and can be filled later.
Here is nearly working (pseudo)code, array needs to be sorted:
PushArray(BTree bTree, int d, key[] Array)
{
List<int> counters = new List<int>{0};
//skip list will contain numbers of nodes to skip
//after filling node of some order in half
List<int> skip = new List<int>();
List<Pair<int,int>> skipList = List<Pair<int,int>>();
int i = -1;
while(true)
{
int order = 0;
while(counters[order] == d) order += 1;
for(int j = order - 1; j >= 0; j--) counters[j] = 0;
if (counters.Lenght <= order + 1) counters.Add(0);
counters[order] += 1;
if (skip.Count <= order)
skip.Add(i + 2);
if (order > 0)
skipList.Add({i,order}); //list of skipped parts that will be needed later
i += skip[order];
if (i > N) break;
bTree.Push(Array[i]);
}
//now we need to add all skipped keys in correct order
foreach(Pair<int,int> p in skipList)
{
for(int i = p.2; i > 0; i--)
PushArray(bTree, d, Array.SubArray(p.1 + skip[i - 1], skip[i] -1))
}
}
Example:
Here is how numbers and corresponding counters keys should be arranged for d = 2 while first pass through array. I marked keys which pushed into the B-Tree during first pass (before loop with recursion) with 'o' and skipped with 'x'.
24
4 9 14 19 29
0 1 2 3 5 6 7 8 10 11 12 13 15 16 17 18 20 21 22 23 25 26 27 28 30 ...
o o x x o o o x x o o o x x x x x x x x x x x x o o o x x o o ...
1 2 0 1 2 0 1 2 0 1 2 0 1 ...
0 0 1 1 1 2 2 2 0 0 0 1 1 ...
0 0 0 0 0 0 0 0 1 1 1 1 1 ...
skip[0] = 1
skip[1] = 3
skip[2] = 13
Since we don't iterate through skipped keys we have O(n) time complexity without adding to B-Tree itself and for sorted array;
In this form it may be unclear how it works when there is not enough keys to fill second half of node after skipped block but we can also avoid skipping of all skip[order] keys if total length of array lesser than ~ i + 2 * skip[order] and skip for skip[order - 1] keys instead, such string after changing counters but before changing variable i might be added:
while(order > 0 && i + 2*skip[order] > N) --order;
it will be correct cause if total count of keys on current level is lesser or equal than 3*d they still are split correctly if add them in original order. Such will lead to slightly different rearrangement of keys between two last nodes on some levels, but will not break any described requirements, and may be it will make behavior more easy to understand.
May be it's reasonable to find some animation and watch how it works, here is the sequence which should be generated on 0..29 range: 0 1 4 5 6 9 10 11 24 25 26 29 /end of first pass/ 2 3 7 8 14 15 16 19 20 21 12 13 17 18 22 23 27 28
The algorithm below attempts to prepare the order the keys so that you don't need to have power or even knowledge about the insertion procedure. The only assumption is that overfilled tree nodes are either split at the middle or at the position of the last inserted element, otherwise the B-tree can be treated as a black box.
The trick is to trigger node splits in a controlled way. First you fill a node exactly, the left half with keys that belong together and the right half with another range of keys that belong together. Finally you insert a key that falls in between those two ranges but which belongs with neither; the two subranges are split into separate nodes and the last inserted key ends up in the parent node. After splitting off in this fashion you can fill the remainder of both child nodes to make the tree as compact as possible. This also works for parent nodes with more than two child nodes, just repeat the trick with one of the children until the desired number of child nodes is created. Below, I use what is conceptually the rightmost childnode as the "splitting ground" (steps 5 and 6.1).
Apply the splitting trick recursively, and all elements should end up in their ideal place (which depends on the number of elements). I believe the algorithm below guarantees that the height of the tree is always minimal and that all nodes except for the root are as full as possible. However, as you can probably imagine it is hard to be completely sure without actually implementing and testing it thoroughly. I have tried this on paper and I do feel confident that this algorithm, or something extremely similar, should do the job.
Implied tree T with maximum branching factor M.
Top procedure with keys of length N:
Sort the keys.
Set minimal-tree-height to ceil(log(N+1)/log(M)).
Call insert-chunk with chunk = keys and H = minimal-tree-height.
Procedure insert-chunk with chunk of length L, subtree height H:
If H is equal to 1:
Insert all keys from the chunk into T
Return immediately.
Set the ideal subchunk size S to pow(M, H - 1).
Set the number of subtrees T to ceil((L + 1) / S).
Set the actual subchunk size S' to ceil((L + 1) / T).
Recursively call insert-chunk with chunk' = the last floor((S - 1) / 2) keys of chunk and H' = H - 1.
For each of the ceil(L / S') subchunks (of size S') except for the last with index I:
Recursively call insert-chunk with chunk' = the first ceil((S - 1) / 2) keys of subchunk I and H' = H - 1.
Insert the last key of subchunk I into T (this insertion purposefully triggers a split).
Recursively call insert-chunk with chunk' = the remaining keys of subchunk I (if any) and H' = H - 1.
Recursively call insert-chunk with chunk' = the remaining keys of the last subchunk and H' = H - 1.
Note that the recursive procedure is called twice for each subtree; that is fine, because the first call always creates a perfectly filled half subtree.
Here is a way which would lead to minimum height in any BST (including b tree) :-
sort array
Say you can have m key in b tree
Divide array recursively in m+1 equal parts using m keys in parent.
construct the child tree of n/(m+1) sorted keys using recursion.
example : -
m = 2 array = [1 2 3 4 5 6 7 8 9 10]
divide array into three parts :-
root = [4,8]
recursively solve :-
child1 = [1 2 3]
root1 = [2]
left1 = [1]
right1 = [3]
similarly for all childs solve recursively.
So is this about optimising the creation procedure, or optimising the tree?
You can clearly create a maximally efficient B-Tree by first creating a full Balanced Binary Tree, and then contracting nodes.
At any level in a binary tree, the gap in numbers between two nodes contains all the numbers between those two values by the definition of a binary tree, and this is more or less the definition of a B-Tree. You simply start contracting the binary tree divisions into B-Tree nodes. Since the binary tree is balanced by construction, the gaps between nodes on the same level always contain the same number of nodes (assuming the tree is filled). Thus the BTree so constructed is guaranteed balanced.
In practice this is probably quite a slow way to create a BTree, but it certainly meets your criteria for constructing the optimal B-Tree, and the literature on creating balanced binary trees is comprehensive.
=====================================
In your case, where you might take an off the shelf "better" over a constructed optimal version, have you considered simply changing the number of children nodes can have? Your diagram looks like a classic 2-3 tree, but its perfectly possible to have a 3-4 tree, or a 3-5 tree, which means that every node will have at least three children.
Your question is about btree optimization. It is unlikely that you do this just for fun. So I can only assume that you would like to optimize data accesses - maybe as part of database programming or something like this. You wrote: "When searching the BTree, it is paged on demand from disk into memory", which means that you either have not enough memory to do any sort of caching or you have a policy to utilize as less memory as possible. In either way this may be the root cause for why any answer to your question will not be satisfying. Let me explain why.
When it comes to data access optimization, memory is your friend. It does not matter if you do read or write optimization you need memory. Any sort of write optimization always works on the assumption that it can read information in a quick way (from memory) - sorting needs data. If you do not have enough memory for read optimization you will not have that for write optimization too.
As soon as you are willing to accept at least some memory utilization you can rethink your statement "When searching the BTree, it is paged on demand from disk into memory", which makes up room for balancing between read and write optimization. A to maximum optimized BTREE is maximized write optimization. In most data access scenarios I know you get a write at any 10-100 reads. That means that a maximized write optimization is likely to give a poor performance in terms of data access optimization. That is why databases accept restructuring cycles, key space waste, unbalanced btrees and things like that...
Is there a more efficient approach to computing a histogram than a binary search for a non-linear bin distribution?
I'm actually only interested in the bit of the algorithm that matches the key (value) to the bin (the transfer function?) , i.e. for a bunch of floating point values I just want to know the appropriate bin index for each value.
I know that for a linear bin distribution you can get O(1) by dividing the value by the bin width, and that for non linear bins a binary search gets you O(logN). My current implementation uses a binary search on unequal bin widths.
In the spirit of improving efficiency I was curious as to whether you could use a hash function to map a value to its appropriate bin and achieve O(1) time complexity when you have bins of unequal widths?
In some simple cases you can get O(1).
Suppose, your values are 8-bit, from 0 to 255.
If you split them into 8 bins of sizes 2, 2, 4, 8, 16, 32, 64, 128, then the bin value ranges will be: 0-1, 2-3, 4-7, 8-15, 16-31, 32-63, 64-127, 128-255.
In binary these ranges look like:
0000000x (bin 0)
0000001x
000001xx
00001xxx
0001xxxx
001xxxxx
01xxxxxx
1xxxxxxx (bin 7)
So, if you can quickly (in O(1)) count how many most significant zero bits there are in the value, you can get the bin number from it.
In this particular case you may precalculate a look-up table of 256 elements, containing the bin number and finding the appropriate bin for a value is just one table look-up.
Actually, with 8-bit values you can use bins of arbitrary sizes since the look-up table is small.
If you were to go with bins of sizes of powers of 2, you could reuse this look-up table for 16-bit values as well. And you'd need two look-ups. You can extend it to even longer values.
Ordinary hash functions are intended to scatter different values quite randomly across some range. A single-bit difference in arguments may lead to dozens of bits different in results. For that reason, ordinary hash functions are not suitable for the situation described in the question.
An alternative is to build an array P with entries that index into the table B of bin limits. Given some value x, we find the bin j it belongs to (or sometimes a nearby bin) via j = P[⌊x·r⌋] where r is a ratio that depends on the size of P and the maximum value in B. The effectiveness of this approach depends on the values in B and the size of P.
The behavior of functions like P[⌊x·r⌋] can be seen via the python code shown below. (The method is about the same in any programming language. However, tips for Python-to-C are given below.) Suppose the code is stored in file histobins.py and loaded into the ipython interpreter with the command import histobins as hb. Then a command like hb.betterparts(27, 99, 9, 80,155) produces output like
At 80 parts, steps = 20 = 7+13
At 81 parts, steps = 16 = 7+9
At 86 parts, steps = 14 = 6+8
At 97 parts, steps = 13 = 12+1
At 108 parts, steps = 12 = 3+9
At 109 parts, steps = 12 = 8+4
At 118 parts, steps = 12 = 6+6
At 119 parts, steps = 10 = 7+3
At 122 parts, steps = 10 = 3+7
At 141 parts, steps = 10 = 5+5
At 142 parts, steps = 10 = 4+6
At 143 parts, steps = 9 = 7+2
These parameters to betterparts set nbins=27, topsize=99, seed=9, plo=80, phi=155 which creates a test set of 27 bins for values from 0 to 99, with random seed 9, and size of P from 80 to 155-1. The number of “steps” is the number of times the two while loops in testparts() operated during a test with 10*nbins values from 0 to topsize. Eg, “At 143 parts, steps = 9 = 7+2” means that when the size of P is 143, out of 270 trials, 261 times P[⌊x·r⌋] produced the correct index at once; 7 times the index had to be decreased, and twice it had to be increased.
The general idea of the method is to trade off space for time. Another tradeoff is preparation time versus operation time. If you are going to be doing billions of lookups, it is worthwhile to do a few thousand trials to find a good value of |P|, the size of P. If you are going to be doing only a few millions of lookups, it might be better to just pick some large value of |P| and run with it, or perhaps just run betterparts over a narrow range. Instead of doing 75 tests as above, if we start with larger |P| fewer tests may give a good enough result. For example, 10 tests via “hb.betterparts(27, 99, 9, 190,200)” produces
At 190 parts, steps = 11 = 5+6
At 191 parts, steps = 5 = 3+2
At 196 parts, steps = 5 = 4+1
As long as P fits into some level of cache (along with other relevant data) making |P| larger will speed up access. So, making |P| as large as practical is a good idea. As |P| gets larger, the difference in performance between one value of |P| and the next gets smaller and smaller. The limiting factors on speed then include time to multiply and time to set up while loops. One approach for faster multiplies may be to choose a power of 2 as a multiplier; compute |P| to match; then use shifts or adds to exponents instead of multiplies. One approach to spending less time setting up while loops is to move the statement if bins[bin] <= x < bins[bin+1]: (or its C equivalent, see below) to before the while statements and do the while's only if the if statement fails.
Python code is shown below. Note, in translating from Python to C,
• # begins a comment
• def begins a function
• a statement like ntest, right, wrong, x = 10*nbins, 0, 0, 0 assigns values to respective identifiers
• a statement like return (ntest, right, wrong, stepdown, stepup) returns a tuple of 5 values that the caller can assign to a tuple or to respective identifiers
• the scope of a def, while, or if ends with a line not indented farther than the def, while, or if
• bins = [0] initializes a list (an extendible indexable array) with value 0 as its initial entry
• bins.append(t) appends value t at the end of list bins
• for i,j in enumerate(p): runs a loop over the elements of iterable p (in this case, p is a list), making the index i and corresponding entry j == p[i] available inside the loop
• range(nparts) stands for a list of the values 0, 1, ... nparts-1
• range(plo, phi) stands for a list of the values plo, plo+1, ... phi-1
• if bins[bin] <= x < bins[bin+1] means if ((bins[bin] <= x) && (x < bins[bin+1]))
• int(round(x*float(nparts)/topsize))) actually rounds x·r, instead of computing ⌊x·r⌋ as advertised above
def makebins(nbins, topsize):
bins, t = [0], 0
for i in range(nbins):
t += random.random()
bins.append(t)
for i in range(nbins+1):
bins[i] *= topsize/t
bins.append(topsize+1)
return bins
#________________________________________________________________
def showbins(bins):
print ''.join('{:6.2f} '.format(x) for x in bins)
def showparts(nbins, bins, topsize, nparts, p):
ratio = float(topsize)/nparts
for i,j in enumerate(p):
print '{:3d}. {:3d} {:6.2f} {:7.2f} '.format(i, j, bins[j], i*ratio)
print 'nbins: {} topsize: {} nparts: {} ratio: {}'.format(nbins, topsize, nparts, ratio)
print 'p = ', p
print 'bins = ',
showbins(bins)
#________________________________________________________________
def testparts(nbins, topsize, nparts, seed):
# Make bins and make lookup table p
import random
if seed > 0: random.seed(seed)
bins = makebins(nbins,topsize)
ratio, j, p = float(topsize)/nparts, 0, range(nparts)
for i in range(nparts):
while j<nbins and i*ratio >= bins[j+1]:
j += 1
p[i] = j
p.append(j)
#showparts(nbins, bins, topsize, nparts, p)
# Count # of hits and steps with avg. of 10 items per bin
ntest, right, wrong, x = 10*nbins, 0, 0, 0
delta, stepdown, stepup = topsize/float(ntest), 0, 0
for i in range(ntest):
bin = p[min(nparts, max(0, int(round(x*float(nparts)/topsize))))]
while bin < nbins and x >= bins[bin+1]:
bin += 1; stepup += 1
while bin > 0 and x < bins[bin]:
bin -= 1; stepdown += 1
if bins[bin] <= x < bins[bin+1]: # Test if bin is correct
right += 1
else:
wrong += 1
print 'Wrong bin {} {:7.3f} at x={:7.3f} Too {}'.format(bin, bins[bin], x, 'high' if bins[bin] > x else 'low')
x += delta
return (ntest, right, wrong, stepdown, stepup)
#________________________________________________________________
def betterparts(nbins, topsize, seed, plo, phi):
beststep = 1e9
for parts in range(plo, phi):
ntest, right, wrong, stepdown, stepup = testparts(nbins, topsize, parts, seed)
if wrong: print 'Error with ', parts, ' parts'
steps = stepdown + stepup
if steps <= beststep:
beststep = steps
print 'At {:3d} parts, steps = {:d} = {:d}+{:d}'.format(parts, steps, stepdown, stepup)
#________________________________________________________________
Interpolation search is your friend. It's kind of an optimistic, predictive binary search where it guesses where the bin should be based on a linear assumption about the distribution of inputs, rather than just splitting the search space in half at each step. It will be O(1) if the linear assumption is true, but still works (though more slowly) when the assumption is not. To the degree that its predictions are accurate, the search is fast.
Depends on the implementation of the hashing and the type of data you're working with. For smaller data sets a more simple algorithm like binary search might outperform constant lookup if the lookup-overhead of hashing is larger on average.
The usual implementation of hashing, consists of an array of linked lists and a hashing function that maps a string to an index in the array of linked lists. There's a thing called the load factor, which is the number of elements in the hash map / length of the linked-list array. Thus for load factors < 1 you'll achieve constant lookup in the best case because no linked-list will contain more than one element (best case).
There's only one way to find out which is better - implement a hash map and see for yourself. You should be able to get something near constant lookup :)