Replacing cell array element with cell subarray

Replacing cell array element with cell subarray - arrays

I have to implement a Huffman encoder in MATLAB/Octave (unfortunately), so I am using a cell array for maximum flexibility. I want to append symbol indices as they are merged to the earlier indices so I can track what symbols are merged as the processing goes up the tree.
Here's an example:
Index Prob. Index Prob. Index Prob.
1 0.5 1 0.5 1 0.5
2 0.2 3,4 0.3 ---> 0.4 2,3,4 0.4
3 0.2 ---> 0.3 2 0.1 ---/
4 0.1 ---/
As you can see, symbols 3 and 4 get merged, and then the indices are merged and both lists are resorted in descending order of probability.
So I declared a cell array:
% cell index
myinds = num2cell(1:numel(probs));
D = 2; % binary
Unfortunately, when I try to merge the two, I get a size mismatch error;
% 3. add subtrees
myinds(end-(D-1)) = [myinds(end-(D-1)) mergedinds(2:end)];
% remove merged leaves
myinds((end-D):end) = []
The output of either side seems conceptually to be what I want, though:
octave> [myinds(end-(D-1)) mergedinds(2:end)]
ans =
{
[1,1] = 3
[1,2] = 4
}
octave> myinds(end-(D-1))
ans =
{
[1,1] = 6
}
I would like to store the Index column above as the algorithm steps through the illustrated process. I can just grow a matrix each time but that's slow and inefficient. As I understand it a cell array will do what I want, but I can't make it work.

Related

Extracting positions of elements from two Matlab vectors satisfying some criteria

Consider three row vectors in Matlab, A, B, C, each with size 1xJ. I want to construct a matrix D of size Kx3 listing every triplets (a,b,c) such that:
a is the position in A of A(a).
b is the position in B of B(b).
A(a)-B(b) is an element of C.
c is the position in C of A(a)-B(b).
A(a) and B(b) are different from Inf, -Inf.
For example,
A=[-3 3 0 Inf -Inf];
B=[-2 2 0 Inf -Inf];
C=[Inf -Inf -1 1 0];
D=[1 1 3; %-3-(-2)=-1
2 2 4; % 3-2=1
3 3 5]; % 0-0=0
I would like this code to be efficient, because in my real example I have to repeat it many times.
This question relates to my previous question here, but now I'm looking for the positions of the elements.

You can use combvec (or any number of alternatives) to get all pairings of indices a and b for the corresponding arrays A and B. Then it's simply a case of following your criteria
Find the differences
Check which differences are in C
Remove elements you don't care about
Like so:
% Generate all index pairings
D = combvec( 1:numel(A), 1:numel(B) ).';
% Calculate deltas
delta = A(D(:,1)) - B(D(:,2));
delta = delta(:); % make it a column
% Get delta index in C (0 if not present)
[~,D(:,3)] = ismember(delta,C);
% If A or B are inf then the delta is Inf or NaN, remove these
idxRemove = isinf(delta) | isnan(delta) | D(:,3) == 0;
D(idxRemove,:) = [];
For your example, this yields the expected results from the question.
You said that A and B are at most 7 elements long, so you have up to 49 pairings to check. This isn't too bad, but readers should be careful that the pairings can grow quickly for larger inputs.

Stuck in implementing a method for mapping symbols to an interval - if-else loop not working properly implementation does not match theory

I am trying out an encoding - decoding method that had been asked in this post
https://stackoverflow.com/questions/40820958/matlab-help-in-implementing-a-mathematical-equation-for-generating-multi-level
and a related one Generate random number with given probability matlab
There are 2 parts to this question - encoding and decoding. Encoding of a symbolic sequence is done using inverse interval mapping using the map f_inv. The method of inverse interval mapping yields a real valued number. Based on the real valued number, we iterate the map f(). The solution in the post in the first link does not work - because once the final interval is found, the iteration of the map f() using the proposed solution does not yield the same exact symbolic array. So, I tried by directly implementing the equations for the forward iteration f() given in the paper for the decoding process, but the decoding does not generate the same symbolic sequence.
Here is a breif explanation of the problem.
Let there be an array b = [1,3,2,6,1] containing N = 5 integer valued elements with probability of occurence of each unique integer as 0.4, 0.2, 0.2, 0.2 respectively. The array b can take any integers from the unique symbol set 1,2,3,4,5,6,7,8. Let n = 8 elements in the symbol set. In essence, the probability for the above data b is
p= [ 0.4 (for symbol 1), 0.2 (for symbol 2) , 0.2 (symbol 3), 0 (for symbol 4 not occuring), 0 (for symbol 5), 0.2(for symbol 6), 0 (for symbol 7), 0 (for symbol 8)]
An interval [0,1] is split into 8 regions. Let, the interval for the data b assumed to be known as
Interval_b = [0, 0.4, 0.6, 0.8, 1];
In general, for n = 8 unique symbols, there are n = 8 intervals such as I_1, I_2, I_3, I_4, I_5, I_6, I_6,I_7,I_8 and each of these intervals is assigned a symbol such as [ 1 2 3 4 5 6 7 8]
Let, x = 0.2848 that has been obtained from the reverse interval mapping for the symbol array b from the solution for the encoding procedure in the link. There is a mapping rule which maps x to the symbol depending on the interval in which x lies and we should obtain the same symbol elements as in b. The rule is

Looks like the argument Interval passed to function ObtainSymbols should contain entries for all elements, including the ones with probability 0. This can be done by adding the statement
Interval = cumsum([0, p_arr]);
immediately before the calls to function ObtainSymbols.
The following is the output with this modificaiton:
...
p_arr = [p_1,p_2,p_3,p_4,p_5,p_6,p_7,p_8];
% unchanged script above this
% recompute Interval for all symbols
Interval = cumsum([0, p_arr]);
% [0 0.4 0.6 0.8 0.8 0.8 1.0 1.0 1.0]
% unchanged script below
[y1,symbol1] = ObtainSymbols(x(1),p_arr,Interval);
[y2,symbol2] = ObtainSymbols(y1,p_arr,Interval);
[y3,symbol3] = ObtainSymbols(y2,p_arr,Interval);
[y4,symbol4] = ObtainSymbols(y3,p_arr,Interval);
[y5,symbol5] = ObtainSymbols(y4,p_arr,Interval);
Symbols = [symbol1,symbol2,symbol3,symbol4,symbol5]
y = [y1,y2,y3,y4,y5]
% Symbols = [1 3 2 6 1]
% y = [0.7136 0.5680 0.8400 0.2000 0.5000]

Finding indexes of maximum values of an array

How do I find the index of the 2 maximum values of a 1D array in MATLAB? Mine is an array with a list of different scores, and I want to print the 2 highest scores.

You can use sort, as #LuisMendo suggested:
[B,I] = sort(array,'descend');
This gives you the sorted version of your array in the variable B and the indexes of the original position in I sorted from highest to lowest. Thus, B(1:2) gives you the highest two values and I(1:2) gives you their indices in your array.

I'll go for an O(k*n) solution, where k is the number of maximum values you're looking for, rather than O(n log n):
x = [3 2 5 4 7 3 2 6 4];
y = x; %// make a copy of x because we're going to modify it
[~, m(1)] = max(y);
y(m(1)) = -Inf;
[~, m(2)] = max(y);
m =
5 8
This is only practical if k is less than log n. In fact, if k>=3 I would put it in a loops, which may offend the sensibilities of some. ;)

To get the indices of the two largest elements: use the second output of sort to get the sorted indices, and then pick the last two:
x = [3 2 5 4 7 3 2 6 4];
[~, ind] = sort(x);
result = ind(end-1:end);
In this case,
result =
8 5

Counting according to query

Given an array of N positive elements. Lets suppose we list all N × (N+1) / 2 non-empty continuous subarrays of the array A and then replaced all the subarrays with the maximum element present in the respective subarray. So now we have N × (N+1) / 2 elements where each element is maximum among its subarray.
Now we are having Q queries, where each query is one of 3 types :
1 K : We need to count of numbers strictly greater than K among those N × (N+1) / 2 elements.
2 K : We need to count of numbers strictly less than K among those N × (N+1) / 2 elements.
3 K : We need to count of numbers equal to K among those N × (N+1) / 2 elements.
Now main problem am facing is N can be upto 10^6. So i can't generate all those N × (N+1) / 2 elements. Please help to solve this porblem.
Example : Let N=3 and we have Q=2. Let array A be [1,2,3] then all sub arrays are :
[1] -> [1]
[2] -> [2]
[3] -> [3]
[1,2] -> [2]
[2,3] -> [3]
[1,2,3] -> [3]
So now we have [1,2,3,2,3,3]. As Q=2 so :
Query 1 : 3 3
It means we need to tell count of numbers equal to 3. So answer is 3 as there are 3 numbers equal to 3 in the generated array.
Query 2 : 1 4
It means we need to tell count of numbers greater than 4. So answer is 0 as no one is greater than 4 in generated array.
Now both N and Q can be up to 10^6. So how to solve this problem. Which data structure should be suitable to solve it.

I believe I have a solution in O(N + Q*log N) (More about time complexity). The trick is to do a lot of preparation with your array before even the first query arrives.
For each number, figure out where is the first number on left / right of this number that is strictly bigger.
Example: for array: 1, 8, 2, 3, 3, 5, 1 both 3's left block would be position of 8, right block would be the position of 5.
This can be determined in linear time. This is how: Keep a stack of previous maximums in a stack. If a new maximum appears, remove maximums from the stack until you get to a element bigger than or equal to the current one. Illustration:
In this example, in the stack is: [15, 13, 11, 10, 7, 3] (you will of course keep the indexes, not the values, I will just use value for better readability).
Now we read 8, 8 >= 3 so we remove 3 from stack and repeat. 8 >= 7, remove 7. 8 < 10, so we stop removing. We set 10 as 8's left block, and add 8 to the maximums stack.
Also, whenever you remove from the stack (3 and 7 in this example), set the right block of removed number to the current number. One problem though: right block would be set to the next number bigger or equal, not strictly bigger. You can fix this with simply checking and relinking right blocks.
Compute what number is how many times a maximum of some subsequence.
Since for each number you now know where is the next left / right bigger number, I trust you with finding appropriate math formula for this.
Then, store the results in a hashmap, key would be a value of a number, and value would be how many times is that number a maximum of some subsequence. For example, record [4->12] would mean that number 4 is the maximum in 12 subsequences.
Lastly, extract all key-value pairs from the hashmap into an array, and sort that array by the keys. Finally, create a prefix sum for the values of that sorted array.
Handle a request
For request "exactly k", just binary search in your array, for more/less thank``, binary search for key k and then use the prefix array.

This answer is an adaptation of this other answer I wrote earlier. The first part is exactly the same, but the others are specific for this question.
Here's an implemented a O(n log n + q log n) version using a simplified version of a segment tree.
Creating the segment tree: O(n)
In practice, what it does is to take an array, let's say:
A = [5,1,7,2,3,7,3,1]
And construct an array-backed tree that looks like this:
In the tree, the first number is the value and the second is the index where it appears in the array. Each node is the maximum of its two children. This tree is backed by an array (pretty much like a heap tree) where the children of the index i are in the indexes i*2+1 and i*2+2.
Then, for each element, it becomes easy to find the nearest greater elements (before and after each element).
To find the nearest greater element to the left, we go up in the tree searching for the first parent where the left node has value greater and the index lesser than the argument. The answer must be a child of this parent, then we go down in the tree looking for the rightmost node that satisfies the same condition.
Similarly, to find the nearest greater element to the right, we do the same, but looking for a right node with an index greater than the argument. And when going down, we look for the leftmost node that satisfies the condition.
Creating the cumulative frequency array: O(n log n)
From this structure, we can compute the frequency array, that tells how many times each element appears as maximum in the subarray list. We just have to count how many lesser elements are on the left and on the right of each element and multiply those values. For the example array ([1, 2, 3]), this would be:
[(1, 1), (2, 2), (3, 3)]
This means that 1 appears only once as maximum, 2 appears twice, etc.
But we need to answer range queries, so it's better to have a cumulative version of this array, that would look like:
[(1, 1), (2, 3), (3, 6)]
The (3, 6) means, for example, that there are 6 subarrays with maxima less than or equal to 3.
Answering q queries: O(q log n)
Then, to answer each query, you just have to make binary searches to find the value you want. For example. If you need to find the exact number of 3, you may want to do: query(F, 3) - query(F, 2). If you want to find those lesser than 3, you do: query(F, 2). If you want to find those greater than 3: query(F, float('inf')) - query(F, 3).
Implementation
I've implemented it in Python and it seems to work well.
import sys, random, bisect
from collections import defaultdict
from math import log, ceil
def make_tree(A):
n = 2**(int(ceil(log(len(A), 2))))
T = [(None, None)]*(2*n-1)
for i, x in enumerate(A):
T[n-1+i] = (x, i)
for i in reversed(xrange(n-1)):
T[i] = max(T[i*2+1], T[i*2+2])
return T
def print_tree(T):
print 'digraph {'
for i, x in enumerate(T):
print ' ' + str(i) + '[label="' + str(x) + '"]'
if i*2+2 < len(T):
print ' ' + str(i)+ '->'+ str(i*2+1)
print ' ' + str(i)+ '->'+ str(i*2+2)
print '}'
def find_generic(T, i, fallback, check, first, second):
j = len(T)/2+i
original = T[j]
j = (j-1)/2
#go up in the tree searching for a value that satisfies check
while j > 0 and not check(T[second(j)], original):
j = (j-1)/2
#go down in the tree searching for the left/rightmost node that satisfies check
while j*2+1<len(T):
if check(T[first(j)], original):
j = first(j)
elif check(T[second(j)], original):
j = second(j)
else:
return fallback
return j-len(T)/2
def find_left(T, i, fallback):
return find_generic(T, i, fallback,
lambda a, b: a[0]>b[0] and a[1]<b[1], #value greater, index before
lambda j: j*2+2, #rightmost first
lambda j: j*2+1 #leftmost second
)
def find_right(T, i, fallback):
return find_generic(T, i, fallback,
lambda a, b: a[0]>=b[0] and a[1]>b[1], #value greater or equal, index after
lambda j: j*2+1, #leftmost first
lambda j: j*2+2 #rightmost second
)
def make_frequency_array(A):
T = make_tree(A)
D = defaultdict(lambda: 0)
for i, x in enumerate(A):
left = find_left(T, i, -1)
right = find_right(T, i, len(A))
D[x] += (i-left) * (right-i)
F = sorted(D.items())
for i in range(1, len(F)):
F[i] = (F[i][0], F[i-1][1] + F[i][1])
return F
def query(F, n):
idx = bisect.bisect(F, (n,))
if idx>=len(F): return F[-1][1]
if F[idx][0]!=n: return 0
return F[idx][1]
F = make_frequency_array([1,2,3])
print query(F, 3)-query(F, 2) #3 3
print query(F, float('inf'))-query(F, 4) #1 4
print query(F, float('inf'))-query(F, 1) #1 1
print query(F, 2) #2 3

You problem can be divided into several steps:
For each element of initial array calculate the number of "subarrays" where current element is maximum. This will involve a bit of combinatorics. First you need for each element to know index of previous and next element that is bigger than current element. Then calculate the number of subarrays as (i - iprev) * (inext - i). Finding iprev and inext requires two traversals of the initial array: in forward and backward order. For iprev you need to traverse you array left to right. During the traversal maintain the BST that contains the biggest of the previous elements along with their index. For each element of original array, find the minimal element in BST that is bigger than current. It's index, stored as value, will be iprev. Then remove from BST all elements that are smaller that current. This operation should be O(logN), as you are removing whole subtrees. This step is required, as current element you are about to add will "override" all element that are less than it. Then add current element to BST with it's index as value. At each point of time, BST will store the descending subsequence of previous elements where each element is bigger than all it's predecessors in array (for previous elements {1,2,44,5,2,6,26,6} BST will store {44,26,6}). The backward traversal to find inext is similar.
After previous step you'll have pairs K→P where K is the value of some element from the initial array and P is the number of subarrays where this element is maxumum. Now you need to group this pairs by K. This means calculating sum of P values of the equal K elements. Be careful about the corner cases when two elements could have share the same subarrays.
As Ritesh suggested: Put all grouped K→P into an array, sort it by K and calculate cumulative sum of P for each element in one pass. It this case your queries will be binary searches in this sorted array. Each query will be performed in O(log(N)) time.

Create a sorted value-to-index map. For example,
[34,5,67,10,100] => {5:1, 10:3, 34:0, 67:2, 100:4}
Precalculate the queries in two passes over the value-to-index map:
Top to bottom - maintain an augmented tree of intervals. Each time an index is added,
split the appropriate interval and subtract the relevant segments from the total:
indexes intervals total sub-arrays with maximum greater than
4 (0,3) 67 => 15 - (4*5/2) = 5
2,4 (0,1)(3,3) 34 => 5 + (4*5/2) - 2*3/2 - 1 = 11
0,2,4 (1,1)(3,3) 10 => 11 + 2*3/2 - 1 = 13
3,0,2,4 (1,1) 5 => 13 + 1 = 14
Bottom to top - maintain an augmented tree of intervals. Each time an index is added,
adjust the appropriate interval and add the relevant segments to the total:
indexes intervals total sub-arrays with maximum less than
1 (1,1) 10 => 1*2/2 = 1
1,3 (1,1)(3,3) 34 => 1 + 1*2/2 = 2
0,1,3 (0,1)(3,3) 67 => 2 - 1 + 2*3/2 = 4
0,1,3,2 (0,3) 100 => 4 - 4 + 4*5/2 = 10
The third query can be pre-calculated along with the second:
indexes intervals total sub-arrays with maximum exactly
1 (1,1) 5 => 1
1,3 (3,3) 10 => 1
0,1,3 (0,1) 34 => 2
0,1,3,2 (0,3) 67 => 3 + 3 = 6
Insertion and deletion in augmented trees are of O(log n) time-complexity. Total precalculation time-complexity is O(n log n). Each query after that ought to be O(log n) time-complexity.

Is there a more elegant way of doing this?

Given an array of positive integers a I want to output array of integers b so that b[i] is the closest number to a[i] that is smaller then a[i], and is in {a[0], ... a[i-1]}. If such number doesn't exist, then b[i] = -1.
Example:
a = 2 1 7 5 7 9
b = -1 -1 2 2 5 7
b[0] = -1 since there is no number that is smaller than 2
b[1] = -1 since there is no number that is smaller than 1 from {2}
b[2] = 2, closest number to 7 that is smaller than 7 from {2,1} is 2
b[3] = 2, closest number to 5 that is smaller than 5 from {2,1,7} is 2
b[4] = 5, closest number to 7 that is smaller than 7 from {2,1,7,5} is 5
I was thinking about implementing balanced binary tree, however it will require a lot of work. Is there an easier way of doing this?

Here is one approach:
for i ← 1 to i ← (length(A)-1) {
// A[i] is added in the sorted sequence A[0, .. i-1] save A[i] to make a hole at index j
item = A[i]
j = i
// keep moving the hole to next smaller index until A[j - 1] is <= item
while j > 0 and A[j - 1] > item {
A[j] = A[j - 1] // move hole to next smaller index
j = j - 1
}
A[j] = item // put item in the hole
// if there are elements to the left of A[j] in sorted sequence A[0, .. i-1], then store it in b
// TODO : run loop so that duplicate entries wont hamper results
if j > 1
b[i] = A[j-1]
else
b[1] = -1;
}
Dry run:
a = 2 1 7 5 7 9
a[1] = 2
its straight forward, set b[1] to -1
a[2] = 1
insert into subarray : [1 ,2]
any elements before 1 in sorted array ? no.
So set b[2] to -1 . b: [-1, -1]
a[3] = 7
insert into subarray : [1 ,2, 7]
any elements before 7 in sorted array ? yes. its 2
So set b[3] to 2. b: [-1, -1, 2]
a[4] = 5
insert into subarray : [1 ,2, 5, 7]
any elements before 5 in sorted array ? yes. its 2
So set b[4] to 2. b: [-1, -1, 2, 2]
and so on..

Here's a sketch of a (nearly) O(n log n) algorithm that's somewhere in between the difficulty of implementing an insertion sort and balanced binary tree: Do the problem backwards, use merge/quick sort, and use binary search.
Pseudocode:
let c be a copy of a
let b be an array sized the same as a
sort c using an O(n log n) algorithm
for i from a.length-1 to 1
binary search over c for key a[i] // O(log n) time
remove the item found // Could take O(n) time
if there exists an item to the left of that position, b[i] = that item
otherwise, b[i] = -1
b[0] = -1
return b
There's a few implementation details that can make this have poor runtime.
For instance, since you have to remove items, doing this on a regular array and shifting things around will make this algorithm still take O(n^2) time. So, you could store key-value pairs instead. One would be the key, and the other would be the number of those keys (kind of like a multiset implemented on an array). "Removing" one would just be subtracting the second item from the pair and so on.
Eventually you will be left with a bunch of 0-value keys. This would eventually make the if there exists an item to the left take roughly O(n) time, and therefore, the entire algorithm would degrade to a O(n^2) for that reason. So another optimization might be to batch remove all of them periodically. For instance, when 1/2 of them are 0-values, perform a pruning.
The ideal option might be to implement another data structure that has a much more favorable remove time. Something along the lines of a modified unrolled linked list with indices could work, but it would certainly increase the implementation complexity of this approach.
I've actually implemented this. I used the first two optimizations above (storing key-value pairs for compression, and pruning when 1/2 of them are 0s). Here's some benchmarks to compare using an insertion sort derivative to this one:
a.length This method Insert sort Method
100 0.0262ms 0.0204ms
1000 0.2300ms 0.8793ms
10000 2.7303ms 75.7155ms
100000 32.6601ms 7740.36 ms
300000 98.9956ms 69523.6 ms
1000000 333.501 ms ????? Not patient enough
So, as you can see, this algorithm grows much, much slower than the insertion sort method I posted before. However, it took 73 lines of code vs 26 lines of code for the insertion sort method. So in terms of simplicity, the insertion sort method might still be the way to go if you don't have time requirements/the input is small.

You could treat it like an insertion sort.
Pseudocode:
let arr be one array with enough space for every item in a
let b be another array with, again, enough space for all elements in a
For each item in a:
perform insertion sort on item into arr
After performing the insertion, if there exists a number to the left, append that to b.
Otherwise, append -1 to b
return b
The main thing you have to worry about is making sure that you don't make the mistake of reallocating arrays (because it would reallocate n times, which would be extremely costly). This will be an implementation detail of whatever language you use (std::vector's reserve for C++ ... arr.reserve(n) for D ... ArrayList's ensureCapacity in Java...)
A potential downfall with this approach compared to using a binary tree is that it's O(n^2) time. However, the constant factors using this method vs binary tree would make this faster for smaller sizes. If your n is smaller than 1000, this would be an appropriate solution. However, O(n log n) grows much slower than O(n^2), so if you expect a's size to be significantly higher and if there's a time limit that you are likely to breach, you might consider a more complicated O(n log n) algorithm.
There are ways to slightly improve the performance (such as using a binary insertion sort: using binary search to find the position to insert into), but generally they won't improve performance enough to matter in most cases since it's still O(n^2) time to shift elements to fit.

Consider this:
a = 2 1 7 5 7 9
b = -1 -1 2 2 5 7
c 0 1 2 3 4 5 6 7 8 9
0 - - - - - - - - - -
Where the index of C is value of a[i] such that 0,3,4,6,8 would have null values.
and the 1st dimension of C contains the highest to date closest value to a[i]
So in step by a[3] we have the following
c 0 1 2 3 4 5 6 7 8 9
0 - -1 -1 - - 2 - 2 - -
and by step a[5] we have the following
c 0 1 2 3 4 5 6 7 8 9
0 - -1 -1 - - 2 - 5 - 7
This way when we get to the 2nd 7 at a[4] we know that 2 is the largest value to date and all we need to do is loop back through a[i-1] until we encounter a 7 again comparing the a[i] value to that in c[7] if bigger, replace c[7]. Once a[i-1] = the 7 we put c[7] into b[i] and move on to next a[i].
The main downfalls to this approach that I can see are:
footprint size depending on how big the c[] needs to be dimensioned..
the fact that you have to revisit elements of a[] that you've already touched. If the distribution of data is such that there are significant spaces between the two 7's then keeping track of the highest value as you go would presumably be faster. Alternatively it might be better to gather statistics on the a[i] up front to know what distributions exist and then use a hybrid method maintaining the max until such time that no more instances of that number are in the statistics.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight