You are given an array A, of size N, containing numbers from 0-N. For each sub-array starting from 0th index, lets say Si, we say Bi is the smallest non negative number that is not present in Si.
We need to find the maximum possible sum of all Bi of this array.
We can rearrange the array to obtain the maximum sum.
For example:
A = 1, 2, 0 , N = 3
then lets say we rearranged it as A= 0, 1, 2
S1 = 0, B1= 1
S2 = 0,1 B2= 2
S3 = 0,1,2 B3= 3
Hence the sum is 6
Whatever examples I have tried, I have seen that sorted array will give the maximum sum. Am I correct or missing something here.
Please help to find the correct logic for this problem. I am not looking for optimal solution but just the correct logic.
Yes, sorting the array maximizes the sum of 𝐵𝑖
As the input size is 𝑛, it does not include every number in the range {0, ..., 𝑛}, as that is a set of 𝑛 + 1 numbers. Let's say it only lacks value 𝑘, then 𝐵𝑖 is 𝑘 for all 𝑖 >= 𝑘. If there are other numbers that are missing, but greater than 𝑘, there is no impact on any 𝐵𝑖.
Thus we need to find out the minimum missing value 𝑘 in the range {0, ..., 𝑛}. And then the maximised sum is 1 + 2 + ... + 𝑘 + (𝑛−𝑘)𝑘. This is 𝑘(𝑘+1)/2 + (𝑛−𝑘)𝑘 = 𝑘(1 + 2𝑛 − 𝑘)/2
To find the value of 𝑘, create a boolean array of size 𝑛 + 1, and set the entry at index 𝑣 to true when 𝑣 is encountered in the input. 𝑘 is then the first index at which that boolean array still has a false value.
Here is a little implementation in a JavaScript snippet:
function maxSum(arr) {
const n = arr.length;
const isUsed = Array(n + 1).fill(false);
for (const value of arr) {
isUsed[value] = true;
}
const k = isUsed.indexOf(false);
return k * (1 + 2*n - k) / 2;
}
console.log(maxSum([0, 1, 2])); // 6
console.log(maxSum([0, 2, 2])); // 3
console.log(maxSum([1, 0, 1])); // 5
Problem statement:
We are given three arrays A1,A2,A3 of lengths n1,n2,n3. Each array contains some (or no) natural numbers (i.e > 0). These numbers denote the program execution times.
The task is to choose the first element from any array and then you can execute that program and remove it from that array.
For example:
if A1=[3,2] (n1=2),
A2=[7] (n2=1),
A3=[1] (n3=1)
then we can execute programs in various orders like [1,7,3,2] or [7,1,3,2] or [3,7,1,2] or [3,1,7,2] or [3,2,1,7] etc.
Now if we take S=[1,3,2,7] as the order of execution the waiting time of various programs would be
for S[0] waiting time = 0, since executed immediately,
for S[1] waiting time = 0+1 = 1, taking previous time into account, similarly,
for S[2] waiting time = 0+1+3 = 4
for S[3] waiting time = 0+1+3+2 = 6
Now the score of array is defined as sum of all wait times = 0 + 1 + 4 + 6 = 11, This is the minimum score we can get from any order of execution.
Our task is to find this minimum score.
How can we solve this problem? I tried with approach trying to pick minimum of three elements each time, but it is not correct because it gets stuck when two or three same elements are encountered.
One more example:
if A1=[23,10,18,43], A2=[7], A3=[13,42] minimum score would be 307.
The simplest way to solve this is with dynamic programming (which runs in cubic time).
For each array A: Suppose you take the first element from array A, i.e. A[0], as the next process. Your total cost is the wait-time contribution of A[0] (i.e., A[0] * (total_remaining_elements - 1)), plus the minimal wait time sum from A[1:] and the rest of the arrays.
Take the minimum cost over each possible first array A, and you'll get the minimum score.
Here's a Python implementation of that idea. It works with any number of arrays, not just three.
def dp_solve(arrays: List[List[int]]) -> int:
"""Given list of arrays representing dependent processing times,
return the smallest sum of wait_time_before_start for all job orders"""
arrays = [x for x in arrays if len(x) > 0] # Remove empty
#functools.lru_cache(100000)
def dp(remaining_elements: Tuple[int],
total_remaining: int) -> int:
"""Returns minimum wait time sum when suffixes of each array
have lengths in 'remaining_elements' """
if total_remaining == 0:
return 0
rem_elements_copy = list(remaining_elements)
best = 10 ** 20
for i, x in enumerate(remaining_elements):
if x == 0:
continue
cost_here = arrays[i][-x] * (total_remaining - 1)
if cost_here >= best:
continue
rem_elements_copy[i] -= 1
best = min(best,
dp(tuple(rem_elements_copy), total_remaining - 1)
+ cost_here)
rem_elements_copy[i] += 1
return best
return dp(tuple(map(len, arrays)), sum(map(len, arrays)))
Better solutions
The naive greedy strategy of 'smallest first element' doesn't work, because it can be worth it to do a longer job to get a much shorter job in the same list done, as the example of
A1 = [100, 1, 2, 3], A2 = [38], A3 = [34],
best solution = [100, 1, 2, 3, 34, 38]
by user3386109 in the comments demonstrates.
A more refined greedy strategy does work. Instead of the smallest first element, consider each possible prefix of the array. We want to pick the array with the smallest prefix, where prefixes are compared by average process time, and perform all the processes in that prefix in order.
A1 = [ 100, 1, 2, 3]
Prefix averages = [(100)/1, (100+1)/2, (100+1+2)/3, (100+1+2+3)/4]
= [ 100.0, 50.5, 34.333, 26.5]
A2=[38]
A3=[34]
Smallest prefix average in any array is 26.5, so pick
the prefix [100, 1, 2, 3] to complete first.
Then [34] is the next prefix, and [38] is the final prefix.
And here's a rough Python implementation of the greedy algorithm. This code computes subarray averages in a completely naive/brute-force way, so the algorithm is still quadratic (but an improvement over the dynamic programming method). Also, it computes 'maximum suffixes' instead of 'minimum prefixes' for ease of coding, but the two strategies are equivalent.
def greedy_solve(arrays: List[List[int]]) -> int:
"""Given list of arrays representing dependent processing times,
return the smallest sum of wait_time_before_start for all job orders"""
def max_suffix_avg(arr: List[int]):
"""Given arr, return value and length of max-average suffix"""
if len(arr) == 0:
return (-math.inf, 0)
best_len = 1
best = -math.inf
curr_sum = 0.0
for i, x in enumerate(reversed(arr), 1):
curr_sum += x
new_avg = curr_sum / i
if new_avg >= best:
best = new_avg
best_len = i
return (best, best_len)
arrays = [x for x in arrays if len(x) > 0] # Remove empty
total_time_sum = sum(sum(x) for x in arrays)
my_averages = [max_suffix_avg(arr) for arr in arrays]
total_cost = 0
while True:
largest_avg_idx = max(range(len(arrays)),
key=lambda y: my_averages[y][0])
_, n_to_remove = my_averages[largest_avg_idx]
if n_to_remove == 0:
break
for _ in range(n_to_remove):
total_time_sum -= arrays[largest_avg_idx].pop()
total_cost += total_time_sum
# Recompute the changed array's avg
my_averages[largest_avg_idx] = max_suffix_avg(arrays[largest_avg_idx])
return total_cost
How to calculate the index of an element in the list of strings sorted accordingly to the input alphabet having a given length and a given number of distinct characters.
from itertools import product
def bruteforce_item_index(item, alphabet, length, distinct):
skipped=0
for u in product(alphabet, repeat=length):
v = ''.join(u)
if v == item:
return skipped
if len(set(u)) == distinct:
skipped += 1
As an example
bruteforce_item_index('0123456777', alphabet='0123456789', length=10, distinct=8)
Runs in ~1 minute and gives the answer 8245410. The run time here is proportional to the index of the given item.
I want an efficient implementation that is able to calculate that index in a fraction of second.
In other words: How to solve this problem? A mathematical approach has been provided on the same page. I want a python or java or c# code as a solution.
In this answer I will explain how to get to a function that will enable you to get the index of an element in the sequence as follows
print("Item 3749832414 is at (0-based) index %d" %
item_index('3749832414', alphabet='0123456789', length=10, distinct=8))
print("Item 7364512193 is at (0-based) index %d" %
item_index('7364512193', alphabet='0123456789', length=10, distinct=8))
> Item 3749832414 is at (0-based) index 508309342
> Item 7364512193 is at (0-based) index 1005336982
Enumeration method
By the nature of your problem it is interesting to solve it in a recursive manner, adding digits one by one and keeping track of the number of digits used. Python provide iterators so that you can produce items one by one without storing the whole sequence.
Basically all the items can be arranged in a prefix tree, and we walk the three yielding the leaf nodes.
def iter_seq(alphabet, length, distinct, prefix=''):
if distinct < 0:
# the prefix used more than the allowed number of distinct digits
return
if length == 0:
# if distinct > 0 it means that prefix did not use
# enought distinct digits
if distinct == 0:
yield prefix
else:
for d in alphabet:
if d in prefix:
# the number of distinct digits in prefix + d is the same
# as in prefix.
yield from iter_seq(alphabet, length-1, distinct, prefix + d)
else:
# the number of distinct digits in prefix + d is one more
# than the distinct digits in prefix.
yield from iter_seq(alphabet, length-1, distinct-1, prefix + d)
Let's test it with examples that can be visualized
list(iter_seq('0123', 5, 1))
['00000', '11111', '22222', '33333']
import numpy as np
np.reshape(list(iter_seq('0123', 4, 2)), (12, 7))
array([['0001', '0002', '0003', '0010', '0011', '0020', '0022'],
['0030', '0033', '0100', '0101', '0110', '0111', '0200'],
['0202', '0220', '0222', '0300', '0303', '0330', '0333'],
['1000', '1001', '1010', '1011', '1100', '1101', '1110'],
['1112', '1113', '1121', '1122', '1131', '1133', '1211'],
['1212', '1221', '1222', '1311', '1313', '1331', '1333'],
['2000', '2002', '2020', '2022', '2111', '2112', '2121'],
['2122', '2200', '2202', '2211', '2212', '2220', '2221'],
['2223', '2232', '2233', '2322', '2323', '2332', '2333'],
['3000', '3003', '3030', '3033', '3111', '3113', '3131'],
['3133', '3222', '3223', '3232', '3233', '3300', '3303'],
['3311', '3313', '3322', '3323', '3330', '3331', '3332']],
dtype='<U4')
Counting items
As you noticed by your previous question, the number of items in a sequence only depends on the length of each string, the size of the alphabet, and the number of distinct symbols.
If we look to the loop of the above function, we only have two cases, (1) the current digit is in the prefix, (2) the digit is not in the prefix. The number of times the digit will be in the prefix is exactly the number of distinct digits in the prefix. So we can add an argument used to keep track of the number of digits already used instead of the actual prefix. Now the complexity goes from O(length!) to O(2**length).
Additionally we use a lru_cache decorator that will memorize the values and return them without calling the function if the arguments are repetaed, this makes the function to run in O(length**2) time and space.
from functools import lru_cache
#lru_cache
def count_seq(n_symbols, length, distinct, used=0):
if distinct < 0:
return 0
if length == 0:
return 1 if distinct == 0 else 0
else:
return \
count_seq(n_symbols, length-1, distinct-0, used+0) * used + \
count_seq(n_symbols, length-1, distinct-1, used+1) * (n_symbols - used)
We can that it is consistent with iter_seq
assert(sum(1 for _ in iter_seq('0123', 4, 2)) == count_seq(4, 4, 2))
We can also test that it aggrees with the example you calculated by hand
assert(count_seq(10, 10, 8) == 1360800000)
Item at index
This part is not necessary to get the final answer but it is a good exercise. Furthermore it will give us a way to compute larger sequences that would be tedious by hand.
This could be achieved by iterating iter_seq the given number of times. This function does that more efficiently by comparing the number of leaves in a given subtree (number of items produced by a specific call) with the distance to the requested index. If the requested index is distant more than the number of items produced by a call it means we can skip that call at all, and jump directly to the next sibling in the tree.
def item_at(idx, alphabet, length, distinct, used=0, prefix=''):
if distinct < 0:
return
if length == 0:
return prefix
else:
for d in alphabet:
if d in prefix:
branch_count = count_seq(len(alphabet),
length-1, distinct, used)
if branch_count <= idx:
idx -= branch_count
else:
return item_at(idx, alphabet,
length-1, distinct, used, prefix + d)
else:
branch_count = count_seq(len(alphabet),
length-1, distinct-1, used+1)
if branch_count <= idx:
idx -= branch_count
else:
return item_at(idx, alphabet,
length-1, distinct-1, used+1, prefix + d)
We can test that it is consistent with iter_seq
for i, w in enumerate(iter_seq('0123', 4, 2)):
assert w == item_at(i, '0123', 4, 2)
Index of a given item
Remembering that we are walking in a prefix tree, given a string we can walk directly to the desired node. The way to find the index is to sum the size of all the subtrees that are left behind on this path.
def item_index(item, alphabet, length, distinct, used=0, prefix=''):
if distinct < 0:
return 0
if length == 0:
return 0
else:
offset = 0
for d in alphabet:
if d in prefix:
if d == item[0]:
return offset + item_index(item[1:], alphabet,
length-1, distinct, used, prefix + d)
else:
offset += count_seq(len(alphabet),
length-1, distinct, used)
else:
if d == item[0]:
return offset + item_index(item[1:], alphabet,
length-1, distinct-1, used+1, prefix + d)
else:
offset += count_seq(len(alphabet),
length-1, distinct-1, used+1)
And again we can test the consistency between this and iter_seq
for i,w in enumerate(iter_seq('0123', 4, 2)):
assert i == item_index(w, '0123', 4, 2)
Or to query for the example numbers you gave as I promised in the beginning of the post
print("Item 3749832414 is at (0-based) index %d" %
item_index('3749832414', alphabet='0123456789', length=10, distinct=8))
print("Item 7364512193 is at (0-based) index %d" %
item_index('7364512193', alphabet='0123456789', length=10, distinct=8))
> Item 3749832414 is at (0-based) index 508309342
> Item 7364512193 is at (0-based) index 1005336982
Bonus: Larger sequences
Let's calculate the index of UCP3gzjGPMwjYbYtsFu2sDHRE14XTu8AdaWoJPOm50YZlqI6skNyfvEShdmGEiB0
in the sequences of length 64 and 50 distinct symbols
item_index('UCP3gzjGPMwjYbYtsFu2sDHRE14XTu8AdaWoJPOm50YZlqI6skNyfvEShdmGEiB0',
alphabet='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
distinct=50, length=64)
Surprisingly it is 10000...000 = 10**110. How could I find that particular string??
If we choose 3 symbols from the set {a, b, c, d, e, f}, there are 20 possible combinations. We can record these combinations in a integer such as:
{a, b, c} => 1
{a, b, d} => 2
{a, b, e} => 3
...
{d, e, f} => 20
Then after we finish choosing 3 symbols from the set, there will have 3^6 possible permutations. Then we can represent it in 12 bits.
Take {a, b, c} for example, the representation can be:
aaaaaa => 00 00 00 00 00 00
aaaaab => 00 00 00 00 00 01
aaaaac => 00 00 00 00 00 10
...
cccccb => 10 10 10 10 10 01
cccccc => 10 10 10 10 10 10
Then you can use the combination of one integer and 12 bits binaries to index your permutation.
You can try Factorial number system. This is pretty complicated to explain, but it helps you to solve the problem with O(1) time. This is Project Euler's Lexicographic permutations.
It finds a permutation by its index. Probably you can rewrite it by finding an index by permutation.
public static String lexicographicPermutation(String str, int n) {
long[] fact = new long[str.length()];
List<Character> letters = new ArrayList<>(str.length());
for (int i = 0; i < str.length(); fact[i] = i == 0 ? 1 : i * fact[i - 1], i++)
letters.add(str.charAt(i));
letters.sort(Comparator.naturalOrder());
n--;
StringBuilder buf = new StringBuilder(str.length());
for (int i = str.length() - 1; i >= 0; n %= fact[i], i--)
buf.append(letters.remove((int)(n / fact[i])));
return buf.toString();
}
I am trying to generate a matrix, that has all unique combinations of [0 0 1 1], I wrote this code for this:
v1 = [0 0 1 1];
M1 = unique(perms([0 0 1 1]),'rows');
• This isn't ideal, because perms() is seeing each vector element as unique and doing:
4! = 4 * 3 * 2 * 1 = 24 combinations.
• With unique() I tried to delete all the repetitive entries so I end up with the combination matrix M1 →
only [4!/ 2! * (4-2)!] = 6 combinations!
Now, when I try to do something very simple like:
n = 15;
i = 1;
v1 = [zeros(1,n-i) ones(1,i)];
M = unique(perms(vec_1),'rows');
• Instead of getting [15!/ 1! * (15-1)!] = 15 combinations, the perms() function is trying to do
15! = 1.3077e+12 combinations and it's interrupted.
• How would you go about doing in a much better way? Thanks in advance!
You can use nchoosek to return the indicies which should be 1, I think in your heart you knew this must be possible because you were using the definition of nchoosek to determine the expected final number of permutations! So we can use:
idx = nchoosek( 1:N, k );
Where N is the number of elements in your array v1, and k is the number of elements which have the value 1. Then it's simply a case of creating the zeros array and populating the ones.
v1 = [0, 0, 1, 1];
N = numel(v1); % number of elements in array
k = nnz(v1); % number of non-zero elements in array
colidx = nchoosek( 1:N, k ); % column index for ones
rowidx = repmat( 1:size(colidx,1), k, 1 ).'; % row index for ones
M = zeros( size(colidx,1), N ); % create output
M( rowidx(:) + size(M,1) * (colidx(:)-1) ) = 1;
This works for both of your examples without the need for a huge intermediate matrix.
Aside: since you'd have the indicies using this approach, you could instead create a sparse matrix, but whether that's a good idea or not would depend what you're doing after this point.
I have a file text with P random entries in Binary (or Hex) for processing, from that P number, I have to take N entries such that they are the most different possible between them so i have a good representative of the possible population.
So far, I have think of do a comparison between the current N, and a average of the array that contains the elements using a modified version of the algorithm in: How do I calculate similarity of two integers?
or having a cumulative score of similarity (the higher the most different) between the next element to be selected and all the elements in the array, and choose the next one, and repeat until have selected the required N
I do not know if there is a better solution to this.
Ex.
[00011111, 00101110, 11111111, 01001010 , 00011000, 10010000, 01110101]
P = 7
N = 3
Result: [00011111, 10010000, 00101110]
Thanks in advance
You should compare them Pairwise. this comparison problem is Shortest common supersequence problem (see this). a shortest common supersequence of strings x and y is a shortest string z such that both x and y are subsequences of z. The shortest common supersequence is a problem closely related to the longest common subsequence (see enter link description here). Best solution for the longest common subsequence is dynamic programming method.
You could calculate the Hamming distances for all combinations if you want to choose the most different binary representation (see https://en.wikipedia.org/wiki/Hamming_distance ).
Edit: small hack
import numpy as np
a = [0b00011111, 0b00101110, 0b11111111, 0b01001010, 0b00011000, 0b10010000, 0b01110101]
N = 3
b = []
for i in a:
b.append(np.unpackbits(np.uint8(i))) #to binary representation
valuesWithBestOverallDiffs = []
def getItemWithBestOverallDiff(b):
itemWithBestOverallDiff = [0, 0] #idx, value
for biter, bval in enumerate(b):
hammDistSum = 0
for biter2, bval2 in enumerate(b):
if biter == biter2:
continue
print("iter1: " + str(biter) + " = " + str(bval))
print("iter2: " + str(biter2) + " = " + str(bval2))
hammDist = len(np.bitwise_xor(bval, bval2).nonzero()[0])
print(" => " + str(hammDist))
hammDistSum = hammDistSum + hammDist
if hammDistSum > itemWithBestOverallDiff[1]:
itemWithBestOverallDiff = [biter, hammDistSum]
#print(itemWithBestOverallDiff)
return itemWithBestOverallDiff
for i in range(N):
itemWithBestOverallDiff = getItemWithBestOverallDiff(b)
print("adding item nr " + str(itemWithBestOverallDiff[0]) + " with value 0b" + str(b[itemWithBestOverallDiff[0]]) + " = " + str(a[itemWithBestOverallDiff[0]]))
val = a.pop(itemWithBestOverallDiff[0])
b.pop(itemWithBestOverallDiff[0])
valuesWithBestOverallDiffs.append(val)
print("result: ")
print(valuesWithBestOverallDiffs)
The final output is
result:
[144, 117, 255]
which is 0b10010000, 0b01110101, 0b11111111