In Pig, finding complement of entries in a table - database

I have a table A which contains a list, a table B which contains a sub-list of the items in A. How should get a table C which contains a list which is the complement of B in A?
I know how to do it in sql. Am not sure how to approach it in Pig.
Thanks.

In PIG words, you have two "bags" A and B, where B is a subset of A.
If B only contains values in A, you can do C = DIFF(A,B).
However, consider that DIFF removes duplicates, so you will get the complement of B in A reduced to unique values.
Generally, DIFF provides the union of both the complement of B in A and that of A in B.

Related

How to get an array that summarizes sometimes repeated negative or positive values, also writing the number of repetitions in a separate column?

I got a list with "Move" (Column C) values.
How to create "Sorted" list (Column E,F) , which repeat number in Column F and write action in Column E as 1, if next value go opposite way from zero;
And summarize 2 or more numbers in Column F, and write action in Column E as 2 or more, if they going together same way from zero;
?
Screenshot shows correct result:
Google Sheets example link:
https://docs.google.com/spreadsheets/d/13WL9pD7glAZxOIhsXq6k-ZG0Z8bbGAmMOhtGrW3mRTA/edit?usp=sharing
Here's what I came up with in cell E1 on the mk.idea tab
=ARRAYFORMULA(array_constrain(query({C2:C\LOOKUP(A2:A;filter({A2:A;1}; sign({C2:C;0})<>sign(n(C:C))))};"select Count(Col2),sum(Col1),Col2 where Col2>1 group by Col2 label Count(Col2)'Sorted',Sum(Col1)''";0);9^9;2))
Does that give you what you're after?

Sort objects by timestamp, but then group dependencies

Let's say I have a list (array) of objects. Each of these objects has two properties: a timestamp and an optional parent object, which can be null. I'd like to first sort this array by timestamps, which is easy enough; but then, I'd want the dependent objects to be kept consecutive.
For example, consider this simplified example: three objects, A, B, and C. B's parent is A, but the timestamps are A=1, B=3, C=2. Sorting by timestamp gives [A, C, B], but then because B's parent is A, I want B to come after A; so the ideal result should be [A, B, C] after all.
Note that if two or more objects have the same parent, they should all be adjacent, but they should be relatively sorted by timestamp still.
What's the best way to do this? This only way I can think is to sort by timestamp, then iterate through the array and, for each dependent object, move it after its parent; but that seems inefficient since it calls for an extra round of iteration. Is there some way to incorporate the grouping into the initial sorting so it can complete with only one round of sorting? (I'm currently using QuickSort, but if need be, I can switch to another algorithm.)
Brute force non-working approach - one option to perform sorting in one single operation you would need to make parent part of sort key in a following way and than sort by {order(Node.Parent), timestamp(Node)} pairs using any algorithm you like.
"A is parent of B" => "order(A) < order(B)" and
"C.timestamp < D.timestamp" => order(C) < order(D)
Unfortunately this "order" function requires sorting of all child nodes first to satisfy second condition thus breaking "one sort" requirement.
To get single sort you can use composite key that includes timestamps for all parent nodes and then sort by such composite key.
The easiest way to build composite key is to construct tree based on parent objects and set value of the key to be concatenation of parent's key and own timestamp using any tree traversal.
Sample:
Data
A (ts = 5) parent of B (ts = 7),C (ts = 2)
B parent of D (ts = 3)
Building tree:
A -> B -> D
-> C
Pre-order traversal: A, B, D, C
composite key -
A -> A.timestamp = 5
B -> key(A) concat B.timestamp = 5.7
C -> key(A) concat C.timestamp = 5.2
D -> key(B) concat D.timestamp = 5.7.2
data for sorting by {order, timestamp} pairs
A {order(no-parent), ts} = {0, 5}
B {order(A), ts} = {1,7}
C {1,2}
D {2,3}
sorted sequences - {5}, {5.2},{5.7},{5.7.2} mapping back to nodes - A,C,B,D
Complexity of this approach is O(n log(n) max_depth):
build tree/walk tree/build keys - O(n)
sort is complexity of sort (usually O(num_elm log(num_elem)) multiplied by complexity of comparing keys which are depending on depth of parent-child tree. This part dominates O(n) needed for preparation phase.
Alternatively you can just build tree, sort each level by time-stamp and than put them back in a list via pre-order traversal, which removes complexity of key comparison but breaks requirement of single sort.
You could sort the objects into lexicographical order using a sequence of one or two numbers as sort key, where if an object has no parent it has a single element in the sequence which is its number, and if an object has a parent the first element in the sequence is its parents number and the second is its own number.
So A, B, and C get sequences {1}, {1, 3}, and {2} and B sorts just after its parent.

A unique key for a two dimensional array of letters

I have a two dimensional array of letters. Any letter can vary according to a certain alphabet.
I want to make a unique key for this array according to the letters and its position.
For example, if the array is 3 * 3 and the alphabet is {0, a, b, c, *}, the array can be in the form like:
0 b c
b * a
a a 0
I have tried Key = sum(code(letter)*(r*3+c)) for all r and c, where r and c are the row and the column, but it still gives me the same key for different array forms.
What do I miss?
P.S. code(letter) is a mapping function to convert the letter into a value.
You need to take into account the size of alphabet. If code and indices are all zero based it would be:
key = Sum(code(letter)*pow(L, r*C+c))
where L is the number of letters and C is the number of columns. However watch out for numeric overflow. For larger alphabets or matrices you need to use one of the following:
Lessen the requirement of keys being unique and use a hash (hash combiner).
Larger number type for the key or even unlimited arithmetic type such as in GMP lib.
Compression such as arithmetic coding if the distribution of letters is not even. However you still run into the risk of not being able to fit / compress specific matrix into the key.

Order insensitive hash function for an array

I'm looking for a hash-function which will produce the same result for unordered sequences containing same elements.
For example:
Array_1: [a, b, c]
Array_2: [b, a, c]
Array_3: [c, b, a]
The hash-function should return the same result for each of these arrays.
How to achieve this?
The most popular answer is to sort elements by some rule, then concatenate, then take hash.
Is there any other method?
if a,b,c are numbers, you could sum up and then build a hash on the sum.
You may multiply, too.
But take care about zeros!
XOR-ing numbers is also an approach.
for very small numbers you may consider to set the bit indexed by the number. This means building a long (64bit) as input for the hash allows only element numbers in range 0-63.
The more elements you have the more collisions you will get.
In the end you map n elements with m bits (resulting to 2^(m*n) range) to a hash value with k bits.
Usually m and k is a constant but n varies.
Please aware any access as by a hash requires a test whether to get the correct element. In general a hash is NOT unique.
otherwise sort the element and then do the hash as proposed
Regarding the comment from CodesInChaos:
in order to be able to omit a test, the numbers of bits of the hash should be much greater than the sum of elements bits. Say at least 64 bits more. In general this situation is not given.
One common case of secure hash/unique id is a guid. This means effectively 128 bits.
A random sequence of text char reaches this number of bits within 20-25 characters.
Longer texts are very likely to produce collisions. It depends on the use case whether this is still acceptable.
XOR | Sum | Sum of squares | ...
where | denotes concat.
or
XOR of hash of elements

Excel: Pair things up between two columns (A and B), and show unpaired values in columns C and D

Excel buffs:
There are many results showing up when searching for comparisons between two columns (ie: using VLOOKUP) but none of the results I have looked so far seems to do what I need in this particular way:
Column A has following values: Z, Q, V, V, T, T
Column B has following values: V, T, T, M
Column C will display Z, Q, V (here we have one V because one set of 'V' pairs up, leaving us with one unpaired 'V')
Column D will display M
The other examples I've seen so far assumes Column C will not have 'V' in it because it's already found in Column A, regardless of how many times it showed up.
Basically, instead, I need values between two columns paired up and removed, but leave me with any "odd ones" out.
I've been unable to figure this one out using formulae - I've resorted to sorting everything first, then shifting cells in either Column A or B downwards until Columna A and B either have matching values or odd one out in each row
Thanks in advance
Edit: Another way of saying it: I'd like to "eliminate" paired up values from Columns A and B, until all pairs have been removed, leaving me with remaining values in Column A and B
Let's make the assumption that the answers in C/D are allowed to reside in the same row as the unmatched originals in A/B.
Here's the formula for C1, copy and paste it down:
=REPT(A1,
MAX(0,MIN(1,COUNTIF($A:$A,A1)
-COUNTIF($B:$B,A1)
-IF(ROW()=1,0,COUNTIF(OFFSET($C$1,0,0,ROW()-1,1),A1))
)))
Basically, we want to "repeat" the corresponding value in column A if we haven't found a match for it in B and we haven't already accounted for it in C so far.
There's logic in there to ensure that OFFSET() doesn't refer to a zero-height range, and that we repeat either 0 or 1 time, no more and no less, on each row of C.
The formula for D1 is similar, but reversed to compare B back to A:
=REPT(B1,
MAX(0,MIN(1,COUNTIF($B:$B,B1)
-COUNTIF($A:$A,B1)
-IF(ROW()=1,0,COUNTIF(OFFSET($D$1,0,0,ROW()-1,1),B1))
)))

Resources