Generating a relation from second field in pig - union

I have performed a "group by" on a relation and the result is similar to the following:
g1,{a1,a2,a3}
g2,{b1,b2,b3,b4}
g3,{c1,c2,c3,c4,c5,c6}
...
so the first field is the group and the second filed is a bag of tuples where each bag may have different number of elements. What I want to do is to generate a new relation which includes all the elements in the second fields. Therefore, the output will be:
B={a1,a2,a3,b1,b2,b3,b4,c1,c2,c3,c4,c5,c6}
Coud you help on this?
Sara

If you want what I think you want you are seeking to create a new relation where each of the tuples that was formerly in the bags from grouping is now an entire record. To do this, use the FLATTEN operator, which blows up the bag into multiple records. If you can assume that all of the tuples in the bags have the same schema, you can additionally FLATTEN those to promote the tuple elements to full-fledged fields:
If A is the result of the grouping, and
DESCRIBE A;
{(key:chararray, bag:{})}
You can do
B = FOREACH A GENERATE FLATTEN(bag) AS tuple;
Then to convert the tuples into full rows, do
C = FOREACH B GENERATE FLATTEN(tuple);
You can read more about FLATTEN here.

In order to get to a result like what you started with, I did:
grunt> A = LOAD '../../../input/tuplesSample.txt' using PigStorage(' ') AS (grupo:chararray, charo:chararray);
grunt> DESCRIBE A;
A: {grupo: chararray,charo: chararray}
grunt> B = GROUP A by grupo;
grunt> DESCRIBE B;
B: {group: chararray,A: {(grupo: chararray,charo: chararray)}}
grunt> C = FOREACH B GENERATE $0 as grupo, $1.charo as charos;
grunt> DESCRIBE C;
C: {grupo: chararray,charos: {(charo: chararray)}}
grunt> DUMP C;
C: {grupo: chararray,charos: {(charo: chararray)}}
(g1,{(a1),(a2),(a3)})
(g2,{(b4),(b3),(b2),(b1)})
(g3,{(c4),(c5),(c6),(c2),(c1),(c3)})
Then I did this, to give you the new relation (E below) that contains all the elements in a single bag.
grunt> D = FOREACH C GENERATE FLATTEN($1) as charos;
grunt> DESCRIBE D;
D: {charos: chararray}
grunt> E = GROUP D ALL;
grunt> DESCRIBE E;
E: {group: chararray,D: {(charos: chararray)}}
grunt> DUMP E;
(all,{(c3),(c1),(c2),(c6),(c5),(c4),(b1),(b2),(b3),(b4),(a3),(a2),(a1)})

Related

Sort objects by timestamp, but then group dependencies

Let's say I have a list (array) of objects. Each of these objects has two properties: a timestamp and an optional parent object, which can be null. I'd like to first sort this array by timestamps, which is easy enough; but then, I'd want the dependent objects to be kept consecutive.
For example, consider this simplified example: three objects, A, B, and C. B's parent is A, but the timestamps are A=1, B=3, C=2. Sorting by timestamp gives [A, C, B], but then because B's parent is A, I want B to come after A; so the ideal result should be [A, B, C] after all.
Note that if two or more objects have the same parent, they should all be adjacent, but they should be relatively sorted by timestamp still.
What's the best way to do this? This only way I can think is to sort by timestamp, then iterate through the array and, for each dependent object, move it after its parent; but that seems inefficient since it calls for an extra round of iteration. Is there some way to incorporate the grouping into the initial sorting so it can complete with only one round of sorting? (I'm currently using QuickSort, but if need be, I can switch to another algorithm.)
Brute force non-working approach - one option to perform sorting in one single operation you would need to make parent part of sort key in a following way and than sort by {order(Node.Parent), timestamp(Node)} pairs using any algorithm you like.
"A is parent of B" => "order(A) < order(B)" and
"C.timestamp < D.timestamp" => order(C) < order(D)
Unfortunately this "order" function requires sorting of all child nodes first to satisfy second condition thus breaking "one sort" requirement.
To get single sort you can use composite key that includes timestamps for all parent nodes and then sort by such composite key.
The easiest way to build composite key is to construct tree based on parent objects and set value of the key to be concatenation of parent's key and own timestamp using any tree traversal.
Sample:
Data
A (ts = 5) parent of B (ts = 7),C (ts = 2)
B parent of D (ts = 3)
Building tree:
A -> B -> D
-> C
Pre-order traversal: A, B, D, C
composite key -
A -> A.timestamp = 5
B -> key(A) concat B.timestamp = 5.7
C -> key(A) concat C.timestamp = 5.2
D -> key(B) concat D.timestamp = 5.7.2
data for sorting by {order, timestamp} pairs
A {order(no-parent), ts} = {0, 5}
B {order(A), ts} = {1,7}
C {1,2}
D {2,3}
sorted sequences - {5}, {5.2},{5.7},{5.7.2} mapping back to nodes - A,C,B,D
Complexity of this approach is O(n log(n) max_depth):
build tree/walk tree/build keys - O(n)
sort is complexity of sort (usually O(num_elm log(num_elem)) multiplied by complexity of comparing keys which are depending on depth of parent-child tree. This part dominates O(n) needed for preparation phase.
Alternatively you can just build tree, sort each level by time-stamp and than put them back in a list via pre-order traversal, which removes complexity of key comparison but breaks requirement of single sort.
You could sort the objects into lexicographical order using a sequence of one or two numbers as sort key, where if an object has no parent it has a single element in the sequence which is its number, and if an object has a parent the first element in the sequence is its parents number and the second is its own number.
So A, B, and C get sequences {1}, {1, 3}, and {2} and B sorts just after its parent.

Stop Matlab from treating a 1xn matrix as a column vector

I'm very frustrated with MATLAB right now. Let me illustrate the problem. I'm going to use informal notation here.
I have a column cell vector of strings called B. For now, let's say B = {'A';'B';'C';'D'}.
I want to have a matrix G, which is m-by-n, and I want to replace the numbers in G with the respective elements of B... For example, let's say G is [4 3; 2 1]
Let's say I have a variable n which says how many rows of G I want to take out.
When I do B(G(1:2,:)), I get what I want ['D' 'C'; 'B' 'A']
However, if I do B(G(1:1,:)) I get ['D';'C'] when what I really want to get is ['D' 'C']
I am using 1:n, and I want it to have the same behavior for n = 1 as it does for n = 2 and n = 3. Basically, G actually is a n-by-1500 matrix, and I want to take the top n rows and use it as indexes into B.
I could use an if statement that transposes the result if n = 1 but that seems so unnecessary. Is there really no way to make it so that it stops treating my 1-by-n matrix as if it was a column vector?
According to this post by Loren Shure:
Indexing with one array C = A(B) produces output the size of B unless both A and B are vectors.
When both A and B are vectors, the number of elements in C is the number of elements in B and with orientation of A.
You are in second case, hence the behaviour you see.
To make it work, you need to maintain the output to have as many columns as in G. To achieve the same, you can do something like this -
out = reshape(B(G(1:n,:)),[],size(G,2))
Thus, with n = 1:
out =
'D' 'C'
With n = 2:
out =
'D' 'C'
'B' 'A'
I think this will only happen in 1-d case. In default, matlab will return column vector since it is the way how it stores matrix. If you want a row vector, you could just use transpose. Well in my opinion it should be fine when n > 1.

Excel: Removing Duplicate Values Using Array Formula For Multiple Columns

I am trying to remove the duplicates from 7 different columns and combine the unique values into one column and I can't find a way to do that using an Excel formula
I've tried the array approach below, but it doesn't work for for more than one column:
=INDEX($A$11:$A$100000, MATCH(0, COUNTIF($C$11:C11,$A$11:$A$100000), 0))
Here's what I'd like ideally:
Starting data:
Column 1: a b d c b i
Column 2: c g h f d c
Column 3: f e a g b a
Ending result:
a
b
c
d
e
f
g
h
i
...
(order not important)
Any solutions would be appreciated.
Not sure if this answers the question exactly, but you could try using COUNTIFS to identify rows where combinations of two or more columns contain duplicate values:
=COUNTIFS($B:$B,$B1,$C:$C,$C1)
This formula will return the number of rows where the value in B1 and C1 is duplicated. You can copy and paste it down to every row in your formula, or use it as an array formula.
There's more on how to do this here:
http://fiveminutelessons.com/learn-microsoft-excel/find-duplicate-rows-excel-across-multiple-columns

Matlab cell array to string vector - unique

Going nuts with cell array, because I just can't get rid of it... However, it will be an easy one for you guys out here.
So here is why:
I have a dataset (data) which contains two variables: A (Numbers) and B (cell array).
Unfortunately I can't even reconstruct the problem nevertheless my imported table looks like this:
data=dataset;
data.A = [1;1;3;3;3];
data.B = ['A';'A';'BUU';'BUU';'A'];
where data.B is of the type 5x1 cell which I can't reconstruct
all I want now is the unique rows like
ans= [1 A;3 BUU;3 A]
the result should be in a dataset or just two vectors where the rows are equivalent.
but unique([dataA dataB],'rows') can't handle cell arrays and I can't find anywhere in the www how I simple convert the cell array B to a vector of strings (does it exist?).
cell2mat() didn't work for me, because of the different word length ('A' vs 'BUU').
Though, two things I would love to learn: Making an 5x1 cell to an string vector
and find unique rows out of numbers and strings (or cells).
Thank you very much!
Cheers Dominik
The problem is that the A and B fields are of a different type. Although they could be concatenated into a cell array, unique can't handle that. A general trick for cases like this is to "translate" elements of each field (column) to unique identifiers, i.e. numbers. This translation can be done applying unique to each field separately and getting its third output. The obtained identifiers can now be concatenated into a matrix, so that each row of this matrix is a "composite identifier". Finally, unique with 'rows' option can be applied to this matrix.
So, in your case:
[~, ~, kA] = unique(data.A);
[~, ~, kB] = unique(data.B);
[~, jR] = unique([kA kB], 'rows');
Now build the result as (same format as data)
result.A = data.A(jR);
result.B = data.B(jR);
or as (2D cell array)
result = cat(2, mat2cell(data.A(jR), ones(1,numel(jR))), data.B(jR));
Here is my clumpsy solution
tt.A = [1;1;3;3;3];
tt.B = {'A';'A';'BUU';'BUU';'A'};
Convert integers to characters, then merge and find unique strings
tt.C = cellstr(num2str(tt.A));
tt.D = cellfun(#(x,y) [x y],tt.C,tt.B,'UniformOutput',0);
[tt.F,tt.E] = unique(tt.D);
Display results
tt.F

In Pig, finding complement of entries in a table

I have a table A which contains a list, a table B which contains a sub-list of the items in A. How should get a table C which contains a list which is the complement of B in A?
I know how to do it in sql. Am not sure how to approach it in Pig.
Thanks.
In PIG words, you have two "bags" A and B, where B is a subset of A.
If B only contains values in A, you can do C = DIFF(A,B).
However, consider that DIFF removes duplicates, so you will get the complement of B in A reduced to unique values.
Generally, DIFF provides the union of both the complement of B in A and that of A in B.

Resources