hadoop pig joining on any matching tuple values - arrays

I'm new to pig and trying to use it to process a dataset. I have a set of records that looks like
id elements
--------------
1 ["a","b","c"]
2 ["a","f","g"]
3 ["f","g","h"]
The idea is that I want to create tuples of elements that have any overlapping elements. If elements was just a single item instead of array, I could do a simple join like:
A = LOAD 'mydata' ...
B = FOREACH A GENERATE id as id_2, elements as elements_2;
C = JOIN A BY elements, B BY elements_2;
But since elements is an array, this won't work if there is only a partial overlap. Any thoughts on how to do this in pig?
The intended output would give the tuples that have overlap:
(1,2)
(2,3)

I don't think it's possible to use JOIN for this.
One (not so elegant) solution is to CROSS both relations and then do a FILTER operation.
The FILTER condition could either be a UDF or some kind of regex_extract_all and a matching of the produced fields. If the size of the array is always 3 I would probably go for the regex_extract_all solution.

Related

How to extract list of unique combinations from multiline array, where only every two consecutive rows are the basis for combinations?

From a multiline table I'm trying to extract (and count) all unique combinations between every two rows.
I have found a way by creating intermediate tables: 1) I use JOIN to create the combinations between each "city" and "zone" row. 2) I rearrange the result into one single column (as necessary for the following UNIQUE function) by entering a TRANSPOSE function every 7th row (because there are 7 columns). 3) I then use UNIQUE and COUNTIF in a third table for the final result.
Link to the example with data and desired result on first sheet, and my solution on the second sheet: Google Sheets file
As my final sheet will be rather large (~2000 rows, 40 columns, ~4000 unique combinations expected), my main problem with my own solution is the manual step to rearrange the city/zone combinations into one single column as preparation for the UNIQUE function.
Is there a way to achieve the same final result without the intermediate steps from my solution?
C11:
=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(QUERY( TRANSPOSE( QUERY("☯"&QUERY(TO_TEXT(B1:H8),"skipping 2",0)&" "&QUERY(B2:H8,"skipping 2",0),,2^99)),,2^99),"☯"))," Select Col1, count(Col1) group by Col1",0))
Create two arrays with query's skipping, one containing cites and other containing zones
JOIN those arrays with a delimiter using Query headers
SPLIT and TRANSPOSE to create a single column
QUERY to create a frequency count.

Retrieve Index by Values From an Array Through a ComboBox

I have a transposed dynamic n x 2 array that is used to populate a combobox. The primary column alone is not descriptive enough to identify rows uniquely. I would like to use the row index to identify the untransposed column uniquely. Using both columns in the array could also be used for this but may prove problematic down the line. This question is closely related to this question.
I have used Me.cbo.ListIndex = 0 to retrieve the index value. Ideally, I'd like to assign the index of the row chosen in the combobox to a variable. The ultimate goal is to use the index in two ways:
For finding the correct column to use in future calculations
As a method for comparison against another combobox that uses the same array in order to ensure that the same row has not been chosen in both comboboxes
To visually illustrate the above, the original data looks like this:
a b c b
1 2 3 4
A B C B
The transposed array looks like this:
A 1
B 2
C 3
B 4
I would like to be able to make a distinction between selecting B2 and B4, ideally by preserving and comparing index 1 and 3 respectively (0-based).
ListIndex is from the documentation. There is no documentation that I could find about retrieving the index from the name except where the value in the selection is unique. Any Help is greatly appreciated

join columns of arrays in matlab

I have the following inputs
dataset 1 with tens of thousands of rows and 5 array columns
dataset 2 with tens of thousands of rows and 3 array columns
I want to join/merge (add) the 3th column of dataset 1 to a new 4th array column of dataset 2 for the elements for which the ID is the same (same value in column 1 of dataset 1 and column 1 of dataset 2). Mathematically you can write it like this I think:
dataset2(i,4)=dataset1(find(dataset1(:,1)==c(i,1)),3);
but how to put it in MATLAB?
None of the methods mentioned in the MATLAB help function or elsewhere on the internet seem to work. I have already tried merge, join, ismember, vectors, but I can't solve the problem.
Does someone have any ideas? I know the problem can be solved with for loops, but i'm not allowed to use them, so I am searching for alternatives.
I believe this is what you want
%We keep the index of all the matching rows
%NOTICE: I changed c(i,1) to dataset2(:,1)
%matches_in_col_1 = find(dataset1(:,1)==dataset2(:,1));
%EDIT: HOW TO COMPARE MORE THAN 2 COLUMNS
%if you want to find matches in 4 datasets just use this
matches_in_col_1 = find(dataset1(:,1)==dataset2(:,1)==dataset3(:,1)==dataset4(:,1));
%now copy the values from those rows into the corresponding row
%of datsaset2
dataset2(matches_in_col_1,4) = dataset1(matches_in_col_1,3);
I'm not 100% sure. Why is i present? were you trying a loop implementation? My solution also assumes that c was supposed to be dataset2

Finding arrays that contain a subset of another array without using #> with postgreSQL

I have a table with 1.5 MM records. Each record has a row number and an array with between 1 and 1,000 elements in the array. I am trying to find all of the arrays that are a subset of the larger arrays.
When I use the code below, I get ERROR: statement requires more resources than resource queue allows (possibly because there are over a trillion possible combinations):
select
a.array as dup
from
table a
left join
table b
on
b.array #> a.array
and a.row_number <> b.row_number
Is there a more efficient way to identify which arrays are subsets of the other arrays and mark them for removal other than using #>?
Your example code suggests that you are only interested in finding arrays that are subsets of any other array in another row of the table.
However, your query with a JOIN returns all combinations, possibly multiplying results.
Try an EXISTS semi-join instead, returning qualifying rows only once:
SELECT a.array as dup
FROM table a
WHERE EXISTS (
SELECT 1
FROM table b
WHERE a.array <# b.array
AND a.row_number <> b.row_number
);
With this form, Postgres can stop iterating rows as soon as the first match is found. If this won't go through either, try partitioning your query. Add a clause like
AND table_id BETWEEN 0 AND 10000
and iterate through the table. Should be valid for this case.
Aside: it's a pity that your derivate (Greenplum) doesn't seem to support GIN indexes, which would make this operation much faster. (The index itself would be big, though)
Well, I don't see how to do this efficiently in a single declarative SQL statement without appropriate support from an index. I don't know how well this would work with a GIN index, but using a GIN index would certainly avoid the need to compare every possible pair of rows.
The first thing I would do is to carefully investigate the types of indexes you do have at your disposal, and try creating one as necessary.
If that doesn't work, the first thing that comes to my mind, procedurally speaking, would be to sort all the arrays, then sort the rows into a graded lexicographic order on the arrays. Then start with the shortest arrays, and work upwards as follows: e.g. for [1,4,9], check all the arrays with length <= 3 that start with 1 if they are a subset, then check all arrays with length <= 2 that start with 4, and then check all the arrays of length <= 1 that start with 9, removing any found subsets from consideration as you go so that you don't keep re-checking the same rows over and over again.
I'm sure you could tune this algorithm a bit, especially depending on the particular nature of the data involved. I wouldn't be surprised if there was a much better algorithm; this is just the first thing that I thought of. You may be able to work from this algorithm backwards to the SQL you want, or you may have to dump the table for client-side processing, or some hybrid thereof.

turn a matrix into a sorted list google docs/spreadsheets

I've created a large-ish matrix by doing a =pearson( analysis on survey responses in google docs/spreadsheets and would like to convert it into a sorted list.
The matrix has labels (the survey questions) in row 2 and column b. Each intersecting cell has the value. Here's what the formula looks like.
=pearson(FILTER( Pc!$C$2:$AW$999 ; Pc!$C$2:$AW$2= C$2 ),FILTER(Pc!$C$2:$AW$999 ;Pc!$C$2:$AW$2=$B3))
This is what I'd like to get to:
a b c
Question one question 2 correlation
Then sorting by column c is easy.
How can I get all the points out of the matrix/array, along with the labels in this way?
Ideally I'd be able to do this only to points below the diagonal as there of course are dupes above..
Thanks!
I think I found a solution to placing the combination of the headers in a single column.
It involves a series of auxiliary columns, but works.
Let's say we have a single column with all unique headers on column A. I'll assume it's 6 values. So, on cell B1 we paste:
=ArrayFormula(join(";";A1&","&A2:A$6))
And then copy it down to B5. On C1 we join it all and split making a single column:
=transpose(split(join(";";B1:B5);";"))
If needed, we can split the combination in two columns again on D1
=ArrayFormula(split(C1:C15;","))
I don't know why, but the value on E1 does not work correctly, so I just pasted =A2
With these columns you can easily do your nice Pearson-Filter trick again to have it all in a single column. Hope this helps :)
Maybe something like this will help:
=ArrayFormula(transpose(split(CONCATENATE(transpose(C2:AW999)&char(9)), char(9))))
(C2:AW999 is your data range)

Resources