A web application can send to a function an array of arrays like
[
[
[1,2],
[3,4]
],
[
[],
[4,5,6]
]
]
The outer array length is n > 0. The middle arrays are of constant length, 2 in this example. And the inner arrays lengths are n >= 0.
I could string build it like this:
with t(a, b) as (
values (1, 4), (2, 3), (1, 4), (7, 3), (7, 4)
)
select distinct a, b
from t
where
(a = any(array[1,2]) or array_length(array[1,2],1) is null)
and
(b = any(array[3,4]) or array_length(array[3,4],1) is null)
or
(a = any(array[]::int[]) or array_length(array[]::int[],1) is null)
and
(b = any(array[4,5,6]) or array_length(array[4,5,6],1) is null)
;
a | b
---+---
7 | 4
1 | 4
2 | 3
But I think I can do better like this
with t(a, b) as (
values (1, 4), (2, 3), (1, 4), (7, 3), (7, 4)
), u as (
select unnest(a)::text[] as a
from (values
(
array[
'{"{1,2}", "{3,4}"}',
'{"{}", "{4,5,6}"}'
]::text[]
)
) s(a)
), s as (
select a[1]::int[] as a1, a[2]::int[] as a2
from u
)
select distinct a, b
from
t
inner join
s on
(a = any(a1) or array_length(a1, 1) is null)
and
(b = any(a2) or array_length(a2, 1) is null)
;
a | b
---+---
7 | 4
2 | 3
1 | 4
Notice that a text array was passed and then "casted" inside the function. That was necessary as Postgresql can only deal with arrays of matched dimensions and the passed inner arrays can vary in dimension. I could "fix" them before passing by adding some special value like zero to make them all the same length of the longest one but I think it is cleaner to deal with that inside the function.
Am I missing something? Is it the best approach?
I like your second approach.
SELECT DISTINCT t.*
FROM (VALUES (1, 4), (5, 1), (2, 3), (1, 4), (7, 3), (7, 4)) AS t(a, b)
JOIN (
SELECT arr[1]::int[] AS a1
,arr[2]::int[] AS b1
FROM (
SELECT unnest(ARRAY['{"{1,2}", "{3,4}"}'
,'{"{}" , "{4,5,6}"}'
,'{"{5}" , "{}"}' -- added element to 1st dimension
])::text[] AS arr -- 1d text array
) sub
) s ON (a = ANY(a1) OR a1 = '{}')
AND (b = ANY(b1) OR b1 = '{}')
;
Suggesting only minor improvements:
Subqueries instead of CTEs for slightly better performance.
Simplified test for empty array: checking against literal '{}' instead of function call.
One less subquery level for unwrapping the array.
Result:
a | b
--+---
2 | 3
7 | 4
1 | 4
5 | 1
For the casual reader: Wrapping the multi-dimensional array of integer is necessary, since Postgres demands that (quoting error message):
multidimensional arrays must have array expressions with matching dimensions
An alternate route would be with a 2-dimensional text array and unnest it using generate_subscripts():
WITH a(arr) AS (SELECT '{{"{1,2}", "{3,4}"}
,{"{}", "{4,5,6}"}
,{"{5}", "{}"}}'::text[] -- 2d text array
)
SELECT DISTINCT t.*
FROM (VALUES (1, 4), (5, 1), (2, 3), (1, 4), (7, 3), (7, 4)) AS t(a, b)
JOIN (
SELECT arr[i][1]::int[] AS a1
,arr[i][2]::int[] AS b1
FROM a, generate_subscripts(a.arr, 1) i -- using implicit LATERAL
) s ON (t.a = ANY(s.a1) OR s.a1 = '{}')
AND (t.b = ANY(s.b1) OR s.b1 = '{}');
Might be faster, can you test?
In versions before 9.3 one would use an explicit CROSS JOIN instead of lateral cross joining.
Related
(Disclaimer: I've simplified my problem to the salient points, what I want to do is slightly more complicated but I describe the core issue here.)
I am trying to build a network using keras to learn properties of some 5 by 5 matrices.
The input data is in the form of a 1000 by 5 by 5 numpy array, where each 5 by 5 sub-array represents a single matrix.
What I want the network to do is to use the properties of each row in the matrix, so I would like to split each 5 by 5 array into individual 1 by 5 arrays and pass each of these 5 arrays on to the next part of the network.
Here is what I have so far:
input_mat = keras.Input(shape=(5,5), name='Input')
part_list = list()
for i in range(5):
part_list.append(keras.layers.Lambda(lambda x: x[i,:])(input_mat))
dense_list = list()
for i in range(5):
dense_list.append( keras.layers.Dense(10, activation='selu',
use_bias=True)(part_list[i]) )
conc = keras.layers.Concatenate(axis=-1, name='Concatenate')(dense_list)
dense_out = keras.layers.Dense(1, name='D_out', activation='sigmoid')(conc)
model = keras.Model(inputs= input_mat, outputs=dense_out)
model.compile(optimizer='adam', loss='mean_squared_error')
My problem is that this does not appear to train well, and looking at the model summary I am not sure that the network is splitting the inputs as I would like:
Layer (type) Output Shape Param # Connected to
==================================================================================================
Input (InputLayer) (None, 5, 5) 0
__________________________________________________________________________________________________
lambda_5 (Lambda) (5, 5) 0 Input[0][0]
__________________________________________________________________________________________________
lambda_6 (Lambda) (5, 5) 0 Input[0][0]
__________________________________________________________________________________________________
lambda_7 (Lambda) (5, 5) 0 Input[0][0]
__________________________________________________________________________________________________
lambda_8 (Lambda) (5, 5) 0 Input[0][0]
__________________________________________________________________________________________________
lambda_9 (Lambda) (5, 5) 0 Input[0][0]
__________________________________________________________________________________________________
dense (Dense) (5, 10) 60 lambda_5[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (5, 10) 60 lambda_6[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (5, 10) 60 lambda_7[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (5, 10) 60 lambda_8[0][0]
__________________________________________________________________________________________________
dense_4 (Dense) (5, 10) 60 lambda_9[0][0]
__________________________________________________________________________________________________
Concatenate (Concatenate) (5, 50) 0 dense[0][0]
dense_1[0][0]
dense_2[0][0]
dense_3[0][0]
dense_4[0][0]
__________________________________________________________________________________________________
D_out (Dense) (5, 1) 51 Concatenate[0][0]
==================================================================================================
Total params: 351
Trainable params: 351
Non-trainable params: 0
The input and output nodes of the Lambda layers look wrong to me, though I'm afraid I'm still struggling to understand the concept.
Lambda are to be avoided.
Subclass layer instead:
class Slice(keras.layers.Layer):
def __init__(self, begin, size,**kwargs):
super(Slice, self).__init__(**kwargs)
self.begin = begin
self.size = size
def get_config(self):
config = super().get_config().copy()
config.update({
'begin': self.begin,
'size': self.size,
})
return config
def call(self, inputs):
return tf.slice(inputs, self.begin, self.size)
In the line
part_list.append(keras.layers.Lambda(lambda x: x[i,:])(input_mat))
You are basically taking the first 5 of the 1000 images, which is not what you want to do.
To achieve what you want, try tensorflow's unstack operation:
part_list = tf.unstack(input_mat, axis=1)
This should give you a list having 5 elements, each element having shape [1000, 5]
I was trying to implement pivoting similar to sql server in spark
As of now, I'm using sqlContext and applying all the transformation within the sql.
I would like to know if I can do a direct pull from sql server and implement the pivot funtion using spark.
Below is an example of what I'm trying to achieve-
SQL Server queries below-
create table #temp(ID Int, MonthPrior int, Amount float);
insert into #temp
values (100,1,10),(100,2,20),(100,3,30),(100,4,10),(100,5,20),(100,6,60),(200,1,10),(200,2,20),(200,3,30),(300,4,10),(300,5,20),(300,6,60);
select * from #temp;
|ID |MonthPrior|Amount|
|-------|----------|------|
|100 |1 |10|
|100 |2 |20|
|100 |3 |30|
|100 |4 |10|
|100 |5 |20|
|100 |6 |60|
|200 |1 |10|
|200 |2 |20|
|200 |3 |30|
|300 |4 |10|
|300 |5 |20|
|300 |6 |60|
Select ID,coalesce([1],0) as Amount1Mth, coalesce([1],0)+coalesce([2],0)+coalesce([3],0) as Amount1to3Mth, coalesce([1],0)+coalesce([2],0)+coalesce([3],0)+coalesce([4],0)+coalesce([5],0)+coalesce([6],0) as Amount_AllMonths from (select * from #temp) A
pivot
( sum(Amount) for MonthPrior in ([1],[2],[3],[4],[5],[6]) ) as Pvt
|ID |Amount1Mth |Amount1to3Mth |Amount_AllMonths|
|-------|-------|-------|---|
|100 |10 |60 |150|
|200 |10 |60 |60|
|300 |0 |0 |90|
If your Amount column is of Decimal type, it would be best to use java.math.BigDecimal as the corresponding argument type. Note that methods + and sum are no longer applicable hence are replaced with add and reduce respectively.
import org.apache.spark.sql.functions._
import java.math.BigDecimal
val df = Seq(
(100, 1, new BigDecimal(10)),
(100, 2, new BigDecimal(20)),
(100, 3, new BigDecimal(30)),
(100, 4, new BigDecimal(10)),
(100, 5, new BigDecimal(20)),
(100, 6, new BigDecimal(60)),
(200, 1, new BigDecimal(10)),
(200, 2, new BigDecimal(20)),
(200, 3, new BigDecimal(30)),
(300, 4, new BigDecimal(10)),
(300, 5, new BigDecimal(20)),
(300, 6, new BigDecimal(60))
).toDF("ID", "MonthPrior", "Amount")
// UDF to combine 2 array-type columns to map
def arrayToMap = udf(
(a: Seq[Int], b: Seq[BigDecimal]) => (a zip b).toMap
)
// Create array columns which get zipped into a map
val df2 = df.groupBy("ID").agg(
collect_list(col("MonthPrior")).as("MonthList"),
collect_list(col("Amount")).as("AmountList")
).select(
col("ID"), arrayToMap(col("MonthList"), col("AmountList")).as("MthAmtMap")
)
// UDF to sum map values for keys from 1 thru n (0 for all)
def sumMapValues = udf(
(m: Map[Int, BigDecimal], n: Int) =>
if (n > 0)
m.collect{ case (k, v) => if (k <= n) v else new BigDecimal(0) }.reduce(_ add _)
else
m.collect{ case (k, v) => v }.reduce(_ add _)
)
val df3 = df2.withColumn( "Amount1Mth", sumMapValues(col("MthAmtMap"), lit(1)) ).
withColumn( "Amount1to3Mth", sumMapValues(col("MthAmtMap"), lit(3)) ).
withColumn( "Amount_AllMonths", sumMapValues(col("MthAmtMap"), lit(0)) ).
select( col("ID"), col("Amount1Mth"), col("Amount1to3Mth"), col("Amount_AllMonths") )
df3.show(truncate=false)
+---+--------------------+--------------------+--------------------+
| ID| Amount1Mth| Amount1to3Mth| Amount_AllMonths|
+---+--------------------+--------------------+--------------------+
|300| 0E-18| 0E-18|90.00000000000000...|
|100|10.00000000000000...|60.00000000000000...|150.0000000000000...|
|200|10.00000000000000...|60.00000000000000...|60.00000000000000...|
+---+--------------------+--------------------+--------------------+
One approach would be to create a map-type column from arrays of MonthPrior and Amount, and apply a UDF that sum the map values based on an integer parameter:
val df = Seq(
(100, 1, 10),
(100, 2, 20),
(100, 3, 30),
(100, 4, 10),
(100, 5, 20),
(100, 6, 60),
(200, 1, 10),
(200, 2, 20),
(200, 3, 30),
(300, 4, 10),
(300, 5, 20),
(300, 6, 60)
).toDF("ID", "MonthPrior", "Amount")
import org.apache.spark.sql.functions._
// UDF to combine 2 array-type columns to map
def arrayToMap = udf(
(a: Seq[Int], b: Seq[Int]) => (a zip b).toMap
)
// Aggregate columns into arrays and apply arrayToMap UDF to create map column
val df2 = df.groupBy("ID").agg(
collect_list(col("MonthPrior")).as("MonthList"),
collect_list(col("Amount")).as("AmountList")
).select(
col("ID"), arrayToMap(col("MonthList"), col("AmountList")).as("MthAmtMap")
)
// UDF to sum map values for keys from 1 thru n (0 for all)
def sumMapValues = udf(
(m: Map[Int, Int], n: Int) =>
if (n > 0) m.collect{ case (k, v) => if (k <= n) v else 0 }.sum else
m.collect{ case (k, v) => v }.sum
)
// Apply sumMapValues UDF to the map column
val df3 = df2.withColumn( "Amount1Mth", sumMapValues(col("MthAmtMap"), lit(1)) ).
withColumn( "Amount1to3Mth", sumMapValues(col("MthAmtMap"), lit(3)) ).
withColumn( "Amount_AllMonths", sumMapValues(col("MthAmtMap"), lit(0)) ).
select( col("ID"), col("Amount1Mth"), col("Amount1to3Mth"), col("Amount_AllMonths") )
df3.show
+---+----------+-------------+----------------+
| ID|Amount1Mth|Amount1to3Mth|Amount_AllMonths|
+---+----------+-------------+----------------+
|300| 0| 0| 90|
|100| 10| 60| 150|
|200| 10| 60| 60|
+---+----------+-------------+----------------+
Thanks #LeoC. above solution worked. I also tried the following-
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
lazy val months = (((df select ($"MonthPrior") distinct) sort
($"MonthPrior".asc)).rdd map (_.getAs[Int](0)) collect).toList
lazy val sliceSpec = List((0, 2, "1-2"), (0, 3, "1-3"), (0, 4, "1-4"), (0, 5, "1-5"), (0, 6, "1-6"))
lazy val createGroup: List[Any] => ((Int, Int, String) => Column) = sliceMe => (start, finish, aliasName) =>
sliceMe slice (start, finish) map (value => col(value.toString)) reduce (_ + _) as aliasName
lazy val grouper = createGroup(months).tupled
lazy val groupedCols = sliceSpec map (group => grouper(group))
val pivoted = df groupBy ($"ID") pivot ("MonthPrior") agg (sum($"Amount"))
val writeMe = pivoted select ((pivoted.columns map col) ++ (groupedCols): _*)
z.show(writeMe sort ($"ID".asc))
I am trying to have a numpy array with random numbers from 0 to 1:
import numpy as np
x = np.random.random((3,3))
yields
[[ 0.11874238 0.71885484 0.33656161]
[ 0.69432263 0.25234083 0.66118676]
[ 0.77542651 0.71230397 0.76212491]]
And, from this array, I need the row,column combinations which have values bigger than 0.3. So the expected output should look like:
(0,1),(0,2),(1,0),(1,2),(2,0),(2,1),(2,2)
To be able to extract the item (the values of x[row][column]),and tried to write the output to a file. I tried the following command:
with open('newfile.txt', 'w') as fd:
for row in x:
for item in row:
if item > 0.3:
print(item)
for row in item:
for col in item:
print(row,column,'\n')
fd.write(row,column,'\n')
However, it raises an error :
TypeError: 'numpy.float64' object is not iterable
Also, I searched but could not find how to start the numpy index from 1 instead of 0. For example, the expected output would look like this:
(1,2),(1,3),(2,1),(2,3),(3,1),(3,2),(3,3)
Do you know how to get these outputs?
Get the indices along first two axes that match that criteria with np.nonzero/np.where on the mask of comparisons and then simply index with integer array indexing -
r,c = np.nonzero(x>0.3)
out = x[r,c]
If you are looking to get those indices a list of tuples, zip those indices -
zip(r,c)
To get those starting from 1, add 1 and then zip -
zip(r+1,c+1)
On Python 3.x, you would need to wrap it with list() : list(zip(r,c)) and list(zip(r+1,c+1)).
Sample run -
In [9]: x
Out[9]:
array([[ 0.11874238, 0.71885484, 0.33656161],
[ 0.69432263, 0.25234083, 0.66118676],
[ 0.77542651, 0.71230397, 0.76212491]])
In [10]: r,c = np.nonzero(x>0.3)
In [14]: zip(r,c)
Out[14]: [(0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1), (2, 2)]
In [18]: zip(r+1,c+1)
Out[18]: [(1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2), (3, 3)]
In [13]: x[r,c]
Out[13]:
array([ 0.71885484, 0.33656161, 0.69432263, 0.66118676, 0.77542651,
0.71230397, 0.76212491])
Writing indices to file -
Use np.savetxt with int format, like so -
In [69]: np.savetxt("output.txt", np.argwhere(x>0.3), fmt="%d", comments='')
In [70]: !cat output.txt
0 1
0 2
1 0
1 2
2 0
2 1
2 2
With the 1 based indexing, add 1 to np.argwhere output -
In [71]: np.savetxt("output.txt", np.argwhere(x>0.3)+1, fmt="%d", comments='')
In [72]: !cat output.txt
1 2
1 3
2 1
2 3
3 1
3 2
3 3
You could use np.where, which returns two arrays (when applied to a 2D array), with indices of rows (and corresponding columns) satisfy the condition you specifiy as an argument.
Then you can zip these two arrays to get back a list of tuples:
list(zip(*np.where(x > 0.3)))
If you want to add 1 to every element of every tuple (use 1 based indexing), either loop over the tuples, either add 1 to each array returned by where:
res = np.where(x > 0.3)
res[0] += 1 # adds one to every element of res[0] thanks to broadcasting
res[1] += 1
list(zip(*res))
I have a table
id int | data json
With data:
1 | [1,2,3,2]
2 | [2,3,4]
I want to modify rows to delete array element (int) 2
Expected result:
1 | [1,3]
2 | [3,4]
As a_horse_with_no_name suggests in his comment the proper data type is int[] in this case. However, you can transform the json array to int[], use array_remove() and transform the result back to json:
with my_table(id, data) as (
values
(1, '[1,2,3,2]'::json),
(2, '[2,3,4]')
)
select id, to_json(array_remove(translate(data::text, '[]', '{}')::int[], 2))
from my_table;
id | to_json
----+---------
1 | [1,3]
2 | [3,4]
(2 rows)
Another possiblity is to unnest the arrays with json_array_elements(), eliminate unwanted elements and aggregate the result:
select id, json_agg(elem)
from (
select id, elem
from my_table,
lateral json_array_elements(data) elem
where elem::text::int <> 2
) s
group by 1;
I have some main tests. Each of these main tests consist on other tests. Each of these tests consists on other tests and so on. See below the trees as an example.
Main Test 1
ID:1
/ | \
/ | \
+ o +
Test Test Test
ID:2 ID:3 ID:4
/ \ / \
/ \ / \
+ o + o
Test Test Test Test
ID:5 ID:6 ID:7 ID:8
| / \
| / \
o o o
Test Test Test
ID:12 ID:9 ID:10
Main Test 2
ID:2
/ \
/ \
+ +
Test Test
ID:3 ID:8
/ | \
/ | \
o o o
Test Test Test
ID:5 ID:10 ID:7
Symbols:
'o' are leafs
'+' are parents
Main Test 1 and Main Test 2 are main tests (root tests).
Within each main test, ids for tests are unique, but ids tests within a main test can be repeated for within another main tests as above trees show.
I have an input table, let's say, "INPUT" with below columns:
ID_MainTest | ID_TEST | PASSED
With this input table we indicate which tests for each main test are passed.
Also we have another table that contains above trees representation into table, let's say table "Trees":
ID_MainTest | ID_TEST | PARENT_ID_TEST
Finally we have another table, let's say table "TESTS", which contains all the tests which indicates the current result (PENDING,FAILED,PASSED) for each test:
ID_MainTest | ID_TEST | RESULT
So suppose tables content are:
INPUT table (ID_MainTest and ID_Test are primary keys):
ID_MainTest | ID_TEST | PASSED
1 4 1
1 5 1
1 6 1
1 2 1
1 3 1
2 3 1
TREES table (ID_MainTest and ID_Test are primary keys):
ID_MainTest | ID_TEST | PARENT_ID_TEST
1 2 NULL
1 3 NULL
1 4 NULL
1 5 2
1 6 2
1 7 4
1 8 4
1 12 5
1 9 7
1 10 7
2 3 NULL
2 8 NULL
2 5 3
2 10 3
2 7 3
TESTS table (ID_MainTest and ID_Test are primary keys):
ID_MainTest | ID_TEST | RESULT
1 2 PENDING
1 3 FAILED
1 4 FAILED
1 5 PASSED
1 6 PENDING
1 7 PASSED
1 8 FAILED
1 12 PASSED
1 9 PASSED
1 10 PENDING
2 3 PENDING
2 8 FAILED
2 5 PASSED
2 10 PENDING
2 7 PENDING
The functionality is the following:
A test (those indicated in input table) will be switch to passed if and only if all its children figure as passed. If any of its children (or descendants) is failed, then parent will be set/switch to failed despite of indicated as passed in input table.
If test is indicated to be passed from input table, all its children (and descendants) will be set/switch to passed from the parent to the leafs when possible: Children (and descendants) may only be switch to passed if they figure as pending. If a child (or descendant) figures as failed it cannot be switch to passed (it keeps as failed). Also if a child (or descendant) already figures as passed it is not necessary to switch again to passed, it will be kept.
Parent indicated as passed in input table, can be switch to passed if
all its descendants figure as passed (independently if this parent
figures as failed or pending in the tests table, this is an
exception).
So taken into account the functionality and tables content above indicated I would like to obtain below result table with only the tests we have tried to switch to passed (successfully or not), switched to passed, or maintained to failed or passed, including those indicated in input table:
(ID_MainTest and ID_Test are primary keys):
ID_MainTest | ID_TEST | RESULT
1 2 PASSED
1 3 PASSED
1 4 FAILED
1 5 PASSED
1 6 PASSED
1 7 PASSED
1 8 FAILED
1 12 PASSED
1 9 PASSED
1 10 PASSED
2 3 PASSED
2 5 PASSED
2 10 PASSED
2 7 PASSED
I provide the initial tables below:
DECLARE #INPUT AS TABLE
(
ID_MainTest int,
ID_TEST int,
PASSED bit
)
INSERT INTO #INPUT VALUES
(1, 4, 1),
(1, 5, 1),
(1, 6, 1),
(1, 2, 1),
(1, 3, 1),
(2, 3, 1)
DECLARE #TREES AS TABLE
(
ID_MainTest int,
ID_TEST int,
PARENT_ID_TEST int
)
INSERT INTO #TREES VALUES
(1, 2, NULL),
(1, 3, NULL),
(1, 4, NULL),
(1, 5, 2),
(1, 6, 2),
(1, 7, 4),
(1, 8, 4),
(1, 12, 5),
(1, 9, 7),
(1, 10, 7),
(2, 3, NULL),
(2, 8, NULL),
(2, 5, 3),
(2, 10, 3),
(2, 7, 3)
DECLARE #TESTS AS TABLE
(
ID_MainTest int,
ID_TEST int,
RESULT NVARCHAR(50)
)
INSERT INTO #TESTS VALUES
(1, 2, 'PENDING'),
(1, 3, 'FAILED'),
(1, 4, 'FAILED'),
(1, 5, 'PASSED'),
(1, 6, 'PENDING'),
(1, 7, 'PASSED'),
(1, 8, 'FAILED'),
(1, 12, 'PASSED'),
(1, 9, 'PASSED'),
(1, 10, 'PENDING'),
(2, 3, 'PENDING'),
(2, 8, 'FAILED'),
(2, 5, 'PASSED'),
(2, 10, 'PENDING'),
(2, 7, 'PENDING')