I try to pass a list in parameter of my function.
My list is composed of different coefficients to be apply to lag numerous columns.
However, I only manage to generate the columns in my dataframe for the first value of my list.
this is my actual result :
"col1", "col2", "col1_0.2", "col2_0.2"
what is expected :
"col1", "col2", "col1_0.2", "col2_0.2", "col1_0.4", "col2_0.4", "col1_0.6", "col2_0.6"
I must have missed a few things in my loop ?
selected_col = col_selector(df, ["col1", "col2"])
w = Window.partitionBy("student").orderBy("date")
coef = (.1,.4,.6)
def custom_coef(col, w, coef):
for x in coef:
return sum(
pow(i, x) * F.lag(F.col(col), i, default=0).over(w)
for i in range(1)
).alias(col +"_"+str(x))
new_df = df.select(
F.col("*"),
*[custom_coef(col, w, coef) for col in selected_col]
)
thanks
The return statement in the custom_coef function ends the function after the first execution of loop over coef. This means that custom_coef will always return the first column definition, and this is the column definition for coef 0.1. As the function is called once per column in selected_col you get the result that you are describing.
One way to fix the problem without changing the structure of the code is to replace return with yield. This way custom_coef creates one generator per element of selected_col. These generators can be chained with itertools.chain and this result can be used as parameter of the select statement:
def custom_coef(col, w, coef):
for x in coef:
yield sum( #use yield instead of return
pow(i, x) * F.lag(F.col(col), i, default=0).over(w)
for i in range(1)
).alias(col +"_"+str(x))
new_df = df.select(
F.col("*"),
*chain(*[custom_coef(col, w, coef) for col in selected_col]) #chain the generators
)
new_df.show()
Related
Goal: write a function in PostgreSQL SQL that takes as input an integer array whose each element is either 0, 1, or -1 and returns an array of the same length, where each element of the output array is the sum of all adjacent nonzero values in the input array having the same or lower index.
Example, this input:
{0,1,1,1,1,0,-1,-1,0}
should produce this result:
{0,1,2,3,4,0,-1,-2,0}
Here is my attempt at such a function:
CREATE FUNCTION runs(input int[], output int[] DEFAULT '{}')
RETURNS int[] AS $$
SELECT
CASE WHEN cardinality(input) = 0 THEN output
ELSE runs(input[2:],
array_append(output, CASE
WHEN input[1] = 0 THEN 0
ELSE output[cardinality(output)] + input[1]
END)
)
END
$$ LANGUAGE SQL;
Which gives unexpected (to me) output:
# select runs('{0,1,1,1,1,0,-1,-1,-1,0}');
runs
----------------------------------------
{0,1,2,3,4,5,6,0,0,0,-1,-2,-3,-4,-5,0}
(1 row)
I'm using PostgreSQL 14.4. While I am ignorant of why there are more elements in the output array than the input, the cardinality() in the recursive call seems to be causing it, as also does using array_length() or array_upper() in the same place.
Question: how can I write a function that gives me the output I want (and why is the function I wrote failing to do that)?
Bonus extra: For context, this input array is coming from array_agg() invoked on a table column and the output will go back into a table using unnest(). I'm converting to/from an array since I see no way to do it directly on the table, in particular because WITH RECURSIVE forbids references to the recursive table in either an outer join or subquery. But if there's a way around using arrays (especially with a lack of tail-recursion optimization) that will answer the general question (But I am still very very curious why I'm seeing the extra elements in the output array).
Everything indicates that you have found a reportable Postgres bug. The function should work properly, and a slight modification unexpectedly changes its behavior. Add SELECT; right after $$ to get the function to run as expected, see Db<>fiddle.
A good alternative to a recursive solution is a simple iterative function. Handling arrays in PL/pgSQL is typically simpler and faster than recursion.
create or replace function loop_function(input int[])
returns int[] language plpgsql as $$
declare
val int;
tot int = 0;
res int[];
begin
foreach val in array input loop
if val = 0 then tot = 0;
else tot := tot + val;
end if;
res := res || tot;
end loop;
return res;
end $$;
Test it in Db<>fiddle.
The OP wrote:
this input array is coming from array_agg() invoked on a table column and the output will go back into a table using unnest().
You can calculate these cumulative sums directly in the table with the help of window functions.
select id, val, sum(val) over w
from (
select
id,
val,
case val
when 0 then 0
else sum((val = 0)::int) over w
end as series
from my_table
window w as (order by id)
) t
window w as (partition by series order by id)
order by id
Test it in Db<>fiddle.
I try to applied loop with condition to sum up the respective row(field), the where condition should be correct but during running of the system, the program ignored the condition and sum up all rows, any suggestion to fix this problem?
SELECT * FROM LIPS INTO CORRESPONDING FIELDS OF TABLE LT_LIPS
WHERE VGBEL = LT_BCODE_I-VGBEL "getDN number
AND VGPOS = LT_BCODE_I-VGPOS. " get vgpos = 01/02/03
LOOP AT LT_BCODE_I INTO LT_BCODE_I WHERE VGBEL = LT_LIPS-VGBEL AND VGPOS = LT_LIPS-VGPOS.
SUM.
LT_BCODE_I-MENGE = LT_BCODE_I-MENGE.
ENDLOOP
.
Although you are asking about LOOP, I think the issue is more about how you use SUM.
The statement SUM can only be specified within a loop LOOP and is only respected within a AT-ENDAT control structure.
Here is an excerpt from the ABAP documentation, for "Calculation of a sum with SUM at AT LAST. All lines of the internal table are evaluated":
DATA:
BEGIN OF wa,
col TYPE i,
END OF wa,
itab LIKE TABLE OF wa WITH EMPTY KEY.
itab = VALUE #( FOR i = 1 UNTIL i > 10 ( col = i ) ).
LOOP AT itab INTO wa.
AT LAST.
SUM.
cl_demo_output=>display( wa ).
ENDAT.
ENDLOOP.
I am looking into a way of computing argmin or argmax aggregation of multiple rows in Snowflake similar to Hive or Presto.
In Hive, one can use a workaround with (named) structs because the aggregation function gets applied to the first element of the struct. Here is an example:
SELECT max(named_struct('y', y, 'x', x)).x FROM t
Now I am asking myself if there is a similar way to do this in Snowflake.
In Snowflake we have an OBJECT datatype with similar properties. Can I use the following code to compute argmin or argmax like in the Hive example? Are min/max aggregations for objects also performed on the first element of the object?
SELECT max(object_construct('y', y, 'x', x)).x FROM t
Running the above code returns an error: SQL compilation error: Function MAX does not support OBJECT argument type.. It actually doesn't support any complex type.
If I have correctly understood the logic of argmin(), then you could implement it as a javascript UDF like so:
create or replace function argmin("a" object, "b" object)
returns object
language javascript
as
$$
for (let [k,v] of Object.entries(a))
if (v==b[k])
continue
else
return v < b[k] ? a : b
return b
$$;
And applied like this:
with t as (
select
object_construct('x',1, 'y',2) a,
object_construct('x',2, 'y',1) b
)
select argmin(t.a,t.b):y from t;
Actually this functionality is built in:
select
object_construct('x',1, 'y',2) a,
object_construct('x',2, 'y',1) b,
iff(a<b, a, b) : y
;
Even more concisely:
select
object_construct('x',1, 'y',2) a,
object_construct('x',2, 'y',1) b,
least(a,b):y,
greatest(a,b):y;
I have a cell array called BodyData in MATLAB that has around 139 columns and 3500 odd rows of skeletal tracking data.
I need to extract all rows between two string values (these are timestamps when an event happened) that I have
e.g.
BodyData{}=
Column 1 2 3
'10:15:15.332' 'BASE05' ...
...
'10:17:33:230' 'BASE05' ...
The two timestamps should match a value in the array but might also be within a few ms of those in the array e.g.
TimeStamp1 = '10:15:15.560'
TimeStamp2 = '10:17:33.233'
I have several questions!
How can I return an array for all the data between the two string values plus or minus a small threshold of say .100ms?
Also can I also add another condition to say that all str values in column2 must also be the same, otherwise ignore? For example, only return the timestamps between A and B only if 'BASE02'
Many thanks,
The best approach to the first part of your problem is probably to change from strings to numeric date values. In Matlab this can be done quite painlessly with datenum.
For the second part you can just use logical indexing... this is were you put a condition (i.e. that second columns is BASE02) within the indexing expression.
A self-contained example:
% some example data:
BodyData = {'10:15:15.332', 'BASE05', 'foo';...
'10:15:16.332', 'BASE02', 'bar';...
'10:15:17.332', 'BASE05', 'foo';...
'10:15:18.332', 'BASE02', 'foo';...
'10:15:19.332', 'BASE05', 'bar'};
% create column vector of numeric times, and define start/end times
dateValues = datenum(BodyData(:, 1), 'HH:MM:SS.FFF');
startTime = datenum('10:15:16.100', 'HH:MM:SS.FFF');
endTime = datenum('10:15:18.500', 'HH:MM:SS.FFF');
% select data in range, and where second column is 'BASE02'
BodyData(dateValues > startTime & dateValues < endTime & strcmp(BodyData(:, 2), 'BASE02'), :)
Returns:
ans =
'10:15:16.332' 'BASE02' 'bar'
'10:15:18.332' 'BASE02' 'foo'
References: datenum manual page, matlab help page on logical indexing.
My action passes a list of values from a column x in table y to the view. How do I write the following SQL: SELECT x FROM y, using DAL "language", when x and y are variables given by the view. Here it is, using exequtesql().
def myAction():
x = request.args(0, cast=str)
y = request.args(1, cast=str)
myrows = db.executesql('SELECT '+ x + ' FROM '+ y)
#Let's convert it to the list:
mylist = []
for row in myrows:
value = row #this line doesn't work
mylist.append(value)
return (mylist=mylist)
Also, is there a more convenient way to convert that data to a list?
First, note that you must create table definitions for any tables you want to access (i.e., db.define_table('mytable', ...)). Assuming you have done that and that y is the name of a single table and x is the name of a single field in that table, you would do:
myrows = db().select(db[y][x])
mylist = [r[x] for r in myrows]
Note, if any records are returned, .select() always produces a Row object, which comprises a set of Row objects (even if only a single field was selected). So, to extract the individual values into a list, you have to iterate over the Rows object and extract the relevant field from each Row object. The above code does so via a list comprehension.
Also, you might want to add some code to check whether db[y] and db[y][x] exist.