How do I do Multiple Window Aggregations in Apache Flink - apache-flink

We have millions of drivers earning tips. A driver can choose to get notified about their tips earned hourly, or every 6-hours, or every 12-hours, or daily. Essentially, we can have 4 time windows to aggregate driver tips by (1, 6, 12 and 24). A driver can fall into one of those windows.
Data Model
{drive_id, driver_id, aggregation_window_preference, tip_earned}
Sample Input:
Time: 00:00 - {drive_id: 1, driver_id: d1, aggregation_window_preference: 6, tip_earned: 1.00}
Time: 00:15 - {drive_id: 2, driver_id: d2, aggregation_window_preference: 12, tip_earned: 3.00}
Time: 07:00 - {drive_id: 3, driver_id: d1, aggregation_window_preference: 6, tip_earned: 2.00}
Time: 07:15 - {drive_id: 4, driver_id: d3, aggregation_window_preference: 24, tip_earned: 4.00}
Time: 08:00 - {drive_id: 5, driver_id: d3, aggregation_window_preference: 24, tip_earned: 1.00}
Time: 08:30 - {drive_id: 6, driver_id: d2, aggregation_window_preference: 12, tip_earned: 2.00}
Time: 09:00 - {drive_id: 7, driver_id: d2, aggregation_window_preference: 12, tip_earned: 3.00}
Time: 10:30 - {drive_id: 8, driver_id: d1, aggregation_window_preference: 6, tip_earned: 1.00}
Sample Output
Time: 06:00 - {driver_id: 1, tips_earned: 1.00}
Time: 12:00 - {driver_id: 1, tips_earned: 3.00}
Time: 12:00 - {driver_id: 2, tips_earned: 8.00}
Time: 24:00 - {driver_id: 3, tips_earned: 5.00}
I realize that problem that I am trying to solve is pretty much similar to this stackoverflow post.
What makes this question a bit different is -
The answer aggregates tips by multiple windows. Drivers fall into disjoint time windows. For example - driver, d3, needs to be notified only every 24 hours. How do we ensure that a driver d3 is NOT notified hourly, every 6 hours, or every 12-hours?
Thanks!
My question is an extension to this question https://stackoverflow.com/a/71833148/21061133. I tried searching stackoverflow but did not find an answer that solves my problem.

Related

Google Sheets: Dynamic array formula that prints a date range into an array for each 'ID' entry [duplicate]

This question already has an answer here:
Google Sheets - Repeat values 'n' times from a range
(1 answer)
Closed last month.
Hoping to get a single dynamic formula in Google Sheets that gives me the following:
Scenario 1
Input:
| ID | Date |
| 1 | Jan 1, 2023 |
| 2 | Jan 2, 2023 |
| 3 | Jan 3, 2023 |
| | Jan 4, 2023 |
| | Jan 5, 2023 |
Desired output:
ID Date
----------------
1 Jan 1, 2023
1 Jan 2, 2023
1 Jan 3, 2023
1 Jan 4, 2023
1 Jan 5, 2023
2 Jan 1, 2023
2 Jan 2, 2023
2 Jan 3, 2023
2 Jan 4, 2023
2 Jan 5, 2023
3 Jan 1, 2023
3 Jan 2, 2023
3 Jan 3, 2023
3 Jan 4, 2023
3 Jan 5, 2023
Scenario 2 (preferred)
Input:
ID Start Date End Date
----------------------------
1 Jan 1, 2023 Jan 10, 2023
2 Jan 5, 2023 Jan 10, 2023
3 Jan 8, 2023 Jan 10, 2023
Desired output:
ID Date
---------------
1 Jan 1, 2023
1 Jan 2, 2023
1 Jan 3, 2023
1 Jan 4, 2023
1 Jan 5, 2023
1 Jan 6, 2023
1 Jan 7, 2023
1 Jan 8, 2023
1 Jan 9, 2023
1 Jan 10, 2023
2 Jan 5, 2023
2 Jan 6, 2023
2 Jan 7, 2023
2 Jan 8, 2023
2 Jan 9, 2023
2 Jan 10, 2023
3 Jan 8, 2023
3 Jan 9, 2023
3 Jan 10, 2023
You may see this sample document for reference.
I was able to find a solution for the first scenario, but it doesn't seem efficient.
Given these inputs:
Date / ID 1 2 3
Jan 1, 2023
Jan 2, 2023
Jan 3, 2023
Jan 4, 2023
Jan 5, 2023
And with this formula,
=SORT(ARRAYFORMULA(split(transpose(split(TEXTJOIN("^",true,(B1:D1&"!"&A2:A6)),"^")),"!")),1, true, 2, true)
I was able to get this output:
Date / ID 1 2 3
---------------------------
Jan 1, 2023 1 Jan 1, 2023
Jan 2, 2023 1 Jan 2, 2023
Jan 3, 2023 1 Jan 3, 2023
Jan 4, 2023 1 Jan 4, 2023
Jan 5, 2023 1 Jan 5, 2023
2 Jan 1, 2023
2 Jan 2, 2023
2 Jan 3, 2023
2 Jan 4, 2023
2 Jan 5, 2023
3 Jan 1, 2023
3 Jan 2, 2023
3 Jan 3, 2023
3 Jan 4, 2023
3 Jan 5, 2023
This works but, as I mentioned above, a solution for scenario 2 is preferred.
Here's a possible solution using REDUCE:
=ArrayFormula(
REDUCE({"ID","Date"},
FILTER(ROW(A4:C)-3,
BYROW(A4:C,LAMBDA(row,COUNTBLANK(row)=0))),
LAMBDA(acc,cur,
{acc;
SPLIT(INDEX(A4:C,cur,1)&"❄️"&
SEQUENCE(
INDEX(A4:C,cur,3)-
INDEX(A4:C,cur,2)+1,1,
INDEX(A4:C,cur,2)),
"❄️")})))
Another option combining two REDUCE with SEQUENCE and INDEX:
The first REDUCE is defined by the amount of items in A2:A (SEQUENCE (COUNTA(A2:A))) With that value v you'll be able to scan the values of columns A,B and C. So each time you use INDEX(A2:A,v) or similar you're accessing the item of that row from that column
The second REDUCE is defined by the amount of days of difference between C and B column, and basically adds one day at a time. Both REDUCE nested looks like this:
=REDUCE({"ID","Date"},SEQUENCE(COUNTA(A2:A)),
LAMBDA(a,v,{a;
REDUCE({INDEX(A2:A,v),INDEX(B2:B,v)}, SEQUENCE (INDEX(C2:C,v)-INDEX(B2:B,v)),
LAMBDA(b,w,{b;INDEX(A2:A,v),INDEX(B2:B,v)+w}))}))

How to take the price from this Json object?

I tried to get the price value from this json object but no chance ,
can someone help me , thanks
{
"Id":10069,
"UrlHash":"3963aa68aac23b61ffc1275ad6e0f43d",
"BrandId":1,
"Name":"Nokia 8.3 5G",
"Picture":"https://fdn2.gsmarena.com/vv/bigpic/nokia-83-5g.jpg",
"ReleasedAt":"Released 2020, September 15",
"Body":"220g, 9mm thickness",
"Os":"Android 10, up to Android 11, Android One",
"Storage":"64GB/128GB storage, microSDXC",
"DisplaySize":"6.81\\\"",
"DisplayResolution":"1080x2400 pixels",
"CameraPixels":"64 MP ",
"VideoPixels":"2160p",
"Ram":"6/8 GB RAM ",
"Chipset":"Snapdragon 765G 5G",
"BatterySize":"4500 mAh ",
"BatteryType":"Li-Po",
"Specifications":"{\\\"Technology\\\":\\\"GSM \\\\/ HSPA \\\\/ LTE \\\\/ 5G\\\",\\\"2G bands\\\":\\\"GSM 850 \\\\/ 900 \\\\/ 1800 \\\\/ 1900 - SIM 1 & SIM 2\\\",\\\"3G bands\\\":\\\"HSDPA 850 \\\\/ 900 \\\\/ 1700(AWS) \\\\/ 1900 \\\\/ 2100 \\\",\\\"4G bands\\\":\\\"1, 2, 3, 4, 5, 7, 8, 12, 13, 17, 20, 28, 32, 38, 39, 40, 41, 66, 71\\\",\\\"5G bands\\\":\\\"1, 2, 3, 5, 7, 8, 28, 38, 40, 41, 66, 71, 78 SA\\\\/NSA\\\",\\\"Speed\\\":\\\"HSPA 42.2\\\\/5.76 Mbps, LTE-A (4CA) Cat18 1200\\\\/150 Mbps, 5G 2.4\\\\/1.2 Gbps\\\",\\\"Announced\\\":\\\"2020, March 19\\\",\\\"Status\\\":\\\"Available. Released 2020, September 15\\\",\\\"Dimensions\\\":\\\"171.9 x 78.6 x 9 mm (6.77 x 3.09 x 0.35 in)\\\",\\\"Weight\\\":\\\"220 g (7.76 oz)\\\",\\\"SIM\\\":\\\"Single SIM (Nano-SIM) or Hybrid Dual SIM (Nano-SIM, dual stand-by)\\\",\\\"Type\\\":\\\"Li-Po 4500 mAh, non-removable\\\",\\\"Size\\\":\\\"6.81 inches, 112.0 cm2 (~82.9% screen-to-body ratio)\\\",\\\"Resolution\\\":\\\"1080 x 2400 pixels, 20:9 ratio (~386 ppi density)\\\",\\\"OS\\\":\\\"Android 10, upgradable to Android 11, Android One\\\",\\\"Chipset\\\":\\\"Qualcomm SM7250 Snapdragon 765G 5G (7 nm)\\\",\\\"CPU\\\":\\\"Octa-core (1x2.4 GHz Kryo 475 Prime & 1x2.2 GHz Kryo 475 Gold & 6x1.8 GHz Kryo 475 Silver)\\\",\\\"GPU\\\":\\\"Adreno 620\\\",\\\"Card slot\\\":\\\"microSDXC (uses shared SIM slot)\\\",\\\"Internal\\\":\\\"64GB 6GB RAM, 64GB 8GB RAM, 128GB 8GB RAM\\\",\\\"Quad\\\":\\\"64 MP, f\\\\/1.9, (wide), 1\\\\/1.72\\\\\\\", 0.8\\\\u00b5m, PDAF\\\\r\\\\n 12 MP, f\\\\/2.2, 120\\\\u02da (ultrawide), 1\\\\/2.43\\\\\\\", 1.4\\\\u00b5m, AF\\\\r\\\\n 2 MP, (macro)\\\\r\\\\n 2 MP, (depth)\\\",\\\"Features\\\":\\\"Zeiss optics, HDR\\\",\\\"Video\\\":\\\"1080p#30fps\\\",\\\"Single\\\":\\\"24 MP, f\\\\/2.0, (wide), 1\\\\/2.8\\\\\\\", 0.9\\\\u00b5m\\\",\\\"Loudspeaker \\\":\\\"Yes\\\",\\\"3.5mm jack \\\":\\\"Yes\\\",\\\"WLAN\\\":\\\"Wi-Fi 802.11 a\\\\/b\\\\/g\\\\/n\\\\/ac, dual-band, Wi-Fi Direct, hotspot\\\",\\\"Bluetooth\\\":\\\"5.0, A2DP, EDR, LE\\\",\\\"GPS\\\":\\\"Yes, with A-GPS, GLONASS, BDS\\\",\\\"NFC\\\":\\\"Yes\\\",\\\"Radio\\\":\\\"FM radio\\\",\\\"USB\\\":\\\"USB Type-C 2.0, USB On-The-Go\\\",\\\"Sensors\\\":\\\"Fingerprint (side-mounted), accelerometer, gyro, proximity, compass\\\",\\\"Charging\\\":\\\"Fast charging 18W\\\",\\\"Colors\\\":\\\"Polar Night\\\",\\\"Models\\\":\\\"TA-1243, TA-1251\\\",\\\"SAR\\\":\\\"0.96 W\\\\/kg (head) 1.41 W\\\\/kg (body) \\\",\\\"SAR EU\\\":\\\"0.96 W\\\\/kg (head) 1.41 W\\\\/kg (body) \\\",\\\"Price\\\":\\\"$ 433.90 \\\\/ € 574.35 \\\\/ £ 349.00\\\"}",
"DeletedAt":null,
"CreatedAt":"2021-09-10T00:45:32",
}
PRICE property is inside Specification
\"Price\":\"$ 433.90 \\/ € 574.35 \\/ £ 349.00\"
The property Specifications has been corrupted. Someone has tried to escape the character for inches ", found in the Quad and Single properties, but they have escaped every single " in the JSON string! This needs to be fixed at the source.
Here's the Quad property found in Specifications
\\\"Quad\\\":\\\"64 MP, f\\\\/1.9, (wide), 1\\\\/1.72\\\\\\\", 0.8\\\\u00b5m, PDAF\\\\r\\\\n 12 MP, f\\\\/2.2, 120\\\\u02da (ultrawide), 1\\\\/2.43\\\\\\\", 1.4\\\\u00b5m, AF\\\\r\\\\n 2 MP, (macro)\\\\r\\\\n 2 MP, (depth)\\\"
and here's what it should be:
"Quad":"64 MP, f/1.9, (wide), 1/1.72\\\", 0.8u00b5m, PDAFrn 12 MP, f/2.2, 120u02da (ultrawide), 1/2.43\\\", 1.4u00b5m, AFrn 2 MP, (macro)rn 2 MP, (depth)"
Also, the Specifications value should not be surrounded in quotations.
"Specifications": {"Technology": "GSM / HSPA / LTE / 5G", ... },
I get the feeling someone is manually creating JSON strings instead of using a library. This a great example why you should never do that.
You just need to parse the JSON:
try {
const myObject: any = JSON.parse(YOUR_JSON_STRING_HERE);
const price = myObject.Specifications.Price;
console.log("Price:", price);
} catch (error) {
console.log(error);
}
if it's an array:
const price = myObjects[0].Specifications.Price;

How to calculate years for a column of "Jan 1" formatted dates?

I have several columns of data, imported via ImportXml from a website, in which all dates are formatted like this:
A
B
Year for A
Year for B
Jan 22
Feb 3
=Sequence(366-21, 1, 2020, 0)
=Sequence(366-31-2, 1, 2020, 0)
Jan 23
Feb 4
...one row
per day
in 2020...
Dec 31
...
Jan 1
...
=Sequence(365, 1, 2021, 0)
...
...
I need to convert these to real dates, so need a year (starting with 2020). I currently fill in the year columns via Sequence, which is hasslesome because I need to put in a custom formula for each column based on the start date, and I need to repeat the sequence for each new year.
Does somebody have a better idea?
text conversion to date is done like this:
=ARRAYFORMULA(IF(A2:B="",,A2:B*1))
UPDATE:
=INDEX(IF(A2:A="",, DATE(2020+IFNA(VLOOKUP(ROW(A2:A), IF(
IFERROR(1/(1/COUNTIFS(A2:A, A2:A, A2:A, "=jan 1",
ROW(A2:A), "<="&ROW(A2:A))))<>"", {ROW(A2:A),
IFERROR(1/(1/COUNTIFS(A2:A, A2:A, A2:A, "=jan 1",
ROW(A2:A), "<="&ROW(A2:A))))}), 2, 1), 0), MONTH(A2:A*1), DAY(A2:A*1))))
=INDEX(IF(B2:B="",, DATE(2020+IFNA(VLOOKUP(ROW(B2:B), IF(
IFERROR(1/(1/COUNTIFS(A2:A, A2:A, A2:A, "=dec 31",
ROW(B2:B), "<="&ROW(B2:B))))<>"", {ROW(B2:B),
IFERROR(1/(1/COUNTIFS(A2:A, A2:A, A2:A, "=dec 31",
ROW(B2:B), "<="&ROW(B2:B))))}), 2, 1), 0), MONTH(B2:B*1), DAY(B2:B*1))))

Run duration sysjobs - sysjobshistory

I have a question regarding the run_duration of sysjobs. Official docs seem to contradict itself:
https://learn.microsoft.com/en-us/sql/relational-databases/system-tables/dbo-sysjobhistory-transact-sql?view=sql-server-2017
First it specifies:
run_duration int Elapsed time in the execution of the job or step in HHMMSS format.
But then it also mentiones a query for a more friendly time format:
SELECT sj.name,
sh.run_date,
sh.step_name,
STUFF(STUFF(RIGHT(REPLICATE('0', 6) + CAST(sh.run_time as varchar(6)), 6), 3, 0, ':'), 6, 0, ':') 'run_time',
STUFF(STUFF(STUFF(RIGHT(REPLICATE('0', 8) + CAST(sh.run_duration as varchar(8)), 8), 3, 0, ':'), 6, 0, ':'), 9, 0, ':') 'run_duration (DD:HH:MM:SS) '
FROM msdb.dbo.sysjobs sj
JOIN msdb.dbo.sysjobhistory sh
ON sj.job_id = sh.job_id
Notice the difference, days are 'suddenly' brought into the picture.
In my real life example, I came across a job that ran really long. The results are as follows:
run_duration (DD:HH:MM:SS) run_duration
01:49:39:39 1493939
So how do I read this? Is this actually 149 hours, 39 minutes and 39 secs?
One day and 49 makes no sense.
Thanks a lot for the feedback!
Normally, if you're using a notation with times, the largest denominator isn't limited; as in that you stop at 24 for hours because that's how many there are in a day. So, in the format HHMMSS, HH can be any value from 0+. the HH isn't limited to 24, as you aren't using a days denominator. Just like if you were counting months you wouldn't stop as 12, even if the difference between 2 dates is 16 months. You're counting in months so why would crossing a year gap stop you.
Like you commented as well 1 day 49 hours literally makes no sense. 1493939 should be read as 149 hours, 39 minutes and 39 seconds.

Adjusting dates to avoid overlap of days

If I have three dates, e.g. Jan 1, Jan 25, and Feb 20 but I want the dates to be separated by 30 days, how can i do it?
For example, what I want to do is Jan 1, Jan 30, Feb 29.
I am very new at R but the code should be something like this - If 2nd date is before (1st date+30), then adjust 2nd date to (1st+31) and similarly for 3rd date..
Any help will be much appreciated!
Since you want a fixed distance between each adjacent pair of dates, you don't need to "adjust" any dates; rather, you can just compute the desired date vector from scratch, starting with the first date.
This can actually be done with a single call to the S3 generic seq(), which will dispatch to seq.Date():
seq(as.Date('2000-01-01'),by=30,length.out=3);
## [1] "2000-01-01" "2000-01-31" "2000-03-01"
Also note that you seem to have made an error in deriving your expected dates; 30 days from Jan 1 is Jan 31, not Jan 30.
d1 = as.Date("01-01",format="%m-%d")
d2 = as.Date("01-25",format="%m-%d")
if (abs(as.numeric(difftime(d2,d1)))<30) d2 = d1 + 30
>d2
[1] "2015-01-31"

Resources