I have a very large table (4+ billion rows) structured as follows :
user_id month nb_logins
-----------------------------
1 01 0
1 02 1
1 03 4
1 04 0
... ... ...
2 01 5
2 02 0
2 03 0
2 04 1
I would like to add a new column which is simply the cumulative sum of nb_logins partitioned by user_id and ordered by month.
I used to compute the whole thing in one query, however I decided to parallelize this as each user is independant (i.e. I can compute the cumsum for user 1 and user 2 in parallel).
Now, to parallelize I've create a list of "partitions" based on user_id (it's an evenly balanced int), let's say I have the following two partitions :
user_id between 0 and 10
user_id between 11 and 20
So I run two MERGE requests in parallel, one for each partition, however I see that most of the requests are BLOCKED in Snowflake because they try to write onto the same micro-partition.
Question : what would be the best way to parallelize this operation ?
Related
I am trying to perform an index and match formula in Excel with a two large datasets which will return multiple unique results.
I have illustrated a simplified version of the data below. The two match conditions are A1 in table 2 = A:A in Table 1 and B2 in table 2 = B:B in table 1. This will result in multiple results and I want a formula I can drag from cells C3 across to D4 in table 2 to show the results of this index and match.
Table 1
First Name
Second Name
Food allergy code
Bob
Johnson
03
Bob
Johnson
04
Table 2
First Name
Second Name
Food allergy code 1
Food allergy code 2
Bob
Johnson
03
04
I have used the formula below which returns the first match, but when I drag this from cell C2 to D2 it returns the same value. I'm not sure how to rewrite this formula so that it provides each unique Food allergy code given both match conditions are met.
=TRANSPOSE(INDEX(Table1!C:C,MATCH(1,(Table1!A:A=A2)*(Table1!B:B=B2),0)))
Any help would be appreciated.
You could use the following, but it is computationally rather inefficient, so if you want to drag this formula over many cells, you should leave a comment to find a computationally more efficient solution.
=TRANSPOSE(FILTER(D:D,(B:B=G4)*(C:C=H4)))
which looks like this in an example:
I have a data source with data formatted like this:
ID
Visits
Charges
Date
Location
33
21
375
2022-01-29
A
34
4285
4400
2022-01-29
B
35
12
2165
2022-01-29
C
36
31
4285
2022-01-30
A
37
40
5881
2022-01-31
A
38
29
4715
2022-01-31
B
39
8
1390
2022-01-31
C
I want to get the aggregated visits of all locations per day, and from there getting the Max value of a day for the time period chosen by the user on a ScoreCard and a Table. At the moment when i choose the max value of the metric visits it only gives me the max value of column (4285), not for the aggregated data per day.
The value i am looking for, in the time period between 28-01 and 31-01 should be 4318 (the sum of all 3 locations for the 29-01, which is the highest of the 3 days)
Thanks!
What I may suggest is to use Pivot Table like this:
You choose Date as your row dimension. Then you choose Visits as metric (aggregation set as SUM).
Remember to sort this table by Visits in descending order. Your maximum value should be on top. If you want to see only this maximum value, you can change size of your pivot table to keep only first value visible.
This should work with additional controls too.
I am building a report in Google Data Studio, and I run into a problem with the aggregation for a couple of metrics.
As an example, I have the following table:
name. |. M1. |. M2. |. M3
A. 23 45 1,9
B. 45 6 0,1
C. 23 45 1,9
D. 12 34 2,8
E. 4 2 0,5
Where
M3 = M2/M1
Now, when I display this in a table in GDS, the totals for M1 and M2 are the sum of the values, and that is ok, but I can only choose between fixed aggregation operations, and the total for M3 should be:
Total M3 = sum(M2)/sum(M1)
Any idea if this is possible?
You can create your own custom field:
https://support.google.com/datastudio/answer/6299685?hl=en
I have a query that returns results that describe a numeric range, with some of these data falling within the range of other data returned in the same query. How can I easily eliminate those?
I have the following data:
Code Start End
----- ------- -------
abc 1 1
abc 2 2
abc 3 8
abc 4 4
abc 5 5
xyz 1 1
xyz 2 5
xyz 3 3
In this case, where code is "abc", there are two rows: start=4,end=4 and start=5,end=5. But preceding them is a row where start=3,end=8. So both of those rows should not be returned in my result set.
I can do with with a temp table, cursor, etc. But I'd like to know if there's an elegant way to do this within the query.
I would do this with a WHERE NOT EXISTS() clause.
The EXISTS() function would be to check for another row where the Start is less than or equal to my Start and the End is greater than or equal to my End.
There are no exact duplicate rows in your sample data, but if it's possible for them to exist in your real data, you will have to consider what you want to do with those as well.
I am attempting to optimise a query in my application that is causing problems when scaling my application.
The table contains two columns: FROM and TO which each contain values. Here is an example:
Row | From | To
1 | AA | Z
2 | B | C
3 | JA | JZ
4 | JM | JZ
The query is passed a name (JOHN) and should return a list of ranges from the table that could contain the name.
select * from Ranges where From <= 'JOHN' and To >= 'JOHN'
Using the table above this would result in rows 1 and 3 being returned.
The problem I am having is one of query consistency.
All indexes are in place but if I search for JOHN the query returns in 20 milliseconds, whereas MARK returns in 250 milliseconds.
Looking at query analyzer shows me that JOHN is actually searching for more rows than MARK but I'm struggling to understand how or why MARK takes so long.
If the time difference was 20 - 40 milliseconds, I could live with that but 250 is so large a difference that the overall performance of my application is terrible.
Does anybody have any idea how I could narrow down why I get such variance in my queries OR a better way of storing and searching for string ranges (which could contains letters and numbers).
Many thanks in advance.
EDIT - One thing I forgot to mention was that the original table contains approximately 15 million rows (its actually postcodes).