gSheets horizontal with ARRAYFORMULA(SPLIT(...)) - arrays

I'm creating a dashboard in gSheets to report key metrics for A/B tests on my company's billing page. The data auto-pulls from GA via gSheets' addon, and I've decided to dynamically populate the list of tests on the dashboard by using the UNIQUE(...) function. This, in turn, allows metrics matching each test (by name) to pull into the dashboard from the data dump.
Case: Because it's a dynamic list of tests that will change over time (there are currently 4, there may be 3 next time, or 7, who knows), I've decided to set up the names horizontally with the metrics vertical, like so:
New Users
| Test A | Test B | Test C | Test D |
----------------|--------|--------|--------|--------|
Sessions |
Conversions |
Conversion Rate |
Return Users
| Test A | Test B | Test C | Test D |
----------------|--------|--------|--------|--------|
Sessions |
Conversions |
Conversion Rate |
The example above is how I'd like it to look. In reality, the test names are extremely long due to identifying keywords, which help in GA but make it impossible to tell which is which in the spreadsheet dashboard. The test name is at the end.
Here's what I'm working with:
Optimizely_AB_Test_Alternate_Billing_(Prod)(Property) (AccountNum): devicesImage
I want to split names on " ", preferably with an arrayformula so it populates horizontally for however many columns have a test. Then I can hide the row with the full name (allowing it to still be used for reference), while displaying the shorter name so people can tell what the test is.
Problem: I'm stuck trying to make the arrayformula go horizontally,
Current approach: =ARRAYFORMULA(TRANSPOSE(SPLIT(A13:13," ")))
Current approach output:
| coversImage | blank | blank | blank
----------------|-------------|-------|-------|-------|
Sessions |
Conversions |
Conversion Rate |
Desired output:
| coversImage | devicesImage | lifestyleImage | stringCentered |
----------------|-------------|--------------|----------------|----------------|
Sessions |
Conversions |
Conversion Rate |

I have to assume your test names are initially this:
Optimizely_AB_Test_Alternate_Billing_(Prod)(Property) (AccountNum):
devicesImage Optimizely_AB_Test_Alternate_Billing_(Prod)(Property)
(AccountNum): coversImage
Optimizely_AB_Test_Alternate_Billing_(Prod)(Property) (AccountNum):
lifestyleImage Optimizely_AB_Test_Alternate_Billing_(Prod)(Property)
(AccountNum): stringCentered
In that case we can extract with regular expressions everything atfer ": ":
=ArrayFormula(TRANSPOSE(REGEXEXTRACT(E26:E29, "(?:\: )(\w*)$")))

Related

Need help creating a simple form for reviewing a (very) large number of diagnosis codes

OK, been lurking here for a long time, but never asked a question before. Apologies for long and complicated question. So I have a very large excel sheet with nearly 40,000 unique codes from the ICD-10 classification system, which classifies essentially all known diseases. Theis is a hierarchical clasisfication system where codes are organized in 20 something chapters and gradually more specific codes, with 3 or more positions. For example, the code A22 is anthrax, with a number of sub-codes A22.0=Cutaneous anthrax, A22.1=Pulmonary anthrax, etc. However, for some diseases, there are no 4-digit codes under the 3-digit codes (e.g. C01, below) or only one 4-digit code that is meaningful for us to recognize (e.g. C00, below). For other diseases, we want full precision (e.g. G23).
Example table
| 3-digit code | Specific code | Description |
| -------- | -------- |-------- |
| C00 | C00.0 | External upper lip |
| C00 | C00.1 | External lower lip |
| C00 | C00.2 | External lip, unspecified |
| C00 | C00.3 | Upper lip, inner aspect |
| C01 | C01 | Malignant neoplasm of base of tongue |
| G23 | G23 | Other degenerative diseases of basal ganglia |
| G23 | G23.0 | Hallervorden-Spatz disease |
| G23 | G23.1 | Progressive supranuclear ophthalmoplegia [Steele-Richardson-Olszewski] |
| G23 | G23.2 | Multiple system atrophy, parkinsonian type [MSA-P] |
| G23 | G23.3 | Multiple system atrophy, cerebellar type [MSA-C] |
The issue at hand is that I'm conducting a large-scale research study based on a health register where diagnoses are coded using this system. Due to a policy of information minimization/data privacy, we need to select which of these 40,000 codes where we need full precision (i.e. on 4-digit level) and where it is sufficent with 3-digit codes. This is a very tedious task and I need to make it as efficient as possible. My idea is to create a simple form that links to my large table (which has the exact format as above, only longer) and presents each 3-digit code one by one, with a simple checkbox or something that allows me to select or not select whether this group should have full precision. I'm envisioning something simple like this:
enter image description here
Sorry for the stupidly long prelude, but my question is much simpler: what would be a simple way to achieve this? I don't "know" any graphical programming languages, but have used SAS, R and statistical programming systems for about 20 years, so I really just need a push in the right direction. Could it, for example, be done using Access form? Any help would be much appreciated!
Thanks,
Gustaf
So, I haven't really tried anything yet as I don't even know where to start.

Efficient data retention policy other than time in timescaledb

I have a hypertable which looks like this:
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
state | text | | | | extended | | |
device | text | | | | extended | | |
time | bigint | | not null | | plain | | |
Indexes:
"device_state_time" btree ("time")
Triggers:
ts_insert_blocker BEFORE INSERT ON "device_state" FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_4_2_chunk
Access method: heap
I have 100k devices each sending their state at different time intervals. For ex, device1 sends state every second, device2 every day, device3 every 5 days etc. And I MUST keep at least 10 latests states for a device. So, I can't really use the default data retention policy provided by timescale.
Is there any way to achieve this efficiently other than manually selecting the latest 10 entries for each device and deleting the rest?
Thanks!
That sounds like a corner case because the chunks are time-based. Can you categorize these devices in advance?
Maybe you can insert data into different hypertables based on the insert timeframe if you still want to use the retention policies.
For example, on promscale, the solution uses one table for each metric, allowing users to redefine the retention policy for every metric.
It will depend on how you read the data later; maybe fragmenting it into several hypertables will make it harder.
Also, consider hacking the hypertable creation optional arguments maybe you can get something from the partitioning_func and time_partitioning_func.

Combining fields in Google Data Studio

I have a CSV file of the form (unimportant columns hidden)
player,game1,game2,game3,game4,game5,game6,game7,game8
Example data:
Alice,0,-10,-30,-60,-30,-50,-10,30
Bob,10,20,30,40,50,60,70,80
Charlie,20,0,20,0,20,0,20,0
Derek,1,2,3,4,5,6,7,8
Emily,-40,-30,-20,-10,10,20,30,40
Francine,1,4,9,16,25,36,49,64
Gina,0,0,0,0,0,0,0,0
Hank,-50,50,-50,50,-50,50,-50,50
Irene,-20,-20,-20,50,50,-20,-20,-20
I am looking for a way to make a Data Studio view where I can see a chart of all the results of a certain player. How would I make a custom field that combines the data from game1 to game8 so I can make a chart of it?
| Name | Scores |
|----------|---------------------------------|
| Alice | [0,-10,-30,-60,-30,-50,-10,30] |
| Bob | [10,20,30,40,50,60,70,80] |
| Charlie | [20,0,20,0,20,0,20,0] |
| Derek | [1,2,3,4,5,6,7,8] |
| Emily | [-40,-30,-20,-10,10,20,30,40] |
| Francine | [1,4,9,16,25,36,49,64] |
| Gina | [0,0,0,0,0,0,0,0] |
| Hank | [-50,50,-50,50,-50,50,-50,50] |
| Irene | [-20,-20,-20,50,50,-20,-20,-20] |
The goal of the resulting chart would be something like this, where game1 is the first point and so on.
If this is not possible, how would I best represent the data so what I am looking for can work in Data Studio? I currently have it implemented in a Google Sheet, but the issue is there's no way to make views, so when someone selects a row it changes for everyone viewing it.
If you have two file games as data sources, I guess that you want to combine them by the name, right?
You can do it with the blending data option. Resource > manage blends I think is the option.
Then you can create a blend data source merging it by the name.
You can add also both score fields, with different labels.
This is some documentation about it: https://support.google.com/datastudio/answer/9061420?hl=en

ad-hoc slowly-changing dimensions materialization from external table of timestamped csvs in a data lake

Question
main question
How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system?
rationale
We're designing ephemeral data warehouses as data marts for end users that can be spun up and burned down without consequence. This requires we have all data in a lake/blob/bucket.
We're ripping daily full extracts because:
we couldn't reliably extract just the changeset (for reasons out of our control), and
we'd like to maintain a data lake with the "rawest" possible data.
challenge question
Is there a solution that could give me the state as of a specific date and not just the "newest" state?
existential question
Am I thinking about this completely backwards and there's a much easier way to do this?
Possible Approaches
custom dbt materialization
There's a insert_by_period dbt materialization in the dbt.utils package, that I think might be exactly what I'm looking for? But I'm confused as it's dbt snapshot, but:
run dbt snapshot for each file incrementally, all at once; and,
built directly off of an external table?
Delta Lake
I don't know much about Databricks's Delta Lake, but it seems like it should be possible with Delta Tables?
Fix the extraction job
Is our oroblem is solved if we can make our extracts contain only what has changed since the previous extract?
Example
Suppose the following three files are in a folder of a data lake. (Gist with the 3 csvs and desired table outcome as csv).
I added the Extracted column in case parsing the timestamp from the filename is too tricky.
2020-09-14_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 |
| 2 | B | 3 - Propose | | 9/12 | 9/14 |
2020-09-15_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 |
| 3 | C | 1 - Lead | | 9/14 | 9/15 |
2020-09-16_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/16 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 |
End Result
Below is SCD-II for the three files as of 9/16. SCD-II as of 9/15 would be the same but OppId=3 has only one from valid_from=9/15 and valid_to=null
| OppId | CustId | Stage | Won | LastModified | valid_from | valid_to |
|-------|--------|-------------|-----|--------------|------------|----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 | null |
| 2 | B | 3 - Propose | | 9/12 | 9/14 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 | null |
| 3 | C | 1 - Lead | | 9/14 | 9/15 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 | null |
Interesting concept and of course it would a longer conversation than is possible in this forum to fully understand your business, stakeholders, data, etc. I can see that it might work if you had a relatively small volume of data, your source systems rarely changed, your reporting requirements (and hence, datamarts) also rarely changed and you only needed to spin up these datamarts very infrequently.
My concerns would be:
If your source or target requirements change how are you going to handle this? You will need to spin up your datamart, do full regression testing on it, apply your changes and then test them. If you do this as/when the changes are known then it's a lot of effort for a Datamart that's not being used - especially if you need to do this multiple times between uses; if you do this when the datamart is needed then you're not meeting your objective of having the datamart available for "instant" use.
Your statement "we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management" I'm not sure is true. How are you going to test updates to your code without spinning up the datamart(s) and going through a standard test cycle with data - and then how is this different from traditional DW change management?
What happens if there is corrupt/unexpected data in your source systems? In a "normal" DW where you are loading data daily this would normally be noticed and fixed on the day. In your solution the dodgy data might have occurred days/weeks ago and, assuming it loaded into your datamart rather than erroring on load, you would need processes in place to spot it and then potentially have to unravel days of SCD records to fix the problem
(Only relevant if you have a significant volume of data) Given the low cost of storage, I'm not sure I see the benefit of spinning up a datamart when needed as opposed to just holding the data so it's ready for use. Loading large volumes of data everytime you spin up a datamart is going to be time-consuming and expensive. Possible hybrid approach might be to only run incremental loads when the datamart is needed rather than running them every day - so you have the data from when the datamart was last used ready to go at all times and you just add the records created/updated since the last load
I don't know whether this is the best or not, but I've seen it done. When you build your initial SCD-II table, add a column that is a stored HASH() value of all of the values of the record (you can exclude the primary key). Then, you can create an External Table over your incoming full data set each day, which includes the same HASH() function. Now, you can execute a MERGE or INSERT/UPDATE against your SCD-II based on primary key and whether the HASH value has changed.
Your main advantage doing things this way is you avoid loading all of the data into Snowflake each day to do the comparison, but it will be slower to execute this way. You could also load to a temp table with the HASH() function included in your COPY INTO statement and then update your SCD-II and then drop the temp table, which could actually be faster.

Pulling ill-formatted data in Libre Calc: What Function will work with this?

I am working on a project where I am pulling tables from a Fandom Wikia page and feeding it into a spreadsheet named 'WikiPullSheet'. The data in the wiki tables is irregular in format; sometimes using multiple rows for the same entry.
Here is an example of some rows as described above from the sheet:
Name | Power | Stamina | Agility
Townsman Shield | 2 | 1 | 2
Starter | | |
Broken Shield | 4(+1) | 2(+1) | 2(+1)
Z1 | | |
Heater | 2(+1) | 4(+1) | 2(+1)
Z1 | | |
Wood Elf Shield | 2(+1) | 2(+1) | 4(+1)
Z1 | | |
Shiv | 4 | 4 | 3
Z1 Shop | | |
Deimos* | 26 | 16 | 26
| 34 | 22 | 34
I want the sheet to auto-update from the wikia page but this format will not allow me to reference items as the sheet expands. For instance, if on another sheet I want to have a drop down list of all the names for items in this list, I would be referencing the blank and starter cells even though they are not actually unique items in the table. I have done research on VLOOKUP, COUNTIF, REGEX options, MATCH, and more, but none of these seem to work for the issue I am having.
How would I take this input and either create a formula to reformat it or pull from the sheet as is and use the columns appropriately for a drop-down box containing only the item names from the NAME column?
Desired Output:
I need the data to end up formatted with each row representing a different unique item. Since the information is pulling with rows that contain location of the item in the name column (Z1 for instance), this is proving difficult. I could simply remove the rows that cause problems such as 'Z1' & 'Z1 Shop' in the above example, however this does not help when an item has multiple upgrade paths like in the case of the 'Deimos' row entry.
If you insert a pivot table (there is a icon to do so, select ColumnA first) based on ColumnA (assuming that is where Name is to be found) you should get something like:
It is far from a complete solution (you don't show what the desired output should be) but I thought a sorted list, with each entry unique and the blanks at least out of the way, might have been a start.

Resources