To ask my question i first have to show you my data and my proposed solution to the dual key problem:
Data has 1 of 2 keys x and y. Sometimes x is pressent sometimes y. One type of event has both.
Type 1: Key x and y
Type 2: key x
Type 3: Key y
To have the full session at the end of the pipeline we need all data under one key: x+y.
To achieve this, I copy the messages with both keys and key one of them by x and the other by y. Then in the following Processor I enrich type x and y.
Each message looks like this: [Flink key, potentialX, potentialY, rest of msg...]
Pipeline
Here is my scenario: I have a close session message
which is type 2. This will be propagated to the key X processor. Here
it will be enriched and we can shut down appropriate
processors in the rest of the pipeline. However key y is
never evicted because it never gets the close session
message.
Close msg flow
Now for the question: How can i close the state in the Y processor?
Initially i thought to duplicate the type 2 msg in the enricher, and make a sideoutput for it, grab that sideoutput before the keyby, and therefore have it go to the correct processor. This is not possible as the sideoutput can only be used after the processor where it was created. Then i found some jira-tickets about side-inputs, but that seems to not be an actual feature yet.
Lastly i thought i might make a sink for the sideoutput mentioned above, and a source at the keyby. This seems a bit hacky tho.
I really hope someone can help!
Edit:
Adding new diagram, to try to clarify the original flow. In the original drawings i tried to make make the flow of data easier to understand by making 2 boxes for the Enrichment processor. I've tried to make the flow more correct with this new drawing:
Improved drawing
That's a bit complicated to follow, but I've seen this pattern before when trying to unify logged-out sessions with logged-in sessions from web logs. If I've understood the details well enough, I think you could take a side output from the X processor, and feed it into the Y processor, like this:
+------------+ +-------+
| +--------------------------> |
+--------+ +-------+ X | X proc | | |
| | | +-----> | sideout +-----------+ | X + Y |
| | | | | +---------> | | |
| source +-----> split | +------------+ | +----> |
| | | | | Y proc | +-------+
| | | +----------------------------> |
+--------+ +-------+ Y +-----------+
Related
I have a hypertable which looks like this:
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
state | text | | | | extended | | |
device | text | | | | extended | | |
time | bigint | | not null | | plain | | |
Indexes:
"device_state_time" btree ("time")
Triggers:
ts_insert_blocker BEFORE INSERT ON "device_state" FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_4_2_chunk
Access method: heap
I have 100k devices each sending their state at different time intervals. For ex, device1 sends state every second, device2 every day, device3 every 5 days etc. And I MUST keep at least 10 latests states for a device. So, I can't really use the default data retention policy provided by timescale.
Is there any way to achieve this efficiently other than manually selecting the latest 10 entries for each device and deleting the rest?
Thanks!
That sounds like a corner case because the chunks are time-based. Can you categorize these devices in advance?
Maybe you can insert data into different hypertables based on the insert timeframe if you still want to use the retention policies.
For example, on promscale, the solution uses one table for each metric, allowing users to redefine the retention policy for every metric.
It will depend on how you read the data later; maybe fragmenting it into several hypertables will make it harder.
Also, consider hacking the hypertable creation optional arguments maybe you can get something from the partitioning_func and time_partitioning_func.
Let's say that I have the following SQL table where each value has a reference to the previous one:
ChainedTable
+------------------+--------------------------------------+------------+--------------------------------------+
| SequentialNumber | GUID | CustomData | LastGUID |
+------------------+--------------------------------------+------------+--------------------------------------+
| 1 | 792c9583-12a1-4c95-93a4-3206855d284f | OtherData1 | 0 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 2 | 1022ffd3-afda-4e20-9d45-eec884bc2a50 | OtherData2 | 792c9583-12a1-4c95-93a4-3206855d284f |
+------------------+--------------------------------------+------------+--------------------------------------+
| 3 | 83729ad4-2564-4146-b451-00d82585bd96 | OtherData3 | 1022ffd3-afda-4e20-9d45-eec884bc2a50 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 4 | d7197e87-d7d6-4175-8172-12656043a69d | OtherData4 | 83729ad4-2564-4146-b451-00d82585bd96 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 5 | c1d3d751-ef34-4079-a73c-8952f93d17db | OtherData5 | d7197e87-d7d6-4175-8172-12656043a69d |
+------------------+--------------------------------------+------------+--------------------------------------+
If I were to insert the sixth row, I would retrieve the data of the last row using a query like this:
SELECT TOP 1 (SequentialNumber, GUID) FROM ChainedTable ORDER BY SequentialNumber DESC;
After that selection and before the insertion of the next row, an operation outside the database will take place.
That would suffice if it is ensured that only one entity is using the table every time. However, if more entities can do this same operation, there is a risk of a race condition. There is the possibility that one entity requests the information of the last row and before doing the insert on the second one.
At first, I thought of creating a new table with a value that indicates if the table is being used or not (the value can be null or the identifier of the process that has access to the table). In that solution, the entity won't start the request of the last operation if the value indicates that the table is being used by another process. However, one of the things that can happen in this scenario is that the process using the table can die without releasing the table, blocking the whole system.
I'm sure this is a "typical" computer science problem and that there are well known solutions to implement this. Can anyone point me in the right direction, please?
I think using Transaction in SQL may solve the problem For example, if you create a transaction that will add a new row, no one else will be able to do the same transaction until the first one is completed.
Good Morning all. I'm trying to write some of my first scripts and I'm having a difficult time doing so. I'm trying to match data from one file to another, and add a label to that row in the original.
I'm using two different data sources, to accomplish this and there are tens of thousands different rows to match. I'm trying to take one column of zip codes in data source one, match it to the same zip codes in a data source two, and add a new column labeling the location in data source one. see example below.
Data Source One:
|A | | B |
|13329 | X |
|22193 | X |
|13211 | X |
Data source two:
|A | | B |
|13211 | Syracuse |
|22193 | D.C. Metro |
|13329 | Utica Rome |
New Data Source one:
| A | B | C |
|13329 | X | Utica-Rome |
|22193 | X | D.C. Metro |
|13211 | X | Syracuse |
New Data Source One is the desired end state. I am dealing with rows that will have no new labels and can be labeled as N/A or NA (whichever way is fine). I hope I have explained the problem and the desired result well enough. Please help.
More commonly than Matching and labeling this is called joining.
join <(sort DS1) <(sort DS2)
Question
main question
How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system?
rationale
We're designing ephemeral data warehouses as data marts for end users that can be spun up and burned down without consequence. This requires we have all data in a lake/blob/bucket.
We're ripping daily full extracts because:
we couldn't reliably extract just the changeset (for reasons out of our control), and
we'd like to maintain a data lake with the "rawest" possible data.
challenge question
Is there a solution that could give me the state as of a specific date and not just the "newest" state?
existential question
Am I thinking about this completely backwards and there's a much easier way to do this?
Possible Approaches
custom dbt materialization
There's a insert_by_period dbt materialization in the dbt.utils package, that I think might be exactly what I'm looking for? But I'm confused as it's dbt snapshot, but:
run dbt snapshot for each file incrementally, all at once; and,
built directly off of an external table?
Delta Lake
I don't know much about Databricks's Delta Lake, but it seems like it should be possible with Delta Tables?
Fix the extraction job
Is our oroblem is solved if we can make our extracts contain only what has changed since the previous extract?
Example
Suppose the following three files are in a folder of a data lake. (Gist with the 3 csvs and desired table outcome as csv).
I added the Extracted column in case parsing the timestamp from the filename is too tricky.
2020-09-14_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 |
| 2 | B | 3 - Propose | | 9/12 | 9/14 |
2020-09-15_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 |
| 3 | C | 1 - Lead | | 9/14 | 9/15 |
2020-09-16_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/16 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 |
End Result
Below is SCD-II for the three files as of 9/16. SCD-II as of 9/15 would be the same but OppId=3 has only one from valid_from=9/15 and valid_to=null
| OppId | CustId | Stage | Won | LastModified | valid_from | valid_to |
|-------|--------|-------------|-----|--------------|------------|----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 | null |
| 2 | B | 3 - Propose | | 9/12 | 9/14 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 | null |
| 3 | C | 1 - Lead | | 9/14 | 9/15 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 | null |
Interesting concept and of course it would a longer conversation than is possible in this forum to fully understand your business, stakeholders, data, etc. I can see that it might work if you had a relatively small volume of data, your source systems rarely changed, your reporting requirements (and hence, datamarts) also rarely changed and you only needed to spin up these datamarts very infrequently.
My concerns would be:
If your source or target requirements change how are you going to handle this? You will need to spin up your datamart, do full regression testing on it, apply your changes and then test them. If you do this as/when the changes are known then it's a lot of effort for a Datamart that's not being used - especially if you need to do this multiple times between uses; if you do this when the datamart is needed then you're not meeting your objective of having the datamart available for "instant" use.
Your statement "we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management" I'm not sure is true. How are you going to test updates to your code without spinning up the datamart(s) and going through a standard test cycle with data - and then how is this different from traditional DW change management?
What happens if there is corrupt/unexpected data in your source systems? In a "normal" DW where you are loading data daily this would normally be noticed and fixed on the day. In your solution the dodgy data might have occurred days/weeks ago and, assuming it loaded into your datamart rather than erroring on load, you would need processes in place to spot it and then potentially have to unravel days of SCD records to fix the problem
(Only relevant if you have a significant volume of data) Given the low cost of storage, I'm not sure I see the benefit of spinning up a datamart when needed as opposed to just holding the data so it's ready for use. Loading large volumes of data everytime you spin up a datamart is going to be time-consuming and expensive. Possible hybrid approach might be to only run incremental loads when the datamart is needed rather than running them every day - so you have the data from when the datamart was last used ready to go at all times and you just add the records created/updated since the last load
I don't know whether this is the best or not, but I've seen it done. When you build your initial SCD-II table, add a column that is a stored HASH() value of all of the values of the record (you can exclude the primary key). Then, you can create an External Table over your incoming full data set each day, which includes the same HASH() function. Now, you can execute a MERGE or INSERT/UPDATE against your SCD-II based on primary key and whether the HASH value has changed.
Your main advantage doing things this way is you avoid loading all of the data into Snowflake each day to do the comparison, but it will be slower to execute this way. You could also load to a temp table with the HASH() function included in your COPY INTO statement and then update your SCD-II and then drop the temp table, which could actually be faster.
In a sequence diagram i am trying to model a loop that creates a bunch of objects. i have found little information online regarding the creation of multiple objects in an SD diagram so i turn to you.
The classes are Deck and Card
Cards are created by fillDeck(), which is called by the constructor of Deck (FYI the objects are stored in an arraylist in Deck).
There are many types of cards with varying properties. Suppose i want 8 cards of type A to be made, 12 of type B and 3 of type C
How would i go about modelling such a thing? this is the idea i have in mind so far, but it is obviously incomplete.
Hope someone can help! thanks!
+------+
| Deck |
+------+
|
+--+-------+--------------+
| loop 8x / |
+--+-----+ +----------+ |
| |-------->| Card(A) | |
| | +-----+----+ |
+--+----------------------+
| |
+--+--------+------|-----------------------+
| loop 12x / | |
+--+------+ | +---------+ |
| |------------------------->| Card(B) | |
| | | +----+----+ |
|--+---------------------------------------+
| | | |
+--+-------+----------------------------------------------+
| loop 3x / | | |
+--+-----+ | | +---------+ |
| |--------------------------------------->| Card(C) | |
| | | | +----+----+ |
|--+------------------------------------------------------+
| | | |
"A sequence diagram describes an Interaction by focusing on the sequence of Messages that are exchanged, along with their corresponding OccurrenceSpecifications on the Lifelines." (UML standard) A lifeline are defined by one object. But that doesn't mean you must keep all objects in lifelines. You should show only these lifelines, that are exchanging messages you are thinking about.
And you needn't show all messages sequences logic on one diagram. In one SD normally you are showing one Interaction. Or maybe a few of them, if they are simple.
So, if your SD is showing one logical concept, it is correct. If there will be another interaction between some objects, you will draw another SD for this interaction, and there will be only objects participating in this second interaction.
UML standard 2.5. Figure 17.25 - Overview of Metamodel elements of a Sequence Diagram