I'm a newbie and I'm building a database with a table which contains a list of jobs. Assuming the current time is between 8.00 and 8.30, an example is:
job | start_time | end_time | status
-------|------------|----------|----------------
A | 8.00 | 8.30 | processing
B | 8.30 | 9.00 | to do
C | 9.15 | 9.30 | to do
as you can see the job A's status is 'processing'. It terminates at 8.30 and it's status should change to 'terminated' at that time. At 8.30 the first job ends and if there is a job starting at that time, it should be started, like job B that starts when job A ends. But there may not be a job starting when the last has finished, as the work C doesn't start when job B finishes.
Now, the problem is: how can I manage this actual-time based update of a table? I'm using PostgreSQL.
Related
Let's say that I have the following SQL table where each value has a reference to the previous one:
ChainedTable
+------------------+--------------------------------------+------------+--------------------------------------+
| SequentialNumber | GUID | CustomData | LastGUID |
+------------------+--------------------------------------+------------+--------------------------------------+
| 1 | 792c9583-12a1-4c95-93a4-3206855d284f | OtherData1 | 0 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 2 | 1022ffd3-afda-4e20-9d45-eec884bc2a50 | OtherData2 | 792c9583-12a1-4c95-93a4-3206855d284f |
+------------------+--------------------------------------+------------+--------------------------------------+
| 3 | 83729ad4-2564-4146-b451-00d82585bd96 | OtherData3 | 1022ffd3-afda-4e20-9d45-eec884bc2a50 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 4 | d7197e87-d7d6-4175-8172-12656043a69d | OtherData4 | 83729ad4-2564-4146-b451-00d82585bd96 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 5 | c1d3d751-ef34-4079-a73c-8952f93d17db | OtherData5 | d7197e87-d7d6-4175-8172-12656043a69d |
+------------------+--------------------------------------+------------+--------------------------------------+
If I were to insert the sixth row, I would retrieve the data of the last row using a query like this:
SELECT TOP 1 (SequentialNumber, GUID) FROM ChainedTable ORDER BY SequentialNumber DESC;
After that selection and before the insertion of the next row, an operation outside the database will take place.
That would suffice if it is ensured that only one entity is using the table every time. However, if more entities can do this same operation, there is a risk of a race condition. There is the possibility that one entity requests the information of the last row and before doing the insert on the second one.
At first, I thought of creating a new table with a value that indicates if the table is being used or not (the value can be null or the identifier of the process that has access to the table). In that solution, the entity won't start the request of the last operation if the value indicates that the table is being used by another process. However, one of the things that can happen in this scenario is that the process using the table can die without releasing the table, blocking the whole system.
I'm sure this is a "typical" computer science problem and that there are well known solutions to implement this. Can anyone point me in the right direction, please?
I think using Transaction in SQL may solve the problem For example, if you create a transaction that will add a new row, no one else will be able to do the same transaction until the first one is completed.
Question
main question
How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system?
rationale
We're designing ephemeral data warehouses as data marts for end users that can be spun up and burned down without consequence. This requires we have all data in a lake/blob/bucket.
We're ripping daily full extracts because:
we couldn't reliably extract just the changeset (for reasons out of our control), and
we'd like to maintain a data lake with the "rawest" possible data.
challenge question
Is there a solution that could give me the state as of a specific date and not just the "newest" state?
existential question
Am I thinking about this completely backwards and there's a much easier way to do this?
Possible Approaches
custom dbt materialization
There's a insert_by_period dbt materialization in the dbt.utils package, that I think might be exactly what I'm looking for? But I'm confused as it's dbt snapshot, but:
run dbt snapshot for each file incrementally, all at once; and,
built directly off of an external table?
Delta Lake
I don't know much about Databricks's Delta Lake, but it seems like it should be possible with Delta Tables?
Fix the extraction job
Is our oroblem is solved if we can make our extracts contain only what has changed since the previous extract?
Example
Suppose the following three files are in a folder of a data lake. (Gist with the 3 csvs and desired table outcome as csv).
I added the Extracted column in case parsing the timestamp from the filename is too tricky.
2020-09-14_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 |
| 2 | B | 3 - Propose | | 9/12 | 9/14 |
2020-09-15_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 |
| 3 | C | 1 - Lead | | 9/14 | 9/15 |
2020-09-16_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/16 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 |
End Result
Below is SCD-II for the three files as of 9/16. SCD-II as of 9/15 would be the same but OppId=3 has only one from valid_from=9/15 and valid_to=null
| OppId | CustId | Stage | Won | LastModified | valid_from | valid_to |
|-------|--------|-------------|-----|--------------|------------|----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 | null |
| 2 | B | 3 - Propose | | 9/12 | 9/14 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 | null |
| 3 | C | 1 - Lead | | 9/14 | 9/15 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 | null |
Interesting concept and of course it would a longer conversation than is possible in this forum to fully understand your business, stakeholders, data, etc. I can see that it might work if you had a relatively small volume of data, your source systems rarely changed, your reporting requirements (and hence, datamarts) also rarely changed and you only needed to spin up these datamarts very infrequently.
My concerns would be:
If your source or target requirements change how are you going to handle this? You will need to spin up your datamart, do full regression testing on it, apply your changes and then test them. If you do this as/when the changes are known then it's a lot of effort for a Datamart that's not being used - especially if you need to do this multiple times between uses; if you do this when the datamart is needed then you're not meeting your objective of having the datamart available for "instant" use.
Your statement "we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management" I'm not sure is true. How are you going to test updates to your code without spinning up the datamart(s) and going through a standard test cycle with data - and then how is this different from traditional DW change management?
What happens if there is corrupt/unexpected data in your source systems? In a "normal" DW where you are loading data daily this would normally be noticed and fixed on the day. In your solution the dodgy data might have occurred days/weeks ago and, assuming it loaded into your datamart rather than erroring on load, you would need processes in place to spot it and then potentially have to unravel days of SCD records to fix the problem
(Only relevant if you have a significant volume of data) Given the low cost of storage, I'm not sure I see the benefit of spinning up a datamart when needed as opposed to just holding the data so it's ready for use. Loading large volumes of data everytime you spin up a datamart is going to be time-consuming and expensive. Possible hybrid approach might be to only run incremental loads when the datamart is needed rather than running them every day - so you have the data from when the datamart was last used ready to go at all times and you just add the records created/updated since the last load
I don't know whether this is the best or not, but I've seen it done. When you build your initial SCD-II table, add a column that is a stored HASH() value of all of the values of the record (you can exclude the primary key). Then, you can create an External Table over your incoming full data set each day, which includes the same HASH() function. Now, you can execute a MERGE or INSERT/UPDATE against your SCD-II based on primary key and whether the HASH value has changed.
Your main advantage doing things this way is you avoid loading all of the data into Snowflake each day to do the comparison, but it will be slower to execute this way. You could also load to a temp table with the HASH() function included in your COPY INTO statement and then update your SCD-II and then drop the temp table, which could actually be faster.
My requirement is to calculate based on an incremental size window for a batch table.
For example, the first window has 1 row, the second window has 2 rows(including 1 row from the 1st window and a new row), then 3 rows in the 3rd window(including 2 rows from the 2nd window and a new row), and so on.
For example:
Source table:
datetime | productId | price |
3-1 | p1 | 10 |
3-2 | p1 | 20 |
3-3 | p1 | 30 |
3-4 | p1 | 40 |
Result table:
datetime | productId | average|
3-1 | p1 | 10/1 |
3-2 | p1 | (10+20)/2 |
3-3 | p1 | (10+20+30)/3 |
3-4 | p1 | (10+20+30+40)/4 |
I am trying to find a way to implement this requirement with Sql, to me seems the OVER action can do that but not yet implemented in flink, so I need an alternative way.
BTW:
I tried to use a TUMBLE window of 1 day and store the previous value in the user defined aggregation object but failed as the aggregation object will be reused by all product not a single object for each product
The OVER clause on a batch table is not supported by Flink's SQL yet. You can track the status of this effort here.
However, did you consider to implement this behavior on a streaming table instead? Streaming tables can also read from static files such as CSV files and many operations are supported there as well. This depends on the other operations you want to use in your query, though.
Firts of all, I'll explain my problem:
I'm developing an ecommerce website. One of it features is the possibility for the customers to create purchasing rules. With these rules a customer can set a start date, a periodicity and a product to purchase. The result is that the product will be purchased every [periodicity] days from [start date].
The system is developed with NodeJS as back-end, MongoDB as database and AngularJS as front-end.
I've found some projects for scheduling tasks in NodeJS. Two of them are:
node-schedule
node-cron
Both of them are great tools but I have the same problem with the two of them. The problem is that I need to create scheduled tasks as well as stop them. With these tools is very clear how to schedule a function to be executed over time, but how can I stop them at any moment?
The objects provided by node-schedule and node-cron have a cancel() or stop() method to stop the scheduling, but to invoke that method I need to have the object.
My question is if there is a way to "store" the scheduled tasks in database in order to be able to stop them at any moment and from anywhere other than the function they were created.
And if this is not possible, if there is another tool better than those I've mentioned to do what I need.
Thank you very much for reading and any help would be appreciated.
Ok, I've found a better solution for my problem than adding a bunch of scheduled task, one for purchasing rule. Now my purchasing_rules table looks like this:
+--------------------+
| purchase_rules |
+--------------------+
| customer_id |
| product_id |
| quantity |
| start_date |
| periodicity (Days) |
+--------------------+
My solution is to add the field next run, so my table will look like:
+--------------------+
| purchase_rules |
+--------------------+
| customer_id |
| product_id |
| quantity |
| start_date |
| periodicity (Days) |
| next_run |
+--------------------+
By default [next_run] will be [start_date] + [periodictiy].
And now, the magic:
I will use node-schedule to schedule a job every day at a certain hour. That job will do the following:
Look in the purchase_rules table for any rule whose next_run = today
For every rule found:
Purchase the desired [quantity] of product [product_id] for the customer [customer_id]
Make [next_run] = [next_run] + [periodicity]
Finish
This way the database will not be accesed as many times as with the Agenda package and for "stopping" a purchasing_rule I only have to remove it from database.
I hope someone will find this useful some day.
I'm writing a simple booking program for a car rental (a school assignment). Me and my buddy are trying to make the system a little more advanced than the assignment dictates, but we're having some problems we hoped you could help us with.
The idea is that you can reserve a certain car type, and when you get the car it will be one of that type (you don't reserve a specific car, as our assignment dictates, but only a type). Only one customer can have the car on a specific date. As the reservations tick in we have to make sure, that we don't hire out more cars of each type than we've got. The reservations are basically stored with a start date, an end date, and a car type.
If we ignore the car type for now (lets say we only have one type) then the reservations could graphically look something like this:
1/12 2/12 3/12 4/12 5/12 6/12 7/12
|-------------------|
|-----------------|
|-----|
|-------|
|-----------|
|-------------|
If the rental only has three cars it would be possible to rent a car from 3/12 to 5/12 since all days only have 2 car reservations. But how do we know this? Do we have to check each date and count() the number of reservations that spans over that date?
And what if somebody had reserved a car on 4/12, then 3/12 and 5/12 would still only have 2 reservations, but 4/12 would have 3.
Would it be possible to do with a query some how, or do we have to step through each date in the program to check the number of reservations didn't exceed the number of cars?
(This is easy enough with only full dates, but consider the scenario where you could rent the cars on an hourly basis (not only on a daily as here). Then it could be a though one to step through each our if we have a lot of reservations and cars and the timespan is long...)
Hope you have some nice ideas that will help us along. Thanks for taking the time to read the question :)
Mikkel, Denmark
Assume, You have such reservation situation in real life:
1/12 2/12 3/12 4/12 5/12 6/12 7/12
Car1: |-------------------|
Car2: |-----------------|
Car3: |-------| |-----------| |-----|
Car4: |-------------|
Table car
| id | type | registration |
| 1 | 1 | HH1111 |
| 2 | 1 | HH3333 |
| 3 | 2 | HH77 |
| 4 | 3 | DD999 |
Table reservation
| car_id | date_from | date_to |
| 1 | 2013-12-01 | 2013-12-04 |
| 2 | 2013-12-04 | 2013-12-07 |
| 3 | 2013-12-01 | 2013-12-02 |
| 3 | 2013-12-03 | 2013-12-05 |
| 3 | 2013-12-06 | 2013-12-07 |
| 4 | 2013-12-01 | 2013-12-03 |
Now, You must by really simple logic, select all available cars for period
from 2013-12-05 to 2013-12-06
"Select ALL cars, which does not have any reservation with dates, which blocks it for usage"
with brillian mysql select:
select * from car where not exists ( select * from reservation
where car.id = reservation.car_id AND
date_from < '2013-12-06' AND
date_to > '2013-12-05' )
"Would it be possible to do with a query some how, or do we have to step through each date in the program to check the number of reservations didn't exceed the number of cars? (This is easy enough with only full dates,"
The nature of your problem is that a violation of the constraint could appear on any individual date. So logically speaking, it is indeed necessary to do the check for each individual date comprised in a new reservations. The only optimisation possible would be to do the check at the level of "smallest intervals". To do that, you must first compute all the intervals that already appear in the database, and which overlap with your new reservation.
For example, a new reservation for 4/12-6/12 would have to be split into 4/12-5/12 (second line) and 5/12-6/12 (third line). Those individual intervals might be longer than one single day, and you can do the checks on the level of those individual intervals. (They are the same as individual days in this particular example, but a reservation 7/12-19/12 would not have to be split at all.
However, computing this might prove difficult, and there's another caveat: when you're looking al multi-row inserts, you should also be splitting over the other rows to be inserted (and that requires you to record all the inserted rows in a temporary table, otherwise you won't be able to access them).