Watermark Strategy after a simple transformation on the table - apache-flink

Sorry for the simple question, but I am struggling to understand how to find out whether result of a given query has watermark or not.
For example, I define my source in Datastream api and then convert it to Table API leveraging SOURCE_WATERMARK() feature
using input.printSchema() I clearly see the Watermark Strategy
(
`id` BIGINT NOT NULL,
`amount` DECIMAL(38, 18),
`rowtime` TIMESTAMP_LTZ(3) *ROWTIME* METADATA,
WATERMARK FOR `rowtime`: TIMESTAMP_LTZ(3) AS SOURCE_WATERMARK(),
CONSTRAINT `PK_id` PRIMARY KEY (`id`) NOT ENFORCED
)
then, I get the schema on a simple select
input2 = input.select('id, 'rowtime)
input2.printSchema()
and the result is
(
`id` BIGINT NOT NULL,
`rowtime` TIMESTAMP_LTZ(3) *ROWTIME*
)
Does this mean input2 table dont have watermark?
When I do a regular join (such as self-join), the result table schema look similar, with just rowtime. I noticed watermark in the flink UI as well but I came across this other note by David Anderson here that
The result of a regular join cannot have a well-defined watermark strategy
Would appreciate any help to understand these concepts better.

Related

SQL Server - smart approach for combining GUID and identity column

I am trying to come up with a design for my database where across all my tables I'd like to have the combination of a GUID column (uniqueidentifier data type) and an identity column (int data type).
The GUID column is going to be a NONCLUSTERED index whilst the identity column is going to be the CLUSTERED index. I was wondering if the script below is a correct/safe approach when it comes to database design:
CREATE TABLE country
(
guid uniqueidentifier DEFAULT NEWID() NOT NULL,
code int IDENTITY(1, 1) NOT NULL,
isoCode nvarchar(5) NOT NULL,
description nvarchar(255) NOT NULL,
created date NOT NULL DEFAULT GETDATE(),
updated date NOT NULL DEFAULT GETDATE(),
inactive bit DEFAULT 0
CONSTRAINT NIX_guid PRIMARY KEY NONCLUSTERED(guid),
CONSTRAINT AK_code UNIQUE(code),
CONSTRAINT AK_isoCode UNIQUE(isoCode)
)
GO
CREATE UNIQUE CLUSTERED INDEX [IX_code] ON country ([code] ASC)
GO
That's how it looks after running the above script:
Any tips would be much appreciated!
The domain of all possible countries is never going to be more than a few hundred, so performance should not be a concern.
You already have an isoCode. That is a canonically defined candidate key. I understand what you mean when talking about GUIDs being useful because they can never collide when created on separate servers/application instances/etc. But ISO country codes can never collide either, because they're already defined by an external authority. You don't need the GUID.
Why is your existing isoCode column an nvarchar(5)? There is a 2 letter, and a 3 letter, ISO3166 standard. There are no unicode characters required, so you can use char(2) or char(3) depending on which standard you pick, both of which would be narrower than a 4 byte int.
Yes, an identity-based clustered index does mean not having to worry about page splits on insert. But these are countries. We already know all of the countries you need to insert right now, and are you really worried that the handful of changes that might be made over the next few decades will kill the performance of your system due to page splits on insert? No, so you don't need the identity column either.
Eliminate both surrogates, and just go with the alpha-2 or alpha-3 ISO country code as your clustered primary key.

Choice of a primary key in SQL table

I want to make an SQL table to keep track of notes that are added/edited/deleted. I want to be able to display the state of each NOTEID at this moment in a table, display log of changes of selected note and be able to delete all notes marked with a given NOTEID.
create table[dbo].[NOTES]{
NOTEID [varchar](128) NOT NULL,
CREATEDBY [varchar](128) NOT NULL, /*is this redundant?*/
TIMECREATED DATE NOT NULL, /*is this redundant?*/
MODIFIEDBY [varchar](128) NOT NULL,
TIMEMODIFIED DATE NOT NULL,
NOTE [VARCHAR}(2000) NULL,
PRIMARY KEY ( /* undecided */ ),
};
What is the natural way of making this table? Should I autogenerate the primary ID or should I use (NOTEID,TIMEMODIFIED) as the primary key? What kind of fool proof protection should be added?
I would like to be able to display all notes in a "Note history" window. So, I should store note from 3 days ago, when it was created, note from 2 days ago and from today, when it was modified.
However, the "Notes" table will show the final state for each NOTEID. That is
SELECT NOTE from NOTES where NOTEID = 'selected_note_id' and date = latest
The best way is create two tables.
NOTES (
NOTE_ID -- primary key and autogenerated / autonumeric
CREATEDBY -- only appear once
TIMECREATED -- only appear once
NOTE
)
NOTES_UPDATE (
NOTES_UPDATE_ID -- primary key and autogenerated / autonumeric
NOTE_ID -- Foreign Key to NOTES
MODIFIEDBY
TIMEMODIFIED
NOTE
)
You can get your notes updates
SELECT N.*, NU.*
FROM NOTES N
JOIN NOTES_UPDATE NU
ON N.NOTE_ID = NU.NOTE_ID
and to get the last update just add
ORDER BY NOTE_UPDATE_ID DESC
LIMIT 1 -- THIS is postgres sintaxis.
SIMPLE ANSWER:
The PRIMARY KEY should be the value that unique identifies each row in your table. In your particular case, NOTEID should be your id.
ELABORATING:
It is important to remember that a PRIMARY KEY creates an index by default, which means that whenever you do a query similar to:
SELECT * FROM table WHERE NOTEID = something
The query will execute a lot faster than without an index (which is mostly relevant for bigger tables). The PRIMARY KEY is also forced to be unique, hence no two rows can have the same PRIMARY KEY
A general rule is that you should have an INDEX for any value that will often be used within the WHERE ... part of the statement. If NOTEID is not the only value you will be using in the WHERE .... part of the query, consider creating more indexes
HOWEVER! TREAD WITH CAUTION. Indexes help speed up searches with SELECT however they make UPDATE and INSERT work slower.
I think your current table design is fine, though you might want to make the NOTEID the primary key and auto increment it. I don't see the point of making (NOTEID, TIMEMODIFIED) a composite primary key because a given note ID should ideally only appear once in the table. If the modified time changes, the ID should remain the same.
Assuming we treat notes as files on a computer, then there should be only one table (file system) which stores them. If a given note gets modified, then the timestamp changes to reflect this.

Building comment system for different types of entities

I'm building a comment system in PostgreSQL where I can comment (as well as "liking" them) on different entities that I already have (such as products, articles, photos, and so on). For the moment, I came up with this:
(note: the foreign key between comment_board and product/article/photo is very loose here. ref_id is just storing the id, which is used in conjunction with the comment_board_type to determine which table it is)
Obviously, this doesn't seem like good data integrity. What can I do to give it better integrity? Also, I know every product/article/photo will need a comment_board. Could that mean I implement a comment_board_id to each product/article/photo entity such as this?:
I do recognize this SO solution, but it made me second-guess supertypes and the complexities of it: Database design - articles, blog posts, photos, stories
Any guidance is appreciated!
I ended up just pointing the comments directly to the product/photo/article fields. Here is what i came up with in total
CREATE TABLE comment (
id SERIAL PRIMARY KEY,
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT (now()),
updated_at TIMESTAMP WITH TIME ZONE,
account_id INT NOT NULL REFERENCES account(id),
text VARCHAR NOT NULL,
-- commentable sections
product_id INT REFERENCES product(id),
photo_id INT REFERENCES photo(id),
article_id INT REFERENCES article(id),
-- constraint to make sure this comment appears in only one place
CONSTRAINT comment_entity_check CHECK(
(product_id IS NOT NULL)::INT
+
(photo_id IS NOT NULL)::INT
+
(article_id IS NOT NULL)::INT
= 1
)
);
CREATE TABLE comment_likes (
id SERIAL PRIMARY KEY,
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT (now()),
updated_at TIMESTAMP WITH TIME ZONE,
account_id INT NOT NULL REFERENCES account(id),
comment_id INT NOT NULL REFERENCES comment(id),
-- comments can only be liked once by an account.
UNIQUE(account_id, comment_id)
);
Resulting in:
This makes it so that I have to do one less join to an intermediary table. Also, it lets me add a field and update the constraints easily.

Large sample database for HSQLDB?

I'm taking a database class and I'd like to have a large sample database to experiment with. My definition of large here is that there's enough data in the database so that if I try a query that's very inefficient, I'll be able to tell by the amount of time it takes to execute. I've googled for this and not found anything that's HSQLDB specific, but maybe I'm using the wrong keywords. Basically I'm hoping to find something that's already set up, with the tables, primary keys, etc. and normalized and all that, so I can try things out on a somewhat realistic database. For HSQLDB I guess that would just be the .script file. Anyway if anybody knows of any resources for this I'd really appreciate it.
You can use the MySQL Sakila database schema and data (open source, on MySQL web site), but you need to modify the schema definition. You can delete the view and trigger definitions, which are not necessary for your experiment. For example:
CREATE TABLE country (
country_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
country VARCHAR(50) NOT NULL,
last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (country_id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
modified:
CREATE TABLE country (
country_id SMALLINT GENERATED BY DEFAULT AS IDENTITY,
country VARCHAR(50) NOT NULL,
last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (country_id)
)
Some MySQL DDL syntax is supported in the MYS syntax mode of HSQLDB, for example AUTO_INCREMENT is translated to IDENTITY, but others need manual editing. The data is mostly compatible, apart from some binary strings.
You need to access the database with a tool that reports the query time. The HSQLDB DatabaseManager does this when the query output is in Text mode.

SQL Server: Clustering by timestamp; pros/cons

I have a table in SQL Server, where i want inserts to be added to the end of the table (as opposed to a clustering key that would cause them to be inserted in the middle). This means I want the table clustered by some column that will constantly increase.
This could be achieved by clustering on a datetime column:
CREATE TABLE Things (
...
CreatedDate datetime DEFAULT getdate(),
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (CreatedDate)
)
But I can't guaranteed that two Things won't have the same time. So my requirements can't really be achieved by a datetime column.
I could add a dummy identity int column, and cluster on that:
CREATE TABLE Things (
...
RowID int IDENTITY(1,1),
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (RowID)
)
But you'll notice that my table already constains a timestamp column; a column which is guaranteed to be a monotonically increasing. This is exactly the characteristic I want for a candidate cluster key.
So I cluster the table on the rowversion (aka timestamp) column:
CREATE TABLE Things (
...
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (timestamp)
)
Rather than adding a dummy identity int column (RowID) to ensure an order, I use what I already have.
What I'm looking for are thoughts of why this is a bad idea; and what other ideas are better.
Note: Community wiki, since the answers are subjective.
So I cluster the table on the
rowversion (aka timestamp) column:
Rather than adding a dummy identity
int column (RowID) to ensure an order,
I use what I already have.
That might sound like a good idea at first - but it's really almost the worst option you have. Why?
The main requirements for a clustered key are (see Kim Tripp's blog post for more excellent details):
stable
narrow
unique
ever-increasing if possible
Your rowversion violates the stable requirement, and that's probably the most important one. The rowversion of a row changes with each modification to the row - and since your clustering key is being added to each and every non-clustered index in the table, your server will be constantly updating loads of non-clustered indices and wasting a lot of time doing so.
In the end, adding a dummy identity column probably is a much better alternative for your case. The second best choice would be the datetime column - but here, you do run the risk of SQL Server having to add "uniqueifiers" to your entries when duplicates occur - and with a 3.33ms accuracy, this could definitely be happening - not optimal, but definitely much better than the rowversion idea...
from the link: timestamp in the question:
The timestamp syntax is deprecated.
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature.
and
Duplicate rowversion values can be
generated by using the SELECT INTO
statement in which a rowversion column
is in the SELECT list. We do not
recommend using rowversion in this
manner.
so why on earth would you want to cluster by either, especially since their values alwsys change when the row is updated? just use an identity as the PK and cluster on it.
You were on the right track already. You can use a DateTime column that holds the created date and create a CLUSTERED but non unique constraint.
CREATE TABLE Things (
...
CreatedDate datetime DEFAULT getdate(),
[timestamp] timestamp,
)
CREATE CLUSTERED INDEX [IX_CreatedDate] ON .[Things]
(
[CreatedDate] ASC
)
If this table gets a lot of inserts, you might be creating a hot spot that interferes with updates, because all of the inserts will be happening on the same physical/index pages. Check your locking setup.

Resources