If a key is used (or has a value) only 40% of the time in a JSON, do I still create a column for that key in Postgres? The online literature states to use a normalized column when the value for a key is always there, but I couldn't find any guidance on what to do when a key is filled most of the time or a threshold. I am trying to find out when to prefer JSONB over column, To paraphrase, what is the threshold for how often a key should have a value to qualify having its own column.
Related
I have a dilemma choosing a primary key between a column which has the type TimeStamp and introducing a surrogate key which should be a identity column.
I have created a small app in which I use .NET SqlBulkCopy class in order to copy data from the source table to the destination table.
The app does not perform well when the primary key is a column of the type TimeStamp and do perform well when the primary key is an identity columns.
This statement is valid until the number of records I have to copy reach some magic number. After that both variants perform well.
I do not know what might be the reason of slow performance for a table with TimeStamp as a primary key?
Any ideas?
I have tried to find any kind of the additional resources about the type TIMESTAMP.
I have found just that the value should increase linearly, and the type occupies 8 bytes.
The type is deprecated ( or not it depends on the documentation ).
I would like to know a logical explanation why the app does not perform well when the PK is of the type TIMESTAMP.
Is it a good idea to generate dimension id with the alpha-numeric character combination instead of integer in Snowflake data warehouse? (https://www.snowflake.com/) For example: Let's say I have to build a dimension table from a source table with 3 key combinations. Normally we built incremental integer column surrogate key as dimension id. Instead, is it better to create a string column key1_key2_key3(concatenated source keys) as surrogate key for generating dimension id? Since snowflakes are distributed database and perform well, I feel this should okay. I'm trying to see any unforeseen impact?
What it seems like you are asking is: Should you use a surrogate key (a monotonically increasing integer) or a concatenation of the business key as a primary key in your dimension.
Apart from the storage and performance benefits of using a surrogate key you also need to consider the main reason for using surrogate keys - slowly changing dimensions. If you decide to track the changes to your dimension records at some point you'll want to use surrogate keys in your dimensions since the concatenation of your business keys will duplicate over time.
I would create the dimension id as integer and add another column as surrogate key. Thus you will follow the standards and have an integer key like all other dimension tables. If you think surrogate key will be meaningful and will be used in joins/filters feel free to add one.
My point is having the dimension id as integer in that particular dimension table will prevent you from deviating following the best practices.
This link explains when and where using a surrogate key makes sense.
https://www.kimballgroup.com/1998/05/surrogate-keys/
I googled this a lot many times but I didn't get the exact explanation for the same.
I am working on a complex database structures (in Oracle 10g) where I hardly have a primary key on one single column except for the static tables.
Now my question is consider a composite primary key ID (LXI, VCODE, IVID, GHID). Since it's a primary key, Oracle will provide a default index.
Will I get ONE (system generated) single index for the primary key itself or for its sub-columns also?
Asking this because I am retrieving data (around millions of records) based on individual columns as well. Now if system generates the indices for the individual columns as well. Why my query runs pretty faster than how it actually runs when I explicitly define indices for each individual column.
Please give a satisfactory answer
Thanks in advance
A primary key is a non-NULL unique key. In your case, the unique index has four columns, LXI, VCODE, IVID GHID in the order of declaration.
If you have a condition on VCODE but not on LXI, then most databases would not use the index. Oracle has a special type of index scan called the "skip scan", which allows for this very situation. It is described in the documentation.
I would expect an index skip scan to be a bit slower than an index range scan on individual columns. However, which is better might also depend on the complexity of the where clause. For instance, three equality conditions on VCODE, IVID and GHID connected by AND might be a great example for the skip scan. And, such an index would cover the WHERE clause -- a great efficiency -- and better than one-column indexes.
As a note: index skip scans were introduced in Oracle 9i, so they are available in Oracle 10.
It will not generate index for individual column. it will generate a composite index
first it will index on LXI
then next column like that it will be a tree structure.
if you search on 1st column of primary key it will use index to use index for second you have to combine it with the first column
ex : select where ...LXI=? will use index PK
select where LXI=? and VCODE=? alse use pk
but select where VCODE=? will not use it (without LXI)
I'm looking at Amazon's DynamoDB as it looks like it takes away all of the hassle of maintaining and scaling your database server. I'm currently using MySQL, and maintaining and scaling the database is a complete headache.
I've gone through the documentation and I'm having a hard time trying to wrap my head around how you would structure your data so it could be easily retrieved.
I'm totally new to NoSQL and non-relational databases.
From the Dynamo documentation it sounds like you can only query a table on the primary hash key, and the primary range key with a limited number of comparison operators.
Or you can run a full table scan and apply a filter to it. The catch is that it will only scan 1Mb at a time, so you'd likely have to repeat your scan to find X number of results.
I realize these limitations allow them to provide predictable performance, but it seems like it makes it really difficult to get your data out. And performing full table scans seems like it would be really inefficient, and would only become less efficient over time as your table grows.
For Instance, say I have a Flickr clone. My Images table might look something like:
Image ID (Number, Primary Hash Key)
Date Added (Number, Primary Range Key)
User ID (String)
Tags (String Set)
etc
So using query I would be able to list all images from the last 7 days and limit it to X number of results pretty easily.
But if I wanted to list all images from a particular user I would need to do a full table scan and filter by username. Same would go for tags.
And because you can only scan 1Mb at a time you may need to do multiple scans to find X number of images. I also don't see a way to easily stop at X number of images. If you're trying to grab 30 images, your first scan might find 5, and your second may find 40.
Do I have this right? Is it basically a trade-off? You get really fast predictable database performance that is virtually maintenance free. But the trade-off is that you need to build way more logic to deal with the results?
Or am I totally off base here?
Yes, you are correct about the trade-off between performance and query flexibility.
But there are a few tricks to reduce the pain - secondary indexes/denormalising probably being the most important.
You would have another table keyed on user ID, listing all their images, for example. When you add an image, you update this table as well as adding a row to the table keyed on image ID.
You have to decide what queries you need, then design the data model around them.
I think you need create your own secondary index, using another table.
This table "schema" could be:
User ID (String, Primary Key)
Date Added (Number, Range Key)
Image ID (Number)
--
That way you can query by User ID and filter by Date as well
You can use composite hash-range key as primary index.
From the DynamoDB Page:
A primary key can either be a single-attribute hash key or a composite
hash-range key. A single attribute hash primary key could be, for
example, “UserID”. This would allow you to quickly read and write data
for an item associated with a given user ID.
A composite hash-range key is indexed as a hash key element and a
range key element. This multi-part key maintains a hierarchy between
the first and second element values. For example, a composite
hash-range key could be a combination of “UserID” (hash) and
“Timestamp” (range). Holding the hash key element constant, you can
search across the range key element to retrieve items. This would
allow you to use the Query API to, for example, retrieve all items for
a single UserID across a range of timestamps.
In tables where you need only 1 column as the key, and values in that column can be integers, when you shouldn't use an identity field?
To the contrary, in the same table and column, when would you generate manually its values and you wouldn't use an autogenerated value for each record?
I guess that it would be the case when there are lots of inserts and deletes to the table. Am I right? What other situations could be?
If you already settled on the surrogate side of the Great Primary Key Debacle then I can't find a single reason not use use identity keys. The usual alternatives are guids (they have many disadvatages, primarily from size and randomness) and application layer generated keys. But creating a surrogate key in the application layer is a little bit harder than it seems and also does not cover non-application related data access (ie. batch loads, imports, other apps etc). The one special case is distributed applications when guids and even sequential guids may offer a better alternative to site id + identity keys..
I suppose if you are creating a many-to-many linking table, where both fields are foreign keys, you don't need an identity field.
Nowadays I imagine that most ORMs expect there to be an identity field in every table. In general, it is a good practice to provide one.
I'm not sure I understand enough about your context, but I interpret your question to be:
"If I need the database to create a unique column (for whatever reason), when shouldn't it be a monotonically increasing integer (identity) column?"
In those cases, there's no reason to use anything other than the facility provided by the DBMS for the purpose; in your case (SQL Server?) that's an identity.
Except:
If you'll ever need to merge the table with data from another source, use a GUID, which will prevent duplicate keys from colliding.
If you need to merge databases it's a lot easier if you don't have to regenerate keys.
One case of not wanting an identity field would be in a one to one relationship. The secondary table would have as its primary key the same value as the primary table. The only reason to have an identity field in that situation would seem to be to satisfy an ORM.
You cannot (normally) specify values when inserting into identity columns, so for example if the column "id" was specified as an identify the following SQL would fail:
INSERT INTO MyTable (id, name) VALUES (1, 'Smith')
In order to perform this sort of insert you need to have IDENTITY_INSERT on for that table - this is not intended to be on normally and can only be on for a maximum of 1 tables in the database at any point in time.
If I need a surrogate, I would either use an IDENTITY column or a GUID column depending on the need for global uniqueness.
If there is a natural primary key, or the primary key is defined as a unique combination of other foreign keys, then I typically do not have an IDENTITY, nor do I use it as the primary key.
There is an exception, which is snapshot configuration tables which I am tracking with an audit trigger. In this case, there is usually a logical "primary key" (usually date of the snapshot and natural key of the row - like a cost center or gl account number for which the row is a configuration record), but instead of using the natural "primary key" as the primary key, I add an IDENTITY and make that the primary key and make a unique index or constraint on the date and natural key. Although theoretically the date and natural key shouldn't change, in these tables, if a user does that instead of adding a new row and deleting the old row, I want the audit (which reflects a change to a row identified by its primary key) to really reflect a change in the row - not the disappearance of a key and the appearance of a new one.
I recently implemented a Suffix Trie in C# that could index novels, and then allow searches to be done extremely fast, linear to the size of the search string. Part of the requirements (this was a homework assignment) was to use offline storage, so I used MS SQL, and needed a structure to represent a Node in a table.
I ended up with the following structure : NodeID Character ParentID, etc, where the NodeID was a primary key.
I didn't want this to be done as an autoincrementing identity for two main reasons.
How do I get the value of a NodeID after I add it to the database/data table?
I wanted more control when it came to generating my own IDs.