Hi my intention is to reduce the bill that we pay for using the snowflakes cloud service.
Part#1
I like to know is there a way to find the unused data(that was not be reference by select statements and other DML) so that i can go ahead and truncate the data.I know we need to pay for truncating the data but thats one time which in turn helps in future storage spendings .
Part#2
How to list down the cluster keys defined in a database ?and how to drop unwanted ones ,so that we can avoid cost during the updates to data in the table that has unwanted cluster key.
The only current way to handle #1 is to use query_history to see what tables are being used. This is obviously difficult since parsing query_text is difficult and if views are used, you will not see which tables are used underneath. Snowflake recently released EXPLAIN, which I believe is the first step towards providing more lineage to tables and what is being queried over time. But, today it isn't easy.
The 2nd part is a bit easier, though. A SHOW TABLES command provides you with the full list of tables for a schema and the cluster keys are included in that query. To drop them, you simply run an ALTER TABLE statement and remove the cluster keys.
Related
The company I work for is building a data mart that wants 7 years of data maintained within it. Unfortunately, one table is well over 1 billion records.
My question is this: what would be the best way to keep this table current? (Daily update or quicker)
I know the MERGE statement is quite beneficial for this but I'm hoping to not have to parse through 1 billion records for each MERGE. Table partitioning is out as we do not have the Enterprise edition of SQL Server.
Any direction would be greatly appreciated :)
there are several options here; two are explained above on the comment.
The answer depends on which operation you want to do on the records.
If you just want to modify your recent records, the best way is to keep those records on an active table and move other records to an archive table as an archive. this way you need a scheduled job to move unnecessary records to the archive table.
If you want also to have a reporting module, you may need to provide an additional table which holds some abstracts of data in such a way that you can extract the reports you want.
You will want to seriously consider splitting the table. For example, see the operational versus archive paradigm.
The first step of splitting the data up from such a massive table is to identify the clustering index (if it has one) and all the other indexes as well, because you will want to avoid operations that cause big rebuilds and data shifting.
Otherwise, if you need to keep the status quo, with good indexes (you should bite the bullet and define one if they have somehow survived this long without one), you can rely on the query optimizer to do a quick index seek (much better than a scan, especially a table scan, which is the concern you seem to have). So just write your MERGE statement and make sure to use indexes in the ON clause (and avoid using functions on indexed columns at all costs!).
Consider a database with several (3-4) tables with a lot of columns (from 15 to 40). In each table we have several thousand records generated per year and about a dozen of changes made for each record.
Right now we need to add a following functionality to our system: every time user makes a change to the record of one of our tables, the system needs to keep track of it - we need to have complete history of changes and also be able to restore row data to selected point.
For some reasons we cannot keep "final" and "historic" data in the same table (so we cannot add some columns to our tables to keep some kind of versioning information, i.e. like wordpress does when it comes to keeping edit history of posts).
What would be best approach to this problem? I was thinking about two solutions:
For each tracked table we have a mirror table with the same columns, and with additional columns where we keep information about versions (i.e. timestamps, id of "original" row etc...)
Pros:
we have data stored exactly in the same way it was in original tables
whenever we need to add a new column to the original table, we can do the same to mirror table
Cons:
we need to create one additional mirror table for each tracked table.
We create one table for "history" revisions. We keep some revisioning information like timestamps etc., and also we keep the track from which table the data originates. But the original data row is being stored in large text column in JSON.
Pros:
we have only one history table for all tracked tables
we don't need to create new mirror tables every time we add new tracked table,
Cons:
there can be some backward compatibility issues while trying to restore data after structure of the original table was changed (i.e. new column was added)
Maybe some other solution?
What would be the best way of keeping the history of versions in such system?
Additional information:
each of the tracked tables can change in the future (i.e. new columns added),
number of tracked tables can change in the future (i.e. new tables added).
FYI: we are using laravel 5.3 and mysql database.
How often do you need access to the auditing data? Is cost of storage ever a concern? Do you need it in the same system that you need the normal data?
Basically, having a table called foo and a second table called foo_log isn't uncommon. It also lets you store foo_log somewhere differently, even possibly a secondary DB. If foo_log is on a spindle disk and foo is on flash, you still get fast reads, but you get somewhat cheaper storage of the backups.
If you don't ever need to display this data, and just need it for legal reasons, or to figure out how something went wrong, the single-table isn't a terrible plan.
But if the issue is backups, which it sounds like it might be, why not just backup the MySQL database on a regular basis and store the backups elsewhere?
Suppose I have a table which contains relevant information. However, the data is only relevant for, let's say, 30 minutes.
After that it's just database junk, so I need to get rid of it asap.
If I wanted, I could clean this table periodically, setting an expiration date time for each record individually and deleting expired records through a job or something. This is my #1 option, and it's what will be done unless someone convince me otherwise.
But I think this solution may be problematic. What if someone stops the job from running and no one notices? I'm looking for something like a built-in way to insert temporary data into a table. Or a table that has "volatile" data itself, in a way that it automagically removes data after x amount of time after its insertion.
And last but not least, if there's no built-in way to do that, could I be able to implement this functionality in SQL server 2008 (or 2012, we will be migrating soon) myself? If so, could someone give me directions as to what to look for to implement something like it?
(Sorry if the formatting ends up bad, first time using a smartphone to post on SO)
As another answer indicated, TRUNCATE TABLE is a fast way to remove the contents of a table, but it's aggressive; it will completely empty the table. Also, there are restrictions on its use; among others, it can't be used on tables which "are referenced by a FOREIGN KEY constraint".
Any more targeted removal of rows will require a DELETE statement with a WHERE clause. Having an index on relevant criteria fields (such as the insertion date) will improve performance of the deletion and might be a good idea (depending on its effect on INSERT and UPDATE statements).
You will need something to "trigger" the DELETE statement (or TRUNCATE statement). As you've suggested, a SQL Server Agent job is an obvious choice, but you are worried about the job being disabled or removed. Any solution will be vulnerable to someone removing your work, but there are more obscure ways to trigger an activity than a job. You could embed the deletion into the insertion process-- either in whatever stored procedure or application code you have, or as an actual table trigger. Both of those methods increase the time required for an INSERT and, because they are not handled out of band by the SQL Server Agent, will require your users to wait slightly longer. If you have the right indexes and the table is reasonably-sized, that might be an acceptable trade-off.
There isn't any other capability that I'm aware of for SQL Server to just start deleting data. There isn't automatic data retention policy enforcement.
See #Yuriy comment, that's relevant.
If you really need to implement it DB side....
Truncate table is fast way to get rid of records.
If all you need is ONE table and you just need to fill it with data, use it and dispose it asap you can consider truncating a (permanent) "CACHE_TEMP" table.
The scenario can become more complicated you are running concurrent threads/jobs and each is handling it's own data.
If that data is just existing for a single "job"/context you can consider using #TEMP tables. They are a bit volatile and maybe can be what you are looking for.
Also you maybe can use table variables, they are a bit more volatile than temporary tables but it depends on things you don't posted, so I cannot say what's really better.
While working on a content management system, I've hit a bit of a wall. Coming back to my data model, I've noticed some issues that could become more prevalent with time.
Namely, I want to maintain a audit trail (change log) of record modification by user (even user record modifications would be logged). Due to the inclusion of an arbitrary number of modules, I cannot use a by-table auto incrementation field for my primary keys, as it will inevitably cause conflicts while attempting to maintain their keys in a single table.
The audit trail would keep records of user_id, record_id, timestamp, action (INSERT/UPDATE/DELETE), and archive (a serialized copy of the old record)
I've considered a few possible solutions to the issue, such as generating a UUID primary key in application logic (to ensure cross database platform compatibility).
Another option I've considered (and I'm sure the consensus will be negative for even considering this method) is, creating a RecordKey table, to maintain a globally auto-incremented key. However, I'm sure there are far better methods to achieve this.
Ultimately, I'm curious to know of what options I should consider in attempting to implement this. For example, I intend on permitting (to start at least) options for MySQL and SQLite3 storage, but I'm concerned about how each database would handle UUIDs.
Edit to make my question less vague: Would using global IDs be a recommended solution for my problem? If so, using a 128 bit UUID (application or database generated) what can I do in my table design that would help maximize query efficiency?
Ok, you've hit a brick wall. And you realise that actually the db design has problems. And you are going to keep hitting this same brick wall many times in the future. And your future is not looking bright. And you want to change that. Good.
But what you have not yet done is, figure what the actual cause of this is. You cannot escape from the predictable future until you do that. And if you do that properly, there will not be a brick wall, at least not this particular brick wall.
First, you went and stuckIdiot columns on all the tables to force uniqueness, without really understanding the Identifiers and keys that used naturally to find the data. That is the bricks that the wall is made from. That was an unconsidered knee-jerk reaction to a problem that demanded consideration. That is what you will have to re-visit.
Do not repeat the same mistake again. Whacking GUIDs or UUIDs, or 32-byteIdiot columns to fix yourNUMERIC(10,0) Idiot columns will not do anything, except make the db much fatter, and all accesses, especially joins, much slower. The wall will be made of concrete blocks and it will hit you every hour.
Go back and look at the tables, and design them with a view to being tables, in a database. That means your starting point is No Surrrogate Keys, noIdiot columns. When you are done, you will have very fewId columns. Not zero, not all tables, but very few. Therefore you have very few bricks in the wall. I have recently posted a detailed set of steps required, so please refer to:
Link to Answer re Identifiers
What is the justification of having one audit table containing the audit "records" of all tables ? Do you enjoy meeting brick walls ? Do you want the concurrency and the speed of the db to be bottlenecked on the Insert hot-spot in one file ?
Audit requirements have been implemented in dbs for over 40 years, so the chances of your users having some other requirement that will not change is not very high. May as well do it properly. The only correct method (for a Rdb) for audit tables, is to have one audit table per auditable real table. The PK will be the original table PK plus DateTime (Compound keys are normal in a modern database). Additional columns will be UserId and Action. The row itself will be the before image (the new image is the single current row in the main table). Use the exact same column names. Do not pack it into one gigantic string.
If you do not need the data (before image), then stop recording it. It is a very silly to be recording all that volume for no reason. Recovery can be obtained from the backups.
Yes, a single RecordKey table is a monstrosity. And yet another guaranteed method of single-threading the database.
Do not react to my post, I can already see from your comments that you have all the "right" reasons for doing the wrong thing, and keeping your brick walls intact. I am trying to help you destroy them. Consider it carefully for a few days before responding.
How about keeping all the record_id local to each table, and adding another column table_name (to the audit table) to make for a composite key?
This way you can also easily filter your audit log by table_name (which will be tricky with arbitrary UUID or sequence numbers). So even if you do not go with this solution, consider adding the table_name column anyway for the sake of querying the log later.
In order to fit the record_id from all tables into the same column, you would still need to enforce that all tables use the same data type for their ids (but it seems like you were planning to do that anyway).
A more powerful scheme is to create an audit table that mirrors the structure of each table rather than put all the audit trail into one place. The "shadow" table model makes it easier to query the audit trail.
DBA (with only 2 years of google for training) has created a massive data management table (108 columns and growing) containing all neccessary attribute for any data flow in the system. Well call this table BFT for short.
Of these columns:
10 are for meta-data references.
15 are for data source and temporal tracking
1 instance of new/curr columns for textual data
10 instances of new/current/delta/ratio/range columns for multi-value numeric updates
:totaling 50 columns.
Multi valued numeric updates usually only need 2-5 of the update groups.
Batches of 15K-1500K records are loaded into the BFT and processed by stored procs with logic to validate those records shuffle them off to permanent storage in about 30 other tables.
In most of the record loads, 50-70 of the columns are empty through out the entire process.
I am no database expert, but this model and process seems to smell a little, but I don't know enough to say why, and don't want to complain without being able to offer an alternative.
Given this very small insight to the data processing model, does anyone have thoughts or suggestions? Can the database (SQL Server) be trusted to handle records with mostly empty columns efficiently, or does processing in this manner wasted lots of cycles/memory,etc.
Sounds like he reinvented BizTalk.
I typically have multiple staging tables corresponding to the input loads. These may or may not correspond to the destination tables, but we don't do what you're talking about. If he doesn't like to have a lot of what are basically temporary work tables, they could be put into their own schema or even a separate database.
As far as the columns which are empty, if they aren't referenced in the particular query which is processing BFT it doesn't matter - HOWEVER, what will happen is that the indexing becomes much more crucial that the index chosen is a non-clustered covering index. When your BFT is used and a table scan or clustered index scan is chosen, the unused column have to be read and ignored or skipped, and this definitely seems to affect processing in my experience. Whereas with a non-clustered index scan or seek, less columns are read, and hopefully this doesn't include (m)any of the unused columns.
Normalization is the keyword here. If you have so many NULL values, chances are high that you're wasting a lot of space. Normalizing the table should also make data integrity in this table easier to enforce.
One thing that might make things a little more flexible (other than normalizing) could be to create one or more views or table functions to present the data. Particularly if the table is outside your control, these would enable you to filter the spurious crap out and grab only what you need from the table.
However, if you're going to be one of the people who will be working with (and frowning every time you have to crack open) that massive table, you might want to trump the DBA's "design" and normalize that beast, and maybe give the DBA the task of creating some views and/or table functions to help you out.
I currently work with a similar but not so huge table which has been around on our system for years and has had new fields and indices and constraints rather hastily tacked on Frankenstein-style. Unfortunately some other workgroups rely on the structure as gospel, so we've created such views and functions to enable us to "shape" the data the way we need it.