Is there an option to force partitions on a Snowflake table - snowflake-cloud-data-platform

I want to look at creating partitions on Snowflake tables and forcing the end user to use them in a query. Intention is to prevent accidental huge queries on large tables
Is that possible?

Snowflake does not support "traditional partitions, it has "micro-partitions" which are created automatically. Please read the following document to understand "micro-partitions" of Snowflake:
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#what-are-micro-partitions
The queries will always try to prune partitions based on query filters, but there is no way to force your users to specify a filter condition/partition.
To prevent accidental high credit usage, you may consider "resource monitor" feature. You may create a smaller warehouse, set a resource limit and assign this warehouse to users.
https://docs.snowflake.com/en/user-guide/resource-monitors.html#creating-resource-monitors

No you can't create partitions manually in Snowflake, micro-partitions in Snowflake are created automatically based on when the data arrives rather than what the data contains. You can use cluster keys however to order the data within and across micro-partitions which will help with pruning out partitions when a query is executed.
You could force users to put predicates on their queries by using a BI tool or by parameterising your (secure?)views with a SQL variable and restricting access to the underlying tables. Obviously this would mean that your users would need to set their session variables correctly somehow, probably a bit too restrictive in my opinion but here is an example:
-- Set variable before creating the view
set predicate_var = 'Mid Atlantic';
-- Create view using parameter in the where condition
create or replace view test_db.public.parameterised_view as
select
*
from "SAMPLE_DATA"."TPCDS_SF100TCL"."CALL_CENTER"
where CC_NAME = $predicate_var;
-- Select from the view. Only records with 'Mid Atlantic' are returned
select CC_CALL_CENTER_SK, CC_NAME from test_db.public.parameterised_view;
--Results:
--|-------------------+--------------|
--| CC_CALL_CENTER_SK | CC_NAME |
--|-------------------+--------------|
--| 2 | Mid Atlantic |
--| 3 | Mid Atlantic |
--|-------------------+--------------|
-- Change the var to something different to prove it works
set predicate_var = 'NY Metro';
-- Select from the view to show that the predicate has changed
select CC_CALL_CENTER_SK, CC_NAME from test_db.public.parameterised_view;
--Results:
--|-------------------+----------|
--| CC_CALL_CENTER_SK | CC_NAME |
--|-------------------+----------|
--| 1 | NY Metro |
--|-------------------+----------|

Related

How to setup one database to simulate online database?

There is a simulated database environment that runs like production environment. but no data in this database, and the statistical information comes from the real production environment. The evironment is the latest version of oceanbase community edition.
Is there a way to achieve that the execution plan of the database depends on statistical information? so the execution plan we get on the simulated database environment will be same as the production environment.
In this way, the execution plan of some SQL statements can be analyzed on the simulated database environment,to determine whether there is a performance problem in SQL.
Due to the particularity of our work, we cannot obtain the execution plan in the production environment.
Now, during our test, the oceanbase's execution plan depends on the estimated number of rows in the storage layer mainly,not statistical information.
By the way, this usage scenario can be run when we used the Oracle before.
Any suggestions on how to setup one database to simulate online database?
Thanks.
You can use DBMS_STATS.EXPORT_... routines to copy all statistics for a table, schema, or even database from one database to another. (You would then import them with DBMS_STATS.IMPORT...)
The issue here is that whilst you'll probably then get identical execution plans across the two databases, I'm not sure what benefit that yields.
For example, lets say you see this on your empty database
SQL> select * from emp
2 where ename = 'JOE';
Execution Plan
--------------------------------------------------
Plan hash value: 3956160932
--------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
--------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 37 |
|* 1 | TABLE ACCESS FULL| EMP | 1 | 37 |
--------------------------------------------------
(or even your production database for that matter).
Is that a good plan? A bad plan? It's really impossible to say in the majority of cases. Copying stats can be good technique when you have comparable data sizes between databases. But if one is empty, I don't see a lot of benefit.

Subscriber-only trigger

There are two servers. The first is ERP system on production. The second one is the BI server for heavy analytical queries. We update the BI server on a daily basis via backups. However, it's not enough, some users want to see their data changes more often than the next day. I have no access to the ERP server and can't do anything, except asking for backups or replications.
Before starting to ask for the replication. I want to understand if it's possible to use subscriber triggers in order to process not all the data, but changed. There is an ETL process to make some queries faster (indexing, transformation, etc). Triggers should do the trick, but I can't find a way to use them on the subscriber side only. The ERP system doesn't allow to make any changes on the DB level. So, the subscriber database seems to be fine for triggers (they don't affect on the ERP server performance). Nonetheless, I can't find a way to set them up. Processing all the data is an insane overhead.
Use case: Simplified example, say, we have two replicated tables:
+------------+-------------+--------+
| dt | customer_id | amount |
+------------+-------------+--------+
| 2017-01-01 | 1 | 234 |
| 2017-01-02 | 1 | 123 |
+------------+-------------+--------+
+------------+-------------+------------+------------+
| manager_id | customer_id | date_from | date_to |
+------------+-------------+------------+------------+
| 1 | 1 | 2017-01-01 | 2017-01-02 |
| 2 | 1 | 2017-01-02 | null |
+------------+-------------+------------+------------+
I need to transform them into the following indexed table:
+----------+-------------+------------+--------+
| dt_id | customer_id | manager_id | amount |
+----------+-------------+------------+--------+
| 20170101 | 1 | 1 | 234 |
| 20170102 | 1 | 2 | 123 |
+----------+-------------+------------+--------+
So, I created yet another database where I store the table above. Now, in order to update the table I have to truncate it and reinsert all the data again. I may join them all the in order to check the diffs, but it's too heavy for big tables as well. The trigger helps to track only changing records. The first input table can use a trigger:
create trigger SaleInsert
on Table1
after insert
begin
insert into NewDB..IndexedTable
select
//some transformation
from inserted
left join Table2
on Table1.customer_id = Table2.customer_id
and Table1.dt >= Table2.date_from
and Table1.dt < Table2.date_to
end
The same idea for update, delete, a similar approach for the second table. I could get automatically updated DWH with little lags. Yeah, I expect performance lags for high-loaded databases. Theoretically, it should work smoothly on servers with the same configurations.
But, again, there are no triggers on the subscriber side only. Any ideas, alternatives?
MS SQL Server has "Change Tracking" features that maybe be of use to you. You enable the database for change tracking and configure which tables you wish to track. SQL Server then creates change records on every update, insert, delete on a table and then lets you query for changes to records that have been made since the last time you checked. This is very useful for syncing changes and is more efficient than using triggers. It's also easier to manage than making your own tracking tables. This has been a feature since SQL Server 2005.
How to: Use SQL Server Change Tracking
Change tracking only captures the primary keys of the tables and let's you query which fields might have been modified. Then you can query the tables join on those keys to get the current data. If you want it to capture the data also you can use Change Capture, but it requires more overhead and at least SQL Server 2008 enterprise edition.
Change Data Capture
The general process is:
Get the current sync version
Get the last sync version you used to get changes
Get all the primary keys of the tables that have changed since that last version (inserts, updates, and deletes).
Join the keys with the data and pull down the data (if not using change capture).
Save the data and the current sync version (from the first step).
Then you repeat this process whenever you want to subscribe to the next set of changes. SQL Server does all the magic behind the scenes for you of storing the changes and the versioning. You also might want to look into Snapshot Isolation... it works well with it. The linked article has more information about that.
This answer is somewhat roundabout, but given your tight limitations perhaps you'll consider it.
First, go for replication as you seem to have decided. You mentioned creating yet another database but were stuck for how to create triggers to populate it. The answer lies in the ability to run post-snapshot scripts. When creating the replication publication the DBA can specify a script to run on the Subscriber after the Snapshot.
You can have the script create all the triggers you require.
Also, to prevent replication from overwriting your triggers with "no trigger" (as defined in the ERP database) the DBA will need to verify that for each table on which you have triggers the property Copy user triggers is set to False.
You cannot get updates in ERP database until you implement Triggers, or change tracking on ERP database. The best way is to go for replication, If you have no access to the ERP server and can't do anything, except asking for backups or replications.
If you are using full recovery model on the ERP system database you could use Log file shipping to accomplish this. This would still have some level of delay between the production system and the reporting system. If some delay between DML statements being issued on the ERP system and the reporting system this solution would work. If you need nearly instant access to the data in the reporting system, replication and the associated overhead is your only good option. For more information of configuring Log File shipping:
https://learn.microsoft.com/en-us/sql/database-engine/log-shipping/about-log-shipping-sql-server

SQL 2005/2008 Database track field change

I do not want Auditing or History tracking.
I have an application that pulls its data from a external source.
We mirror the external tables in our own DB. We allow our users to update the data in our DB.
Next time the system syncs with the external data, we only override fields that we have not changed locally.
Off the top of my head I can think of 2 ways to do this
1) Store 2 Rows for each Object. First 1 is the external version, the 2nd row links to the external version but will only have data in a field if that field has been changed.
e.g.
id | parentId | field1 | field2
1 | null | foo | bar
2 | 1 | new foo | null
This illustrates what the data would look like when a local user changed field1.
If no change occurred there would only be the first row.
2) Double the number of columns in the table.
e.g
name_external
name_internal
I like option 1 better as it seems like it would provides better separation and easier to query and to do in code comparisons between the 2 objects. The only downside is that it will result in more rows, but the DB will be pretty small.
Is there any other patterns I could use? Or a reason I shouldn't go with option 1.
I will be using .NET 4 WCF services
Solution
I will go with the two table answer provided below. and use the following SQL to return a Row that has the fields which have changed locally merged with the orginal values
SELECT
a.[ID],
isnull(b.[firstName], a.[firstName]),
isnull(b.[lastName], a.[lastName]),
isnull(b.[dob], a.[dob]),
isnull(b.active, a.active)
From tbl1 a
left join tbl2 b on a.[ID] = b.[ID]
As in my the DB will only ever be able to be updated via the UI system. And I can ensure people are not allowed to enter NULL as a value instead I force them to enter a blank string. This will allow me to overcome the issue what happens if the user updates a value to NULL
There are two issues I can think of if you choose option 1.
Allowing users to update the data means you will either have to write procedures to perform the insert/update/delete statements for them, managing the double row structure, or you have to train all the users to update the table correctly.
The other problem is modelling fields in your table which can be NULL. If you are using NULL to represent the field has not changed how can you represent a field changing to NULL?
The second option of doubling the number of columns avoids the complexity of updates and allows you to store NULL values but you may see performance decrease due to the increased row size and therefore the amount of data the server has to move around (without testing it I realise this claim is unsubstantiated but I thought it worth mentioning).
Another suggestion would be to duplicate the tables, perhaps putting them in another schema, which hold a snapshot of the data just after the sync with the external data source is performed.
I know you are not wanting a full audit, this would be a copy of the data at a point in time. You can then allow users to modify the data as they wish, without the complexity of the double row/column, and when you come to re-sync with the external source you can find out which fields have changed.
I found a great article: The shortest, fastest, and easiest way to compare two tables in SQL Server: UNION! describing such a comparison.
Consider having an update trigger that updates an UpdatedBy field. We have a user for our imports that no other process is allowed to use that updates the field if they were done by the import. So if any value other than user -1 was the last to update you can easily tell. Then you can use the merge statment to insert/delete/update.

Advice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day

Our company is developing an internal project to parse text files.
Those text files are composed of metadata which is extracted using regular expresions.
Ten computers are 24/7 parsing the text files and feeding a high-end Intel Xeon SQL Server 2005 database with the extracted metadata.
The simplified database schema looks like this:
Items
| Id | Name |
|----|--------|
| 1 | Sample |
Items_Attributes
| ItemId | AttributeId |
|--------|-------------|
| 1 | 1 |
| 1 | 2 |
Attributes
| Id | AttributeTypeId | Value |
|----|-----------------|-------|
| 1 | 1 | 500mB |
| 2 | 2 | 1.0.0 |
AttributeTypes
| Id | Name |
|----|---------|
| 1 | Size |
| 2 | Version |
There are many distinct text files types with distinct metadata inside. For every text file we have an Item and for every extracted metadata value we have an Attribute.
Items_Attributes allow us to avoid duplicate Attribute values which avoids database size to increase x^10.
This particular schema allows us to dynamically add new regular expressions and to obtain new metadata from new processed files no matter which internal structure they have.
Additionally this allow us to filter the data and to obtain dynamic reports based on the user criteria. We are filtering by Attribute and then pivoting the resultset (http://msdn.microsoft.com/en-us/library/ms177410.aspx). So this example pseudo-sql query
SELECT FROM Items WHERE Size = #A AND Version = #B
would return a pivoted table like this
| ItemName | Size | Version |
|----------|-------|---------|
| Sample | 500mB | 1.0.0 |
The application has been running for months and performance decreased terribly at the point is no longer usable. Reports should take no more than 2 seconds and Items_Attributes table increases an average of 10,000,000 rows per week.
Everything is properly indexed and we spent severe time analyzing and optimizing query execution plans.
So my question is, how would you scale this in order to decrease report execution times?
We came with this possible solutions:
Buy more hardware and setup an SQL Server cluster. (we need advice on the proper "clustering" strategy)
Use a key/value database like HBase (we don't really know if would solve our problem)
Use a ODBMS rather than a RDBMS (we have been considering db4o)
Move our software to the cloud (we have zero experience)
Statically generate reports at runtime. (we don't really want to)
Static indexed views for common reports (performance is almost the same)
De-normalize schema (some of our reports involves up to 50 tables in a single query)
Perhaps this white paper by SQL Server CAT team on the pitfalls of Entity-Attribute-Value database model can help: http://sqlcat.com/whitepapers/archive/2008/09/03/best-practices-for-semantic-data-modeling-for-performance-and-scalability.aspx
I'd start from posting exact tables metadata (along with indexing details), exact query text and the execution plan.
With you current table layout, the query similar to this:
SELECT FROM Items WHERE Size = #A AND Version = #B
cannot benefit from using a composite index on (Size, Version), since it's impossible to build such an index.
You cannot even build an indexed view, since it would contain a self-join on attributes.
Probably the best decision would be to denormalize the table like this:
id name size version
and create an index on (size, version)
Worked with such schemas a lot of time. They never perform well.
The best thing is to just store the data as you need it, in the form:
| ItemName | Size | Version |
|----------|-------|---------|
| Sample | 500mB | 1.0.0 |
Then you don;t need to pivot. And BTW, please do not call your original EAV schema "normalized" - it is not normalized.
Looks to me like issuing some OLAP queries on a database optimized for OLTP transactions. Don't knowing details, I'd recommend building a separate "datawarehouse" optimized for the kind of queries you are doing. That would involve aggregating data (if possible), denormalization and also having a data base, which is 1 day old or so. You would incrementally update the data each day or at any interval you wish.
Please post exact DDL and indexes, if you have indexes on the ID columns then your query will result in a scan
instead of something like this
SELECT FROM Items WHERE Size = #A AND Version = #B
you need to do this
SELECT FROM Items WHERE ID = 1
in other words you need to grab the text values, find the ids that you are indexing on and then use that as your query to return results instead
Probably also a good idea to look at partitioning function to distribute your data
clustering is done for availability not performance, if one node dies (the active cluster) , the other node (the passive cluster) will become active....of course there is also active active clustering but that is another story
A short term fix may be to use horizontal partitioning. I am assuming your largest table is Items_Attributes. You could horizontally partition this table, putting each partition on a separate filegroup on a separate disk controller.
That's assuming you are not trying to report across all ItemIds at once.
You mention 50 tables in a single query. Whilst SQL server supports up to 256 tables in a single, monolithic query, taking this approach reduces the chances of the optimiser producing an efficient plan.
If you are wedded to the schema as it stands, consider breaking your reporting queries down into a series of steps which materialise their results into temporary (#) tables.
This approach enables you to carry out the most selective parts of the query in isolation, and can, in my experience, offer big performance gains. The queries are generally more maintainable too.
Also (a bit of a long shot, this) you don't say which SQL server version you're on; but if you're on SQL 2005, given the number of tables involved in your reports and the volume of data, it's worth checking that your SQL server is patched to at least SP2.
I worked on an ETL project using tables with rowcounts in the hundreds of millions, where we found that the query optimiser in SQL 2005 RTM/SP1 could not consistently produce efficient plans for queries joining more than 5 tables where one or more of the tables was of this scale. This issue was resolved in SP2.

How to log in an Oracle database?

I am interested in what methods of logging is frequent in an Oracle database.
Our method is the following:
We create a log table for the table to be logged. The log table contains all the columns of the original table plus some special fields including timestamp, modification type (insert, update, delete), modifier's id. A trigger on the original table creates one log row for each insertion and deletion, and two rows for a modification. Log rows contain the data before and after the alteration of the original one.
Although state of the records can be mined back in time using this method, it has some drawbacks:
Introduction of a new column in the original table does not automatically involves log modification.
Log modification affects log table and trigger and it is easy to mess up.
State of a record at a specific past time cannot be determined in a straightforward way.
...
What other possibilities exist?
What kind of tools can be used to solve this problem?
I only know of log4plsql. What are the pros/cons of this tool?
Edit: Based on Brian's answer I have found the following reference that explains standard and fine grain auditing.
It sounds like you are after 'auditing'. Oracle has a built-in feature called Fine Grain Auditing (FGA). In a nutshell you can audit everything or specific conditions. What is really cool is you can 'audit' selects as well as transactions. Simple command to get started with auditing:
audit UPDATE on SCOTT.EMP by access;
Think of it as a 'trigger' for select statements. For example, you create policies:
begin
dbms_fga.add_policy (
object_schema=>'BANK',
object_name=>'ACCOUNTS',
policy_name=>'ACCOUNTS_ACCESS'
);
end;
After you have defined the policy, when a user queries the table in the usual way, as follows:
select * from bank.accounts;
the audit trail records this action. You can see the trail by issuing:
select timestamp,
db_user,
os_user,
object_schema,
object_name,
sql_text
from dba_fga_audit_trail;
TIMESTAMP DB_USER OS_USER OBJECT_ OBJECT_N SQL_TEXT
--------- ------- ------- ------- -------- ----------------------
22-OCT-08 BANK ananda BANK ACCOUNTS select * from accounts
Judging from your description, I wonder if what you really need is not logging mechanism, but rather some sort of Historical value of some table. If this is the case, then maybe you better off using some kind of Temporal Database design (using VALID_FROM and VALID_TO fields). You can track changes in database using Oracle LogMiner tools.
As for your scenarios, I would rather stored the changes data in this kind of schema :
+----------------------------------------------------------------------------+
| Column Name | Function |
+----------------------------------------------------------------------------+
| Id | PRIMARY_KEY value of the SOURCE table |
| TimeStamp | Time stamp of the action |
| User | User who make the action |
| ActionType | INSERT, UPDATE, or DELETE |
| OldValues | All fields value from source table, seperated by '|' |
| Newvalues | All fields value from source table, seperated by '|' |
+----------------------------------------------------------------------------+
With this type of logging table, you can easily determine :
Historical Change action of particular record (using Id)
State of specific record in some point in time
Of course this kind of logging cannot easily determine all valid values of table in specific point in time. For this, you need to change you table design to Temporal Database Design.
In the similar question (How to Audit Database Activity without Performance and Scalability Issues?) the accepted answer mentions the monitoring of database traffic using a network traffic sniffer as an interesting alternative.
log4plsql is a completely different thing, its for logging debug info from PL/SQL
For what you want, you need to either.
Setup a trigger
Setup PL/SQL interface around the tables, CRUD operations happen via this interface, the interface ensures the log tables are updated.
Setup interface in your application layer, as with PL/SQL interface, just higher up.
Oracle 11g contains versioned tables, I have not used this at all though, so can make no real comment.
If you just interested in knowing what the data looked like in the recent past you could just use Oracles flashback query functionality to query the data for a specific time in the past. How far in the past is dependent on how much disk space you have and how much database activity there is. The bright side of this solution is that new columns automatically get added. The downside is that you can't flashback query past ddl operations.

Resources