How to setup one database to simulate online database? - database

There is a simulated database environment that runs like production environment. but no data in this database, and the statistical information comes from the real production environment. The evironment is the latest version of oceanbase community edition.
Is there a way to achieve that the execution plan of the database depends on statistical information? so the execution plan we get on the simulated database environment will be same as the production environment.
In this way, the execution plan of some SQL statements can be analyzed on the simulated database environment,to determine whether there is a performance problem in SQL.
Due to the particularity of our work, we cannot obtain the execution plan in the production environment.
Now, during our test, the oceanbase's execution plan depends on the estimated number of rows in the storage layer mainly,not statistical information.
By the way, this usage scenario can be run when we used the Oracle before.
Any suggestions on how to setup one database to simulate online database?
Thanks.

You can use DBMS_STATS.EXPORT_... routines to copy all statistics for a table, schema, or even database from one database to another. (You would then import them with DBMS_STATS.IMPORT...)
The issue here is that whilst you'll probably then get identical execution plans across the two databases, I'm not sure what benefit that yields.
For example, lets say you see this on your empty database
SQL> select * from emp
2 where ename = 'JOE';
Execution Plan
--------------------------------------------------
Plan hash value: 3956160932
--------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
--------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 37 |
|* 1 | TABLE ACCESS FULL| EMP | 1 | 37 |
--------------------------------------------------
(or even your production database for that matter).
Is that a good plan? A bad plan? It's really impossible to say in the majority of cases. Copying stats can be good technique when you have comparable data sizes between databases. But if one is empty, I don't see a lot of benefit.

Related

The same SQL query has a different performance and execution plan on two similar databases

We have two PostgreSQL 11.16 databases, running on the same Linux distribution Red Hat Enterprise Linux Server release 7.9 (Maipo), but on a different host. The hosts are similar in sizing and the databases are similar in data. When we execute the following query on both databases it takes about 26 seconds on database A, and less than 1 second on database B.
select all pdb_personeelsdocument.r_object_id, pdb_personeelsdocument.r_object_type "r_object_type",
pdb_personeelsdocument.i_vstamp, dm_repeating.r_aspect_name, pdb_personeelsdocument.i_is_replica,
pdb_personeelsdocument.i_is_reference, pdb_personeelsdocument.acl_name
from pdb_personeelsdocument_sp pdb_personeelsdocument, pdb_personeelsdocument_rp dm_repeating
where ( (pdb_personeelsdocument.r_object_id
in (select r_object_id from dm_sysobject_r where i_folder_id= '123'))
and ((pdb_personeelsdocument.document_status= 'Goedgekeurd')) )
and (pdb_personeelsdocument.i_has_folder = 1 and pdb_personeelsdocument.i_is_deleted = 0)
and dm_repeating.r_object_id=pdb_personeelsdocument.r_object_id order by 1
We looked at the query execution plan (by using explain (analyze)) in pgAdmin. On database A we see that the execution plan first tries to perform a nested loop (see attachment) and on database B it doesn't. Furthermore showing the execution plan graphically on database A doesn't work and gives an empty screen, on database B we get a nice graphical execution plan. Finally if we remove the 'order by 1' from the query it performs great on both databases.
Query Execution plans (text):
Database A | https://explain.depesz.com/s/4Wfu#html
Database B | https://explain.depesz.com/s/eRur#html
We tried to reindex all the indexes, and did an analyze for all the tables that were used in the execution plan. However, with no result. Does anyone know what more actions we could take?

Is there an option to force partitions on a Snowflake table

I want to look at creating partitions on Snowflake tables and forcing the end user to use them in a query. Intention is to prevent accidental huge queries on large tables
Is that possible?
Snowflake does not support "traditional partitions, it has "micro-partitions" which are created automatically. Please read the following document to understand "micro-partitions" of Snowflake:
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#what-are-micro-partitions
The queries will always try to prune partitions based on query filters, but there is no way to force your users to specify a filter condition/partition.
To prevent accidental high credit usage, you may consider "resource monitor" feature. You may create a smaller warehouse, set a resource limit and assign this warehouse to users.
https://docs.snowflake.com/en/user-guide/resource-monitors.html#creating-resource-monitors
No you can't create partitions manually in Snowflake, micro-partitions in Snowflake are created automatically based on when the data arrives rather than what the data contains. You can use cluster keys however to order the data within and across micro-partitions which will help with pruning out partitions when a query is executed.
You could force users to put predicates on their queries by using a BI tool or by parameterising your (secure?)views with a SQL variable and restricting access to the underlying tables. Obviously this would mean that your users would need to set their session variables correctly somehow, probably a bit too restrictive in my opinion but here is an example:
-- Set variable before creating the view
set predicate_var = 'Mid Atlantic';
-- Create view using parameter in the where condition
create or replace view test_db.public.parameterised_view as
select
*
from "SAMPLE_DATA"."TPCDS_SF100TCL"."CALL_CENTER"
where CC_NAME = $predicate_var;
-- Select from the view. Only records with 'Mid Atlantic' are returned
select CC_CALL_CENTER_SK, CC_NAME from test_db.public.parameterised_view;
--Results:
--|-------------------+--------------|
--| CC_CALL_CENTER_SK | CC_NAME |
--|-------------------+--------------|
--| 2 | Mid Atlantic |
--| 3 | Mid Atlantic |
--|-------------------+--------------|
-- Change the var to something different to prove it works
set predicate_var = 'NY Metro';
-- Select from the view to show that the predicate has changed
select CC_CALL_CENTER_SK, CC_NAME from test_db.public.parameterised_view;
--Results:
--|-------------------+----------|
--| CC_CALL_CENTER_SK | CC_NAME |
--|-------------------+----------|
--| 1 | NY Metro |
--|-------------------+----------|

Subscriber-only trigger

There are two servers. The first is ERP system on production. The second one is the BI server for heavy analytical queries. We update the BI server on a daily basis via backups. However, it's not enough, some users want to see their data changes more often than the next day. I have no access to the ERP server and can't do anything, except asking for backups or replications.
Before starting to ask for the replication. I want to understand if it's possible to use subscriber triggers in order to process not all the data, but changed. There is an ETL process to make some queries faster (indexing, transformation, etc). Triggers should do the trick, but I can't find a way to use them on the subscriber side only. The ERP system doesn't allow to make any changes on the DB level. So, the subscriber database seems to be fine for triggers (they don't affect on the ERP server performance). Nonetheless, I can't find a way to set them up. Processing all the data is an insane overhead.
Use case: Simplified example, say, we have two replicated tables:
+------------+-------------+--------+
| dt | customer_id | amount |
+------------+-------------+--------+
| 2017-01-01 | 1 | 234 |
| 2017-01-02 | 1 | 123 |
+------------+-------------+--------+
+------------+-------------+------------+------------+
| manager_id | customer_id | date_from | date_to |
+------------+-------------+------------+------------+
| 1 | 1 | 2017-01-01 | 2017-01-02 |
| 2 | 1 | 2017-01-02 | null |
+------------+-------------+------------+------------+
I need to transform them into the following indexed table:
+----------+-------------+------------+--------+
| dt_id | customer_id | manager_id | amount |
+----------+-------------+------------+--------+
| 20170101 | 1 | 1 | 234 |
| 20170102 | 1 | 2 | 123 |
+----------+-------------+------------+--------+
So, I created yet another database where I store the table above. Now, in order to update the table I have to truncate it and reinsert all the data again. I may join them all the in order to check the diffs, but it's too heavy for big tables as well. The trigger helps to track only changing records. The first input table can use a trigger:
create trigger SaleInsert
on Table1
after insert
begin
insert into NewDB..IndexedTable
select
//some transformation
from inserted
left join Table2
on Table1.customer_id = Table2.customer_id
and Table1.dt >= Table2.date_from
and Table1.dt < Table2.date_to
end
The same idea for update, delete, a similar approach for the second table. I could get automatically updated DWH with little lags. Yeah, I expect performance lags for high-loaded databases. Theoretically, it should work smoothly on servers with the same configurations.
But, again, there are no triggers on the subscriber side only. Any ideas, alternatives?
MS SQL Server has "Change Tracking" features that maybe be of use to you. You enable the database for change tracking and configure which tables you wish to track. SQL Server then creates change records on every update, insert, delete on a table and then lets you query for changes to records that have been made since the last time you checked. This is very useful for syncing changes and is more efficient than using triggers. It's also easier to manage than making your own tracking tables. This has been a feature since SQL Server 2005.
How to: Use SQL Server Change Tracking
Change tracking only captures the primary keys of the tables and let's you query which fields might have been modified. Then you can query the tables join on those keys to get the current data. If you want it to capture the data also you can use Change Capture, but it requires more overhead and at least SQL Server 2008 enterprise edition.
Change Data Capture
The general process is:
Get the current sync version
Get the last sync version you used to get changes
Get all the primary keys of the tables that have changed since that last version (inserts, updates, and deletes).
Join the keys with the data and pull down the data (if not using change capture).
Save the data and the current sync version (from the first step).
Then you repeat this process whenever you want to subscribe to the next set of changes. SQL Server does all the magic behind the scenes for you of storing the changes and the versioning. You also might want to look into Snapshot Isolation... it works well with it. The linked article has more information about that.
This answer is somewhat roundabout, but given your tight limitations perhaps you'll consider it.
First, go for replication as you seem to have decided. You mentioned creating yet another database but were stuck for how to create triggers to populate it. The answer lies in the ability to run post-snapshot scripts. When creating the replication publication the DBA can specify a script to run on the Subscriber after the Snapshot.
You can have the script create all the triggers you require.
Also, to prevent replication from overwriting your triggers with "no trigger" (as defined in the ERP database) the DBA will need to verify that for each table on which you have triggers the property Copy user triggers is set to False.
You cannot get updates in ERP database until you implement Triggers, or change tracking on ERP database. The best way is to go for replication, If you have no access to the ERP server and can't do anything, except asking for backups or replications.
If you are using full recovery model on the ERP system database you could use Log file shipping to accomplish this. This would still have some level of delay between the production system and the reporting system. If some delay between DML statements being issued on the ERP system and the reporting system this solution would work. If you need nearly instant access to the data in the reporting system, replication and the associated overhead is your only good option. For more information of configuring Log File shipping:
https://learn.microsoft.com/en-us/sql/database-engine/log-shipping/about-log-shipping-sql-server

Advice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day

Our company is developing an internal project to parse text files.
Those text files are composed of metadata which is extracted using regular expresions.
Ten computers are 24/7 parsing the text files and feeding a high-end Intel Xeon SQL Server 2005 database with the extracted metadata.
The simplified database schema looks like this:
Items
| Id | Name |
|----|--------|
| 1 | Sample |
Items_Attributes
| ItemId | AttributeId |
|--------|-------------|
| 1 | 1 |
| 1 | 2 |
Attributes
| Id | AttributeTypeId | Value |
|----|-----------------|-------|
| 1 | 1 | 500mB |
| 2 | 2 | 1.0.0 |
AttributeTypes
| Id | Name |
|----|---------|
| 1 | Size |
| 2 | Version |
There are many distinct text files types with distinct metadata inside. For every text file we have an Item and for every extracted metadata value we have an Attribute.
Items_Attributes allow us to avoid duplicate Attribute values which avoids database size to increase x^10.
This particular schema allows us to dynamically add new regular expressions and to obtain new metadata from new processed files no matter which internal structure they have.
Additionally this allow us to filter the data and to obtain dynamic reports based on the user criteria. We are filtering by Attribute and then pivoting the resultset (http://msdn.microsoft.com/en-us/library/ms177410.aspx). So this example pseudo-sql query
SELECT FROM Items WHERE Size = #A AND Version = #B
would return a pivoted table like this
| ItemName | Size | Version |
|----------|-------|---------|
| Sample | 500mB | 1.0.0 |
The application has been running for months and performance decreased terribly at the point is no longer usable. Reports should take no more than 2 seconds and Items_Attributes table increases an average of 10,000,000 rows per week.
Everything is properly indexed and we spent severe time analyzing and optimizing query execution plans.
So my question is, how would you scale this in order to decrease report execution times?
We came with this possible solutions:
Buy more hardware and setup an SQL Server cluster. (we need advice on the proper "clustering" strategy)
Use a key/value database like HBase (we don't really know if would solve our problem)
Use a ODBMS rather than a RDBMS (we have been considering db4o)
Move our software to the cloud (we have zero experience)
Statically generate reports at runtime. (we don't really want to)
Static indexed views for common reports (performance is almost the same)
De-normalize schema (some of our reports involves up to 50 tables in a single query)
Perhaps this white paper by SQL Server CAT team on the pitfalls of Entity-Attribute-Value database model can help: http://sqlcat.com/whitepapers/archive/2008/09/03/best-practices-for-semantic-data-modeling-for-performance-and-scalability.aspx
I'd start from posting exact tables metadata (along with indexing details), exact query text and the execution plan.
With you current table layout, the query similar to this:
SELECT FROM Items WHERE Size = #A AND Version = #B
cannot benefit from using a composite index on (Size, Version), since it's impossible to build such an index.
You cannot even build an indexed view, since it would contain a self-join on attributes.
Probably the best decision would be to denormalize the table like this:
id name size version
and create an index on (size, version)
Worked with such schemas a lot of time. They never perform well.
The best thing is to just store the data as you need it, in the form:
| ItemName | Size | Version |
|----------|-------|---------|
| Sample | 500mB | 1.0.0 |
Then you don;t need to pivot. And BTW, please do not call your original EAV schema "normalized" - it is not normalized.
Looks to me like issuing some OLAP queries on a database optimized for OLTP transactions. Don't knowing details, I'd recommend building a separate "datawarehouse" optimized for the kind of queries you are doing. That would involve aggregating data (if possible), denormalization and also having a data base, which is 1 day old or so. You would incrementally update the data each day or at any interval you wish.
Please post exact DDL and indexes, if you have indexes on the ID columns then your query will result in a scan
instead of something like this
SELECT FROM Items WHERE Size = #A AND Version = #B
you need to do this
SELECT FROM Items WHERE ID = 1
in other words you need to grab the text values, find the ids that you are indexing on and then use that as your query to return results instead
Probably also a good idea to look at partitioning function to distribute your data
clustering is done for availability not performance, if one node dies (the active cluster) , the other node (the passive cluster) will become active....of course there is also active active clustering but that is another story
A short term fix may be to use horizontal partitioning. I am assuming your largest table is Items_Attributes. You could horizontally partition this table, putting each partition on a separate filegroup on a separate disk controller.
That's assuming you are not trying to report across all ItemIds at once.
You mention 50 tables in a single query. Whilst SQL server supports up to 256 tables in a single, monolithic query, taking this approach reduces the chances of the optimiser producing an efficient plan.
If you are wedded to the schema as it stands, consider breaking your reporting queries down into a series of steps which materialise their results into temporary (#) tables.
This approach enables you to carry out the most selective parts of the query in isolation, and can, in my experience, offer big performance gains. The queries are generally more maintainable too.
Also (a bit of a long shot, this) you don't say which SQL server version you're on; but if you're on SQL 2005, given the number of tables involved in your reports and the volume of data, it's worth checking that your SQL server is patched to at least SP2.
I worked on an ETL project using tables with rowcounts in the hundreds of millions, where we found that the query optimiser in SQL 2005 RTM/SP1 could not consistently produce efficient plans for queries joining more than 5 tables where one or more of the tables was of this scale. This issue was resolved in SP2.

How to log in an Oracle database?

I am interested in what methods of logging is frequent in an Oracle database.
Our method is the following:
We create a log table for the table to be logged. The log table contains all the columns of the original table plus some special fields including timestamp, modification type (insert, update, delete), modifier's id. A trigger on the original table creates one log row for each insertion and deletion, and two rows for a modification. Log rows contain the data before and after the alteration of the original one.
Although state of the records can be mined back in time using this method, it has some drawbacks:
Introduction of a new column in the original table does not automatically involves log modification.
Log modification affects log table and trigger and it is easy to mess up.
State of a record at a specific past time cannot be determined in a straightforward way.
...
What other possibilities exist?
What kind of tools can be used to solve this problem?
I only know of log4plsql. What are the pros/cons of this tool?
Edit: Based on Brian's answer I have found the following reference that explains standard and fine grain auditing.
It sounds like you are after 'auditing'. Oracle has a built-in feature called Fine Grain Auditing (FGA). In a nutshell you can audit everything or specific conditions. What is really cool is you can 'audit' selects as well as transactions. Simple command to get started with auditing:
audit UPDATE on SCOTT.EMP by access;
Think of it as a 'trigger' for select statements. For example, you create policies:
begin
dbms_fga.add_policy (
object_schema=>'BANK',
object_name=>'ACCOUNTS',
policy_name=>'ACCOUNTS_ACCESS'
);
end;
After you have defined the policy, when a user queries the table in the usual way, as follows:
select * from bank.accounts;
the audit trail records this action. You can see the trail by issuing:
select timestamp,
db_user,
os_user,
object_schema,
object_name,
sql_text
from dba_fga_audit_trail;
TIMESTAMP DB_USER OS_USER OBJECT_ OBJECT_N SQL_TEXT
--------- ------- ------- ------- -------- ----------------------
22-OCT-08 BANK ananda BANK ACCOUNTS select * from accounts
Judging from your description, I wonder if what you really need is not logging mechanism, but rather some sort of Historical value of some table. If this is the case, then maybe you better off using some kind of Temporal Database design (using VALID_FROM and VALID_TO fields). You can track changes in database using Oracle LogMiner tools.
As for your scenarios, I would rather stored the changes data in this kind of schema :
+----------------------------------------------------------------------------+
| Column Name | Function |
+----------------------------------------------------------------------------+
| Id | PRIMARY_KEY value of the SOURCE table |
| TimeStamp | Time stamp of the action |
| User | User who make the action |
| ActionType | INSERT, UPDATE, or DELETE |
| OldValues | All fields value from source table, seperated by '|' |
| Newvalues | All fields value from source table, seperated by '|' |
+----------------------------------------------------------------------------+
With this type of logging table, you can easily determine :
Historical Change action of particular record (using Id)
State of specific record in some point in time
Of course this kind of logging cannot easily determine all valid values of table in specific point in time. For this, you need to change you table design to Temporal Database Design.
In the similar question (How to Audit Database Activity without Performance and Scalability Issues?) the accepted answer mentions the monitoring of database traffic using a network traffic sniffer as an interesting alternative.
log4plsql is a completely different thing, its for logging debug info from PL/SQL
For what you want, you need to either.
Setup a trigger
Setup PL/SQL interface around the tables, CRUD operations happen via this interface, the interface ensures the log tables are updated.
Setup interface in your application layer, as with PL/SQL interface, just higher up.
Oracle 11g contains versioned tables, I have not used this at all though, so can make no real comment.
If you just interested in knowing what the data looked like in the recent past you could just use Oracles flashback query functionality to query the data for a specific time in the past. How far in the past is dependent on how much disk space you have and how much database activity there is. The bright side of this solution is that new columns automatically get added. The downside is that you can't flashback query past ddl operations.

Resources