Slow copying of virtual table rtree - database

From the file system the code is copied into an in-memory database.
After some measurement of the copying process the result was that the virtual table needs a lot of time. My assumption is that it has to do with the shadow tables.
First here the results:
Attached to database on file system in 7 milliseconds.
Created in-memory table-1 in 2 milliseconds.
Created in-memory table-2 in 2 milliseconds.
Copied table-1 in 2953 milliseconds.
Copied table-2 in 3086 milliseconds.
Create in-memory coordinates table in 4 milliseconds.
Copied coordinates table in 78813 milliseconds.
Detached from database on file system in 12 milliseconds.
Completed time was 84880 milliseconds.
The table scheme on the hard disk looks like this:
CREATE VIRTUAL TABLE coordinates USING rtree(
id,
min_latitude,
max_latitude,
min_longitude,
max_longitude,
);
The copying process looks like this in Java:
statement = connection.createStatement();
statement.execute("ATTACH '" + databasePath + "' AS fs");
statement.execute("CREATE VIRTUAL TABLE mCoordinates USING rtree(id, min_latitude, max_latitude, min_longitude, max_longitude)");
statement.execute("INSERT INTO mCoordinates (id, min_latitude, max_latitude, min_longitude, max_longitude) SELECT id, min_latitude, max_latitude, min_longitude, max_longitude FROM fs.coordinates");
Here now my question: "Is there a chance to get a better performance"?

Constructing 150,000 R-tree entries from scratch just takes some time.
However, if both databases have the same page size, you could just copy the contents of the shadow tables:
BEGIN;
DELETE FROM mCoordinates_node;
INSERT INTO mCoordinates_node SELECT * FROM fs.coordinates_node;
INSERT INTO mCoordinates_parent SELECT * FROM fs.coordinates_parent;
INSERT INTO mCoordinates_rowid SELECT * FROM fs.coordinates_rowid;
COMMIT;

Related

Snapshot of a table daily for 30 days with task and stored procedure

I wanted to have a view of a table at the end of each day at midnight in the PST timezone. These tables are very small, only 300 entries per day on average.
I want to track the change in the rows based on the ids of the table and take a snap shot of their state each day, where each new table view would have a 'date' status.
The challenge is that the original table is growing so each new 'snapshot' will be a different size.
Here is an example of my data:
Day 1
Id Article Title Genre views date
1, "I Know Why the Caged Bird Sings", 10, 01-26-2019
2, "In the Cold", "Non-Fiction", 20, 01-26-2019
Day 2
Id Article Title Genre views date
1, "I Know Why the Caged Bird Sings", "Non-Fiction", 20, 02-27-2019
2, "In the Cold", "Non-Fiction", 40, 02-27-2019
3, "Bury My Heart At Wounded Knee", "Non-Fiction", 100, 02-27-2019
I have a stored procedure that I would like to create to copy the state of the current table. However it is not recommended to create a table in a stored procedure, so I am trying to create a task that manages the table creation and stored procedure call:
USE WAREHOUSE "TEST";
CREATE DATABASE "Snapshots";
USE DATABASE "Snapshots";
Create or Replace Table ArticleLibrary (id int, Title string, Genre string, Viewed number, date_captured timestamp );
INSERT INTO ArticleLibrary Values
(1, 'The man who walked the moon', 'Non-Fiction', 10, CURRENT_DATE() ),
(2, 'The girl who went to Vegas', 'Fiction', 20 , CURRENT_DATE())
;
SELECT * FROM ArticleLibrary;
//CREATE Stored Procedure
create procedure Capture_daily()
Returns string
LANGUAGE JAVASCRIPT
AS
$$
var rs = snowflake.execute({sqlText: "})
var rs = snowflake.execute( {sqlText: "COPY INTO "ARTICLELIBRARY"+CURRENT_DATE() * FROM ArticleLibrary; "} ); );
return 'success';
$$
;
CREATE OR REPLACE TASK IF NOT EXISTS Snapshot_ArticleLibrary_Task
WAREHOUSE = 'TEST'
SCHEDULE = '{ 1440 MINUTE | USING CRON 0 America/Los_Angeles }'
AS
CREATE TABLE "ARTICLELIBRARY"+"CURRENT_DATE()";
CALL Capture_daily();
//Run tomorrow
INSERT INTO ArticleLibrary Values
(3, 'The Best Burger in Town', 'News', 100, CURRENT_DATE());
I need some help improving the Stored Procedure and task I set up, I am not sure how to call more than one sql statement at the end of the task.
I am open to advice on how to better achieve this, considering this is a small amount of data and just an experiment to demonstrate compute cost on a small scale. I also am considering using a window function with a window frame in one large table that inserts data from each new day, where the date is the status, the ids would then not be unique.
Since you're talking about daily snapshots and such a small amount of data I would insert each days snapshot into a single table with the current_date() as a new column called "snapshot_id" for example.
You can have a view on top of this table that shows the latest day or even a UDF that can take the day as a parameter and return the results for any day. This table will be extremely quick since it'll be naturally clustered by the "snapshot_id" column and you will have all of your history in one spot which is nice and clean.
I've done this in the past where our source tables had millions of records and you can get quite far with this approach.
I would recommend leveraging the Zero-Copy Cloning functionality of Snowflake for this. You can create a clone every day with a simple command, it will take no time, and if the underlying data isn't completely changing every day, then you're not going to use any additional storage, either.
https://docs.snowflake.net/manuals/sql-reference/sql/create-clone.html
You would still need an SP to dynamically create the table name based on the date, and you can execute that SP from a TASK.
For the question: I am not sure how to call more than one sql statement at the end of the task.
One of the approaches would be to embed the multiple sql commands in their desired order in a stored procedure and call this stored procedure through the Task.
create or replace procedure capture_daily()
returns string
language javascript
as
$$
var sql_command1 = snowflake.createStatement({ sqlText:`Create or Replace Table
"ARTICLELIBRARY".....`});
var sql_command2 = snowflake.createStatement({ sqlText:'COPY INTO
"ARTICLELIBRARY" ...`});
var sql_command3 = snowflake.createStatement({ sqlText:"Any Other DML
Command OR CALL sp_name"});
 
try 
{ 
sql_command1.execute(); 
sql_command2.execute(); 
sql_command3.execute(); 
return "Succeeded."; 
} 
catch (err) 
{ 
return "Failed: " + err;
} 
$$
;
 
CREATE OR REPLACE TASK IF NOT EXISTS Snapshot_ArticleLibrary_Task
WAREHOUSE = 'TEST'
SCHEDULE = '{ 1440 MINUTE | USING CRON 0 America/Los_Angeles }'
AS
CALL Capture_daily();

Table insertions using stored procedures?

(Submitting for a Snowflake User, hoping to receive additional assistance)
Is there another way to perform table insertion using a stored procedure faster?
I started building a usp with the purpose to insert million or so of rows of test data into a table for the purpose of load testing.
I got to this stage show below and set the iteration value to 10,000.
This took over 10 mins to iterate 10,000 times to insert a single integer into a table each iteration
Yes - I am using a XS data warehouse, but even if this is increased to MAX - this is way to slow to be of any use.
--build a test table
CREATE OR REPLACE TABLE myTable
(
myInt NUMERIC(18,0)
);
--testing a js usp using a while statement with the intention to insert multiple rows into a table (Millions) for load testing
CREATE OR REPLACE PROCEDURE usp_LoadTable_test()
RETURNS float
LANGUAGE javascript
EXECUTE AS OWNER
AS
$$
//set the number of iterations
var maxLoops = 10;
//set the row Pointer
var rowPointer = 1;
//set the Insert sql statement
var sql_insert = 'INSERT INTO myTable VALUES(:1);';
//Insert the fist Value
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
//Loop thorugh to insert all other values
while (rowPointer < maxLoops)
{
rowPointer += 1;
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
}
return rowPointer;
$$;
CALL usp_LoadTable_test();
So far, I've received the following recommendations:
Recommendation #1
One thing you can do is to use a "feeder table" containing 1000 or more rows instead of INSERT ... VALUES, eg:
INSERT INTO myTable SELECT <some transformation of columns> FROM "feeder table"
Recommendation #2
When you perform a million single row inserts, you consume one million micropartitions - each 16MB.
That 16 TB chunk of storage might be visible on your Snowflake bill ... Normal tables are retained for 7 days minimum after drop.
To optimize storage, you could define a clustering key and load the table in ascending order with each chunk filling up as much of a micropartition as possible.
Recommendation #3
Use data generation functions that work very fast if you need sequential integers: https://docs.snowflake.net/manuals/sql-reference/functions/seq1.html
Any other ideas?
This question was also asked at the Snowflake Lodge some weeks ago.
Given the answers you received, do you still feel unanswered, then maybe hint about why?
If you just want a table with a single column of sequence numbers, use GENERATOR() as in #3 above. Otherwise, if you want more advice, share your specific requirements.

SSIS Package how to successfully do a foreach or for loop to auto increment a value for field insert?

First of all I have never attempted something like this in SSIS and I am very new to SSIS package development.
I need to build a component in my package that will run through a table of data (say 80 rows) and set a field titled DisplayOrder to the auto incremented number. The catch is that one of the records HAS to be set to 0 and then the rest of he records set to the auto incremented number.
In regards to code, I am not even sure what code to attach to this question or even what screenshots.
I finally figured it out and there is no need for a loop.
Create a SQL Task to clear the linked Table.
Script Used
DELETE FROM [Currency].[ExchangeRates]
Create a SQL Task to clear the main table.
Script Used
DELETE FROM [Currency].[CurrencyList]
Load the values into the main table.
Actions Used
Load values from XML Source
Dump values to [ExchangeRates] Table
Create a SQL Task to load the Values from the main table to the linked table.
Script Used
INSERT INTO [Currency].[CurrencyList] (CurrencyCode, CurrencyName, ExchangeRateID, DisplayOrder) SELECT [er].[TargetCurrency] AS [CurrencyCode], [er].[TargetName] AS [CurrencyName], [er].[ID] AS [ExchangeRateID], ROW_NUMBER() OVER (ORDER BY [ER].[TargetName]) AS [DisplayOrder] FROM [Currency].[ExchangeRates] AS [er] ORDER BY [CurrencyName]
Create a SQL Task to load a new record to the main table for use as DisplayOrder 0.
Script Used
INSERT INTO [Currency].[ExchangeRates] ([Title], [Link], [Description], [PubDate], [BaseCurrency], [TargetCurrency], [TargetName], [ExchangeRate]) VALUES ('1 USD = 1 USD','http://www.floatrates.com/usd/usd/','1 U.S. Dollar = 1 U.S. Dollar',(SELECT TOP 1 [PubDate] FROM [Currency].[ExchangeRates]),'USD','USD','United States Dollar','1')
Create a SQL Task to reference the newly created record from the main table.
Script Used
INSERT INTO [Currency].[CurrencyList] (CurrencyCode, CurrencyName, ExchangeRateID, DisplayOrder) SELECT [er].[TargetCurrency] AS [CurrencyCode], [er].[TargetName] AS [CurrencyName], [er].[ID] AS [ExchangeRateID], 0 AS [DisplayOrder] FROM [Currency].[ExchangeRates] AS [er] WHERE [er].[TargetCurrency] = 'USD'

Correct method of deleting over 2100 rows (by ID) with Dapper

I am trying to use Dapper support my data access for my server app.
My server app has another application that drops records into my database at a rate of 400 per minute.
My app pulls them out in batches, processes them, and then deletes them from the database.
Since data continues to flow into the database while I am processing, I don't have a good way to say delete from myTable where allProcessed = true.
However, I do know the PK value of the rows to delete. So I want to do a delete from myTable where Id in #listToDelete
Problem is that if my server goes down for even 6 mintues, then I have over 2100 rows to delete.
Since Dapper takes my #listToDelete and turns each one into a parameter, my call to delete fails. (Causing my data purging to get even further behind.)
What is the best way to deal with this in Dapper?
NOTES:
I have looked at Tabled Valued Parameters but from what I can see, they are not very performant. This piece of my architecture is the bottle neck of my system and I need to be very very fast.
One option is to create a temp table on the server and then use the bulk load facility to upload all the IDs into that table at once. Then use a join, EXISTS or IN clause to delete only the records that you uploaded into your temp table.
Bulk loads are a well-optimized path in SQL Server and it should be very fast.
For example:
Execute the statement CREATE TABLE #RowsToDelete(ID INT PRIMARY KEY)
Use a bulk load to insert keys into #RowsToDelete
Execute DELETE FROM myTable where Id IN (SELECT ID FROM #RowsToDelete)
Execute DROP TABLE #RowsToDelte (the table will also be automatically dropped if you close the session)
(Assuming Dapper) code example:
conn.Open();
var columnName = "ID";
conn.Execute(string.Format("CREATE TABLE #{0}s({0} INT PRIMARY KEY)", columnName));
using (var bulkCopy = new SqlBulkCopy(conn))
{
bulkCopy.BatchSize = ids.Count;
bulkCopy.DestinationTableName = string.Format("#{0}s", columnName);
var table = new DataTable();
table.Columns.Add(columnName, typeof (int));
bulkCopy.ColumnMappings.Add(columnName, columnName);
foreach (var id in ids)
{
table.Rows.Add(id);
}
bulkCopy.WriteToServer(table);
}
//or do other things with your table instead of deleting here
conn.Execute(string.Format(#"DELETE FROM myTable where Id IN
(SELECT {0} FROM #{0}s", columnName));
conn.Execute(string.Format("DROP TABLE #{0}s", columnName));
To get this code working, I went dark side.
Since Dapper makes my list into parameters. And SQL Server can't handle a lot of parameters. (I have never needed even double digit parameters before). I had to go with Dynamic SQL.
So here was my solution:
string listOfIdsJoined = "("+String.Join(",", listOfIds.ToArray())+")";
connection.Execute("delete from myTable where Id in " + listOfIdsJoined);
Before everyone grabs the their torches and pitchforks, let me explain.
This code runs on a server whose only input is a data feed from a Mainframe system.
The list I am dynamically creating is a list of longs/bigints.
The longs/bigints are from an Identity column.
I know constructing dynamic SQL is bad juju, but in this case, I just can't see how it leads to a security risk.
Dapper request the List of object having parameter as a property so in above case a list of object having Id as property will work.
connection.Execute("delete from myTable where Id in (#Id)", listOfIds.AsEnumerable().Select(i=> new { Id = i }).ToList());
This will work.

How to find the size of data returned from a table

I have a table the holds snapshots of data. These snapshots are all tagged with 'jan2010' or 'april2011'. All the snapshots will grow exponentially over time and I wanted to see if I could forecast when we'd need to upgrade our storage.
Is there any way to
select monthlysnapshot, sum(size)
from tblclaims_liberty
group by monthlysnapshot
order by monthlysnapshot desc
What am I missing to get the size of the data returned? Is there a system function I can call?
EXEC sp_spaceused 'tablename'
This will return a single result set that provides the following information:
Name - the name of the table
Rows - the number of rows in the table
Reserved - amount of total reserved space for the table
Data - amount of space used by the data for the table
Index_Size - amount of space used by the table's indexes
Unused - amount of usused space in the table
Is it what you are looking for?
C# Getting the size of the data returned from and SQL query
Changed:
EXEC sp_spaceused 'tablename'
If you can do this in your code then in C# (Change code to whatever lang you are using)
long size = 0;
object o = new object();
using (Stream s = new MemoryStream()) {
BinaryFormatter formatter = new BinaryFormatter();
formatter.Serialize(s, o);
size = s.Length;
Code copied from: How to get object size in memory?
I think you can Insert the selected data into a table via "select * into newtable from table" and get the size of that newly created table.

Resources