I need to insert 36 million rows from Oracle to MSSQL. The below code works but even with chunking at 1k (since you can only insert 1k rows at a time in MSSQL) it is not quick at all. Current estimates have this taking around 100 hours which won't cut it :)
def method(self):
# get IDs and Dates from Oracle
ids_and_dates = self.get_ids_and_dates()
# get 2 each time
for chunk in chunks(ids_and_dates, 2):
# set up list for storing each where clause
where_clauses = []
for id, last_change_dt in chunk:
where_clauses.append(self.queries['where'] % {dict})
# set up final SELECT statement
details_query = self.queries['details'] % " OR ".join([wc for wc in where_clauses])
details_rows = [str(r).replace("None", "null") for r in self.src_adapter.fetchall(details_query)]
for tup in chunks(details_rows, 1000):
# tup in the form of ["(VALUES_QUERY)"], remove []""
insert_query = self.queries['insert'] % ', '.join(c for c in tup if c not in '[]{}""')
self.dest_adapter.execute(insert_query)
I realize fetchall isn't ideal from what I've been reading. Should I consider implementing something else? And should I try out executemany instead of using execute for the inserts?
The Oracle query standalone was really slow so I broke it up into a few queries:
query1 gets IDs and dates.
query 2 uses the IDs and dates from query1 and selects more columns (chunked at max 2 OR statements).
query3 takes the query2 data and inserts that into MSSQL.
Related
Basic issue: I have a process to extract records from a CDC table which is 'missing' records.
I am pulling from a MS SQL 2019 (Data Center Ed) DB with CDC enabled on 67 tables. One table in particular houses 323 million rows, and is ~125 columns wide. During a nightly process, around 12 million of these rows are updated, therefore around 20 million rows are generated in the _CT table. During this nightly process, CDC capture is still running using default settings. It can 'get behind', but we check for this.
After the nightly process is complete, I have a Python 3.6 extractor which connects to the SQL server using ODBC. I have a loop which goes over each of the 67 source tables. Before the loop begins, I ensure that the CDC capture is 'caught up'.
For each table, the extractor begins the process by reading the last successfully loaded LSN from the target database, which is in Snowflake.
The Python script the table name, last loaded LSN, and table PKEY to the following query to get the current MAX_LSN for the table:
def get_incr_count(self, table_name, pk, last_loaded_lsn):
try:
cdc_table_name = self.get_cdc_table(table_name)
max_lsn = self.get_max_lsn(table_name)
incr_count_query = """with incr as
(
select
row_number() over
(
partition by """ + pk + """
order by
__$start_lsn desc,
__$seqval desc
) as __$rn,
*
from """ + cdc_table_name + """
where
__$operation <> 3 and
__$start_lsn > """ + last_loaded_lsn + """ and
__$start_lsn <= """ + max_lsn + """
)
select COUNT(1) as count from incr where __$rn = 1 ;
"""
lsn_df = pd.read_sql_query(incr_count_query, self.cnxn)
incr_count = lsn_df['count'][0]
return incr_count
except Exception as e:
raise Exception('Could not get the count of the incremental load for ' + table_name + ': ' + str(e))
In the event that this query finds records to process, it then runs this function. The limitation of pulling 500,000 records at a time is a memory limitation on the virtual machine that runs this code. More than this amount maxes out the available memory.
def get_cdc_data(self, table_name, pk, last_loaded_lsn, offset_iterator=0, fetch_count=500000):
try:
cdc_table_name = self.get_cdc_table(table_name)
max_lsn = self.get_max_lsn(table_name)
#Get the lasst LSN loaded from the ODS.LOG_CDC table for the current table
last_lsn = last_loaded_lsn
incremental_pull_query = """with incr as
(
select
row_number() over
(
partition by """ + pk + """
order by
__$start_lsn desc,
__$seqval desc
) as __$rn,
*
from """ + cdc_table_name + """
where
__$operation <> 3 and
__$start_lsn > """ + last_lsn + """ and
__$start_lsn <= """ + max_lsn + """
)
select CONVERT(VARCHAR(max), __$start_lsn, 1) as __$conv_lsn, *
from incr where __$rn = 1
order by __$conv_lsn
offset """ + str(offset_iterator) + """ rows
fetch first """ + str(fetch_count) + """ rows only;
"""
# Load the incremental data into a dataframe, df, using the SQL Server connection and the incremental query
full_df = pd.read_sql_query(incremental_pull_query, self.cnxn)
# Trim all cdc columns except __$operation
df = full_df.drop(['__$conv_lsn', '__$rn', '__$start_lsn', '__$end_lsn', '__$seqval', '__$update_mask', '__$command_id'], axis=1)
return df
except Exception as e:
raise Exception('Could not get the incremental load dataframe for ' + table_name + ': ' + str(e))
The file is then moved into snowflake and merged into a table. If every import loop succeeds, we update the MAX LSN in the target db to set the next starting point. If any fail, we leave the max and re-try next pass. In the scenario below, there are no identified errors.
We are finding evidence that this second query is not pulling every valid record between the starting and MAX LSN as it loops through. There is no discernable pattern to which records are missed, other than if one LSN is missed, all changes within are missed.
I think it may have something to do with how we are ordering records: order by __$conv_lsn. This value is converted Binary to VARCHAR(MAX)...so I am wondering if trying to order on a more reliable key would be advisable. I cannot think of a way to audit this without adding additional work to this process, which is extremely time sensitive. This does make troubleshooting much more difficult.
I suspect that your problem is here.
row_number() over
(
partition by """ + pk + """
order by
__$start_lsn desc,
__$seqval desc
) as __$rn,
...
from incr where __$rn = 1
If a given transaction affected more than one row, they'll be enumerated 1-N. Even that is a little hand-wavy; I'm not sure what happens if a row is affected more than once in a transaction (I'd need to set up a test and... well... I'm lazy).
But all that said, this workflow feels weird to me. I've worked with CDC in the past and while admittedly I wasn't targeting snowflake, the extraction part should be similar and fairly straightforward.
Get max LSN using sys.fn_cdc_get_max_lsn(); (i.e. no need to query the CDC data itself to obtain this value)
Select from cdc.fn_cdc_get_all_changes_«capture_instance»() or cdc.fn_cdc_get_net_changes_«capture_ instance»() using the LSN endpoints (min from either the previous run for that table or from sys.fn_cdc_get_min_lsn(«capture_ instance») for a first run; max from above)
Stream the results to wherever (i.e. you shouldn't need to hold a significant number of change records in memory at once).
(Submitting for a Snowflake User, hoping to receive additional assistance)
Is there another way to perform table insertion using a stored procedure faster?
I started building a usp with the purpose to insert million or so of rows of test data into a table for the purpose of load testing.
I got to this stage show below and set the iteration value to 10,000.
This took over 10 mins to iterate 10,000 times to insert a single integer into a table each iteration
Yes - I am using a XS data warehouse, but even if this is increased to MAX - this is way to slow to be of any use.
--build a test table
CREATE OR REPLACE TABLE myTable
(
myInt NUMERIC(18,0)
);
--testing a js usp using a while statement with the intention to insert multiple rows into a table (Millions) for load testing
CREATE OR REPLACE PROCEDURE usp_LoadTable_test()
RETURNS float
LANGUAGE javascript
EXECUTE AS OWNER
AS
$$
//set the number of iterations
var maxLoops = 10;
//set the row Pointer
var rowPointer = 1;
//set the Insert sql statement
var sql_insert = 'INSERT INTO myTable VALUES(:1);';
//Insert the fist Value
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
//Loop thorugh to insert all other values
while (rowPointer < maxLoops)
{
rowPointer += 1;
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
}
return rowPointer;
$$;
CALL usp_LoadTable_test();
So far, I've received the following recommendations:
Recommendation #1
One thing you can do is to use a "feeder table" containing 1000 or more rows instead of INSERT ... VALUES, eg:
INSERT INTO myTable SELECT <some transformation of columns> FROM "feeder table"
Recommendation #2
When you perform a million single row inserts, you consume one million micropartitions - each 16MB.
That 16 TB chunk of storage might be visible on your Snowflake bill ... Normal tables are retained for 7 days minimum after drop.
To optimize storage, you could define a clustering key and load the table in ascending order with each chunk filling up as much of a micropartition as possible.
Recommendation #3
Use data generation functions that work very fast if you need sequential integers: https://docs.snowflake.net/manuals/sql-reference/functions/seq1.html
Any other ideas?
This question was also asked at the Snowflake Lodge some weeks ago.
Given the answers you received, do you still feel unanswered, then maybe hint about why?
If you just want a table with a single column of sequence numbers, use GENERATOR() as in #3 above. Otherwise, if you want more advice, share your specific requirements.
I am trying to read data from a table, which is a PostgreSQL database, to a Pandas dataframe. Before the DB creation, I used to store data as msgpack files (created using Pandas to_msgpack ~25 GB of data on a SSD).
The table has around ~4.5 GB, around 10^8 rows, due to the conversion of types and duplicates removal. The database is stored in the tablespace on a SSD drive.
Now I am rewriting my code so it works with the DB instead of Msgpack files. The only main difference in the data is that time in the msgpack files is being stored as a UNIX timestamp, since in the DB it is Postgresql's timestamp.
The key part is importing the data.
My current Python code:
db_connection = psycopg2.connect("dbname={} user={} host={} password={} port={}".format(database_name, user, host, password, port))
df = pd.read_sql(('SELECT * '
'FROM "raw_data" '
# 'WHERE "time" > %(dstart)s AND "time" < %(dfinish)s '
# 'ORDER BY "time" ASC'
),
db_connection,
params={"dstart": start_date, "dfinish": end_date,},
index_col=['time'])
It takes around 5-6 minutes to execute this code, since reading the data from Msgpack takes around 3 min. I thought it may be caused by DB -> Pandas DataFrame conversion, but when I type the same query in the SQL console it takes around the same time to display the results. Strange thing is that "explain" says that execution time is only 10 s. Is there any way to improve performance of such the process?
My CPU: 8-core Xenon, SSD and 64GB RAM.
DB structure:
CREATE DOMAIN T AS double precision;
CREATE TABLE IF NOT EXISTS raw_data(
time timestamp PRIMARY KEY NOT NULL UNIQUE,
ib1 T,
ib2 T,
pb1 boolean,
pb2 boolean,
fn int,
bm smallint
);
Explain:
test=# explain analyse select * from raw_data;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on raw_data (cost=0.00..1596470.56 rows=98799456 width=32) (actual time=0.014..6935.492 rows=98801394 loops=1)
Planning time: 0.048 ms
Execution time: 10199.595 ms
(3 rows)
Postgresql config:
superuser_reserved_connections = 3 # (change requires restart)
dynamic_shared_memory_type = posix # the default is the first option
log_destination = 'stderr' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_line_prefix = '%m [%p] ' # special values:
log_timezone = 'Europe/Vaduz'
datestyle = 'iso, mdy'
timezone = 'UTC'
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 100
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
effective_cache_size = 44GB
work_mem = 320MB
wal_buffers = 16MB
shared_buffers = 15GB
max_connections = 20
I've been working to parameterize a SQL Statement that uses the IN statement in the WHERE clause. I'm using rodbcext library for parameterizing but it seems to lack expansion of a list.
I was hoping to write code such as
sqlExecute("SELECT * FROM table WHERE name IN (?)", c("paul","ringo","john", "george")
I'm using the following code but wondered if there's an easier way.
library(RODBC)
library(RODBCext)
# Search inputs
names <- c("paul", "ringo", "john", "george")
# Build SQL statement
qmarks <- replicate(length(names), "?")
stringmarks <- paste(qmarks, collapse = ",")
sql <- paste("SELECT * FROM tableA WHERE name IN (", stringmarks, ")")
# expand to Columns - seems to be the magic step required
bindnames <- rbind(names)
# Execute SQL statement
dbhandle <- RODBC::odbcDriverConnect(connectionString)
result <- RODBCext::sqlExecute(dbhandle, sql, bindnames, fetch = TRUE)
RODBC::odbcClose(dbhandle)
It works but feel I'm using R to expand the strings in the wrong way (bit new to R - so many ways to do the same thing wrong). Somebody will probably say "that creates factors - never do that" :-)
I found this article which suggest I'm on the right track but it doesn't discuss having to expand the "?" and turn the list into columns of a data.frame
R RODBC putting list of numbers into an IN() statement
Thank you.
UPDATE: As Benjamin shows below - the sqlExecute function can handle a list() of inputs. However upon inspection of the resulting SQL I discovered that it uses cursors to rollup the results. This significantly increases the CPU and I/O over the sample code I show above.
While the library can indeed solve this for you, for large results it may be too expensive. There are two answers and it depends upon your needs.
Since your only parameter in the query is in collection for IN, you could get away with
sqlExecute(dbhandle,
"SELECT * FROM table WHERE name IN (?)",
list(c("paul","ringo","john", "george")),
fetch = TRUE)
sqlExecute will bind the values in the list to the question mark. Here, it will actually repeat the query four times, once for each value in the vector. It may seem kind of silly to do it this way, but when trying to pass strings, it's a lot safer in many ways to let the binding take care of setting up the appropriate quote structure rather than trying to paste it in yourself. You will generate fewer errors this way and avoid a lot of database security concerns.
What if you declare a variable table in a character object and then concatenate with the query.
library(RODBC)
library(RODBCext)
# Search inputs
names <- c("paul", "ringo", "john", "george")
# Build SQL statement
sql_top <- paste0( "SET NOCOUNT ON \r\n DECLARE #LST_NAMES TABLE (ID NVARCHAR(20)) \r\n INSERT INTO #LST_NAMES VALUES ('", paste(names, collapse = "'), ('" ) , "')")
sql_body <- paste("SELECT * FROM tableA WHERE name IN (SELECT id FROM #LST_NAMES)")
sql <- paste0(sql_top, "\r\n", sql_body)
# Execute SQL statement
dbhandle <- RODBC::odbcDriverConnect(connectionString)
result <- RODBCext::sqlExecute(dbhandle, sql, bindnames, fetch = TRUE)
RODBC::odbcClose(dbhandle)
The query will be (the set no count on is important to retrieve the results)
SET NOCOUNT ON
DECLARE #LST_NAMES TABLE (ID NVARCHAR(20))
INSERT INTO #LST_NAMES VALUES ('paul'), ('ringo'), ('john'), ('george')
SELECT * FROM tableA WHERE name IN (SELECT id FROM #LST_NAMES)
I have a query for example:
var personList = context.People;
People is a view that has 2 joins on it and about 2500 rows and takes ~10 seconds to execute.
Looking at the Estimated Execution plan tells me that it is using a nested loop.
Now if i do this:
var personList = context.People.Where(r => r.Surname.Length > -1);
Execution time is under a second and the execution plan is using a Hash Join.
Adding "OPTION (HASH JOIN)" to the generated SQL has the desired effect of increasing performance.
So my question is ...
How can i get the query to use a Hash Join? It can't be added to the view (I tried, it errors).
Is there an option in EF4 that will force this? Or will i have to put it in a stored procedure?
RE: View
SELECT dbo.DecisionResults.ID, dbo.DecisionResults.UserID, dbo.DecisionResults.HasAgreed, dbo.DecisionResults.Comment,
dbo.DecisionResults.DateResponded, Person_1.Forename, Person_1.Surname, Site_1.Name, ISNULL(dbo.DecisionResults.StaffID, - 999)
AS StaffID
FROM dbo.DecisionResults INNER JOIN
Server2.DB2.dbo.Person AS Person_1 ON Person_1.StaffID = dbo.DecisionResults.StaffID INNER JOIN
Server2.DB2.dbo.Site AS Site_1 ON Person_1.SiteID = Site_1.SiteID
ORDER BY Person_1.Surname
If i add OPTION(HASH JOIN) to the end it will error with :
'Query hints' cannot be used in this query type.
But running that script as a query works fine.