I have a snowpipe that is copying a CSV into a staging table. On a routine, I run a merge command and remove rows from the staging table from staging. To ensure that I only remove rows that have been processed, and not rows that have been inserted since the merge process began, I and only merging rows that are <= the current maximum ingest row number. Once I have processed those rows, I delete rows from the staging table that are against <= that number.
My ingest table looks like this:
CREATE TABLE IF NOT EXISTS ingest_staging (
collapse_key VARCHAR NOT NULL,
target VARCHAR NOT NULL,
action VARCHAR NOT NULL,
params OBJECT NOT NULL,
ingest_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP(),
ingest_row BIGINT AUTOINCREMENT START 1 INCREMENT 1
);
I noticed that while I was running the copy and merge routine, not all rows that were inserted into the staging table got merged into the production table, yet all rows were deleted. I disabled the merge process and just ran the copy process on a newly created ingest table (to clear the auto increment field) and immediately noticed the table count was out of sync with the autoicrement field:
SELECT COUNT(*) as "count", MAX(INGEST_ROW) as "max_ingest_row" FROM ingest_staging;
The first batch of rows copied in showed a 28k discrepancy between the two.
count: 368640, max_ingest_row: 397312
COUNT AND INGEST OUT OF SYNC: 28672
And after 15 minutes there was a discrepancy of 3.3mm between the number of rows and the autoincrement field.
count: 15624757, max_ingest_row: 18953955
COUNT AND INGEST OUT OF SYNC: 3329198
The count matches exactly the number of source rows I'm copying from, so I know no rows got dropped.
My guess is that, while the number auto increments, it's not guaranteed that a lower auto increment number will get inserted at a later time. Is this what is happening?
Why is the autoincrement column out of sync with the number of rows in the table?
Snowflake leverages SEQUENCES for autoincrement columns and does not guarantee a "gapless" increment of numbers in that sequence. Snowflake guarantees that the numbers are sequential and unique, but not necessarily gap-free.
Take a look here in the documentation if you are interested:
https://docs.snowflake.com/en/user-guide/querying-sequences.html#sequence-semantics
Related
I have a table that looks something like this:
CREATE TABLE Records
(
ID UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED,
owner UNIQUEIDENTIFIER,
value FLOAT,
timestamp DATETIME
)
There is a multi-column clustered index on some other columns not relevant to this question.
The table currently has about 500,000,000 rows, and I need to operate on the table but it's too large to deal with currently (I am hampered by slow hardware). So I decided to work on it in chunks.
But if I say
SELECT ID
FROM records
WHERE ID LIKE '0000%'
The execution plan shows that the ENTIRE TABLE is scanned. I thought that with an index, only those rows that matched the original condition would be scanned until SQL reached the '0001' records. With the % in front, I could clearly see why it would scan the whole table. But with the % at the end, it shouldn't have to scan the whole table.
I am guessing this works different with GUIDs rather than CHAR or VARCHAR columns.
So my question is this: how can I search for a subsection of GUIDs without having to scan the whole table?
From your comments, I see the actual need is to break the rows of random GUID values into chunks (ordered) based on range. In this case, you can specify a range instead of LIKE along with a filter on the desired start/end values in the last group:
SELECT ID
FROM dbo.records
WHERE
ID BETWEEN '00000000-0000-0000-0000-000000000000'
AND '00000000-0000-0000-0000-000FFFFFFFFF';
This article explains how uniqueidentifiers (GUIDs) are stored and ordered in SQL Server, comparing and sorting the last group first rather than left-to-right as you might expect. By filtering on the last group, you'll get a sargable expression and touch only those rows in the specified range (assuming an index on ID is used).
I have a RecyclerView list of items that uses an SQLite database to store user input data. I use the traditional _id column as INTEGER PRIMARY KEY AUTOINCREMENT. If I understand correctly, newly inserted rows in the database are added below existing rows and the new ROWID takes the largest existing ROWID and increments it by +1. Therefore, a cursor search for the latest insert will have to scan down the entire set of rows to reach the bottom of the database. For example, after 10 inserts, the cursor has to search down from 1, 2, 3,... until it gets to row 10.
To avoid a lengthy search of the entire set of ROWIDs, is there any way to have new inserts be added to the top of the database and not the bottom? That way a cursor search for the latest insert using moveToFirst() will be very fast since the cursor will stop at the first row it searches, the top of the database. The cursor would search 10, 9, 8,...3,2,1 and therefore the search would be very fast since it would stop at 10, the first row at the top of the database.
You are thinking too much about the database internals. Indexes are designed for this kind of optimisation.
Make a new numeric column where you put your wished ordering as a value and use order by in selects. Do not forget to make an index on this column and verify your selects do use the indexes. (explain)
First, if you are concerned about overheads then use the recommended INTEGER PRIMARY KEY as opposed to INTEGER PRIMARY KEY AUTOINCREMENT. Both will result in a unique id, the latter has overheads as per :-
The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and
disk I/O overhead and should be avoided if not strictly needed. It is
usually not needed.
SQLite Autoincrement
If I understand correctly, newly inserted rows in the database are
added below existing rows and the new ROWID takes the largest existing
ROWID and increments it by +1.
Generally BUT not necessarily, there is no guarantee that the value will increment by 1.
AUTOINCREMENT utilises a table called sqlite_seqeunce that has a single row per table that stores the highest last used sequence number along with the table name. The next sequence number will be that value + probably 1 UNLESS the highest rowid is greater than the value in the sqlite_sequence table.
Without AUTOINCREMENT then the next sequence is the highest rowid + probably 1.
AUTOINCREMENT guarantees a higher number. Without AUOINCREMENT can use a lower number (BUT not until the number would be greater than 9223372036854775807). If AUTOINCREMENT would use a number higher that this then an SQLITE_FULL exception will happen.
Again with regard to rowid's and searching :-
The data for rowid tables is stored as a B-Tree structure containing
one entry for each table row, using the rowid value as the key. This
means that retrieving or sorting records by rowid is fast. Searching
for a record with a specific rowid, or for all records with rowids
within a specified range is around twice as fast as a similar search
made by specifying any other PRIMARY KEY or indexed value. ROWIDs and the INTEGER PRIMARY KEY
To avoid a lengthy search of the entire set of ROWIDs, is there any
way to have new inserts be added to the top of the database and not
the bottom?
Yes there is, simply specify the value for the rowid or typically the alias when inserting (but beware using an already used value and good luck with managing the numbering). However, I doubt that doing so would result in a faster search. Tables have a rowid by default largely due to the rowid being optimised for searching by rowid.
I have a table in MS SQL SERVER 2008 and I have set its primary key to increment automatically but if I delete any row from this table and insert some new rows in the table it starts from the next identity value which created gap in the identity value. My program requires all the identities or keys to be in sequence.
Like:
Assignment Table has total 16 rows with sequence identities(1-16) but if I delete a value at 16th position
Delete From Assignment Where assignment_id=16;
and after this operation when I insert a new row
Insert into Assignment(assignment_title)Values('myassignment');
Rather than assigning 16 as a primary key to this new value it assigns 17.
How can I solve this Problem ?
Renaming or re-numbering primary key values is not a good database management practice. I suggest you keep the primary key as is, and create a separate column index with the values you require to be re-numbered. Then simply create a trigger to run a routine that will re-number every row in the order you expect, obviously by seeking the "gaps" and entering them with values incremented from their previous value.
This is SQL Servers standard behaviour. If you deleted a row with ID=8 in your example, you would still have a gap.
All you could do, is write a function getSmallestDreeID in SQL Server, that you called for every insert and that would get you the smallest not assigned ID. But you would have to take great care of transactions and ACID.
The behavior you desire isn't possible without some post processing logic to renumber the rows.
Consider thus scenario:
Session 1 begins a transaction, inserts a row (id=16), but doesn't commit yet.
Session 2 begins a transaction, inserts a row (id=17) and commits.
Session1 rolls back.
Whether 16 will or will not exist in the table is decided after 17 is committed.
And you can't renumber these in a trigger, you'll get deadlocked.
What you probably need to do is to query the data adding a row number that is a sequential integer.
Gaps in identity values isn't a problem
well, i have recently faced the same problem: i need the ID values in an external C# application in order to retrieve files named exactly as the ID.
==> here is what i did to avoid the identity property, i entered id values manually because it was a small table, but if it is not in your case, use a SEQUENCE SQL Server 2014.
Use the statement UPDATE instead of delete to keep the id values in order.
I'm importing more than 600.000.000 rows from an old database/table that has no primary key set, this table is in a sql server 2005 database. I created a tool to import this data into a new database with a very different structure. The problem is that I want to resume the process from where it stopped for any reason, like an error or network error. As this table doesn't have a primary key, I can't check if the row was already imported or not. Does anyone know how to identify each row so I can check if it was already imported or not? This table has duplicated row, I already tried to compute the hash of all the columns, but it's not working due to duplicated rows...
thanks!
I would bring the rows into a staging table if this is coming from another database -- one that has an identity set on it. Then you can identify the rows where all the other data is the same except for the id and remove the duplicates before trying to put it into your production table.
So: you are loading umpteen bazillion rows of data, the rows cannot be uniquely identified, the load can (and, apparently, will) be interrupted at any point at any time, and you want to be able to resume such an interrupted load from where you left off, despite the fact that for all practical purposes you cannot identify where you left off. Ok.
Loading into a table containing an additional identity column would work, assuming that however and whenever the data load is started, it always starts at the same item and loads items in the same order. Wildly inefficient, since you have to read through everythign every time you launch.
Another clunky option would be to first break the data you are loading into manageably-sized chunks (perhaps 10,000,000 rows). Load them chunk by chunk, keeping track of which chunk you have loaded. Use a Staging table, so that you know and can control when a chunk has been "fully processed". If/when interrupted, you've only toss the chunk you were working on when interrupted, and resume work with that chunk.
With duplicate rows, even row_number() is going to get you nowhere, as that can change between queries (due to the way MSSQL stores data). You need to either bring it into a landing table with an identity column or add a new column with an identity onto the existing table (alter table oldTbl add column NewId int identity(1,1)).
You could use row_number(), and then back out the last n rows if they have more than the count in the new database for them, but it would be more straight-forward to just use a landing table.
Option 1: duplicates can be dropped
Try to find a somewhat unique field combination. (duplicates are allowed) and join over a hash of the rest of the fields which you store in the destination table.
Assume a table:
create table t_x(id int, name varchar(50), description varchar(100))
create table t_y(id int, name varchar(50), description varchar(100), hash varbinary(8000))
select * from t_x x
where not exists(select *
from t_y y
where x.id = y.id
and hashbytes('sha1', x.name + '~' + x.description) = y.hash)
The reason to try to join as many fields as possible is to reduce the chance of hash collisions which are real on a dataset with 600.000.000 records.
Option 2: duplicates are important
If you really need the duplicate rows you should add a unique id column to your big table. To achieve this in a performing way you should do the following steps:
Alter the table and add a uniqueidentifier or int field
update the table with the newsequentialid() function or a row_number()
create an index on this field
add the id field to your destination table.
once all the data is moved over, the field can be dropped.
I want to store only last 200 data (if 201 data is coming means i need to delete 1 row and so on....)to table in database.Is there any logic?
Thanks
I don't think you can do this in a single statement. Suppose a table has a primary key column 'id', a data column 'dataCol', and a time stamp column 'dataTime'. Here's an approach (untested) that deletes the oldest row whenever the number of rows exceeds 200:
INSERT INTO t (dataCol, dataTime) VALUE ('data', NOW);
DELETE FROM t
WHERE 200 < (SELECT COUNT (*) FROM t)
AND id = (SELECT id FROM t ORDER BY dataTime LIMIT 1);
If you aren't going to delete data for any reason other than that the table is full, then another approach is possible. (No idea if this is better.) Start by pre-populating the table with 200 entries. Then create a second table with bookkeeping information about what's in the first table. The second table would have a single row that keeps the following information:
the next slot to be filled
whether the first table is full
The idea is that you insert new data into the first table at the index specified by the second table, increment the second table's slot column modulo 200, and set the full column to 1 the first time the slot pointer wraps back to 0.
As already mentioned, it probably is not possible to do this in one statement as you cannot update a table you are selecting from, use a single char flag field that is set when a row needs to be marked as deleted, thus the marked rows can be ignored by select statements and cleaned out by a separate housekeeping routine.