Database rows limit - database

I want to store only last 200 data (if 201 data is coming means i need to delete 1 row and so on....)to table in database.Is there any logic?
Thanks

I don't think you can do this in a single statement. Suppose a table has a primary key column 'id', a data column 'dataCol', and a time stamp column 'dataTime'. Here's an approach (untested) that deletes the oldest row whenever the number of rows exceeds 200:
INSERT INTO t (dataCol, dataTime) VALUE ('data', NOW);
DELETE FROM t
WHERE 200 < (SELECT COUNT (*) FROM t)
AND id = (SELECT id FROM t ORDER BY dataTime LIMIT 1);
If you aren't going to delete data for any reason other than that the table is full, then another approach is possible. (No idea if this is better.) Start by pre-populating the table with 200 entries. Then create a second table with bookkeeping information about what's in the first table. The second table would have a single row that keeps the following information:
the next slot to be filled
whether the first table is full
The idea is that you insert new data into the first table at the index specified by the second table, increment the second table's slot column modulo 200, and set the full column to 1 the first time the slot pointer wraps back to 0.

As already mentioned, it probably is not possible to do this in one statement as you cannot update a table you are selecting from, use a single char flag field that is set when a row needs to be marked as deleted, thus the marked rows can be ignored by select statements and cleaned out by a separate housekeeping routine.

Related

Auto Increment Growing Faster than Number of Rows?

I have a snowpipe that is copying a CSV into a staging table. On a routine, I run a merge command and remove rows from the staging table from staging. To ensure that I only remove rows that have been processed, and not rows that have been inserted since the merge process began, I and only merging rows that are <= the current maximum ingest row number. Once I have processed those rows, I delete rows from the staging table that are against <= that number.
My ingest table looks like this:
CREATE TABLE IF NOT EXISTS ingest_staging (
collapse_key VARCHAR NOT NULL,
target VARCHAR NOT NULL,
action VARCHAR NOT NULL,
params OBJECT NOT NULL,
ingest_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP(),
ingest_row BIGINT AUTOINCREMENT START 1 INCREMENT 1
);
I noticed that while I was running the copy and merge routine, not all rows that were inserted into the staging table got merged into the production table, yet all rows were deleted. I disabled the merge process and just ran the copy process on a newly created ingest table (to clear the auto increment field) and immediately noticed the table count was out of sync with the autoicrement field:
SELECT COUNT(*) as "count", MAX(INGEST_ROW) as "max_ingest_row" FROM ingest_staging;
The first batch of rows copied in showed a 28k discrepancy between the two.
count: 368640, max_ingest_row: 397312
COUNT AND INGEST OUT OF SYNC: 28672
And after 15 minutes there was a discrepancy of 3.3mm between the number of rows and the autoincrement field.
count: 15624757, max_ingest_row: 18953955
COUNT AND INGEST OUT OF SYNC: 3329198
The count matches exactly the number of source rows I'm copying from, so I know no rows got dropped.
My guess is that, while the number auto increments, it's not guaranteed that a lower auto increment number will get inserted at a later time. Is this what is happening?
Why is the autoincrement column out of sync with the number of rows in the table?
Snowflake leverages SEQUENCES for autoincrement columns and does not guarantee a "gapless" increment of numbers in that sequence. Snowflake guarantees that the numbers are sequential and unique, but not necessarily gap-free.
Take a look here in the documentation if you are interested:
https://docs.snowflake.com/en/user-guide/querying-sequences.html#sequence-semantics

Fast duplicates filtering in T-SQL procedure

I need to store some records in a table. It's crucial the records must be unique. All duplicates must be silently ignored.
I wrote a procedure which accepts a user-defined table type, I send a record collection to it, and I try to have NEW, UNIQUE records stored.
How do I determine unique? I calculate SHA1 from a couple of columns. In my table, I have a Hash column. It has UNIQUE index constraint.
Here comes the tricky part. Instead of using IF EXIST(SEELCT ..) I use TRY / CATCH blocks, I let the INSERT silently fail on duplicate hash.
So I use cursor to fetch my rows, then I calculate the hash for each row, then I try to insert this row. If it fails, the next row is processed.
It works. It's quite fast. However I'm very disappointed with my identity column.
If I try to enter 3 identical records and 1 new one I get following id-s: 1, 4. I would expect 1 and 2, not 1 and 4. So identity is incremented on each failed insert. I need to avoid it.
I tried to wrap the INSERT into TRANSACTION and ROLLBACK in the CATCH block. It does nothing. It works, just the id-s are wrong.
Is there a way to use UNIQUE constraint to filter the duplicates efficiently or the only way is using IF EXISTS method?
Is using UNIQUE constraint really faster than IF EXISTS?
UPDATE:
The typical scenario would look like 1000 duplicates and 2 new rows. There will be some concurrent calls to this procedure. I just don't want it to slow the server considerably when I'll have like a couple of millions of rows in my table.
you can use SET IDENTITY INSERT and control that identity field till the end of insert:
SET IDENTITY INSERT ON
CURSOR --increasing also counter for identity column and setting it during loop
SET IDENTITY INSERT OFF
By the way, are you sure you cannot avoid cursors?
You could use EXCEPT to get only the values that are not already existing and inserting them in only one statement, that would be definitely faster; just to give an idea:
INSERT INTO DestTable
SELECT * FROM (SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestTable)

How can the date a row was added be in a different order to the identity field on the table?

I have a 'change history' table in my SQL Server DB called tblReportDataQueue that records changes to rows in other source tables.
There are triggers on the source tables in the DB which fire after INSERT, UPDATE or DELETE. The triggers all call a stored procedure that just inserts data into the change history table that has an identity column:
INSERT INTO tblReportDataQueue
(
[SourceObjectTypeID],
[ActionID],
[ObjectXML],
[DateAdded],
[RowsInXML]
)
VALUES
(
#SourceObjectTypeID,
#ActionID,
#ObjectXML,
GetDate(),
#RowsInXML
)
When a row in a source table is updated multiple times in quick succession the triggers fire in the correct order and put the changed data in the change history table in the order that it was changed. The problem is that I had assumed that the DateAdded field would always be in the same order as the identity field but somehow it is not.
So my table is in the order that things actually happened when sorted by the identity field but not when sorted by the 'DateAdded' field.
How can this happen?
screenshot of example problem
In example image 'DateAdded' of last row shown is earlier than first row shown.
You are using a surrogate key. One very important characteristic of a surrogate key is that it cannot be used to determine anything about the tuple it represents, not even the order of creation. All systems which have auto generated values like this, including Oracles sequences, make no guarantee as to order, only that the next value generated will be unique from previous generated values. That is all that is required, really.
We all do it, of course. We look at a row with ID of 2 and assume it was inserted after the row with ID of 1 and before the row with ID of 3. That is a bad habit we should all work to break because the assumption could well be wrong.
You have the DateAdded field to provide the information you want. Order by that field and you will get the rows in order of insertion (if that field is not updateable, that is). The auto generated values will tend to follow that ordering, but absolutely do not rely on that!
try use Sequence...
"Using the identity attribute for a column, you can easily generate auto-
incrementing numbers (which as often used as a primary key). With Sequence, it
will be a different object which you can attach to a table column while
inserting. Unlike identity, the next number for the column value will be
retrieved from memory rather than from the disk – this makes Sequence
significantly faster than Identity.
Unlike identity column values, which are generated when rows are inserted, an
application can obtain the next sequence number before inserting the row by
calling the NEXT VALUE FOR function. The sequence number is allocated when NEXT
VALUE FOR is called even if the number is never inserted into a table. The NEXT
VALUE FOR function can be used as the default value for a column in a table
definition. Use sp_sequence_get_range to get a range of multiple sequence
numbers at once."

Auto-increment id when field is empty

How to avoid auto-increment in database table when field is empty (whenever i leave project_id, project_name etc, it still automatically increment the id). What should i do to avoid this ?
I dont think (as far as i know) auto increment can be stopped. Any insert to the table will cause the counter to increment ( hence the name auto-increment).
Say suppose you inserted a row and the auto increment value was 10. Now delete that row and insert a new row. The value 10 wont be reused.
My advise would be to be more careful while inserting values (add as many checks as required) into the database or remove the auto increment and provide a key manually - query to get max of that column, increment it by 1 and used this new value to insert to the database(not a particularly good idea especially if the database size is huge).

How to unique identify rows in a table without primary key

I'm importing more than 600.000.000 rows from an old database/table that has no primary key set, this table is in a sql server 2005 database. I created a tool to import this data into a new database with a very different structure. The problem is that I want to resume the process from where it stopped for any reason, like an error or network error. As this table doesn't have a primary key, I can't check if the row was already imported or not. Does anyone know how to identify each row so I can check if it was already imported or not? This table has duplicated row, I already tried to compute the hash of all the columns, but it's not working due to duplicated rows...
thanks!
I would bring the rows into a staging table if this is coming from another database -- one that has an identity set on it. Then you can identify the rows where all the other data is the same except for the id and remove the duplicates before trying to put it into your production table.
So: you are loading umpteen bazillion rows of data, the rows cannot be uniquely identified, the load can (and, apparently, will) be interrupted at any point at any time, and you want to be able to resume such an interrupted load from where you left off, despite the fact that for all practical purposes you cannot identify where you left off. Ok.
Loading into a table containing an additional identity column would work, assuming that however and whenever the data load is started, it always starts at the same item and loads items in the same order. Wildly inefficient, since you have to read through everythign every time you launch.
Another clunky option would be to first break the data you are loading into manageably-sized chunks (perhaps 10,000,000 rows). Load them chunk by chunk, keeping track of which chunk you have loaded. Use a Staging table, so that you know and can control when a chunk has been "fully processed". If/when interrupted, you've only toss the chunk you were working on when interrupted, and resume work with that chunk.
With duplicate rows, even row_number() is going to get you nowhere, as that can change between queries (due to the way MSSQL stores data). You need to either bring it into a landing table with an identity column or add a new column with an identity onto the existing table (alter table oldTbl add column NewId int identity(1,1)).
You could use row_number(), and then back out the last n rows if they have more than the count in the new database for them, but it would be more straight-forward to just use a landing table.
Option 1: duplicates can be dropped
Try to find a somewhat unique field combination. (duplicates are allowed) and join over a hash of the rest of the fields which you store in the destination table.
Assume a table:
create table t_x(id int, name varchar(50), description varchar(100))
create table t_y(id int, name varchar(50), description varchar(100), hash varbinary(8000))
select * from t_x x
where not exists(select *
from t_y y
where x.id = y.id
and hashbytes('sha1', x.name + '~' + x.description) = y.hash)
The reason to try to join as many fields as possible is to reduce the chance of hash collisions which are real on a dataset with 600.000.000 records.
Option 2: duplicates are important
If you really need the duplicate rows you should add a unique id column to your big table. To achieve this in a performing way you should do the following steps:
Alter the table and add a uniqueidentifier or int field
update the table with the newsequentialid() function or a row_number()
create an index on this field
add the id field to your destination table.
once all the data is moved over, the field can be dropped.

Resources