Updating data in Clickhouse - database

I went over the documentation for Clickhouse and I did not see the option to UPDATE nor DELETE. It seems to me its an append only system.
Is there a possibility to update existing records or is there some workaround like truncating a partition that has records in it that have changed and then re-insering the entire data for that partition?

Through Alter query in clickhouse we can able to delete/update the rows in a table.
For delete: Query should be constructed as
ALTER TABLE testing.Employee DELETE WHERE Emp_Name='user4';
For Update: Query should be constructed as
ALTER TABLE testing.employee UPDATE AssignedUser='sunil' where AssignedUser='sunny';

UPDATE: This answer is no longer true, look at https://stackoverflow.com/a/55298764/3583139
ClickHouse doesn't support real UPDATE/DELETE.
But there are few possible workarounds:
Trying to organize data in a way, that is need not to be updated.
You could write log of update events to a table, and then calculate reports from that log. So, instead of updating existing records, you append new records to a table.
Using table engine that do data transformation in background during merges. For example, (rather specific) CollapsingMergeTree table engine:
https://clickhouse.yandex/reference_en.html#CollapsingMergeTree
Also there are ReplacingMergeTree table engine (not documented yet, you could find example in tests: https://github.com/yandex/ClickHouse/blob/master/dbms/tests/queries/0_stateless/00325_replacing_merge_tree.sql)
Drawback is that you don't know, when background merge will be done, and will it ever be done.
Also look at samdoj's answer.

You can drop and create new tables, but depending on their size this might be very time consuming. You could do something like this:
For deletion, something like this could work.
INSERT INTO tableTemp SELECT * from table1 WHERE rowID != #targetRowID;
DROP table1;
INSERT INTO table1 SELECT * from tableTemp;
Similarly, to update a row, you could first delete it in this manner, and then add it.

Functionality to UPDATE or DELETE data has been added in recent ClickHouse releases, but its expensive batch operation which can't be performed too frequently.
See https://clickhouse.yandex/docs/en/query_language/alter/#mutations for more details.

It's an old question, but updates are now supported in Clickhouse. Note it's not recommended to do many small changes for performance reasons. But it is possible.
Syntax:
ALTER TABLE [db.]table UPDATE column1 = expr1 [, ...] WHERE filter_expr
Clickhouse UPDATE documentation

Related

How to write a code to timetravel using a specific transaction ID

I would like to use a timetravel feature on snowflake and restore the original table.
I've deleted and created the table using following command:
DROP TABLE "SOCIAL_LIVE"
CREATE TABLE "SOCIAL_LIVE" (...)
I would like to go back to the original table before dropping table.
I've used following code (hid the transaction ID to 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
Select "BW"."PUBLIC"."SOCIAL_LIVE".* From "BW"."PUBLIC"."SOCIAL_LIVE";
select * from SOCIAL_LIVE before(statement => 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx');
Received an error message:
Statement xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx cannot be used to specify time for time travel query.
How can we go back to the original table and restore it on snowflake?
The documentation states:
After dropping a table, creating a table with the same name creates a
new version of the table. The dropped version of the previous table
can still be restored using the following method:
Rename the current version of the table to a different name.
Use the UNDROP TABLE command to restore the previous version.
If you need further information, this page is useful:
https://docs.snowflake.net/manuals/sql-reference/sql/drop-table.html#usage-notes
You will need to undrop the table in order to access that data, though. Time-travel is not maintained by name alone. So, once you dropped and recreated the table, the new table has its own, new time travel.
Looks like there's 3 common reasons that error is seen, with solutions:
the table has been dropped and recreated
see this answer
the time travel period has been exceeded
no solution: target a statement within the time travel period for the table
the wrong statement type is being targeted
only certain statement types can be targeted. Currently, these include SELECT, BEGIN, COMMIT, and DML (INSERT, UPDATE etc). See documentation here.
Statement xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx cannot be used to specify time for time travel query.
Usually we will get above error when we trying to travel behind the object creation time. Try with time travel option with offset option.

Stored procedure to update different columns

I have an API that i'm trying to read that gives me just the updated field. I'm trying to take that and update my tables using a stored procedure. So far the only way I have been able to figure out how to do this is with dynamic SQL but i would prefer to not do that if there is a way not to.
If it was just a couple columns, I'd just write a proc for each but we are talking about 100 fields and any of them could be updated together. One ticket might just need a timestamp updated at this time, but the next ticket might be a timestamp and who modified it while the next one might just be a note.
Everything I've read and have been taught have told me that dynamic SQL is bad and while I'll write it if I have too, I'd prefer to have a proc.
YOU CAN PERHAPS DO SOMETHING LIKE THIS:::
IF EXISTS (SELECT * FROM NEWTABLE NOT IN (SELECT * FROM OLDTABLE))
BEGIN
UPDATE OLDTABLE
SET OLDTABLE.OLDRECORDS = NEWTABLE.NEWRECORDS
WHERE OLDTABLE.PRIMARYKEY= NEWTABLE.PRIMARYKEY
END
The best way to solve your problem is using MERGE:
Performs insert, update, or delete operations on a target table based on the results of a join with a source table. For example, you can synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table.
As you can see your update could be more complex but more efficient as well. Using MERGE requires some proficiency, but when you start to use it you'll use it with pleasure again and again.
I am not sure how your business logic works that determines what columns are updated at what time. If there are separate business functions that require updating different but consistent columns per function, you will probably want to have individual update statements for each function. This will ensure that each process updates only the columns that it needs to update.
On the other hand, if your API is such that you really don't know ahead of time what needs to be updated, then building a dynamic SQL query is a good idea.
Another option is to build a save proc that sets every user-configurable field. As long as the calling process has all of that data, it can call the save procedure and pass every updateable column. There is no harm in having a UPDATE MyTable SET MyCol = #MyCol with the same values on each side.
Note that even if all of the values are the same, the rowversion (or timestampcolumns) will still be updated, if present.
With our software, the tables that users can edit have a widely varying range of columns. We chose to create a single save procedure for each table that has all of the update-able columns as parameters. The calling processes (our web servers) have all the required columns in memory. They pass all of the columns on every call. This performs fine for our purposes.

sql server to dump big table into other table

I'm currently changing the Id field of table to be an IDENTITY field. This is simple: Create a temp-table, copy all the data to the temp-table, adjust all the references from and to the table to point from and to the new temp-table, drop the old table, rename the temp-table to the original name.
Now I've got the problem that the copy step is taking too long. Actually the table doesn't have too many entries (~7.5 million rows), but it still takes multiple hours to do this.
I'm currently moving the data with a query like this:
SET IDENTITY_INSERT MyTable_Temp ON
INSERT INTO MyTable_Temp ([Fields]) SELECT [Fields] FROM MyTable
SET IDENTITY_INSERT MyTable_Temp OFF
I've had a look at bcp in combination with cmdshell and a following BULK INSERT, but I don't like the solution of first writing the data to a temp-file and afterwards dumping it back into the new table.
Is there a more efficient way to copy or move the data from the old to the new table? And can this be done in "pure" T-SQL?
Keep in mind, the data is correct (no external sources involved) and no changes are being made to the data during transfer.
Your approach seems fair, but the transaction generated by the insert command is too large and that is why it takes so long.
My approach when dealing with this in the past, was to use a cursor and a batching mechanism.
Perform the operation for only 100000 rows at a time, and you will see major improvements.
After the copy is made you can rebuild your references and eventually remove the old table... and so on. Be careful to reseed your new table accordingly after the data is copied.

Audit each inserted row in a Trigger

I am trying to do an audit history by adding triggers to my tables and inserting rows intto my Audit table. I have a stored procedure that makes doing the inserts a bit easier because it saves code; I don't have to write out the entire insert statement, but I instead execute the stored procedure with a few parameters of the columns I want to insert.
I am not sure how to execute a stored procedure for each of the rows in the "inserted" table. I think maybe I need to use a cursor, but I'm not sure. I've never used a cursor before.
Since this is an audit, I am going to need to compare the value for each column old to new to see if it changed. If it did change I will execute the stored procedure that adds a row to my Audit table.
Any thoughts?
I would trade space for time and not do the comparison. Simply push the new values to the audit table on insert/update. Disk is cheap.
Also, I'm not sure what the stored procedure buys you. Can't you do something simple in the trigger like:
insert into dbo.mytable_audit
(select *, getdate(), getdate(), 'create' from inserted)
Where the trigger runs on insert and you are adding created time, last updated time, and modification type fields. For an update, it's a little tricker since you'll need to supply named parameters as the created time shouldn't be updated
insert into dbo.mytable_audit (col1, col2, ...., last_updated, modification)
(select *, getdate(), 'update' from inserted)
Also, are you planning to audit only successes or failures as well? If you want to audit failures, you'll need something other than triggers I think since the trigger won't run if the transaction is rolled back -- and you won't have the status of the transaction if the trigger runs first.
I've actually moved my auditing to my data access layer and do it in code now. It makes it easier to both success and failure auditing and (using reflection) is pretty easy to copy the fields to the audit object. The other thing that it allows me to do is give the user context since I don't give the actual user permissions to the database and run all queries using a service account.
If your database needs to scale past a few users this will become very expensive. I would recommend looking into 3rd party database auditing tools.
There is already a built in function UPDATE() which tells you if a column has changed (but it is over the entire set of inserted rows).
You can look at some of the techniques in Paul Nielsen's AutoAudit triggers which are code generated.
What it does is check both:
IF UPDATE(<column_name>)
INSERT Audit (...)
SELECT ...
FROM Inserted
JOIN Deleted
ON Inserted.KeyField = Deleted.KeyField -- (AutoAudit does not support multi-column primary keys, but the technique can be done manually)
AND NOT (Inserted.<column_name> = Deleted.<column_name> OR COALESCE(Inserted.<column_name>, Deleted.<column_name>) IS NULL)
But it audits each column change as a separate row. I use it for auditing changes to configuration tables. I am not currently using it for auditing heavy change tables. (But in most transactional systems I've designed, rows on heavy activity tables are typically immutable, you don't have a lot of UPDATEs, just a lot of INSERTs - so you wouldn't even need this kind of auditing). For instance, orders or ledger entries are never changed, and shopping carts are disposable - neither would have this kind of auditing. On low volume change tables, like customer, you can use this kind of auditing.
Jeff,
I agree with Zodeus..a good option is to use a 3rd tool.
I have used auditdatabase (FREE)web tool that generates audit triggers (you do not need to write a single line of TSQL code)
Another good tools is Apex SQL Audit but..it's not free.
I hope this helps you,
F. O'Neill

Deleting Rows from a SQL Table marked for Replication

I erroneously delete all the rows from a MS SQL 2000 table that is used in merge replication (the table is on the publisher). I then compounded the issue by using a DTS operation to retrieve the rows from a backup database and repopulate the table.
This has created the following issue:
The delete operation marked the rows for deletion on the clients but the DTS operation bypasses the replication triggers so the imported rows are not marked for insertion on the subscribers. In effect the subscribers lose the data although it is on the publisher.
So I thought "no worries" I will just delete the rows again and then add them correctly via an insert statement and they will then be marked for insertion on the subscribers.
This is my problem:
I cannot delete the DTSed rows because I get a "Cannot insert duplicate key row in object 'MSmerge_tombstone' with unique index 'uc1MSmerge_tombstone'." error. What I would like to do is somehow delete the rows from the table bypassing the merge replication trigger. Is this possible? I don't want to remove and redo the replication because the subscribers are 50+ windows mobile devices.
Edit: I have tried the Truncate Table command. This gives the following error "Cannot truncate table xxxx because it is published for replication"
Have you tried truncating the table?
You may have to truncate the table and reset the ID field back to 0 if you need the inserted rows to have the same ID. If not, just truncate and it should be fine.
You also could look into temporarily dropping the unique index and adding it back when you're done.
Look into sp_mergedummyupdate
Would creating a second table be an option? You could create a second table, populate it with the needed data, add the constraints/indexes, then drop the first table and rename your second table. This should give you the data with the right keys...and it should all consist of SQL statements that are allowed to trickle down the replication. It just isn't probably the best on performance...and definitely would impose some risk.
I haven't tried this first hand in a replicated environment...but it may be at least worth trying out.
Thanks for the tips...I eventually found a solution:
I deleted the merge delete trigger from the table
Deleted the DTSed rows
Recreated the merge delete trigger
Added my rows correctly using an insert statement.
I was a little worried bout fiddling with the merge triggers but every thing appears to be working correctly.

Resources