Cassandra, counters and delete by field

Cassandra, counters and delete by field - database

I've this specific use case. I'm storing counters in a table associated with a timestamp:
CREATE TABLE IF NOT EXISTS metrics(
timestamp timestamp,
value counter,
PRIMARY KEY ((timestamp))
);
And I would like to delete all the metrics whose timestamp is lower than a specific value, for example:
DELETE FROM metrics WHERE timestamp < '2015-01-22 17:43:55-0800';
But this command is returning the following error:
code=2200 [Invalid query] message="Invalid operator < for PRIMARY KEY part timestamp"
How could I implement this functionality?

For a delete to work, you will need to provide a precise key with an equals operator. Deleting with a greater/less than operator does not work. Basically, you would have to obtain a list of the timestamps that you want to delete, and iterate through them with a (Python?) script or short (Java/C#) program.
One possible solution (if you happen to know how long you want to keep the data for), would be to set a time to live (TTL) on the data. On a table with counter columns, you cannot do it as a part of an UPDATE command. The only option, is to set it when you create the table:
CREATE TABLE IF NOT EXISTS metrics(
timestamp timestamp,
value counter,
PRIMARY KEY ((timestamp))
) WITH default_time_to_live=259200;
This will remove all data put into the table after 3 days (259200 seconds).
EDIT
And it turns out that the possible solution really isn't possible. Even though Cassandra lets you create a counter table with a default_time_to_live set, it doesn't enforce it.
Back to my original paragraph, the only way to execute a DELETE is to provide the specific key that you are deleting. And for counter tables, it looks like that is probably the only way possible.

Related

DynamoDB update entire column efficiently

I have a dynamo table called 'Table'. There are a few columns in the table, including one called 'updated'. I want to set all the 'updated' field to '0' without having to providing a key to avoid fetch and search in the table.
I tried batch write, but seems like update_item required Key inputs. How could I update the entire column to have every value as 0 efficiently please?
I am using a python script.
Thanks a lot.

At this point, you cannot do this, we have to pass a key (Partition key or Partition key and sort key) to update the item.
Currently, the only way to do this is to scan the table with filters to get all the values which have 0 in "updated" column and get respective Keys.
Pass those keys and update the value.
Hopefully, in future AWS will come up with something better.

If you can get partition key, for each partition key you can update the item

Dynamo DB Data model

I am designing a DynamoDB table, I have following attributes:
uniqueID | TimeStamp | Type | Content | flag
I need to get sorted list of all rows based on timestamp having flag set to true.
uniqueID is system generated ID.
TimeStamp is system time while populating table.
Number of distinct Type will be less than 10.
flag: true/false
I can think of following 3 approaches:
To make uniqueID as partition key for the table, And create Global Secondary Index as flag & TimeStamp, Partition and Sort keys respectively. Now I can query Global Secondary index with hash as flag and get sorted items on TimeStamp.
But the problem here is, as value of flag will be true and false only, and no of rows having flag set to false is relatively very less compared to true, there will be only 2 partitions. This loses all scaling characteristics of DynamoDB.
One another alternative is making Type as Partition key and TimeStamp as sort key for Global Secondary Index. This is better. But while querying I can't select all types of Type as DynamoDB requires Hash key in Query parameter. So I need to query this GSI multiple times to get data for all types of Type hash key.
Scan the table (Scan Operation): Scan returns all data with flag set to true without requirement of hash key but It won't give me sorted results on creationTime.
After analyzing use case, I think approach 1 is the best for now.
Could you please suggest any other approach better that this.
Thanks in advance!

Any partition key that is based on flag or TypeOfInfo will be bad as there are only few possible values (2 and 10 respectively) and the way your data goes into partitions will be skewed. You need to use something that provides a good distribution and in your case the base candidate for the partition key of the table is uniqueId.
The problem is that when you want to get the results based on flag, especially when flag is true, you will get a lot of records, possibly big majority. So scaling of DynamoDB won't give you much anyway if you need to get back most records.
You can try to create a GSI with flag as the partition key and timestamp as the range key. This is not an ideal set of keys but covers what you need. Having a good key for the table means that you can later easily switch to another solution (e.g. scanning and not using the GSI). Keep in mind that if you want to avoid querying the table when using the GSI, you will have to project those attributes you want to return into the GSI.
So summing up, I think you can choose between the GSI and scanning:
Scanning can be slower (test it) but won't require additional data storage
GSI can be faster (test it) but will require additional data storage.

How can the date a row was added be in a different order to the identity field on the table?

I have a 'change history' table in my SQL Server DB called tblReportDataQueue that records changes to rows in other source tables.
There are triggers on the source tables in the DB which fire after INSERT, UPDATE or DELETE. The triggers all call a stored procedure that just inserts data into the change history table that has an identity column:
INSERT INTO tblReportDataQueue
(
[SourceObjectTypeID],
[ActionID],
[ObjectXML],
[DateAdded],
[RowsInXML]
)
VALUES
(
#SourceObjectTypeID,
#ActionID,
#ObjectXML,
GetDate(),
#RowsInXML
)
When a row in a source table is updated multiple times in quick succession the triggers fire in the correct order and put the changed data in the change history table in the order that it was changed. The problem is that I had assumed that the DateAdded field would always be in the same order as the identity field but somehow it is not.
So my table is in the order that things actually happened when sorted by the identity field but not when sorted by the 'DateAdded' field.
How can this happen?
screenshot of example problem
In example image 'DateAdded' of last row shown is earlier than first row shown.

You are using a surrogate key. One very important characteristic of a surrogate key is that it cannot be used to determine anything about the tuple it represents, not even the order of creation. All systems which have auto generated values like this, including Oracles sequences, make no guarantee as to order, only that the next value generated will be unique from previous generated values. That is all that is required, really.
We all do it, of course. We look at a row with ID of 2 and assume it was inserted after the row with ID of 1 and before the row with ID of 3. That is a bad habit we should all work to break because the assumption could well be wrong.
You have the DateAdded field to provide the information you want. Order by that field and you will get the rows in order of insertion (if that field is not updateable, that is). The auto generated values will tend to follow that ordering, but absolutely do not rely on that!

try use Sequence...
"Using the identity attribute for a column, you can easily generate auto-
incrementing numbers (which as often used as a primary key). With Sequence, it
will be a different object which you can attach to a table column while
inserting. Unlike identity, the next number for the column value will be
retrieved from memory rather than from the disk – this makes Sequence
significantly faster than Identity.
Unlike identity column values, which are generated when rows are inserted, an
application can obtain the next sequence number before inserting the row by
calling the NEXT VALUE FOR function. The sequence number is allocated when NEXT
VALUE FOR is called even if the number is never inserted into a table. The NEXT
VALUE FOR function can be used as the default value for a column in a table
definition. Use sp_sequence_get_range to get a range of multiple sequence
numbers at once."

Inserting a record into a database with looping foreign keys

I have this relationship:
Where CurrentVersionID points to the current active version of the game.
In ArcadeGameVersion the GameID property points to the associated ArcadeGame record.
Problem is, I can't insert either record:
The INSERT statement conflicted with the FOREIGN KEY constraint "FK_ArcadeGame_ArcadeGameVersions". The conflict occurred in database "Scirra", table "dbo.ArcadeGameVersions", column 'ID'.
Is this a badly formed data structure? Otherwise what is the best solution to overcome this?

This structure can work if you need it to be this way.. assuming the IDs are identity fields, I believe you will need to do this in 5 steps:
Insert an ArcadeGame record with a null value for CurrentVersionId
Determine the ID value of the record just added, using a statement like: SELECT #arcadeGameId = SCOPE_IDENTITY()
Insert an ArcadeGameVersion record, setting the GameID to the value determined in the previous step
Determine the ID value of the record just added (again using SCOPE_IDENTITY())
Update the ArcadeGame record (where the ID matches that determined in step 2) and set the CurrentVersionId to the value determined in the previous step.
You will (most likely) want to do the above within a transaction.
If the IDs aren't identity fields and you know the values ahead of time, you can mostly follow the same steps as above but skip the SELECT SCOPE_IDENTITY() steps.

Seems badly formed. I can not see why you need this circular reference.
I would use only one table ArcadeGame with additional fields CurrentVersion and UploadDate.
You can query it based on UploadDate for example, up to your needs. If you explain what you want from that db, answer could be more specific.

Can DB constraits ignore existing records and apply for only new data?

I want to learn the answer for different DB engines but in our case;
we have some records that are not unique for a column and now we want to make that column unique which forces us to remove duplicate values.
We use Oracle 10g. Is this reasonable? Or is this something like goto statement :) ? Should we really delete? What if we had millions of records?

To answer the question as posted: No, it can't be done on any RDBMS that I'm aware of.
However, like most things you can work around it, by doing the following.
Create a composite key, with a new column and the existing column
You can make it unique without deleting anything by adding a new column, call it PartialKey.
For existing rows you set PartialKey to a unique value (starting at Zero).
Create a unique constraint on the existing column and PartialKey (you can do this because each of these rows will now be unique).
For new rows, only use a default value of Zero for PartialKey (because zero has already been used), this will force the existing column to have unqiue values in the table.
IMPORTANT EDIT
This is weak - if you delete a row with partial key 0. Now another row can be added with a value that is already in the existing column, because the 0 in partial key will guarentee uniqueness.
You would need to ensure that either
You never delete the row with
partial key 0
You always have a dummy row with
partial key 0, and you never delete
it (or you immediately reinsert it automatically)
Edit: Bite the bullet and clean the data
If as you said you've just realised that the column should be unique, then you should (if possible) clean up the data. The above approach is a hack, and you'll find yourself writing more hacks when accessing the table (you may find you've two sets of logic for dealing with queries against that table, one for where the column IS unique, and one where it's NOT. I'd clean this now or it'll come back and bite you in the arse a thousand times over.

This can be done in SQL Server.
When you create a check constraint,
you can set an option to apply it
either to new data only or to existing
data as well. The option of applying
the constraint to new data only is
useful when you know that the existing
data already meets the new check
constraint, or when a business rule
requires the constraint to be enforced
only from this point forward.
for example
ALTER TABLE myTable
WITH NOCHECK
ADD CONSTRAINT myConstraint CHECK ( column > 100 )

You can do this using NOVALIDATE ENABLE constraint state, but deleting is much more preferred way.

You have to set your records straight before adding the constraints.

In Oracle you can put a constraint in a enable novalidate state. When a constraint is in the enable novalidate state, all subsequent statements are checked for conformity to the constraint. However, any existing data in the table is not checked. A table with enable novalidated constraints can contain invalid data, but it is not possible to add new invalid data to it. Enabling constraints in the novalidated state is most useful in data warehouse configurations that are uploading valid OLTP data.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight