Maximum value of global secondary index range key in dynamoDB - database

I have a table in dynamoDB with id as primary key and a global secondary index (GSI). GSI has hash key as p_identifier and range key as financialID. FinancialID is a 6 digit number starting with 100000. I have a requirement to get the maximum of the financialID so that next record to be added can have financialID incremented by 1.
Can anyone help me in making this work? Also is there is any other alternative to do this?

I would go and use a different approach.
From your requests I am assuming financialID should be unique.
The database won't prevent you from duplicating it and you should make sure some other part of your application syncs these numbers. So you need an atomic counter.
If you must use DynamoDB alone, you should have a table set up just for this type of task.
A table where you have a hash primary key called financial_id_counter and you atomically raise it by 1 and use the id retrieved as the next financialID to be used.
This is not ideal, but can work when issuing updates with UpdateItem ADD.

If you need the FinancialID to be in strict-order, approach by #Chen is good.
On the other hand, if you just need a unique id here, you can use a UUID. Here too there is a very small chance of collision. To counter this, you need to use the API with the "Exists" condition - the call will fail, if it exists and then you can retry with another UUID.

Firstly, incrementing will not be a good idea for DynamoDB, but the following can be a workaround:
We have to query based on equal-to operator, so let's say:
p_identifier = 101
and you can use
scanIndexForward-false
(will sort data descending based on your range key) and get that item and increment your key.
If you don't know p_identifier, then you need to scan (which is not recommended) and manually find largest key and increment.

Related

Dynamo DB Data model

I am designing a DynamoDB table, I have following attributes:
uniqueID | TimeStamp | Type | Content | flag
I need to get sorted list of all rows based on timestamp having flag set to true.
uniqueID is system generated ID.
TimeStamp is system time while populating table.
Number of distinct Type will be less than 10.
flag: true/false
I can think of following 3 approaches:
To make uniqueID as partition key for the table, And create Global Secondary Index as flag & TimeStamp, Partition and Sort keys respectively. Now I can query Global Secondary index with hash as flag and get sorted items on TimeStamp.
But the problem here is, as value of flag will be true and false only, and no of rows having flag set to false is relatively very less compared to true, there will be only 2 partitions. This loses all scaling characteristics of DynamoDB.
One another alternative is making Type as Partition key and TimeStamp as sort key for Global Secondary Index. This is better. But while querying I can't select all types of Type as DynamoDB requires Hash key in Query parameter. So I need to query this GSI multiple times to get data for all types of Type hash key.
Scan the table (Scan Operation): Scan returns all data with flag set to true without requirement of hash key but It won't give me sorted results on creationTime.
After analyzing use case, I think approach 1 is the best for now.
Could you please suggest any other approach better that this.
Thanks in advance!
Any partition key that is based on flag or TypeOfInfo will be bad as there are only few possible values (2 and 10 respectively) and the way your data goes into partitions will be skewed. You need to use something that provides a good distribution and in your case the base candidate for the partition key of the table is uniqueId.
The problem is that when you want to get the results based on flag, especially when flag is true, you will get a lot of records, possibly big majority. So scaling of DynamoDB won't give you much anyway if you need to get back most records.
You can try to create a GSI with flag as the partition key and timestamp as the range key. This is not an ideal set of keys but covers what you need. Having a good key for the table means that you can later easily switch to another solution (e.g. scanning and not using the GSI). Keep in mind that if you want to avoid querying the table when using the GSI, you will have to project those attributes you want to return into the GSI.
So summing up, I think you can choose between the GSI and scanning:
Scanning can be slower (test it) but won't require additional data storage
GSI can be faster (test it) but will require additional data storage.

Cassandra, counters and delete by field

I've this specific use case. I'm storing counters in a table associated with a timestamp:
CREATE TABLE IF NOT EXISTS metrics(
timestamp timestamp,
value counter,
PRIMARY KEY ((timestamp))
);
And I would like to delete all the metrics whose timestamp is lower than a specific value, for example:
DELETE FROM metrics WHERE timestamp < '2015-01-22 17:43:55-0800';
But this command is returning the following error:
code=2200 [Invalid query] message="Invalid operator < for PRIMARY KEY part timestamp"
How could I implement this functionality?
For a delete to work, you will need to provide a precise key with an equals operator. Deleting with a greater/less than operator does not work. Basically, you would have to obtain a list of the timestamps that you want to delete, and iterate through them with a (Python?) script or short (Java/C#) program.
One possible solution (if you happen to know how long you want to keep the data for), would be to set a time to live (TTL) on the data. On a table with counter columns, you cannot do it as a part of an UPDATE command. The only option, is to set it when you create the table:
CREATE TABLE IF NOT EXISTS metrics(
timestamp timestamp,
value counter,
PRIMARY KEY ((timestamp))
) WITH default_time_to_live=259200;
This will remove all data put into the table after 3 days (259200 seconds).
EDIT
And it turns out that the possible solution really isn't possible. Even though Cassandra lets you create a counter table with a default_time_to_live set, it doesn't enforce it.
Back to my original paragraph, the only way to execute a DELETE is to provide the specific key that you are deleting. And for counter tables, it looks like that is probably the only way possible.

How can I get current autoincrement value

How can I get last autoincrement value of specific table right after I open database? It's not last_insert_rowid() because there is no insertion transaction. In other words I want to know in advance which number autoincrement will choose when inserting new row for particular table.
It depends on how the autoincremented column has been defined.
If the column definition is INTEGER PRIMARY KEY AUTOINCREMENT, then SQLite will keep the largest ID in an internal table called sqlite_sequence.
If the column definition does NOT contain the keyword AUTOINCREMENT, SQLite will use its ‘regular’ routine to determine the new ID. From the documentation:
The usual algorithm is to give the newly created row a ROWID that is
one larger than the largest ROWID in the table prior to the insert. If
the table is initially empty, then a ROWID of 1 is used. If the
largest ROWID is equal to the largest possible integer
(9223372036854775807) then the database engine starts picking positive
candidate ROWIDs at random until it finds one that is not previously
used. If no unused ROWID can be found after a reasonable number of
attempts, the insert operation fails with an SQLITE_FULL error. If no
negative ROWID values are inserted explicitly, then automatically
generated ROWID values will always be greater than zero.
I remember reading that, for columns without AUTOINCREMENT, the only surefire way to determine the next ID is to VACUUM the database first; that will reset all ID counters to the largest existing ID for that table + 1. But I can’t find that quote anymore, so this may no longer be true.
That said, I agree with slash_rick_dot that fetching auto-incremented IDs beforehand is a bad idea, especially if there’s even a remote chance that another process might write to the database at the same time.
Different databases implement auto-increment differently. But as far as I know, none of them will answer the question you are asking.
The auto increment feature is intended to create a unique ID for a newly added table row. If a row hasn't been inserted yet, then the feature hasn't produced the id.
And it makes sense... If you did get the next auto increment number, what would you do with it? Likely the intent is to assign it as the primary key of the not-yet-inserted new row. But between the time you got the id, and the time you used it to insert the row, the database could have used that id to insert a row for another process.
Your choices are this: manage the creation of ids yourself, or wait until rows are inserted before using their auto-created identifiers.

Is there a way to create a unique constraint on a column larger than 900 bytes?

I'm fairly new to SQL Server, so if anything I say doesn't make sense, there's a good chance I'm just confused by something. Anyway...
I have a simple mapping table. It has two columns, Before and After. All I want is a constraint that the Before column is unique. Originally it was set to be a primary key, but this created errors when the value was too large. I tried adding an ID column as a primary key and then adding UNIQUE to the Before column, but I have the same problem with the max length exceeding 900 bytes (I guess the constraint creates an index).
The only option I can think of is too change the id column to a checksum column and make that the primary key, but I dislike this option. Is there a different way to do this? I just need two simple columns.
The only way I can think of to guarantee uniqueness inside the database is to use an INSTEAD OF trigger. The link I provided to MSDN has an example for checking uniqueness. This solution will most likely be quite slow indeed, since you won't be able to index on the column being checked.
You could speed it up somewhat by using a computed column to create a hash, perhaps using the HASHBYTES function, of the Before column. You could then create a non-unique index on that hash column, and inside your trigger check for the negative case -- that is, check to see if a row with the same hash doesn't exist. If that happens, exit the trigger. In the case there is another row with the same hash, you could then do the more expensive check for an exact duplicate, and raise an error if the user enters a duplicate value. You also might be able to simplify your check by simply comparing both the hash value and the Before value in one EXISTS() clause, but I haven't played around with the performance of that solution.
(Note that the HASHBYTES function I referred to itself can hash only up to 8000 bytes. If you want to go bigger than that, you'll have to roll your own hash function or live with the collisions caused by the CHECKSUM() function)

hash functions-sql studio express

I need to create a hash key on my tables for uniqueness and someone mentioned to me about md5. But I have read about checksum and binary sum; would this not serve the same purpose? To ensure no duplicates in a specific field.
Now I managed to implement this and I see the hask keys in my tables.
Do I need to alter index keys originally created since I created a new index key with these hash keys? Also do I need to change the keys?
How do I change my queries for example SELECT statements?
I guess I am still unsure how hash keys really help in queries other than uniqueness?
If your goal is to ensure no duplicates in a specific field, why not just apply a unique index to that field and let the database engine do what it was meant to do?
It makes no sense to write a unique function to replace SQL Server unique constraints/indexes.
How are you going to ensure the hash is unique? With a constraint?
If you index it (which may not be allowed because of determinism), then the optimiser will treat it as non-unique. As well as killing performance.
And you only have a few 100,000 rows. Peanuts.
Given time I could come up with more arguments, but I'll summarise: Don't do it
There's always the HashBytes() function. It supports md5, but if you don't like it there's an option for sha1.
As for how this can help queries: one simple example is if you have a large varchar column — maybe varchar max — and in your query you want to know if the contents of this column match a particular string. If you have to compare your search with every single record it could be slow. But if you hash your search string and use that, things can go much faster since now it's just a very short binary compare.
Cryptographically save Hash functions are one way functions and they consume more resources (CPU cycles) that functions that are not cryptographically secure. If you just need function as hash key you do not need such property. All you need is low probability for collisions what is related whit uniformity. Try whit CRC or if you have strings or modulo for numbers.
http://en.wikipedia.org/wiki/Hash_function
why don't you use a GUID with a default of NEWSEQUENTIALID() ..don't use NEWID() since it is horrible for clustering, see here: Best Practice: Do not cluster on UniqueIdentifier when you use NewId
make this column the primary key and you are pretty much done

Resources