Managing IDs in an asynchronous world - database

When a new record can be created at potentially any number of locations (i.e. different mobile devices), how do you guarantee that record a unique identity?
(In my SQL-steeped worldview, the default type of an ID is an int or long, though I gladly consider other possibilities.)
The solutions I've considered are
Assign each device a pile of IDs which is (hopefully) more than they will use between syncs, and replenish it when syncing.
Assign each newly created record a temporary ID (Guid) until it can be assigned a "real" ID by the System of Record.
Use Guids as IDs.
Block the creation process until the ID is provided by the System of Record (not preferred due to possible network interruption).
Use a primary value (e.g. Name) as an ID (also not preferred due to potential of primary value to change).
These are what I've come up with on my own, but since this is the type of problem that has certainly already been solved ten million times, what are the accepted solutions?

You could have an unique id for each device (could be set during initial on-line registration), and each device would do its own numbering. The records themselves would use composite primary key: (originDeviceId, recordId), which then is guaranteed to be unique across all devices and has several other advantages, like no need for changing the key when syncing with server and ability to use that key to build relations on off-line remote device right from the start.
The main downside is that you need two columns to reference the record. A - bit hacky - workaround would be to have those columns defined as ints and another - computed one - as a big int, made out of those previous two. The downside is no leftshift operator in most RDBMS, but it can be solved using multiplying by power of two. Then you just make relations using that computed field, so for example:
SELECT file.* FROM t_File as file
JOIN t_User as user on file.UserId = user.Id
-- t_File.UserId is big int and t_User.Id is deviceId * POWER(2, 32) + recordId
Another downside is limiting your records to max int, which might or might not be enough in your case, but at least you have guaranteed uniqueness.
Last downside I see is a need for that initial registration to get assigned an unique device id.

Related

DynamoDB - Design 1 to Many relationship

I'm new at DynamoDB technologies but not at NoSQL (I've already done some project using Firebase).
Read that a DynamoDB best practice is one table per application I've been having a hard time on how to design my 1 to N relationship.
I have this entity (pseudo-json):
{
machineId: 'HASH_ID'
machineConfig: /* a lot of fields */
}
A machineConfig is unique for each machine and can change rarely and only by an administration (no consistency issue here).
The issue is that I have to manage a log of data from the sensors of each machine. The log is described as:
{
machineId: 'HASH_ID',
sensorsData: [
/* Huge list of: */
{ timestamp: ..., data: /* lot of fields */ },
...
]
}
I want to keep my machineConfig in one place. Log list can't be insert into the machine entity because it's a continuous stream of data taken over time.
Furthermore, I don't understand which could be the composite key, the partition key obviously is the machineId, but what about the order key?
How to design this relationship taking into account the potential dimensions of data?
You could do this with 1 table. The primary key could be (machineId, sortKey) where machineId is the partition key and sortKey is a string attribute that is going to be used to cover the 2 cases. You could probably come up with a better name.
To store the machineConfig you would insert an item with primary key (machineId, "CONFIG"). The sortKey attribute would have the constant value CONFIG.
To store the sensorsData you could use the timestamp as the sortKey value. You would insert a new item for each piece of sensor data. You would store the timestamp as a string (as time since the epoch, ISO8601, etc)
Then to query everything about a machine you would run a Dynamo query specifying just the machineId partition key - this would return many items including the machineConfig and the sensor data.
To query just the machineConfig you would run a Dynamo query specifying the machineId partition key and the constant CONFIG as the sortKey value
To query the sensor data you could specify an exact timestamp or a timestamp range for the sortKey. If you need to query the sensor data by other values then this design might not work as well.
Editing to answer follow up question:
You would have to resort to a scan with a filter to return all machines with their machineId and machineConfig. If you end up inserting a lot of sensor data then this will be a very expensive operation to perform as Dynamo will look at every item in the table. If you need to do this you have a couple of options.
If there are not a lot of machines you could insert an item with a primary key like ("MACHINES", "ALL") and a list of all the machineIds. You would query on that key to get the list of machineIds, then you would do a bunch of queries (or a batch get) to retrieve all the related machineConfigs. However since the max Dynamo item size is 400KB you might not be able to fit them all.
If there are too many machines to fit in one item you could alter the above approach a bit and have ("MACHINES", $machineIdSubstring) as a primary key and store chunks of machineIds under each sort key. For example, all machineIds that start with 0 go in ("MACHINES", "0"). Then you would query by each primary key 0-9, build a list of all machineIds and query each machine as above.
Alternatively, you don't have to put everything in 1 table - it is just a guideline that fits a lot of use cases. If there are too many machines to fit in less than 400KB but there aren't tens of thousands and you aren't trying to query all of them all the time, you could have a separate table of machineId and machineConfig that you resort to scanning when necessary.

DB Sync Issue with PK ID Conflict - Design Help Needed

The scenario:
I have a local DB and a remote public DB. Both are synced using SQLyog SJA job files - which sync both DB's to be the same. It works well.
The part of the DB with the issue is a user comments table.
The local DB contains thousands of user comments, and more are always being added through various means. These are all synced to the remote DB comments table.
The remote comments table DB accepts direct user comments to be entered. These then sync to the local comments DB.
It is a two way sync, where neither one deletes from the other. This actually seems to work well and is automated through SJA.
The problem:
The primary key ID's for the comments are auto incremented on both sides.
So if both tables are in exact sync and the key count is at 50, and a user makes a remote entry it will be key 51. Now the local DB is also growing and a different entry is made under key 51. So when the next sync is called there will be a problem as the keys are conflicting.
Possible solutions:
So I thought A good idea would be to add a large number to the remote comments PK ID as they are added. That way when a sync is called the primary keys will not conflict as the local PK ID would never get that high.
It worked well on the first sync. But the problem is that the auto increment feature will increment off the highest value, even if there is a large gap in between the keys. So this solution does not work.
I would like to maintain a single table for user comments and have a seamless sync but the issue of conflicting primary keys is a problem.
I am interested if other people have some thoughts on the matter.
I hope I described the problem clearly.
Thanks.
------ EDIT -----------
I have found a solution that works.
I changed the primary key ID to just a normal INT with auto increment. I then created a second ID INT field which consists of a random INT about 10 in length. I now use the two ID fields together to form a PK ID. Now the chance of a conflict between the two DB's is essentially none existent. The auto incremented ID and the long random INT ID would both have to be the same on the same entry, highly unlikely with the volume I'm dealing with.
Not the best solution but it works well.
Hope this helps someone else out.
I have found a solution that works.
I changed the primary key ID to just a normal INT with auto increment. I then created a second ID INT field which consists of a random INT about 10 in length. I now use the two ID fields together to form a PK ID. Now the chance of a conflict between the two DB's is essentially none existent. The auto incremented ID and the long random INT ID would both have to be the same on the same entry, highly unlikely with the volume I'm dealing with.
Not the best solution but it works well.
Hope this helps someone else out.

Single Big SQL Server lookup table

I have a SQL Server 2008 database with a snowflake-style schema, so lots of different lookup tables, like Language, Countries, States, Status, etc. All these lookup table have almost identical structures: Two columns, Code and Decode. My project manager would like all of these different tables to be one BIG table, so I would need another column, say CodeCategory, and my primary key columns for this big table would be CodeCategory and Code. The problem is that for any of the tables that have the actual code (say Language Code), I cannot establish a foreign key relationship into this big decode table, as the CodeCategory would not be in the fact table, just the code. And codes by themselves will not be unique (they will be within a CodeCategory), so I cannot make an FK from just the fact table code field into the Big lookup table Code field.
So am I missing something, or is this impossible to do and still be able to do FKs in the related tables? I wish I could do this: have a FK where one of the columns I was matching to in the lookup table would match to a string constant. Like this (I know this is impossible but it gives you an idea what I want to do):
ALTER TABLE [dbo].[Users] WITH CHECK ADD CONSTRAINT [FK_User_AppCodes]
FOREIGN KEY('Language', [LanguageCode])
REFERENCES [dbo].[AppCodes] ([AppCodeCategory], [AppCode])
The above does not work, but if it did I would have the FK I need. Where I have the string 'Language', is there any way in T-SQL to substitute the table name from code instead?
I absolutely need the FKs so, if nothing like this is possible, then I will have to stick with my may little lookup tables. any assistance would be appreciated.
Brian
It is not impossible to accomplish this, but it is impossible to accomplish this and not hurt the system on several levels.
While a single lookup table (as has been pointed out already) is a truly horrible idea, I will say that this pattern does not require a single field PK or that it be auto-generated. It requires a composite PK comprised of ([AppCodeCategory], [AppCode]) and then BOTH fields need to be present in the fact table that would have a composite FK of both fields back to the PK. Again, this is not an endorsement of this particular end-goal, just a technical note that it is possible to have composite PKs and FKs in other, more appropriate scenarios.
The main problem with this type of approach to constants is that each constant is truly its own thing: Languages, Countries, States, Statii, etc are all completely separate entities. While the structure of them in the database is the same (as of today), the data within that structure does not represent the same things. You would be locked into a model that either disallows from adding additional lookup fields later (such as ISO codes for Language and Country but not the others, or something related to States that is not applicable to the others), or would require adding NULLable fields with no way to know which Category/ies they applied to (have fun debugging issues related to that and/or explaining to the new person -- who has been there for 2 days and is tasked with writing a new report -- that the 3 digit ISO Country Code does not apply to the "Deleted" status).
This approach also requires that you maintain an arbitrary "Category" field in all related tables. And that is per lookup. So if you have CountryCode, LanguageCode, and StateCode in the fact table, each of those FKs gets a matching CategoryID field, so now that is 6 fields instead of 3. Even if you were able to use TINYINT for CategoryID, if your fact table has even 200 million rows, then those three extra 1 byte fields now take up 600 MB, which adversely affects performance. And let's not forget that backups will take longer and take up more space, but disk is cheap, right? Oh, and if backups take longer, then restores also take longer, right? Oh, but the table has closer to 1 billion rows? Even better ;-).
While this approach looks maybe "cleaner" or "easier" now, it is actually more costly in the long run, especially in terms of wasted developer time, as you (and/or others) in the future try to work around issues related to this poor design choice.
Has anyone even asked your project manager what the intended benefit of this is? It is a reasonable question if you are going to spend some amount of hours making changes to the system that there be a stated benefit for that time spent. It certainly does not make interacting with the data any easier, and in fact will make it harder, especially if you choose a string for the "Category" instead of a TINYINT or maybe SMALLINT.
If your PM still presses for this change, then it should be required, as part of that project, to also change any enums in the app code accordingly so that they match what is in the database. Since the database is having its values munged together, you can accomplish that in C# (assuming your app code is in C#, if not then translate to whatever is appropriate) by setting the enum values explicitly with a pattern of the first X digits are the "category" and the remaining Y digits are the "value". For example:
Assume the "Country" category == 1 and the "Language" catagory == 2, you could do:
enum AppCodes
{
// Countries
United States = 1000001,
Canada = 1000002,
Somewhere Else = 1000003,
// Languages
EnglishUS = 2000001,
EnglishUK = 2000002,
French = 2000003
};
Absurd? Completely. But also analogous to the request of merging all lookup tables into a single table. What's good for the goose is good for the gander, right?
Is this being suggested so you can minimise the number of admin screens you need for CRUD operations on your standing data? I've been here before and decided it was better/safer/easier to build a generic screen which used metadata to decide what table to extract from/write to. It was a bit more work to build but kept the database schema 'correct'.
All the standing data tables had the same basic structure, they were mainly for dropdown population with occasional additional fields for business rule purposes.

General database design: Is it ever considered "okay" to create a non-normalized table on purpose?

After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?
Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.
I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.

Database design, huge number of parameters, denormalise?

Given the table tblProject. This has a myriad of properties. For example, width, height etc etc. Dozens of them.
I'm adding a new module which lets you specify settings for your project for mobile devices. This is a 1-1 relationship, so all the mobile settings should be stored in tblProject. However, the list is getting huge, there will be some ambiguity amongst properties (IE, I will have to prefix all mobile fields with MOBILE so that Mobile_width isn't confused with width).
How bad is it to denormalise and store the mobile settings in another table? Or a better way to store the settings? The properties and becoming unwieldly and hard to modify/find in the table.
I want to respond to #Alexander Sobolev's suggestion and provide my own.
#Alexander Sobolev suggests an EAV model. This trades maximum flexibility, for poor performance and complexity as you need to join multiple times to get all values for an entity. The way you typically work around those issues is keeping all the entity meta data in memory (i.e. tblProperties) so you don't join to it at runtime. And, denormalize the values (i.e. tblProjectProperties) as a CLOB (i.e. XML) off the root table. Thus you only use the values table for querying and sorting, but not to actually retrieve the data. Also you usually end up caching the actual entities by ID as well so you don't have the expense of deserialization each time. Issues you run into the are cache invalidation of the entities and their meta data. So overall a non trivial approach.
What I would do instead is create a separate table, perhaps more than one depending on your data, with a discriminator/type column:
create table properties (
root_id int,
type_id int,
height int
width int
...etc...
)
Make the unique a combination of root_id and type_id, where type_id would be representative of mobile for instance - assuming a separate lookup table in my example.
There is nothing bad in storing mobile section in other table. This could even carry some economy, this depends on how much this information is used.
You can store in another table or use even more complicated version with three tables. One is your tblProject, one is tblProperties and one is tblProjectProperties.
create table tblProperties (
id int autoincrement(1,1) not null,
prop_name nvarchar(32),
prop_description nvarchar(1024)
)
create table tblProjectProperties
(
ProjectUid int not null,
PropertyUid int not null,
PropertyValue nvarchar(256)
)
with foreign key tblProjectProperties. ProjectUid -> tblProject.uid
and foreign key tblProjectProperties.propertyUid -> tblProperties.id
Thing is if you have different types of projects wich use different properties, you have no need to store all these unused null and store only properties you really need for given project. Above schema gives you some flexibility. You can create some views for different project types and use it to avoid too much joins in user selects.

Resources