We store documents in our database (sql server), the documents are spread across various tables so there is no one table that contains all of them.
I now have a requirement to give all documents a system wide unique id, one that is semi-readable, not like a guid.
I've seen this done before by creating a single table with a single row/column with just a number that gets incremented when a new document is created.
Is this the best way to go about it, how do I ensure that no one reads the current number if someone is about to update and and vice versa?
In this case the number can be something like 001 and auto-increment as required, I'm mainly worried about stopping collisions rather than getting a fancy identifier.
If you want the single row/column approach, I've used:
declare #MyRef int
update CoreTable set #MyRef = LastRef = LastRef + 1
The update will be safe - each person who executes it will receive a distinct result in #MyRef. This is safer than doing separate read, increment, update.
Table defn:
create table CoreTable (
X char(1) not null,
LastRef int not null,
constraint PK_CoreTable PRIMARY KEY (X),
constraint CK_CoreTable_X CHECK (X = 'X')
)
insert into CoreTable (X,LastRef) values ('X',0)
You can use Redis for this. Have a look at this article: http://rediscookbook.org/create_unique_ids.html
Redis is a very fast in-memory NoSQL database, but one with persistence capabilities. You can quickly utilize a Redis instance and use it to create incremental numbers which will be unique.
You can then leverage Redis for other many purposes in your app.
Another suggestion for your inquiry that does not involve installing Redis is to use a single DB row/column as you suggested, while encapsulating it in a transaction. That way you won't run into conflicts.
One 'classic' approach would indeed be to have a seperate table (e.g. Documents) with (at least) an ID column (int identity). Then, add a foreign key column and constraint to all your existing document tables. This ensures the uniqueness of the document ID over all tables.
Something like this:
CREATE TABLE Documents (Id int identity not null)
ALTER TABLE DocumentTypeOne
ADD CONSTRAINT (DocumentId) DocumentTypeOne_Documents_FK Documents(Id)
Related
Suppose I have the following tables
Companies
--CompanyID
--CompanyName
and
Locations
--LocationID
--CompanyID
--LocationName
Every company has at least one location. I want to track a primary location for each company (and yes, every company will have exactly one primary location). What's the best way to set this up? Add a primaryLocationID in the Companies table?
Add a primaryLocationID in the Companies table?
Yes, however that creates a circular reference which could prevent you from inserting new data:
One way to resolve this chicken-and-egg problem is to simply leave Company.PrimaryLocationID NULL-able, so you can temporarily disable one of the circular FKs. This unfortunately means the database will enforce only "1:0..1", but not the strict "1:1" relationship (so you'll have to enforce it in the application code).
However, if your DBMS supports deferred constraints (such as Oracle or PostgreSQL), you can simply defer one of the FKs to break the cycle while the transaction is still in progress. By the end of the transaction both FKs have to be in place, resulting in a real "1:1" relationship.
The alternative solution is to have a flag in the Locations table that is set for a primary location, and NULL non-primary locations (note the U1, denoting a UNIQUE constraint, ensuring a company cannot have multiple primary locations):
CREATE TABLE Location (
LocationID INT PRIMARY KEY,
CompanyID INT NOT NULL, -- References Company table, not shown here.
LocationName VARCHAR(50) NOT NULL, -- Possibly UNIQUE?
IsPrimary INT CHECK (IsPrimary IS NULL OR IsPrimary = 1), -- Use a BIT or BOOLEAN if supported by your DBMS.
CONSTRAINT Locations_U1 UNIQUE (CompanyID, IsPrimary)
);
Unfortunately, this has some problems:
It can only guarantee up to "1:0..1" (but not the real "1:1") even on a DBMS that supports deferred constraints.
It requires an additional index (in support to the UNIQUE constraint). Every index brings certain overhead, mostly for INSERT/UPDATE/DELETE performance. Furthermore, secondary indexes in clustered tables contain copy of PK, which may make them "fatter" than expected.
It depends on ANSI-compliant composite UNIQUE constraints, that allow duplicated rows if any (but not necessarily all) of the fields are NULL. Unfortunately not all DBMSes follow the standard, so the above would not work out-of-box under Oracle or MS SQL Server (but would work under PostgreSQL and MySQL). You could use a filtered unique index instead of the UNIQUE constraint to work-around that, but not all DBMSes support that either.
The BaBL86's solution models M:N, while your requirement seems to be 1:N. Nonetheless, that model could be "coerced" into 1:N by placing a key on {LocationID} (and on {CompanyID, TypeOfLocation} to ensure there cannot be multiple locations of the same type for the same company), but is probably over-engineered for a simple "is primary" requirement.
I think your own solution is the best one - this ensures that every company can only have one primary location. By making it a NOT NULL column, you can even enforce that every company should have a primary location.
Using BaBL86's solution, you don't have those constraints: a company can have 0 - unlimited 'primary locations', which obviously shouldn't be possible.
Do note that, if you use foreign key constraints AND define primaryLocationID as a NOT NULL column, you'll run into problems, because you basically have a loop (Location points to Company, Company points to location). You cannot create a new Company (because it needs a primary location), nor can you create a new Location (because it needs a company).
I do it with pivot table:
CompanyLocations
--CompanyID
--LocationID
--TypeOfLocation (primary, office, warehouse etc.)
In this case you can select all locations and than use type as you like. If you create PrimaryLocationID - you're need two joins of one table and more complex logic. It's worst than this.
I have encountered a problem in my project using PostgreSQL.
Say there are two tables A and B, both A and B have a (unique) field named ID. The ID column of table A is declared as a primary key, while the ID column of table B is declared as a foreign key pointing back to table A.
My problem is that every time we have new data inputted into database, the values in table B tend to be updated prior to the ones in table A (this problem can not be avoided as the project is designed this way). So I have to modify the relationship between A and B.
My goal is to achieve a situation where I can insert data into A and B separately while having the ON DELETE CASCADE clause enabled. What's more, INSERT and DELETE queries may happen at the same time.
Any suggestions?
It sounds like you have a badly designed project, if you can't use deferred constraints. Your basic problem is that you can't guarantee internal consistency of the data because transactions may occur which do not move the data from one consistent state to another.
Here is what I would do to be honest:
Catalog affected keys.
Drop affected key constraints.
Write a periodic job that looks for orphaned rows. Use LEFT JOIN because antijoins do not perform as well in PostgreSQL.
The problem with a third table is it doesn't solve your basic problem, which is that writes are not atomically consistent. And once you sacrifice that a lot of your transactional controls go out the window.
Long term, the project needs to be rewritten.
I have been creating database tables using only a primary key of the datatype int and I have always had great performance but need to setup merge replication with updatable subscribers.
The tables use a typical primary key, data type int, and identity increment. Setting up merge replication, I have to add the rowguid to all tables with a newsequentialid() function for the default value. I noticed that the rowguid has indexable on and was wondering if I needed the primary key anymore?
Is it okay to have 2 indexes, the primary key int and the rowguid? What is the best layout for a merge replication table? Do I keep the int id for easy row referencing and just remove the index but keep the primary key? Not sure what route to take, Thanks.
Remember that if you remove the int id column and replace it with a GUID, you may need to rework a good deal of your data and your queries. And do you really want to do queries like:
select * from orders where customer_id = '2053995D-4EFE-41C0-8A04-00009890024A'
Remember if your ids are exposed to any users (often in the case of a customer because the customer table often has no natural key since names are not unique), they will find the guid daunting for doing research.
There is nothing wrong in an existing system with having both. In a new system, you could plan to not use the ints, but there is a great risk of introducing bugs if you try to remove them in a system already using them.
The only downside of replacing the integer primary key with the guid (that I know of) is that GUIDs are larger, so the btree (index space used) will be larger and if you have foreign keys to this table (which you'd also need to change) a lot more space may end up being used across (potentially) many tables.
In tables where you need only 1 column as the key, and values in that column can be integers, when you shouldn't use an identity field?
To the contrary, in the same table and column, when would you generate manually its values and you wouldn't use an autogenerated value for each record?
I guess that it would be the case when there are lots of inserts and deletes to the table. Am I right? What other situations could be?
If you already settled on the surrogate side of the Great Primary Key Debacle then I can't find a single reason not use use identity keys. The usual alternatives are guids (they have many disadvatages, primarily from size and randomness) and application layer generated keys. But creating a surrogate key in the application layer is a little bit harder than it seems and also does not cover non-application related data access (ie. batch loads, imports, other apps etc). The one special case is distributed applications when guids and even sequential guids may offer a better alternative to site id + identity keys..
I suppose if you are creating a many-to-many linking table, where both fields are foreign keys, you don't need an identity field.
Nowadays I imagine that most ORMs expect there to be an identity field in every table. In general, it is a good practice to provide one.
I'm not sure I understand enough about your context, but I interpret your question to be:
"If I need the database to create a unique column (for whatever reason), when shouldn't it be a monotonically increasing integer (identity) column?"
In those cases, there's no reason to use anything other than the facility provided by the DBMS for the purpose; in your case (SQL Server?) that's an identity.
Except:
If you'll ever need to merge the table with data from another source, use a GUID, which will prevent duplicate keys from colliding.
If you need to merge databases it's a lot easier if you don't have to regenerate keys.
One case of not wanting an identity field would be in a one to one relationship. The secondary table would have as its primary key the same value as the primary table. The only reason to have an identity field in that situation would seem to be to satisfy an ORM.
You cannot (normally) specify values when inserting into identity columns, so for example if the column "id" was specified as an identify the following SQL would fail:
INSERT INTO MyTable (id, name) VALUES (1, 'Smith')
In order to perform this sort of insert you need to have IDENTITY_INSERT on for that table - this is not intended to be on normally and can only be on for a maximum of 1 tables in the database at any point in time.
If I need a surrogate, I would either use an IDENTITY column or a GUID column depending on the need for global uniqueness.
If there is a natural primary key, or the primary key is defined as a unique combination of other foreign keys, then I typically do not have an IDENTITY, nor do I use it as the primary key.
There is an exception, which is snapshot configuration tables which I am tracking with an audit trigger. In this case, there is usually a logical "primary key" (usually date of the snapshot and natural key of the row - like a cost center or gl account number for which the row is a configuration record), but instead of using the natural "primary key" as the primary key, I add an IDENTITY and make that the primary key and make a unique index or constraint on the date and natural key. Although theoretically the date and natural key shouldn't change, in these tables, if a user does that instead of adding a new row and deleting the old row, I want the audit (which reflects a change to a row identified by its primary key) to really reflect a change in the row - not the disappearance of a key and the appearance of a new one.
I recently implemented a Suffix Trie in C# that could index novels, and then allow searches to be done extremely fast, linear to the size of the search string. Part of the requirements (this was a homework assignment) was to use offline storage, so I used MS SQL, and needed a structure to represent a Node in a table.
I ended up with the following structure : NodeID Character ParentID, etc, where the NodeID was a primary key.
I didn't want this to be done as an autoincrementing identity for two main reasons.
How do I get the value of a NodeID after I add it to the database/data table?
I wanted more control when it came to generating my own IDs.
I'm going to running thousands of queries into SQL and I need to prevent the duplication of field 'domain'. Never had to do this before and any help would be appreciated.
You probably want to create a "UNIQUE" constraint on the field "Domain" - this constraint will raise an error if you create two rows that have the same domain in the database. For an explanation, see this tutorial in W3C school -
http://www.w3schools.com/sql/sql_unique.asp
If this doesn't solve your problem, please clarify the database you have chosen to use (MySql?).
NOTE: This constraint is completely separate from your choice of PHP as a programming language, it is a SQL database definition thing. A huge advantage of expressing this constraint in SQL is that you can trust the database to preserve the constraint even when people import / export data from the database, your application is buggy or another application shares the database.
If this is an absolute database integrity requirement (It's not likely to change, nor does existing data have this problem), I would enforce it at the database with a unique constraint.
As far as detecting it before or after the attempt in order to notify the user, there are a number of techniques which could be used.
Where is the data coming from? Is this something you only want to run once, or a couple of times, or often? If the domain-value already exists, do you just want to skip the insert or do something else (ie increment a counter)?
Depending on your answers, there are many possible solutions:
Pre-sort your data, eliminate duplicates, then insert
(assumes relatively static data, empty table to begin with)
Use an associative array in PHP as a local domain-value cache
(if table already contains data, start by reading existing content;
not thread-safe, but works if it only runs once at a time)
Make domain a UNIQUE column and write wrapper code to handle return errors
Make domain a UNIQUE or PRIMARY KEY column and use an ON DUPLICATE KEY clause:
INSERT INTO mydata ( domain, count ) VALUES
( 'firstdomain', 1 ),
( 'seconddomain', 1 ),
( 'thirddomain', 1 )
ON DUPLICATE KEY
UPDATE count = count+1
Insert all data into the table, then remove duplicates
Note that batching inserts (ie using multiple value clauses per statement) can be significantly faster.
I'm not really sure I understood your question, but perhaps you are looking for SQL's "UNIQUE" constraint. If the query tries to insert a pre-existing value to a field, you (PHP) will be notified about this constraint breach.
There are a bunch of ways to approach this. You could set a unique constraint (like a primary key) on that column. This will cause the insert to fail if that domain has also been inserted. You could also insert all of the duplicate domains and just delete them later on. This will work well if not that many of the domains are duplicated. There are a few questions posted already on finding duplicate rows.
This can be doen with sql, rather than with php.
i am assuming that you are using MySQl, but the same principles will work with different databases.
make the Domain column the primary key. (makes sense, as it has to unique.)
Rather than using INSERT, use UPDATE.
if the primary key already exists (that you are trying to put into the table), update will update the existing tuple, rather than creating a new tuple.
so you will overwrite existing data if it is different, and if it is identical the update will be skipped.