Can I use Elegant Pairing Function as primary key in DB?

Can I use Elegant Pairing Function as primary key in DB? - database

I'm writing a server that allows multiple users can modify a post.
So I created Permission table that contains user id, post id, and permission data.
And I just wanted to query this with only one value (I just thought that querying with one value is more efficient than querying with two values), so I googled and found this.
However, I also found that Cantor Pairing Function isn't unique,
so we can't use Cantor Pairing Function as a primary key.
But that only deals with Cantor way, not Elegant Pairing Function (by Matthew Szudzik)
How about Elegant pairing function?
Is it safe to use Elegant pairing key as primary key in database?
Or should I just give up and query that with two values?

Unless single ID field constraint is enforced by your storage, I believe that this is an example of famous
Premature optimization is the root of all evil. (D.Knuth).
Note that Cantor pairing function is not unique for real numbers but it is unique for integers and I don't think that your IDs are non-integer numbers. I think this is quite the same for the Elegant Pairing Function you reference because structurally it is based on the same idea. If you need a specific counter-example, here is one:
ElegantPair(1, 2) = 2^2 + 1 = 5 = 2.1^2 + 0.59 = ElegantPair(0.59, 2.1)
On the other hand the real problem is that you can't fit two 32-bit (or whatever size you use) int values into a single int value of the same size no matter what clever trick you use. The trick behind pairing functions is based on the fact that whole N is infinite and NxN has the same "size" as just N which is obviously not true for a fixed-size real (computer) world integers. Thus whatever mapping you use over fixed-sized ints, it will not be unique.

Related

Is storing a value many times considered a normal form failure?

When storing a user's religion in a "User Table", so that if you look down a column you would see "Christian" many times, "Muslim" many times, etc considered a failure of a normal form? Which form?
The way I see it:
1nf: There are no repeating columns.
2nf: There is no concatenated primary key, so this does not apply.
3nf: There is no dependency on a nonkey attribute.
Storing user religion this way does not seem to fail any normal form, however it seems very inefficient. Comments?

Your design supports all normal forms. It's fine that your attribute has a string value. The size of the data type is irrelevant for normalization.
The goal of normalization is not physical storage efficiency -- the goal is to prevent anomalies. And to support logical efficiency, i.e. store a given fact only once. In this case, the fact that the user on a given row is Christian.

The principle disadvantage to storing the column in that manner is in storage space as the number of rows scales up.
Rather than a character column, you could use an ENUM() if you have a fixed set of choices that will rarely, if ever, change, and still avoid creating an additional table of religion options to which this one has a foreign key. However, if the choices will be fluid, normalization rules would prefer that the choices be placed into their own table with a foreign key in your user table.
There are other advantages besides storage space to keeping them in another table. Modifying them is a snap. To change Christian to Christianity, you can make a single change in the religions table, rather than doing the potentially expensive (if you have lots of rows and religion is not indexed)
UPDATE users SET religion='Christianity' WHERE religion='Christian'
... you can do the much simpler and cheaper
UPDATE religions SET name='Christianity' WHERE id=123
Of course, you also enforce data integrity by keying against a religions table. It becomes impossible to insert an invalid value like the misspelled Christain.

I'm assuming that there's a list of valid religions; if you've just got the user entering their own string, then you have to store it in the user table and this is all moot.
Assume that religions are stored in their own table. If you're following well-established practices, this table will have a primary key which is an integer, and all references to entries in the table in other tables (such as the user table) will be foreign keys. The string method of storing religion doesn't violate any normal form (since the name of a religion is a candidate key for the religion table), but it does violate the practice of not using strings as keys.
(This is an interesting difference between the theory and practice of relational algebra. In theory, a string is no different from an integer; they're both atomic mathematical values. In practice, strings have a lot of overhead that leads programmers not to use them as keys.)
Of course, there are other ways (such as ENUM for some RDBMSes) of storing a list of possible values, each with their own advantages and disadvantages.

Your normal forms are a little awry. Second normal form is that the rest of the row depends on "the whole key". Third normal form is that the rest of the row depends on "nothing but the key." (So help me Codd).
No, your situation as described does not violate any of the first three normal forms. (It might violate the sixth, depending on other factors).

There are a few cons with this approach (compared to using a foreign key) that you will need to make sure you are ok with.
1 - wastes storage.
2 - slower to query by religion
3 - someone might put data in there that doesn't match, eg manually insert "Jedi" or something that you might not consider correct
4 - there's no way to have a list of possible religions (eg if there are no one of a certain religion in your table, eg, Zoroastrian) but you still want it to be a valid possibility
5 - incorrect capitalization might cause problems
6 - white space around the string might cause problems
The main pro with this technique is the data is quicker to pull out (no joining on a table) and it is also quicker for a human to read.

Use of specifying lengths for surrogate keys

In one of my database class assignments, I wrote that I specifically didn't assign lengths to my NUMBER columns acting as surrogate keys since it would unnecessarily limit the number of records able to be stored in the table, and because there is literally no difference in performance or physical storage between NUMBER(n) and NUMBER.
My professor wrote back that it would be technically possible but "impractical" for large databases, and that most DBAs in real-life situations would not do that.
There is no difference whatsoever between NUMBER(n) and NUMBER as far as physical storage or performance goes, and thus no reason to specify a length for a NUMBER-based surrogate key column. Why does this professor think that using NUMBER alone would be "impractical"?

In my experience, most production DBAs in real life would likely do as you suggested and declare key columns as NUMBER rather than NUMBER(n).
It would be worthwhile to ask the professor what makes this approach impractical in his or her opinion. There are a couple possibilities that I can think of
Assuming that you are using a data modeling tool to design your schema, a reasonable tool will ensure that the data type of a key will be the same in the table where it is defined as a primary key and in the child table where it is a foreign key. If you specify a length for the primary key, forcing the key to generate foreign keys without length limits would be impractical. Of course, the counter to this is that you can just declare both the primary and foreign key columns as NUMBER.
DBAs tend to be extremely picky (and I mean this as a compliment). They like to see everything organized "just so". Adding a length qualifier to a field whether it be a NUMBER or a VARCHAR2 serves as an implicit constraint that ensure that incorrect data does not get stored. Ideally, you would know when you are designing a table a reasonable upper bound on the number of rows you'll insert over the table's lifetime (i.e. if your PERSON table ended up with more than 10 billion rows, something would likely be seriously wrong). Applying length constraints to numeric columns demonstrates to the DBA that you've done this sort of analysis.
Today, however, that is rather unlikely to actually happen at least with respect to numeric columns both because it is something that is more in keeping with waterfall planning methods that would generally involve that sort of detailed design discussion and because people are less concerned with the growth analysis that would have traditionally been done at the same time. If you were designing a database schema 20 or 30 years ago, it wouldn't be uncommon to provide the DBA with a table-by-table breakdown of the projected size of each table at go-live and over the next few years. Today, it's more cost effective to potentially spend more on disk rather than investing the time to do this analysis up front.

It would probably be better from a readability and self documentation standpoint to limit what can be stored in the column to numbers that are expected. I would agree that I don't see how its impractical
From this thread about number
number(n) is an edit -- restricting the number to n digits in length.
if you store the number '55', it takes the same space in a number(2)
as it does in a number(38).
the storage required = function of the number actually stored.

Left to my own devices I would declare surrogate primary keys as NUMBER(38) on oracle instead of NUMBER. And possibly a check constraint to make the key > 0. Primarily to serve as documentation to outside systems about what they can expect in the column and what they need to be able to handle.
In theory, when building an application that is reading the surrogate primary key, seeing NUMBER means one needs to handle full floating point range of number, whereas NUMBER(38) means the application needs to handle an integer with up to 38 digits.
If I were working in an environment where all the front ends were going to be using a 32 bit integer for surrogate keys I'd define it as a number(10) with appropriate check constraint.

good database design: enum values: ints or strings?

I have a column in a table that will store an enum value. E.g. Large, Medium, Small or the days of the week. This will correspond to displayed text on a web page or user selection from a droplist. What is the best design?
Store the values as an int and then perhaps have a table that has the enums/int corresponding string in it.
Just store the values in the column as a string, to make queries a little more self-explanatory.
At what point/quantity of values is it best to use ints or strings.
Thanks.

Assuming your RDBMS of choice doesn't have an ENUM type (which handles this for you), I think best to use ids instead of strings directly when the values can change (either in value or in quantity.)
You might think that days of the week won't change, but what if your application needs to add internationalization support? (or an evil multinational corporation decides to rename them after taking control of the world?)
Also, that Large, Medium and Small categorization is probably changing after a while. Most values you think cannot change, can change after a while.
So, mainly for anticipating change reasons, I think it's best to use ids, you just need to change the translation table and everything works painlessly. For i18n, you can just expand the translation table and pull the proper records automatically.
Most likely (it'll depend on various factors) ints are going to perform better, at the very least in the amount of required storage. But I wouldn't do ints for performance reasons, I'd do ints for flexibility reasons.

this is an interesting question. Definitely you have to take in consideration performance targets here. If you wan't to go for speed, int is a must. A Database can index integers a bit better than Strings although I must say its not at all a bad performance loss.
On example is Oracle database itself where they have the luxury of doing large caps enum as strings on their system tables. Things like USER_ALLOCATION_TYPE or things like that are the norm. Its like you say, Strings can be more "extensible" and more readable, but in any case in the code you will end up with:
Static final String USER_ALLOCATION_TYPE="USER_ALLOCATION_TYPE";
in place of
Static final int USER_ALLOCATION_TYPE=5;
Because you either do this you will end up with all this string literals that are just aching for someone to go there and misplace a char! :)
In my company we use tables with integers primary keys; all the tables have a serial primary key, because even if you don't think you need one, sooner or later you'll regret that.
In the case you are describing what we do is that we have a table with (PK Int, Description String) and then we do Views over the master tables with joins to get the descriptions, that way we get to see the joined fields descriptions if we must and we keep the performance up.
Also, with a separate description table you can have EXTRA information about those ids you would never think about. For example, lets say a user can have access to some fields in the combo box if and only if they have such property and so. You could use extra fields in the description table to store that in place of ad-hoc code.
My two cents.

Going with your first example. Lets say you create a Look up table: Sizes. It has the following columns:
Id - primary key + identity
Name - varchar / nvarchar
You'd have three rows in the table, Small, Medium and Large with values 1, 2, 3 if you inserted them in that order.
If you have another table that uses those values you can use the identity value as the foreign key...or you could create a third column which is a short hand value for the three values. It would have the values S, M & L. You could use that as the foreign key instead. You'd have to create a unique constraint on the column.
As far as the dropdown, you could use either one as the value behind the scenes.
You could also create S/M/L value as the primary key as well.
For your other question about when its best to use the ints vs strings. There is probably a lot of debate on the subject. A lot of people only like using identity values as their primary keys. Other people say that it's better to use a natural key. If you are not using an identity as the primary key then it's just important to make sure you have a good candidate for the primary key (making sure it will always be unique and that the value does not change).

I too would be interested in people's thinking regarding this, I've always gone the route of storing the enum in a look up table and then in any data tables that referenced the enum I would store the ID and using FK relationship. In a certain way, I still like this approach, but there is something plain and simple about putting the string value directly in the table.
Going purely by size, an int is 4 bytes, where as the string is n btyes (where n is number of characters). Shortest value in your look up is 5 characters, longest is 6, so storing the actual value would use up more space eventually (if that was a problem).
Going by performance, I'm not sure if an index on an int or on a varchar would return any difference in speed / optimisation / index size?

hash functions-sql studio express

I need to create a hash key on my tables for uniqueness and someone mentioned to me about md5. But I have read about checksum and binary sum; would this not serve the same purpose? To ensure no duplicates in a specific field.
Now I managed to implement this and I see the hask keys in my tables.
Do I need to alter index keys originally created since I created a new index key with these hash keys? Also do I need to change the keys?
How do I change my queries for example SELECT statements?
I guess I am still unsure how hash keys really help in queries other than uniqueness?

If your goal is to ensure no duplicates in a specific field, why not just apply a unique index to that field and let the database engine do what it was meant to do?

It makes no sense to write a unique function to replace SQL Server unique constraints/indexes.
How are you going to ensure the hash is unique? With a constraint?
If you index it (which may not be allowed because of determinism), then the optimiser will treat it as non-unique. As well as killing performance.
And you only have a few 100,000 rows. Peanuts.
Given time I could come up with more arguments, but I'll summarise: Don't do it

There's always the HashBytes() function. It supports md5, but if you don't like it there's an option for sha1.
As for how this can help queries: one simple example is if you have a large varchar column — maybe varchar max — and in your query you want to know if the contents of this column match a particular string. If you have to compare your search with every single record it could be slow. But if you hash your search string and use that, things can go much faster since now it's just a very short binary compare.

Cryptographically save Hash functions are one way functions and they consume more resources (CPU cycles) that functions that are not cryptographically secure. If you just need function as hash key you do not need such property. All you need is low probability for collisions what is related whit uniformity. Try whit CRC or if you have strings or modulo for numbers.
http://en.wikipedia.org/wiki/Hash_function

why don't you use a GUID with a default of NEWSEQUENTIALID() ..don't use NEWID() since it is horrible for clustering, see here: Best Practice: Do not cluster on UniqueIdentifier when you use NewId
make this column the primary key and you are pretty much done

How would you represent a hashtable collection in a database schema?

If you were trying to create a domain object in a database schema, and in your code said domain object has a hashtable/list member, like so:
public class SpaceQuadrant : PersistentObject
{
public SpaceQuadrant()
{
}
public virtual Dictionary<SpaceCoordinate, SpaceObject> Space
{
get;
set;
}
}
A Dictionary is just a hashtable/list mapping object keys to value keys, I've come up with multiple ways to do this, creating various join tables or loading techniques, but they all kind of suck in terms of getting that O(1) access time that you get in a hashtable.
How would you represent the SpaceQuadrant, SpaceCoordinate, and Space Object in a database schema?
A simple schema code description would be nice,
ie.
table SpaceQuadrant
{
ID int not null primary key,
EntryName varchar(255) not null,
SpaceQuadrantJoinTableId int not null
foreign key references ...anothertable...
}
but any thoughts at all would be nice as well, thanks for reading!
More Information:
Thanks for the great answers, already, I've only skimmed them, and I want to take some time thinking about each before I respond.
If you think there is a better way to define these classes, then by all means show me an example, any language your comfortable with is cool

Relations are not hash tables; they are sets.
I wouldn't organize the database using the coordinates as the key. What if an object changes location? Instead, I would probably treat coordinates as attributes of an object.
Also, I assume there is a fixed number of dimensions, for example, three. If so, then you can store these attributes of an object in fixed columns:
CREATE TABLE SpaceQuadrant (
quadrant_id INT NOT NULL PRIMARY KEY,
quadrant_name VARCHAR(20)
-- other attributes
);
CREATE TABLE SpaceObject (
object_id INT NOT NULL PRIMARY KEY,
x NUMERIC(9,2) NOT NULL,
y NUMERIC(9,2) NOT NULL
z NUMERIC(9,2) NOT NULL,
object_name VARCHAR(20) NOT NULL,
-- other attributes
quadrant_id INT NOT NULL,
FOREIGN KEY (quadrant_id) REFERENCES SpaceQuadrant(quadrant_id)
);
In your object-oriented class, it's not clear why your objects are in a dictionary. You mention accessing them in O(1) time, but why do you do that by coordinate?
If you're using that to optimize finding objects that are near a certain point (the player's spaceship, for instance), you could also build into your SQL query that populates this SpaceQuadrant a calculation of every object's distance from that given point, and sort the results by distance.
I don't know enough about your program to know if these suggestions are relevant. But are they at least making you think of different ways of organizing the data?

In the simplest case, the dictionary has a key which would map to the primary key of a table - so that when you specify the values of the key, you can immediately find the matching data via a simple lookup.
In this case, you would need a table SpaceQuadrant with any general (single-valued) attributes that describe or characterize a space quadrant. The SpaceQuadrant table would have a primary key, possibly a generated ID, possibly a natural value. The hashtable would then consist of a table with the primary key value for cross-referencing the SpaceQuadrant, with the position (a SpaceCoordinate) and the attributes of the quadrant and coordinate.
Now, if you have an extensible DBMS, you can define a user-defined type for the SpaceCoordinate; failing that, you can use a trio of columns - x, y, z or r, theta, rho, for example - to represent the position (SpaceCoordinate).
In general terms, the structure I'm describing is quite similar to Bill Karwin's; the key (pun not intended until after I was rereading the message) difference is that it is perfectly OK in my book to have the position as part of the primary key of the sub-ordinate table if you are sure that's the best way to organize it. You might also have an object ID column that is an alternative candidate key. Alternatively, if objects have an existence independent of the space quadrant they happen to be in at the moment (or can exist in multiple positions - because they aren't points but are space stations or something), then you might have the SpaceObject in a separate table. What is best depends on information that we don't have available to us.
You should be aware of the limitations of using a SpaceCoordinate as part of the primary key:
no two objects can occupy the same position (that's called a collision in a hash table, as well as in 3D space),
if the position changes, then you have to update the key data, which is more expensive than an update up non-key data,
proximity lookups will be hard - exact lookups are easy enough.
The same is true of your dictionary in memory; if you change the coordinates, you have to remove the record from the old location and place it in the new location in the dictionary (or the language has to do that for you behind the scenes).

A dictionary is a table. The hash is a question of what kind of index is used. Most RDBMS assume that tables are big and densely packed, making a hashed index not appropriate.
table SpaceQuadrant {
ID Primary Key,
-- whatever other attributes are relevant
}
table Space {
SpaceCoordinate Primary Key,
Quadrant Foreign Key SpaceQuadrant(ID),
SpaceObject -- whatever the object is
}
Your Space objects have FK references to the Quadrant in which they're located.
Depending on your RDBMS, you might be able to find a hash-based index that gets you the performance you're hoping for. For example MySQL, using the HEAP storage engine supports HASH indexes.

First, dedicated support for geo-located data exists in many databases - different algorithms can be used (a spatial version of a B-Tree exists for instance), and support for proximity searches probably will exist.
Since you have a different hash table for each SpaceQuadrant, you'd need something like (edited from S.Lott's post):
table Space {
SpaceCoordinate,
Quadrant Foreign Key SpaceQuadrant(ID),
SpaceObject -- whatever the object is (by ID)
Primary Key(SpaceCoordinate, Quadrant)
}
This is a (SpaceCoordinate, Quadrant) -> SpaceObjectId dictionary.
=====
Now, about your O(1) performance concern, there is a lot of reasons why it's wrongly addressed.
You can use in many DB's a hash index for memory-based tables, as somebody told you. But if you need persistent storage, you'd need to update two tables (the memory one and the persistent one) instead of one (if there is no built-in support for this). To discover whether that's worth, you'd need to benchmark on the actual data (with actual data sizes).
Also, forcing a table into memory can have worse implications.
If something ever gets swapped, you're dead - if you had used a B-Tree (i.e. normal disk-based index), its algorithms would have minimized the needed I/O. Otherwise, all DBMS's would use hash tables and rely on swapping, instead of B-Trees. You can try to anticipate whether you'll fit in memory, but...
Moreover, B-Trees are not O(1) but they are O(log_512(N)), or stuff like that (I know that collapses to O(log N), but bear me on this). You'd need (2^9)^4 = 2^36 = 64GiB for that to be 4, and if you have so much data you'd need a big iron server anyway for that to fit in memory. So, it's almost O(1), and the constant factors are what actually matters.
Ever heard about low-asymptotic-complexity, big-constant-factor algorithms, that would be faster than simple ones just on unpractical data sizes?
Finally, I think DB authors are smarter than me and you. Especially given the declarative nature of SQL, hand-optimizing it this way isn't gonna pay. If an index fits in memory, I guess they could choose to build and use a hashtable version of the disk index, as needed, if it was worth it. Investigate your docs for that.
But the bottom line is that, premature optimization is evil, especially when it's of this kind (weird optimizations we're thinking on our own, as opposed as standard SQL optimizations), and with a declarative language.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight