Databases - Why case insensitive? - database

I saw one or two threads talking globally about case sensitivity, but my question is more specific.
I understand the interest of case insensitive searches on text values for example.
But why would we use case-insensitive database names, tables and columns?
Isn't it going to lead to mistakes? Script languages that use databases are all case-sensitive, so for example if we didn't use the right case for a field it will not be found...

The SQL:2008 and SQL-99 standards define databases to be case-insensitive for identifiers unless they are quoted. I've found most ORMs will quote identifiers in the SQL they generate.
However, as you probably know not all relational databases strictly adhere to the standards. DB2 and Oracle are 100% compliant. PostgreSQL is mostly compliant, except for the fact that it automatically lowercases anything that isn't quoted (which personally I prefer.)
mySQL gets a bit weird, since it stores each table as a file on the file system. For this reason, it's subject to the case sensitivity of the file system. On Windows:
CREATE TABLE FOO (a INTEGER);
CREATE TABLE 'Foo' (a INTEGER); -- Errors out, already exists
Where-as on Linux:
CREATE TABLE FOO (a INTEGER); -- Creates FOO table
CREATE TABLE 'Foo' (a INTEGER); -- Creates Foo table
SQL Server is even stranger. It will preserve the case on creation, however let you refer to it in any way after (even if you quote the name!) You can't create two tables whose only difference is their casing. Note: SQL Server does have configuration options that control this stuff though, as case-sensitivity of identifiers will depend on the default collation of the database instance. How confusing!
While for the most part I agree with you that computers (programming languages, databases, file systems, URLs, passwords, etc) should be case-sensitive, all systems are implemented independently and may or may not adhere to standards that may or may not exist. Implementing a case-senstive database is definitely possible, if you know the ins and outs of your particular database system and how it behaves.
It's really your responsibility to implement your system in a way that works for you, and not the entire technology industry to implement everything in a consistent way to make your life easier.

The main advantage of using case sensitivity is that when we deploy it on the client site, our DB works regardless whether the client's SQL Server is set up case sensitive or not, so yes it really isn't a good idea and I don't know why anyone would use case-insensitve database tables/columns.

If you would redo all the it industry today, with the knowledge and technology you might default to do everything case sensitive with the only exception of things especially asked for being not case sensitive.
But back in the days before I was born and even when I started working (ok, playing) with computers many computers couldn't even differentiate between upper and lower case letters. I build a might complicated card to plug into my fake apple II to make it understand the difference.
So I guess in these days having something like a difference between upper and lower case was something like having a retina display nowadays. Its cool if you have it. And in 10 years we might ask why anybody ever created an application without such displays in mind, but today it just isn't that relevant.
Same is true for databases (and file systems) since many of them and their respective standards go back to the 70s at least.

Related

Good or bad idea to include numbers in SQL table names?

It's clear that you can use numeric characters in SQL table names and use them so long as they're not at the beginning. (There's a discussion here on one of the side effects: SQLite issue with Table Names using numbers?)
The database I'm targetting is Oracle 10g/11g.
I'm designing a reporting database where naming some of the entities clearly is best done by describing the reports, which are named after numbers ('part 45', '102S', '401'). It's just the business domain language: these reports just aren't commonly referred to by any other name. The entities I'm modelling really are best named this way.
My question is: am I going to have difficulties with maintenance or programmability if I put numbers in a table name? I'm always worried about ancillary software around the database: drivers, ETL code that might not play nice with a non-plain-vanilla name. But there's a real benefit in intelligibility in this business domain, so am I just being squeamish?
My question put simply is: are there any 'gotchas' or corner cases that would rule out a table name like PART_45_AUDIT?
If PART_45_AUDIT is really the clearest description of the entity you're modeling (which would be very rare), there shouldn't be any gotchas to having numbers in the middle of a name. Putting numbers at the front of the name would be a different story because that would require using double-quoted identifiers and there are plenty of tools that don't fully support double-quoted identifiers. Plus, of course, it's rather annoying to have to type the double-quotes every time you reference the table.
CREATE TABLE "102S" (
col1 number
);
SELECT *
FROM "102S"

SQL Server Collation Choices

I've spent a lot of time this evening trying to find guidance about which choice of collation to apply in my SQL Server 2008 R2 installation, but almost everything online basically says "choose what is right for you." Extremely unhelpful.
My context is new application development. I am not worrying about backward compatibility with a prior version of SQL Server (viz. <= 2005). I am very interested in storing data representing languages from around the globe - not just Latin based. What very little help I've found online suggests I should avoid all "SQL_" collations. This narrows my choice to using either a binary or "not binary" collation based on the Windows locale.
If I use binary, I gather I should use "BIN2." So this is my question. How do I determine whether I should use BIN2 or just "Latin1_General_100_XX_XX_XX"? My spider-sense tells me that BIN2 will provide collation that is "less accurate," but more generic for all languages (and fast!). I also suspect the binary collation is case sensitive, accent sensitive, and kana-sensitive (yes?). In contrast, I suspect the non-binary collation would work best for Latin-based languages.
The documentation doesn't support my claims above, I'm making educated guesses. But this is the problem! Why is the online documentation so thin that the choice is left to guesswork? Even the book "SQL Server 2008 Internals" discussed the variety of choices, without explaining why and when binary collation would be chosen (compared with non-binary windows collation). Criminy!!!
"SQL Server 2008 Internals" has a good discussion on the topic imho.
Binary collation is tricky, if you intend to support text search for human beings, you'd better go with non-binary. Binary is good to gain a tiny bit of performance if you have tuned everything else (architecture first) and in cases where case sensitivity and accent sensitivity are a desired behavior, like password hashes for instance. Binary collation is actually "more precise" in a sense that it does not consider similar texts. The sort orders you get out of there are good for machines only though.
There is only a slight difference between the SQL_* collations and the native windows ones. If you're not constrained with compatibility, go for the native ones as they are the way forward afaik.
Collation decides sort order and equality. You choose, what really best suits your users. It's understood that you will use the unicode types (like nvarchar) for your data to support international text. Collation affects what can be stored in a non-unicode column, which does not affect you then.
What really matters is that you avoid mixing collations in WHERE clause because that's where you pay the fine by not using indexes. Afaik there's no silver bullet collation to support all languages. You can either choose one for the majority of your users or go into localization support with different column for each language.
One important thing is to have the server collation the same as your database collation. It will make your life much easier if you plan to use temporary tables as temporary tables if created with "CREATE TABLE #ttt..." pick up the server collation and you'd run into collation conflicts which you'll need to solve with specifying an explicit collation. This has a performance impact too.
Please do not consider my answer as complete, but you should take into consideration the following points:
( as said by #Anthony) All text fields must use nvarchar data type. This will allow you to store any character from any language, as defined by UTF-8\unicode character set! If you do not do so, you will not be able to mix text from different origins (latin, cyrillic, arabic, etc) in your tables.
This said, your collation choice will mainly affect the following:
The collating sequence, or sorting rules to be set between characters such as 'e' and 'é', or 'c' and 'ç' (should they be considered as equal or not?). In some cases, collating sequences do consider specific letter combinations, just like in hungarian, where C and CS, or D, DZ and DZS, are considered independantly.
The way spaces (or other non letter characters) are analysed: which one is the correct 'alphabetical' order?
this one (spaces are considered as 'first rank' characters)?
San Juan
San Teodoro
Santa Barbara
or this one (spaces are not considered in the ordering)?
San Juan
Santa Barbara
San Teodoro
Collation also impacts on case sensitivity: do capital letters have to be considered as similar to small letters?
The best default collation for a global database (e.g. a website) is probably Latin1_General_CI_AS. More important than collation is making sure that all textual columns use the nvarchar data type.
As long as you use NVARCHAR columns (as you should for mixed international data), all *_BIN and *_BIN2 collations perform the same binary comparison/sorting based on the Unicode code points. It doesn't matter which one you pick. Latin1_General_BIN2 looks like a reasonable generic choice.
Source: http://msdn.microsoft.com/en-us/library/ms143350(v=sql.105).aspx

Table Naming: Underscore vs Camelcase? namespaces? Singular vs Plural?

I've been reading a couple of questions/answers on StackOverflow trying to find the 'best', or should I say must accepted way, to name tables on a Database.
Most of the developers tend to name the tables depending on the language that requires the database (JAVA, .NET, PHP, etc). However I just feel this isn't right.
The way I've been naming tables till now is doing something like:
doctorsMain
doctorsProfiles
doctorsPatients
patientsMain
patientsProfiles
patientsAntecedents
The things I'm concerned are:
Legibility
Quick identifying of the module the table is from (doctors||patients)
Easy to understand, to prevent confusions.
I would like to read any opinions regarding naming conventions.
Thank you.
Being consistent is far more important than what particular scheme you use.
I typically use PascalCase and the entities are singular:
DoctorMain
DoctorProfile
DoctorPatient
It mimics the naming conventions for classes in my application keeping everything pretty neat, clean, consistent, and easy to understand for everybody.
Since the question is not specific to a particular platform or DB engine, I must say for maximum portability, you should always use lowercase table names.
/[a-z_][a-z0-9_]*/ is really the only pattern of names that seamlessly translates between different platforms. Lowercase alpha-numeric+underscore will always work consistently.
As mentioned elsewhere, relation (table) names should be singular: http://www.teamten.com/lawrence/programming/use-singular-nouns-for-database-table-names.html
Case insensitive nature of SQL supports Underscores_Scheme. Modern software however supports any kind of naming scheme. However sometimes some nasty bugs, errors or human factor can lead to UPPERCASINGEVERYTHING so that those, who selected both Pascal_Case and Underscore_Case scheme live with all their nerves in good place.
An aggregation of most of the above:
don't rely on case in the database
don't consider the case or separator part of the name - just the words
do use whatever separator or case is the standard for your language
Then you can easily translate (even automatically) names between environments.
But I'd add another consideration: you may find that there are other factors when you move from a class in your app to a table in your database: the database object has views, triggers, stored procs, indexes, constraints, etc - that also need names. So for example, you may find yourself only accessing tables via views that are typically just a simple "select * from foo". These may be identified as the table name with just a suffix of '_v' or you could put them in a different schema. The purpose for such a simple abstraction layer is that it can be expanded when necessary to allow changes in one environment to avoid impacting the other. This wouldn't break the above naming suggestions - just a few more things to account for.
I use underscores. I did an Oracle project some years ago, and it seemed that Oracle forced all my object names to upper case, which kind of blows any casing scheme. I am not really an Oracle guy, so maybe there was a way around this that I wasn't aware of, but it made me use underscores and I have never gone back.
I tend to agree with the people who say it depends on the conventions of language you're using (e.g. PascalCase for C# and snake_case for Ruby).
Never camelCase, though.
After reading a lot of other opinions I think it's very important to use the naming conventions of the language, consistency is more important than naming conventions only if you're (and will be) the only developer of the application. If you want readability (which is of huge importance) you better use the naming conventions for each language. In MySQL for example, I don't suggest using CamelCase since not all platforms are case sensitive. So here underscore goes better.
These are my five cents. I came to conclusion that if DBs from different vendors are used for one project there are two best ways:
Use underscores.
Use camel case with quotes.
The reason is that some database will convert all characters to uppercase and some to lowercase. So, if you have myTable it will become MYTABLE or mytable when you will work with DB.
Naming conventions exist within the scope of a language, and different languages have different naming conventions.
SQL is case-insensitive by default; so, snake_case is a widely used convention. SQL also supports delimited identifiers; so, mixed case in an option, like camelCase (Java, where fields == columns) or PascalCase (C#, where tables == classes and columns == fields). If your DB engine can't support the SQL standard, that's its problem. You can decide to live with that or choose another engine. (And why C# just had to be different is a point of aggravation for those of us who code in both.)
If you intend to ever only use one language in your services and applications, use the conventions of that language at all layers. Else, use the most widely used conventions of the language in the domain where that language is used.
C# approach
Singular/Plural
singular if your record in row contains just 1 value.
If it is array then go for plural. It would make perfect sense also when you foreach such element. E.g. your array column contains MostVisitedLocations: London, NewYork, Bratislava
then:
foreach(var mostVisitedLocation in MostVisitedLocations){
//go through each array element
}
Casing
PascalCase for table names and camelCase for columns made the best sense to me. But in my case in .NET 5 when I had json objects saved in dbs with json object names in camelCase, System.Text.Json wasnt able to deserialise it to object. Because your model has to be public and public properties are PascalCase. So mapping table columns(camelCase) and json object names(camelCase) to these properties can result in error(because mapping is case sensitive).
Btw with NeftonsoftJson this problem is not present.
So I ended app with:
Tables: App.Admin, App.Pricing, UserData.Account
Columns: Id, Price, IsOnline.
2 suggestions based on use cases:
Singular table names.
Although I used to believe in pluralizing table names once, I found in practise that there is little to no benefit to it other than the human mind to think in terms of tables as collections.
When singularising the table names, you can silently add -table to the singular table name in your head, and then it all makes sense again.
SELECT username FROM UserTable
Sounds more natural than
SELECT username FROM UsersTable
But post-fixing every table with is just a waste.
The actual practical argumentation for singularising table names:
What is the plural of person: persons or people?
This is still ok.
But how do you like a table with postfix -status? Statuses?
That sucks, sorry.
It is easy to inadvertently make a human mistake by singularizing the status table, but pluralizing the other tables.
PascalCasing + Underscore convention.
Given table User, Role and a many-to-many table User_Role.
Considering underscore cased user_role is dubious when all table names are using underscore per default.
Is user_role a table that contains user roles? In this case it is not, it is a join table.
When deciding on table name conventions I think it is useful to let go of personal preference and take into account the real practical considerations of real life problems in order to minimize dubious situations to occur.
As the many answers and opinions have indicated, whatever your personal opinion is, different people think differently, and you will not be the only person working on the database despite being the one who sets it up (unless you do, in which case you're only helping yourself).
Therefore it is useful to have practical argumentation (practical in the sense of, does it help my future co-workers to avoid dubious situations) when your past decision is being questioned.
Unfortunately there is no "best" answer to this question. As #David stated consistency is far more important than the naming convention.
there's wide variability on how to separate words, so there you'll have to pick whatever you like better; but at the same time, it seems there's near consensus that the table name should be singular.

Are there benefits to a case sensitive database?

We have just 'migrated' an SQL Server 2005 database from DEVEL into TEST. Somehow during the migration process the DB was changed from case insensitive to sensitive - so most SQL queries broke spectacularly.
What I would like to know, is - are there any clear benefits to having a case sensitive schema?
NOTE: By this I mean table names, column names, stored proc names etc. I am NOT referring to the actually data being stored in the tables.
At first inspection, I cannot find a valid reason that offers benefits over case insensitivity.
I just found out why WE make it case sensitive. It is to ensure that when we deploy it on the client site, our DB works regardless whether the client's SQL Server is set up case sensitive or not.
That is one answer I wasn't expecting.
I really can't think of any good reason SQL identifiers should be case sensitive. I can think of one bad one, its the one MySQL gives for why their table names are case sensitive. Each table is a file on disk, your filesystem is case-sensitive and the MySQL devs forgot to table_file = lc(table_name). This is heaps of fun when you move a MySQL schema to a case-insensitive filesystem.
I can think of one big reason why they shouldn't be case sensitive.
Some schema author is going to be clever and decide that this_table obviously means something different from This_Table and make those two tables (or columns). You might as well write "insert bugs here" at that point in the schema.
Also, case-insensitivity lets you be more expressive in your SQL to emphasize tables and columns vs commands without being held to what the schema author decided to do.
SELECT this, that FROM Table;
Not all sections of Unicode have a bijective mapping between upper and lower-case characters — or even two sets of cases.
In those regions, "case-insensitivity" is a little meaningless, and probably misleading.
That's about all I can think of for now; in the ASCII set, unless you want Foo and foo to be different, I don't see the point.
Most languages out there are case-sensitive, as are most comparison algorithms, most file systems, etc. Case insensitivity is for lazy users. Although it does tend to make things easier to type, and does lead to many variants of the same names differing only by case.
Personally, between (MyTable, mytable, myTable, MYTABLE, MYTable, myTABLE, MyTaBlE), I would please like to see one universal version.
Case insensitivity is a godsend when you have developers that fail to follow any sort of conventions when writing SQL or come from development languages where case insensitivity is the norm such as VB.
Generally speaking I find it easier to deal with databases where there is no possibility that ID, id, and Id are distinct fields.
Other than a personal preference for torture, I would strongly recommend you stay with case insensitivity.
The only database I ever worked on that was set up for case sensitivity was Great Plains. I found having to remember every single casing of their schema namings was painful. I have not had the privilege of working with more recent versions.
Unless it has changed and if my memory serves, the nature of case sensitivity you are speaking of is determined at installation time and is applied to all databases. It was the case with the SQL Server installion that ran the Great Plains database I mentioned that all databases on that installation were case sensitive.
I like case-sensitivity, mostly because that's what I'm used to from programming in Perl (and most any other language too). I like using StudlyCaps for table names and all lower case with underscores for columns.
Of course, many databases allow you to quote names to enforce casing, like Postgres does. That seems like a reasonable approach as well.
I do support for Sybase Advantage Database Server and it uses a flat file format allowing DBF's as well as our own proprietary ADT format. The case where I see case sensitivity being an issue is when using our Linux version of the server. Linux is a case sensitive OS so we have an option in our db to lowercase all calls. This requires that the table files be lower case.
I'm pretty sure the SQL Spec requires case folding (which is effectively the same as insensitivity) for identifiers. PostgreSQL folds to lower, ORACLE folds to upper.

Obfuscate a SQL Server Db schema

When posting example code or filing bug reports based on a real production app, it would be helpful to have some way to change the table and column names to not potentially give away information about the internals of the app. Doing it by hand without breaking things is time consuming. Does anything automatic exist? Ideally it would use real English words so they are more easily referred to than random text strings.
As long as you don't use real data, I don't see what the issue is. Most apps are fairly obvious based on the requirements. ie CRM system = (customer name, address, etc...) or (customer name, addressid, etc.. with some address table with parts of the address, etc...). By knowing your schema I have no idea how you implement your app. Generally without the stored procedures/program code it would be hard to steal any intellectual property. Even if you were the NSA or something (InternetIP, PacketHeadingID, PacketDetailID, TimeStampID). Even with the structure of the tables I still would have no information on how your system to log all the internet traffic actually works. I also wouldn't know anything that is logged.
I don't know of anything off hand to do what you are requesting, but I would think it is fairly easy to write a script to do it on your own. Look at the table columns and datatypes and call text columns "TextColumn1", int columns "IntColumn2", etc. and build a table of substitutions, then perform the substitutions globally in the script file. I would think this is a fairly easy Python/Perl/PowerShell/Ruby/VbScript program.
I agree that there's no real need to do so, but if you feel that way, take a look at anonymizers, usually used to protect the data and not the schemas, but you could easily apply those approaches to schemas as well.
See this paper (which is the description of this framework) especially page 8 an onwards for different anonymization methods, although replacing column names for static strings might probably be good enough anyway.

Resources