Here is the case:
There are lot of columns in my databases inside one sqlserver that contains the same data but there is a big inconsistency in datatypes lenght for them.
For example I have column called "name" in schemas stage and dbo in DB1 and the same column in DB2. In all those places this column has different lenght of data type:
stage.name is defined as varchar(10),
dbo.name is defined as varchar(20),
column "name" in DB2 is defined as varchar(max)
Is there any tool that can help me to fix that?
I mean different that writing SQL queries agains Information_schema.columns and then genereting manually alter scripts to implement changes.
As they are in different schemas and even in a different database, do you know for sure that they actually represent the same piece of information? A generic term like name could mean many things in different contexts - e.g. Australian State Names have a very specific max size that I can define, but a City Name could vary hugely. Don't change anything without understanding the actual domain of the data and then ensure the size is appropriate.
Related
Background
I'm using Azure data factory v2 to load data from on-prem databases (for example SQL Server) to Azure data lake gen2. Since I'm going to load thousands of tables, I've created a dynamic ADF pipeline that loads the data as-is in the source based on parameters for schema, table name, modified date (for identifying increments) and so on. This obviously means I can't specify any type of schema or mapping manually in ADF. This is fine since I want the data lake to hold a persistent copy of the source data in the same structure. The data is loaded into ORC files.
Based on these ORC files I want to create external tables in Snowflake with virtual columns. I have already created normal tables in Snowflake with the same column names and data types as in the source tables, which I'm going to use in a later stage. I want to use the information schema for these tables to dynamically create the DDL statement for the external tables.
The issue
Since column names are always UPPER case in Snowflake, and it's case-sensitive in many ways, Snowflake is unable to parse the ORC file with the dynamically generated DDL statement as the definition of the virtual columns no longer corresponds to the source column name casing. For example it will generate one virtual column as -> ID NUMBER AS(value:ID::NUMBER)
This will return NULL as the column is named "Id" with a lower case D in the source database, and therefore also in the ORC file in the data lake.
This feels like a major drawback with Snowflake. Is there any reasonable way around this issue? The only options I can think of is to:
1. Load the information schema from the source database to Snowflake separately and use that data to build a correct virtual column definition with correct cased column names.
2. Load the records in their entirety into some variant column in Snowflake, converted to UPPER or LOWER.
Both options add a lot of complexity or even messes up the data. Is there any straight forward way to only return the column names from an ORC file? Ultimately I would need to be able to use something like Snowflake's DESCRIBE TABLE on the file in the data lake.
Unless you set the parameter QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE you can declare your column in the casing you want:
CREATE TABLE "MyTable" ("Id" NUMBER);
If your dynamic SQL carefully uses "Id" and not just Id you will be fine.
Found an even better way to achieve this, so I'm answering my own question.
With the below query we can get the path/column names directly from the ORC file(s) in the stage with a hint of the data type from the source. This filters out colums that only contains NULL values. Will most likely create some type of data type ranking table for the final data type determination for the virtual columns we're aiming to define dynamically for the external tables.
SELECT f.path as "ColumnName"
, TYPEOF(f.value) as "DataType"
, COUNT(1) as NbrOfRecords
FROM (
SELECT $1 as "value" FROM #<db>.<schema>.<stg>/<directory>/ (FILE_FORMAT => '<fileformat>')
),
lateral flatten(value, recursive=>true) f
WHERE TYPEOF(f.value) != 'NULL_VALUE'
GROUP BY f.path, TYPEOF(f.value)
ORDER BY 1
I'm getting a little confused in db2 table naming. In db2 we have tablespace, schema, database name, username, table name, etc... .
My (silly) question is: When querying (i.e. select * from ???), what is the full and complete name of a table? As in some situations it doesn't work if I put just the table name. (i.e. tablespace and schema are also required.)
For example, in MySQL it is [database].[table]. But what about DB2?
Thanks a lot for your attention.
In DB2 you connect to a database and then select from a schema and table name
schema.tablename
For an end user that is it. In the background there can be crazy stuff going on with nicknames -- any surfaced schema.tablename can be an alias to anywhere else (even other servers with federation), but from the query point of view it is in that schema.tablename location.
tablespaces and indexspaces are used internally to map where the data is stored on the disk and only matters when you are creating the table.
bufferpools are used internally to map where the data is stored in memory and only matter when you create tablespaces.
Other objects (Views, Indexes, Stored Procedures, Functions, and Sequences) are just like tables
schema.<objectname>
Special Characters
In DB2 you can have special characters in names. You use " to around the special names
schema."tablename with space"
Quotes are also needed if you combine upper and lower case in your names. DB2 will convert to upper case if the name is declared in upper case so a leading practice is to use upper case and underscore in all names -- then you don't need to worry about quoting or matching case.
Consider a situation where the schema of a database table may change, that is, the fields, number of fields, and types of those fields may vary based on, say a client ID.
Take, for example a Users table. Typically we would represent this in a horizontal table with the following fields:
FirstName
LastName
Age
However, as I mentioned, each client may have different requirements.
I was thinking that to represent a multi-schema approach to Users in a relational database like SQL Server, this would be done with two tables:
UsersFieldNames - {FieldNameId, ClientId, FieldName, FieldType}
UsersValues - {UserValueId, FieldNameId, FieldValue}
To retrieve the data (using EntityFramework DB First), I'm thinking pivot table, and the use of something like LINQ Extentions - Pivot Extensions may be useful.
I would like to know of any other approaches that would satisfy this requirement.
I'm asking this question for my own curiosity, as I recall similar conversations coming up in the past, and in relation to this question posed.
Thanks.
While I think a NoSQL data base would work best for this, I once tried something like this.
Have a table named something like METATABLES, like this
METATABLE = {table_name, field name}
and another,
ACTUAL_DATA ={table_name, field_name, actual_data_id, float_value, string_value, double_value, varchar_value}
in actual_data, the fields table_name and field_name would be foreign keys, pointing to METATABLES. In METATABLES you define the specific fields each client requires. the ACTUAL_DATA table holds the actual values of those fields, stored in the appropiate value field, depending on the data type (if the field value is a string, it would be stored in the string_Value field).
This approach is probably not the most efficient, though. Hope it helps.
I think it would be a mistake to have the schema vary. It is typically something you want to be standard.
In this case you may have users that have different attributes. In the user table you store attributes that are common across all users:
USER {id(primary key), username, first, last, DOB, etc...}
Note: Age is something that should not be stored, it should be calculated.
Then you could have a USER_ATTRIBUTE table:
{userId,key,value}
So users can have multiple attributes that are unrelated to one another without the schema changing.
Changing the schema often breaks the application.
I am using SQL Server 2005 Express and Visual Studio 2008.
I have a database which has a table with 400 Columns. Things were (just about manageable) until I had to perform bi-directional sync between several databases.
I am wondering what arguments are for and against using 400 column database or 40 table database are?
The table in not normalised and comprises of mainly nvarchar(64) columns and some TEXT columns. (there are no datatypes as it was converted from text files).
There is one other table that links to this table and is a 1-1 relationship (i.e one entry relates to one entry in the 400 column table).
The table is a list files that contained parameters that are "plugged" into a application.
I look forward to your replies.
Thank you
Based on your process description I would start with something like this. The model is simplified, does not capture history, etc -- but, it is a good starting point. Note: parameter = property.
- Setup is a collection of properties. One setup can have many properties, one property belongs to one setup only.
- Machine can have many setups, one setup belongs to one machine only.
- Property is of a specific type (temperature, run time, spindle speed), there can be many properties of a certain type.
- Measurement and trait are types of properties. Measurement is a numeric property, like speed. Trait is a descriptive property, like color or some text.
For having a wide table:
Quick to report on as it's presumably denormalized and so no joins are needed.
Easy to understand for end-consumers as they don't need to hold a data model in their heads.
Against having a wide table:
Probably need to have multiple composite indexes to get good query performance
More difficult to maintain data consistency i.e. need to update multiple rows when data changes if that data is on multiple rows
As you're having to update multiple rows and maintain multiple indexes, concurrent performance for updates may become an issue as locks escalate.
You might end up with records with loads of nulls in columns if the attribute isn't relevant to the entity on that row which can make handling results awkward.
If lazy developers do a SELECT * from the table you end up dragging loads of data across the network, so you generally have to maintain suitable subset views.
So it all really depends on what you're doing. If the main purpose of the table is OLAP reporting and updates are infrequent and affect few rows then perhaps a wide, denormalized table is the right thing to have. In an OLTP environment then it's probably not and you should prefer narrower tables. (I generally design in 3NF and then denormalize for query performance as I go along.)
You could always take the approach of normalizing and providing a wide-view for readers if that's what they want to see.
Without knowing more about the situation it's not really possible to say more about the pros and cons in your particular circumstance.
Edit:
Given what you've said in your comments, have you considered just having a long & skinny name=value pair table so you'd just have UserId, PropertyName, PropertyValue columns? You might want to add in some other meta-attributes into it too; timestamp, version, or whatever. SQL Server is quite efficient at handling these sorts of tables so don't discount a simple solution like this out-of-hand.
We have a situation where we need to store form data in our sql server but each new job we setup will have different fields with different field names and lengths. An example
Job 1:
Field 1: first_name - varchar(20)
Field 2: last_name - varchar(30)
Job 2:
Field 1: first_name - varchar(15)
Field 2: middle_initial - varchar(1)
Field 3: last_name - varchar(30)
Initially we were setting up separate tables to store this data which conformed exactly to the form in question. But it lead to a maintenance nightmare as there were so many tables, procs, dts, ssis packages to be changing each time in order to accomodate the dynamic nature of this data.
We came up with a different solution to store all of the data in xml fields thus solving most of the problem. It now exists similar to this.
<Record>
<first_name>value</first_name>
<last_name>value</last_name>
</Record>
Then we would create views to pull this data out of the table
SELECT
, IsNull(data.value('(/Record/first_name)[1]', 'varchar(20)'),'') as first_name
, IsNull(data.value('(/Record/last_name)[1]', 'varchar(30)'),'') as last_name
FROM FormTable
Now this is much better than we had before but it also means that we still need to create the custom view each time. I'd much rather maintain some type of table that lists the fields and will build that query for me.
Field Name | Field Type | Field Length
first_name | varchar | 20
last_name | varchar | 30
I'm pretty sure I cannot create a dynamic view. One option that could work is a table valued function. But is there something that I am overlooking here? Would there be better options to be able to dynamically store the data in this way (without moving away from SQL SERVER since i know other databases such as CouchDB will do this natively.)
I'm pretty sure a table valued function would not work, since you can't use dynamic sql or temp tables (which you would almost certainly need to use to do this).
A stored procedure would be the obvious choice - it can do everything you need to do, but the problem is of course that you can't SELECT from a stored procedure.
I was looking at this page that discusses a bunch of options for doing things like this. One of the options he mentions is using OPENQUERY, which seems like it would be very easy, but there could be performance problems:
SELECT * FROM OPENQUERY(LOCALSERVER, 'EXEC sp_getformdata')
Anyways, you might want to check out that link to get some additional ideas.