Inconsistent data in table - database

A partner where I work have created the customer table with the following fields:
first_name, middle_name, last_name, second_last_name, full_name
Where full_name is the concatenation of the other fields.
Can you give me the best explaining why is a bad practice?

It's not ideal because sooner or later, someone or something is going to update last_name or first_name without updating full_name, or vice versa, and you'll have something like this in your database:
first_name last_name full_name
John White John Black
And then you get to try to figure out where the discrepancy is coming from and what this guy's last name is really supposed to be, which is no fun. If you're going to denormalize a table like this, there ought to be some compelling reason for doing so. What's your partner's rationale for wanting full_name to be a separate field?
You should probably investigate alternatives. For instance, you could define a view that returns the various name components from your table and also assembles them into a full_name. Depending on your RDBMS, you may have other options as well. For instance, in SQL Server you can put a computed column right into your table.
declare #customer table (first_name varchar(50), last_name varchar(50), full_name as first_name + ' ' + last_name);
insert #customer values ('John', 'B');
select * from #customer;
Result:
first_name last_name full_name
John B John B

If the full_name is persisted on storage, then you got data doubling: you waste twice the storage with no benefits and extra overhead on editing or other maintenance.
If full_name column is actually a function (e.g. it is calculated from the other elements on this row) the solution is fine!
Depending on the database engine you use those calculated columns can only be read (you have to update the other columns to change their outcome), or even be read-write. Writing to such a column is handled by another function, which for example could parse the full name and store the parts in the row, which is to be updated.

Related

SSIS - Split multiple columns into single rows

Below is a very small example of the flat table I need to split. The first table is a Lesson table which has the ID, Name and Duration. The second table is a student table which only has the Student Name as a PK. And the third table will be a Many to Many of Lesson ID and Student Name.
Lesson Id
Lesson Name
Lesson Duration
Student1
Student2
Student3
Student4
1
Maths
1 Hour
Jean
Paul
Jane
Doe
2
English
1 Hour
Jean
Jane
Doe
I don't know how, using SSIS, I can assign Jean, Paul, Jane and Doe to their own tables using the Student 1, 2, 3 and 4 columns. When I figure this out, I imagine I can use the same logic to map the Lesson ID and columns to the third Many to Many table?
How do I handle duplicate entries, for example Jean Jane and Doe already exist from the first row so they do not need to be added to the Students table.
I assume I use a conditional split to skip null values? For example Student4 on the second row is Null.
Thanks for the assistance.
Were it me, I would design this as 3 data flows.
Data flow 1 - student population
Since we're assuming the name is what makes a student unique, we need to build a big list of the unique names.
SELECT D.*
FROM
(
SELECT S.Student1 AS StudentName
FROM dbo.MyTable AS S
UNION
SELECT S.Student2 AS StudentName
FROM dbo.MyTable AS S
UNION
SELECT S.Student3 AS StudentName
FROM dbo.MyTable AS S
UNION
SELECT S.Student4 AS StudentName
FROM dbo.MyTable AS S
)D
WHERE D.StudentName IS NOT NULL
ORDER BY D.StudentName;
The use of UNION in the query will handle deduplication of data and we wrap that in a derived table to filter the NULLs.
I add an explicit order by not that it's needed but since I'm assuming you're using the name as the primary key, let's avoid sort operation when we land the data.
Add an OLE DB Source to your data flow and instead of picking a table in the drop down, you'll use the above query.
Add an OLE DB Destination to the same data flow and connect the two. Assuming your target table looks something like
CREATE TABLE dbo.Student
(
StudentName varchar(50) NOT NULL CONSTRAINT PK__dbo__Student PRIMARY KEY(StudentName)
);
Data Flow 2 - Lessons
Dealers choice here, you can either write the query or just point at the source table.
A very good practice to get into with SSIS is to only bring the data you need into the buffers so I would write a query like
SELECT DISTINCT S.[Lesson Id], S.[Lesson Name], S.[Lesson Duration]
FROM dbo.MyTable AS S;
I favor a distinct here as I don't know enough about your data but if it were extended and a second Maths class was offered to accommodate another 4 students, it might be Lesson Id 1 again. Or it might be 3 as it indicates course time or something else.
Add an OLE DB Destination and land the data.
Data Flow 3 - Many to Many
There's a few different ways to handle this. I'd favor the lazy way and repeat our approach from the first data flow
SELECT D.*
FROM
(
SELECT S.Student1 AS StudentName, S.[Lesson Id]
FROM dbo.MyTable AS S
UNION
SELECT S.Student2 AS StudentName, S.[Lesson Id]
FROM dbo.MyTable AS S
UNION
SELECT S.Student3 AS StudentName, S.[Lesson Id]
FROM dbo.MyTable AS S
UNION
SELECT S.Student4 AS StudentName, S.[Lesson Id]
FROM dbo.MyTable AS S
)D
WHERE D.StudentName IS NOT NULL
ORDER BY D.StudentName;
And then land in your bridge table with an OLE DB Destination and be done with it.
If this has been homework/an assignment to have you learn the native components...
Do keep with the 3 data flow approach. Trying to do too much in one go is a recipe for trouble.
The operation of moving wide data to narrow data is an Unpivot operation. You'd use that in the Student and bridge table data flows but honestly, I think I've used that component less than 10 times in my career and/or answering SSIS questions here and I do a lot of that.
If the Unpivot operation generates a NULL, then yes, you'd likely want to use a Conditional Split to filter those rows out.
If your reference tables were more complex, then you'd likely be adding a Lookup component to your bridge table population step to retrieve the surrogate key.

Creating a unique ID using first 3 character of last name and a sequence number

I have this employee table which I want every employee to have a unique ID using the 3 first letters of their name plus a sequence number in SQL Server.
I don't remember at all how to do this I haven't used SQL in a year and kinda forgot everything.
Can anyone refresh my mind on how to do this. Google has been of no help on this matter. Thanks
Firstly, I suggest that the numeric portion of your identifier be unique in and of itself, in case the employee gets married and changes their last name. The prefix can still appear to the left of it, but should not be necessary to be unique.
If you agree with this design, then you can simply use a numeric identity column on the Employee table and combine that with the last name when retrieving the data, using a computed column. I suggest you seed the identity with a value that has enough digits to keep your identifier lengths consistent, so for example to support 90,000 employees you can use a seed of 10,000 which ensures all identifiers are 8 characters long (three letters of the name plus five numeric).
Simple example:
CREATE TABLE Employee
(
EmployeeNo int IDENTITY(10000,1) PRIMARY KEY,
LastName VarChar(64),
EmployeeID AS SUBSTRING(UPPER(LastName), 1, 3) + RIGHT('0000' + CONVERT(char(5), EmployeeNo), 5)
)
INSERT Employee (LastName) VALUES ('Smith')
SELECT * FROM Employee
Results:
EmployeeNo LastName EmployeeID
10000 Smith SMI10000
For the purposes of your SQL and table design, your tables should all use EmployeeNo as foreign key, since it is compact and unique. Apply the three-letter prefix during data retrieval and only for customer-facing purposes.
#John Wu is right. However, if you don't want to rely on the Employee no then you use NewID() function, which will create a unique number always. Below is the code.
CREATE TABLE EmployeeDetails
(
EmployeeCode int IDENTITY(1,1),
FirstName varchar(50),
LastName Varchar(50),
Empid as left(Lastname,3) + convert(varchar(500), newid())
)
INSERT EmployeeDetails VALUES ('Atul', 'Jain')
`

SQL JOIN all tables from one data

I am trying to get all the data from all tables in one DB.
I have looked around, but i haven't been able to find any solution that works with my current problems.
I made a C# program that creates a table for each day the program runs. The table name will be like this tbl18_12_2015 for today's date (Danish date format).
Now in order to make a yearly report i would love if i can get ALL the data from all the tables in the DB that stores these reports. I have no way of knowing how many tables there will be or what they are called, other than the format (tblDD-MM-YYYY).
in thinking something like this(that obviously doesen't work)
SELECT * FROM DB_NAME.*
All the tables have the same columns, and one of them is a primary key, that auto increments.
Here is a table named tbl17_12_2015
ID PERSONID NAME PAYMENT TYPE RESULT TYPE
3 92545 TOM 20,5 A NULL NULL
4 92545 TOM 20,5 A NULL NULL
6 117681 LISA NULL NULL 207 R
Here is a table named tbl18_12_2015
ID PERSONID NAME PAYMENT TYPE RESULT TYPE
3 117681 LISA 30 A NULL NULL
4 53694 DAVID 78 A NULL NULL
6 58461 MICHELLE NULL NULL 207 R
What i would like to get is something like this(from all tables in the DB):
PERSONID NAME PAYMENT TYPE RESULT TYPE
92545 TOM 20,5 A NULL NULL
92545 TOM 20,5 A NULL NULL
117681 LISA NULL NULL 207 R
117681 LISA 30 A NULL NULL
53694 DAVID 78 A NULL NULL
58461 MICHELLE NULL NULL 207 R
Have tried some different query's but none of them returned this, just a lot of info about the tables.
Thanks in advance, and happy holidays
edit: corrected tbl18_12_2015 col 3 header to english rather than danish
Thanks to all those who tried to help me solving this question, but i can't (due to my skill set most likely) get the UNION to work, so that's why i decided to refactor my DB.
While you could store the table names in a database and use dynamic sql to union them together, this is NOT a good idea and you shouldn't even consider it - STOP NOW!!!!!
What you need to do is create a new table with the same fields - and add an ID (auto-incrementing identity column) and a DateTime field. Then, instead of creating a new table for each day, just write your data to this table with the DateTime. Then, you can use the DateTime field to filter your results, whether you want something from a day, week, month, year, decade, etc. - and you don't need dynamic sql - and you don't have 10,000 database tables.
I know some people posted comments expressing the same sentiments, but, really, this should be an answer.
If you had all the tables in the same database you would be able to use the UNION Operator to combine all your tables..
Maybe you can do something like this to select all the tables names from a given database
For SQL Server:
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE' AND TABLE_CATALOG='dbName'
For MySQL:
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE' AND TABLE_SCHEMA='dbName'
Once you have the list of tables you can move all the tables to 1 database and create your report using Unions..
You will need to use a UNION between each select query.
Do not use *, always list the name of the columns you are bringing up.
If you want duplicates, then UNION ALL is what you want.
If you want unique records based on the PERSONID, but there is likely to be differences, then I will guess that an UPDATE_DATE column will be useful to determine which one to use but what if each records with the same PERSONID lived a life of its own on each side?
You'd need to determine business rules to find out which specific changes to keep and merge into the unique resulting record and you'd be on your own.
What is "Skyttenavn"? Is it Danish? If it is the same as "NAME", you'd want to alias that column as 'NAME' in the select query, although it's the order of the columns as listed that counts when determining what to unite.
You'd need a new auto-incremented ID as a unique primary key, by the way, if you are likely to have conflicting IDs. If you want to merge them together into a new primary key identity column, you'd want to set IDENTITY_INSERT to OFF then back to ON if you want to restart natural incrementation.

Generate Unique hash for a field in SQL Server

I'm in the process of writing a Membership Provider for use with our existing membership base. I use EF4.1 for all of my database access and one of the issued that I'm running into is when the DB was originally setup the relationships were done programmatically instead of in the db. One if the relationships needs to be made on a column that isn't required for all of our users, but in order to make the relationships does need to be unique (from my understanding).
My solution that I believe will work is to do an MD5 hash on the userid field (which is unique ...which would/should guarantee a unique value in that field). The part that I'm having issues with on sql server is the query that would do this WITHOUT replacing the existing values stored in the employeeNum field (the one in question).
So in a nutshell my question is. What is the best way to get a unique value in the employeeNum field (possibly based on an md5 hash of the userid field) on all the rows in which a value isn't already present. Also, to a minor/major extent...does this sound like a good plan?
If your question is just how to generate a hash value for userid, you can do it this way using a computed column (or generate this value as part of the insert process). It isn't clear to me whether you know about the HASHBYTES function or what other criteria you're looking at when you say "best."
DECLARE #foo TABLE
(
userid INT,
hash1 AS HASHBYTES('MD5', CONVERT(VARCHAR(12), userid)),
hash2 AS HASHBYTES('SHA1', CONVERT(VARCHAR(12), userid))
);
INSERT #foo(userid) SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 500;
SELECT userid, hash1, hash2 FROM #foo;
Results:
userid hash1 hash2
------ ---------------------------------- ------------------------------------------
1 0xC4CA4238A0B923820DCC509A6F75849B 0x356A192B7913B04C54574D18C28D46E6395428AB
2 0xC81E728D9D4C2F636F067F89CC14862C 0xDA4B9237BACCCDF19C0760CAB7AEC4A8359010B0
500 0xCEE631121C2EC9232F3A2F028AD5C89B 0xF83A383C0FA81F295D057F8F5ED0BA4610947817
In SQL Server 2012, I highly recommend at least SHA2_256 instead of either of the above. (You forgot to mention what version you're using - always useful information.)
All that said, I still want to call attention to the point I made in the comments: the "best" solution here is to fix the model. If employeeNum is optional, EF shouldn't be made to think it is required or unique, and it shouldn't be used in relationships if it is not, in fact, some kind of identifier. Why would a user care about collisions between employeeNum and userid if you're using the right attribute for the relationship in the first place?
EDIT as requested by OP
So what is wrong with saying UPDATE table SET EmployeeNum = 1000000 + UserID WHERE EmployeeNum IS NULL? If EmployeeNum will stay below 1000000 then you've guaranteed no collisions and you've avoided hashing altogether.
You could generate similar padding if employeeNum might contain a string, but again is it EF that promotes these horrible column names? Why would a column with a Num suffix contain anything but a number?
You could also use a uniqueidentifier setting the default value to (newid())
Create a new column EmployeeNum as uniqueidentifer, then:
UPDATE Employees SET EmployeeNum = newid()
Then set as primary key.
UPDATE EMPLOYEE
SET EMPLOYEENUM = HASHBYTES('SHA1', CAST(USERID AS VARCHAR(20)))
WHERE EMPLOYEENUM IS NULL

Sql Server Column with Auto-Generated Data

I have a customer table, and my requirement is to add a new varchar column that automatically obtains a random unique value each time a new customer is created.
I thought of writing an SP that randomizes a string, then check and re-generate if the string already exists. But to integrate the SP into the customer record creation process would require transactional SQL stuff at code level, which I'd like to avoid.
Help please?
edit:
I should've emphasized, the varchar has to be 5 characters long with numeric values between 1000 and 99999, and if the number is less than 10000, pad 0 on the left.
if it has to be varchar, you can cast a uniqueidentifier to varchar.
to get a random uniqueidentifier do NewId()
here's how you cast it:
CAST(NewId() as varchar(36))
EDIT
as per your comment to #Brannon:
are you saying you'll NEVER have over 99k records in the table? if so, just make your PK an identity column, seed it with 1000, and take care of "0" left padding in your business logic.
This question gives me the same feeling I get when users won't tell me what they want done, or why, they only want to tell me how to do it.
"Random" and "Unique" are conflicting requirements unless you create a serial list and then choose randomly from it, deleting the chosen value.
But what's the problem this is intended to solve?
With your edit/update, sounds like what you need is an auto-increment and some padding.
Below is an approach that uses a bogus table, then adds an IDENTITY column (assuming that you don't have one) which starts at 1000, and then which uses a Computed Column to give you some padding to make everything work out as you requested.
CREATE TABLE Customers (
CustomerName varchar(20) NOT NULL
)
GO
INSERT INTO Customers
SELECT 'Bob Thomas' UNION
SELECT 'Dave Winchel' UNION
SELECT 'Nancy Davolio' UNION
SELECT 'Saded Khan'
GO
ALTER TABLE Customers
ADD CustomerId int IDENTITY(1000,1) NOT NULL
GO
ALTER TABLE Customers
ADD SuperId AS right(replicate('0',5)+ CAST(CustomerId as varchar(5)),5)
GO
SELECT * FROM Customers
GO
DROP TABLE Customers
GO
I think Michael's answer with the auto-increment should work well - your customer will get "01000" and then "01001" and then "01002" and so forth.
If you want to or have to make it more random, in this case, I'd suggest you create a table that contains all possible values, from "01000" through "99999". When you insert a new customer, use a technique (e.g. randomization) to pick one of the existing rows from that table (your pool of still available customer ID's), and use it, and remove it from the table.
Anything else will become really bad over time. Imagine you've used up 90% or 95% of your available customer ID's - trying to randomly find one of the few remaining possibility could lead to an almost endless retry of "is this one taken? Yes -> try a next one".
Marc
Does the random string data need to be a certain format? If not, why not use a uniqueidentifier?
insert into Customer ([Name], [UniqueValue]) values (#Name, NEWID())
Or use NEWID() as the default value of the column.
EDIT:
I agree with #rm, use a numeric value in your database, and handle the conversion to string (with padding, etc) in code.
Try this:
ALTER TABLE Customer ADD AVarcharColumn varchar(50)
CONSTRAINT DF_Customer_AVarcharColumn DEFAULT CONVERT(varchar(50), GETDATE(), 109)
It returns a date and time up to milliseconds, wich would be enough in most cases.
Do you really need an unique value?

Resources