Planning out basic TSQL to manipulate table data - sql-server

Can someone help me get into the thinking of knowing how to fix data in SQL tables (by trying NOT to give me an SQL routines I could run).
Ok, this is the situation…. Suppose I have a single table with has a column called ColumnA which has lots of duplicate values. I need to remove all the duplicate entries from the table in question. Question is….if I had to write pseudo-code as a plan, what SQL should be written
Many thanks to anyone who can offer me any pointers.
Kind Regards
James

I think you've already articulated a very basic psuedocode for the issue you describe in stating that you wish to delete duplicate values from column A.
For this example, I would tend to;
Find all instances of duplicates
Work out method of determining which one to keep (Google "MAX N in
Group" for ideas) There are good articles here on SO and DBA
Stackexchange also other external articles with examples
Write your delete to cater for the records you identify as unwanted
duplicates in the previous step
For me, when working through these types of issues in SQL Server, I tend to write a series of Common Table Expressions (CTE) to identify my target records and then delete based on that.
For example;
;WITH Duplicates AS (
-- Write your select query to identify which subset of your records are affected by having duplicate values in Column A
), TargetRows AS (
-- Further select from Duplicates some method of MAX N in Group to identify which of the rows are unwanted
) -- Then here DELETE from your table based upon your findings from above

Related

Table of tables..use SQL pivot to extract unique fields for each table

new to stack overflow as my job has me doing more SQL querying than I'm use to (super basic queries). And since this is the best online resource....:)
Preamble...my company has developed an SQL database that contains a giant table of tables. In other words, they extracted a series of tables (200+) from external sources and put them all into one massive table to be used for reporting purposes in other systems. For example, if one of these external tables has 5 fields and 10 rows of data, that translates to 50 rows in this 'table of tables' (Table1, Field1, Value...Table1, Field2, Value....TableX, FieldX, Value...etc.)
Requirement...I need to 'pivot' the data for the purposes of getting a list of all the fields in all the tables. In other words ignore the values (just TableX, FieldX). I need to do this in order to find 'like' fields across all the tables. Being new to using PIVOT in SQL queries, I know the basic structure of the SQL query, but I'm getting lost in the organization of it. Maybe I don't even use PIVOT. Here's what I have...
SELECT * from (
SELECT [FieldName],[iModelTable]
FROM [H352090DataMart].[dbo].[HA_iModelTableData]
PIVOT (MAX([FieldName]) FOR [iTableName] IN
(
--not sure what would go here if anything
)
) AS pvt
Any help is greatly appreciated.
Austin.
Nevermind. I figured it out. Is was as simple as using the DISTINCT function. As I said, I'm a bit of a noob when it comes to scripting. So for those who read this post, the answer is as follows (the ORDER BY is extra)...
SELECT DISTINCT
[iModelTable]
,[FieldName]
FROM [H352090DataMart].[dbo].[HA_iModelTableData]
ORDER BY iModelTable

How can I get one of my foreign key outputs to repeat in a merge transformation in SSIS?

I tried asking this question before and it seemed to have gotten swept under the rug.
First thing first, here are these two pictures to show the table structure and the current output I get in SSIS.
Table Diagram
Current Output
So in table three, there is only one entry. This entry (name) applies to the other foreign keys though. What I want the final output to look like is like my current output, but instead of the NULLS, there should just be ones.
I was able to get this far on my own through researching and learning about the merge transformations but I can't seem to find anything on manipulating the data in the way that I want.
I greatly appreciate any tips or advice you can offer.
EDIT: Since the images can't be seen apparently, I will try and describe them.
The table diagram has four tables, the top one in the waterfall has a primary key formed from the three foreign keys for the three different tables.
Trying to accomplish filling out this table in SSIS, my output has each foreign key id from the first two tables, but only one in the third table. The rest from the third foreign key are all NULLS. I believe this is because there is only one entry in that table for now, but this entry applies to all of the foreign key ids and so it should be repeating.
It should look like this:
ID1 ID2 ID3
1 1 1
2 2 1
3 3 1
But instead, I am only getting nulls in the ID3 field after the first record. How do I make the single id repeat in ID3?
EDIT 2: Some additional screenshots of my data flow and merge transformation as requested.
[![SSIS Dataflow][3]][3]
After working on this for a few weeks, and with a tips from a colleague, a solution to this question was found. Surprisingly, it was quite simple and I'm slightly shocked that no one on here could provide the answer.
The solution was simply this; Using a data source, write the following SQL code in the data access mode (SQL Command):
SELECT a.T1ID,
b.T2ID,
c.T3ID
FROM Table1 AS a join
Table2 AS b
On a.T1ID = b.T2ID,
Table3 AS c
ORDER BY a.[T1ID] ASC
If Table3 will always have just a single row, the simplest solution would be to use an Execute SQL task to save the T3id to a variable (Control Flow), then use a Derived Column task (Data Flow) to add the variable as a new column.
If that won't work for you (or your data), you can take a look here to see how to fudge the Merge Join task to do what you want.

Insert into two SQL Server tables and pass unique ID from one to the other

I'm hoping someone can help me because, despite finding numerous other questions like this one, I can't seem to find an answer specific to this problem?
I have two tables, TblReportsStore and TblReportsStoreComments as follows:
I want to add a new record to TblReportsStore, and then pass the TblReportsStoreID of that record to the intReportID column of a new record in TblReportsStoreComments. If it helps at all, the data is coming from a spreadsheet where the only changing data is the txtSchoolID and the txtComment.
One way is to use output. This assumes that it is an Identity column. Otherwise you could use NEWID() with the values:
INSERT INTO TblReportsStore(...)
OUTPUT INSERTED.TblReportsStoreID,txtComment
INTO TblReportsStoreComments(intReportID,txtComment)
VALUES ...

cakephp linked data HABTM by JOIN need both related data?

This should be a simple Yes/No answer so here goes.
If I set a 3 tables, 2 typical recordsets and 1 that joins them by the id of the 2 tables, do I need the id from both tables in order to have an entry in the join table?
The scenario is a Jobs table and a Parts table linked by JobsParts table. But some parts are not in the Parts table, they are just freetext entries (so as to avoid stock control issues) belonging to a Job.
Hope this is enough to explain my question.
Thanks
BTW using CakePHP 2.0
For database sanity, I'd say the join table 'jobs_parts' should have both IDs.
If you try entering free-form parts into the join table, you're not only going to increase the size of the join table, but you've effectively lost the ability to grow/expand - ie. what if you want to add a few more fields to this unknown part? Or what if it turns out to be a part that you actually want in your normal parts table.... it just gets confusing.
There are other options for dealing with free-form parts vs actual parts...
have a field in the parts table that's a tinyint(1) for whether or not it's a verified part
OR make an UnknownParts model/table
In my opinion, go with what makes logical sense for ease of understanding and for future updates to your database/website...etc. And IMO, adding a freeform part into the join table would not fit that bill.

How do you avoid adding timestamp fields to your tables? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a question regarding the two additional columns (timeCreated, timeLastUpdated) for each record that we see in many solutions. My question: Is there a better alternative?
Scenario: You have a huge DB (in terms of tables, not records), and then the customer comes and asks you to add "timestamping" to 80% of your tables.
I believe this can be accomplished by using a separate table (TIMESTAMPS). This table would have, in addition to the obvious timestamp column, the table name and the primary key for the table being updated. (I'm assuming here that you use an int as primary key for most of your tables, but the table name would most likely have to be a string).
To picture this suppose this basic scenario. We would have two tables:
PAYMENT :- (your usual records)
TIMESTAMP :- {current timestamp} + {TABLE_UPDATED, id_of_entry_updated, timestamp_type}
Note that in this design you don't need those two "extra" columns in your native payment object (which, by the way, might make it thru your ORM solution) because you are now indexing by TABLE_UPDATED and id_of_entry_updated. In addition, timestamp_type will tell you if the entry is for insertion (e.g "1"), update (e.g "2"), and anything else you may want to add, like "deletion".
I would like to know what do you think about this design. I'm most interested in best practices, what works and scales over time. References, links, blog entries are more than welcome. I know of at least one patent (pending) that tries to address this problem, but it seems details are not public at this time.
Cheers,
Eduardo
While you're at it, also record the user who made the change.
The flaw with the separate-table design (in addition to the join performance highlighted by others) is that it makes the assumption that every table has an identity column for the key. That's not always true.
If you use SQL Server, the new 2008 version supports something they call Change Data Capture that should take away a lot of the pain you're talking about. I think Oracle may have something similar as well.
Update: Apparently Oracle calls it the same thing as SQL Server. Or rather, SQL Server calls it the same thing as Oracle, since Oracle's implementation came first ;)
http://www.oracle.com/technology/oramag/oracle/03-nov/o63tech_bi.html
I have used a design where each table to be audited had two tables:
create table NAME (
name_id int,
first_name varchar
last_name varchar
-- any other table/column constraints
)
create table NAME_AUDIT (
name_audit_id int
name_id int
first_name varchar
last_name varchar
update_type char(1) -- 'U', 'D', 'C'
update_date datetime
-- no table constraints really, outside of name_audit_id as PK
)
A database trigger is created that populates NAME_AUDIT everytime anything is done to NAME. This way you have a record of every single change made to the table, and when. The application has no real knowledge of this, since it is maintained by a database trigger.
It works reasonably well and doesn't require any changes to application code to implement.
I think I prefer adding the timestamps to the individual tables. Joining on your timestamp table on a composite key -- one of which is a string -- is going to be slower and if you have a large amount of data it will eventually be a real problem.
Also, a lot of the time when you are looking at timestamps, it's when you're debugging a problem in your application and you'll want the data right there, rather than always having to join against the other table.
One nightmare with your design is that every single insert, update or delete would have to hit that table. This can cause major performance and locking issues. It is a bad idea to generalize a table like that (not just for timestamps). It would also be a nightmare to get the data out of.
If your code would break at the GUI level from adding fields you don't want the user to see, you are incorrectly writing the code to your GUI which should specify only the minimum number of columns you need and never select *.
The advantage of the method you suggest is that it gives you the option of adding other fields to your TIMESTAMP table, like tracking the user who made the change. You can also track edits to sensitive fields, for example who repriced this contract?
Logging record changes in a separate file means you can show multiple changes to a record, like:
mm/dd/yy hh:mm:ss Added by XXX
mm/dd/yy hh:mm:ss Field PRICE Changed by XXX,
mm/dd/yy hh:mm:ss Record deleted by XXX
One disadvantage is the extra code the will perform inserts into your TIMESTAMPS table to reflect changes in your main tables.
If you set up the time-stamp stuff to run off of triggers, than any action that can set off a trigger (Reads?) can be logged. Also there might be some locking advantages.
(Take all that with a grain of salt, I'm no DBA or SQL guru)
Yes, I like that design, and use it with some systems. Usually, some variant of:
LogID int
Action varchar(1) -- ADDED (A)/UPDATED (U)/DELETED (D)
UserID varchar(20) -- UserID of culprit :)
Timestamp datetime -- Date/Time
TableName varchar(50) -- Table Name or Stored Procedure ran
UniqueID int -- Unique ID of record acted upon
Notes varchar(1000) -- Other notes Stored Procedure or Application may provide
I think the extra joins you will have to perform to get the Timestamps will be a slight performance hit and a pain the neck. Other than that I see no problem.
We did exactly what you did. It is great for the object model and the ability to add new stamps and differant types of stamps to our model with minimal code. We were also tracking the user that made the change, and a lot of our logic was heavily based on these stamps. It woked very well.
One drawback is reporting, and/or showing a lot of differant stamps on on screen. If you are doing it the way we did it, it caused a lot of joins. Also,back ending changes was a pain.
Our solution is to maintain a "Transaction" table, in addition to our "Session" table. UPDATE, INSERT and DELETE instructions are all managed through a "Transaction" object and each of these SQL instruction is stored in the "Transaction" table once it has been successfully executed on the database. This "Transaction" table has other fields such as transactiontType (I for INSERT, D for DELETE, U for UPDATE), transactionDateTime, etc, and a foreign key "sessionId", telling us finally who sent the instruction. It is even possible, through some code, to identify who did what and when (Gus created the record on monday, Tim changed the Unit Price on tuesday, Liz added an extra discount on thursday, etc).
Pros for this solution are:
you're able to tell "what who and when", and to show it to your users! (you'll need some code to analyse SQL statements)
if your data is replicated, and replication fails, you can rebuild your database through this table
Cons are
100 000 data updates per month mean 100 000 records in Tbl_Transaction
Finally, this table tends to be 99% of your database volume
Our choice: all records older than 90 days are automatically deleted every morning
Philippe,
Don't simply delete those older than 90 days, move them first to a separate DB or write them to text file, do something to preserve them, just move them out of the main production DB.
If ever comes down to it, most often it is a case of "he with the most documentation wins"!

Resources