Distinct with long columns

Distinct with long columns - sql-server

I have here some database schema with tables having long fields (in MS-SQL-Server of type "text", in Sybase of type "text" too) and I need to retrieve distinct rows.
The tables looks like
create table node (id int primary key, … a few more fields … data text);
create table ref (id int primary key, node_id int, … a few more fields);
For one row in "node", there may be zero or more rows in "ref".
Now I have a query like
SELECT node.* FROM node, ref WHERE node.id = ref.node_id AND ... some more restrictions.
This query returns duples and triples when there is more than a single row in "ref" for some "node_id".
But I need unique rows!
Using SELECT DISTINCT node.* does not work because of the columns of type "text" :-(
In Sybase there is trick, just add "GROUP BY node.id" to the query, voila! You get unique rows returned.
Is there some similar simple Trick for MS-SQL-Server?
I have already a solution with temporary tables, but this seems to be a lot slower maybe the reason is just because of the larger number of statements transferred to the database?

It looks like you are approaching this problem from the wrong direction. Joins are typically used to expand on keys where relevant data is stored in different tables. So it's no surprise you are getting more than one row per node_id.
In your query, you join the two tables together, but then you ignore everything from ref. It looks like you're just trying to filter out ids from node that are not referenced in ref. If that is the case, then you don't want to use a join. The following will work much better
select *
from node
where id in (
select node_id
from ref
where [any restrictions placed on the ref table go here]
)
and [any restrictions placed on the node table go here]
Furthermore, at the risk of teaching you bad join practices, the same thing can be accomplished they way you were trying to do it originally, but it's more painful to write and it's not good practice
select node.col1, node.col2, ... , node.last_col
FROM node
inner join ref on node.id = ref.node_id
where [some restrictions.]
group by node.col1, node.col2, ... , node.last_col

Related

SQL query runs into a timeout on a sparse dataset

For sync purposes, I am trying to get a subset of the existing objects in a table.
The table has two fields, [Group] and Member, which are both stringified Guids.
All rows together may be to large to fit into a datatable; I already encountered an OutOfMemory exception. But I have to check that everything I need right now is in the datatable. So I take the Guids I want to check (they come in chunks of 1000), and query only for the related objects.
So, instead of filling my datatable once with all
SELECT * FROM Group_Membership
I am running the following SQL query against my SQL database to get related objects for one thousand Guids at a time:
SELECT *
FROM Group_Membership
WHERE
[Group] IN (#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, #Guid5, ..., #Guid999)
The table in question now contains a total of 142 entries, and the query already times out (CommandTimeout = 30 seconds). On other tables, which are not as sparsely populated, similar queries don't time out.
Could someone shed some light on the logic of SQL Server and whether/how I could hint it into the right direction?
I already tried to add a nonclustered index on the column Group, but it didn't help.

I'm not sure that WHERE IN will be able to maximally use an index on [Group], or if at all. However, if you had a second table containing the GUID values, and furthermore if that column had an index, then a join might perform very fast.
Create a temporary table for the GUIDs and populate it:
CREATE TABLE #Guids (
Guid varchar(255)
)
INSERT INTO #Guids (Guid)
VALUES
(#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, ...)
CREATE INDEX Idx_Guid ON #Guids (Guid);
Now try rephrasing your current query using a join instead of a WHERE IN (...):
SELECT *
FROM Group_Membership t1
INNER JOIN #Guids t2
ON t1.[Group] = t2.Guid;
As a disclaimer, if this doesn't improve the performance, it could be because your table has low cardinality. In such a case, an index might not be very effective.

Convert Date Stored as VARCHAR into INT to compare to Date Stored as INT

I'm using SQL Server 2014. My request I believe is rather simple. I have one table containing a field holding a date value that is stored as VARCHAR, and another table containing a field holding a date value that is stored as INT.
The date value in the VARCHAR field is stored like this: 2015M01
The data value in the INT field is stored like this: 201501
I need to compare these tables against each other using EXCEPT. My thought process was to somehow extract or TRIM the "M" out of the VARCHAR value and see if it would let me compare the two. If anyone has a better idea such as using CAST to change the date formats or something feel free to suggest that as well.
I am also concerned that even extracting the "M" out of the VARCHAR may still prevent the comparison since one will still remain VARCHAR and the other is INT. If possible through a T-SQL query to convert on the fly that would be great advice as well. :)

REPLACE the string and then CONVERT to integer
SELECT A.*, B.*
FROM TableA A
INNER JOIN
(SELECT intField
FROM TableB
) as B
ON CONVERT(INT, REPLACE(A.varcharField, 'M', '')) = B.intField

Since you say you already have the query and are using EXCEPT, you can simply change the definition of that one "date" field in the query containing the VARCHAR value so that it matches the INT format of the other query. For example:
SELECT Field1, CONVERT(INT, REPLACE(VarcharDateField, 'M', '')) AS [DateField], Field3
FROM TableA
EXCEPT
SELECT Field1, IntDateField, Field3
FROM TableB
ＨＯＷＥＶＥＲ, while I realize that this might not be feasible, your best option, if you can make this happen, would be to change how the data in the table with the VARCHAR field is stored so that it is actually an INT in the same format as the table with the data already stored as an INT. Then you wouldn't have to worry about situations like this one.
Meaning:
Add an INT field to the table with the VARCHAR field.
Do an UPDATE of that table, setting the INT field to the string value with the M removed.
Update any INSERT and/or UPDATE stored procedures used by external services (app, ETL, etc) to do that same M removal logic on the way in. Then you don't have to change any app code that does INSERTs and UPDATEs. You don't even need to tell anyone you did this.
Update any "get" / SELECT stored procedures used by external services (app, ETL, etc) to do the opposite logic: convert the INT to VARCHAR and add the M on the way out. Then you don't have to change any app code that gets data from the DB. You don't even need to tell anyone you did this.
This is one of many reasons that having a Stored Procedure API to your DB is quite handy. I suppose an ORM can just be rebuilt, but you still need to recompile, even if all of the code references are automatically updated. But making a datatype change (or even moving a field to a different table, or even replacinga a field with a simple CASE statement) "behind the scenes" and masking it so that any code outside of your control doesn't know that a change happened, not nearly as difficult as most people might think. I have done all of these operations (datatype change, move a field to a different table, replace a field with simple logic, etc, etc) and it buys you a lot of time until the app code can be updated. That might be another team who handles that. Maybe their schedule won't allow for making any changes in that area (plus testing) for 3 months. Ok. It will be there waiting for them when they are ready. Any if there are several areas to update, then they can be done one at a time. You can even create new stored procedures to run in parallel for any updated app code to have the proper INT datatype as the input parameter. And once all references to the VARCHAR value are gone, then delete the original versions of those stored procedures.

If you want everything in the first table that is not in the second, you might consider something like this:
select t1.*
from t1
where not exists (select 1
from t2
where cast(replace(t1.varcharfield, 'M', '') as int) = t2.intfield
);
This should be close enough to except for your purposes.
I should add that you might need to include other columns in the where statement. However, the question only mentions one column, so I don't know what those are.

You could create a persisted view on the table with the char column, with a calculated column where the M is removed. Then you could JOIN the view to the table containing the INT column.
CREATE VIEW dbo.PersistedView
WITH SCHEMA_BINDING
AS
SELECT ConvertedDateCol = CONVERT(INT, REPLACE(VarcharCol, 'M', ''))
--, other columns including the PK, etc
FROM dbo.TablewithCharColumn;
CREATE CLUSTERED INDEX IX_PersistedView
ON dbo.PersistedView(<the PK column>);
SELECT *
FROM dbo.PersistedView pv
INNER JOIN dbo.TableWithIntColumn ic ON pv.ConvertedDateCol = ic.IntDateCol;
If you provide the actual details of both tables, I will edit my answer to make it clearer.
A persisted view with a computed column will perform far better on the SELECT statement where you join the two columns compared with doing the CONVERT and REPLACE every time you run the SELECT statement.
However, a persisted view will slightly slow down inserts into the underlying table(s), and will prevent you from making DDL changes to the underlying tables.
If you're looking to not persist the values via a schema-bound view, you could create a non-persisted computed column on the table itself, then create a non-clustered index on that column. If you are using the computed column in WHERE or JOIN clauses, you may see some benefit.
By way of example:
CREATE TABLE dbo.PCT
(
PCT_ID INT NOT NULL
CONSTRAINT PK_PCT
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SomeChar VARCHAR(50) NOT NULL
, SomeCharToInt AS CONVERT(INT, REPLACE(SomeChar, 'M', ''))
);
CREATE INDEX IX_PCT_SomeCharToInt
ON dbo.PCT(SomeCharToInt);
INSERT INTO dbo.PCT(SomeChar)
VALUES ('2015M08');
SELECT SomeCharToInt
FROM dbo.PCT;
Results:

transfer data from one database to another regarding keys

How can i transfer rows from two tables (Patient and ContactDetails) from DB1 to DB2?
Both DBs, have already these 2 tables with data. i just want to add data from these two tables from db1 to db2.
i tried following that
but it didnt work, because there are some rows with the same keys and overwrite is forbidden.
is there an other way to do it? or am i missing something?
patient and contactdetails relationship is
patient inner join contactdetails
(foreign_key)patient.contactdetailsid = (primary_key)contactdetails.id

loop on the source contactdetails table, insert each row one a time saving in a temp table the old contactdetail id and the matching new contactdetail id (here is an example of sql loop).
the temp table should be something like:
create #temptableforcopy table (
oldcontactdetailsid [insertheretherightdatatype],
newcontactdetailsid [insertheretherightdatatype]
)
copy the data from the patient table joined to the temp table used for the previous step like this:
insert into newdb.newschema.patient (contactdetailsid, field1, field2, ...)
select TT.newcontactdetailsid,
old.field1,
old.field2,
...
from olddb.oldschema.patient old
join #temptableforcopy TT on TT.oldcontactdetailsid = old.contactdetailsid
please note that my proposal is just a wild guess: you gave no information about structure, keys, constraints, no detail about which key is preventing the copy with which error message, the solution you already discarded, the amount of data you have to deal with...

Recommended approach to merging two tables

I have a database schema like this:
[Patients] [Referrals]
| |
[PatientInsuranceCarriers] [ReferralInsuranceCarriers]
\ /
[InsuranceCarriers]
PatientInsuranceCarriers and ReferralInsuranceCarriers are identical, except for the fact that they reference either Patients or Referrals. I would like to merge those two tables, so that it looks like this:
[Patients] [Referrals]
\ /
[PatientInsuranceCarriers]
|
[InsuranceCarriers]
I have two options here
either create two new columns - ID_PatientOrReferral + IsPatient (will tell me which table to reference)
or create two different columns - ID_Patient and ID_Referral, both nullable.
Generally, I try to avoid nullable columns, because I consider them a bad practice (meaning, if you can live w/o nulls, then you don't really need a nullable column) and they are more difficult to work with in code (e.g., LINQ to SQL).
However I am not sure if the first option would be a good idea. I saw that it is possible to create two FKs on ID_PatientOrReferral (one for Patients and one for Referrals), though I can't set any update/delete behavior there for obvious reasons, I don't know if constraint check on insert works that way, either, so it looks like the FKs are there only to mark that there are relationships. Alternatively, I may not create any foreign keys, but instead add the relationships in DBML manually.
Is any of the approaches better and why?

To expand on my somewhat terse comment:
I would like to merge those two tables
I believe this would be a bad idea. At the moment you have two tables with good clear relation predicates (briefly, what it means for there to exist a record in the table) - and crucially, these relation predicates are different for the two tables:
A record exists in PatientInsuranceCarriers <=> that Patient is associated with that Insurance Carrier
A record exists in ReferralInsuranceCarriers <=> that Referral is associated with that Insurance Carrier
Sure, they are similar, but they are not the same. Consider now what would be the relation predicate of a combined table:
A record exists in ReferralAndPatientInsuranceCarriers <=> {(IsPatient is true and the Patient with ID ID_PatientOrReferral) or alternatively (IsPatient is false and the Referral with ID ID_PatientOrReferral)} is associated with that Insurance Carrier
or if you do it with NULLs
A record exists in ReferralAndPatientInsuranceCarriers <=> {(ID_Patient is not NULL and the Patient with ID ID_Patient) or alternatively (ID_Referral is not NULL and the Referral with ID ID_Referral)} is associated with that Insurance Carrier
Now, I'm not one to automatically suggest that more complicated relation pedicates are necessarily worse; but I'm fairly sure that either of the two above are worse than those they would replace.
To address your concerns:
we now have two LINQ to SQL entities, separate controllers and views for each
In general I would agree with reducing duplication; however, only duplication of the same things! Here, is it not the case that all the above are essentially 'boilerplate', and their construction and maintenance can be delegated to suitable development tools?
and have to merge them when preparing data for reports
If you were to create a VIEW, containing a UNION, for reporting purposes, you would keep the simplicity of the actual data and still have the ability to report on a combined list; eg (making assumptions about column names etc):
CREATE VIEW InterestingInsuranceCarriers
AS
SELECT
IC.Name InsuranceCarrierName
, P.Name CounterpartyName
, 'Patient' CounterpartyType
FROM InsuranceCarriers IC
INNER JOIN PatientInsuranceCarriers PIC ON IC.ID = PIC.InsuranceCarrierID
INNER JOIN Patient P ON PIC.PatientId = P.ID
UNION
SELECT
IC.Name InsuranceCarrierName
, R.Name CounterpartyName
, 'Referral' CounterpartyType
FROM InsuranceCarriers IC
INNER JOIN ReferralInsuranceCarriers RIC ON IC.ID = RIC.InsuranceCarrierID
INNER JOIN Referral R ON PIC.ReferralId = R.ID

Copying my answer from this question
If you really need A_or_B_ID in TableZ, you have two similar options:
1) Add nullable A_ID and B_ID columns to table z, make A_or_B_ID a computed column using ISNULL on these two columns, and add a CHECK constraint such that only one of A_ID or B_ID is not null
2) Add a TableName column to table z, constrained to contain either A or B. now create A_ID and B_ID as computed columns, which are only non-null when their appropriate table is named (using CASE expression). Make them persisted too
In both cases, you now have A_ID and B_ID columns which can have appropriate foreign
keys to the base tables. The difference is in which columns are computed. Also, you
don't need TableName in option 2 above if the domains of the 2 ID columns don't
overlap - so long as your case expression can determine which domain A_or_B_ID
falls into

Duplicate Data over One-to-Many self relation (Tsql)

Sorry if the title is poorly descriptive, but I can't do better right now =(
So, I have this master-detail scheme, with the detail being a tree structure (one to many self relation) with n levels (on SQLServer 2005)
I need to copy a detail structure from one master to the another using a stored procedure, by passing the source master id and the target master id as parameters (the target is new, so it doesn't has details).
I'm having troubles, and asking for your kind help in finding a way to keep track of parent id's and inserting the children without using cursors or nasty things like that...
This is a sample model, of course, and what I'm trying to do is to copy the detail structure from one master to other. In fact, I'm creating a new master using an existing one as template.

If I understand the problem, this might be what you want:
INSERT dbo.Master VALUES (#NewMaster_ID, #NewDescription)
INSERT dbo.Detail (parent_id, master_id, [name])
SELECT detail_ID, #NewMaster_ID, [name]
FROM dbo.Detail
WHERE master_id = #OldMaster_ID
UPDATE NewChild
SET parent_id = NewParent.detail_id
FROM dbo.Detail NewChild
JOIN dbo.Detail OldChild
ON NewChild.parent_id = OldChild.detail_id
JOIN dbo.Detail NewParent
ON NewParent.parent_id = OldChild.parent_ID
WHERE NewChild.master_id = #NewMaster_ID
AND NewParent.master_id = #NewMaster_ID
AND OldChild.master_id = #OldMaster_ID
The trick is to use the old detail_id as the new parent_id in the initial insert. Then join back to the old set of rows using this relationship, and update the new parent_id values.
I assumed that detail_id is an IDENTITY value. If you assign them yourself, you'll need to provide details, but there's a similar solution.

you'll have to provide create table and insert into statements for little sample data.
and expected results based on this sample data.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Distinct with long columns - sql-server

Related

SQL query runs into a timeout on a sparse dataset

Convert Date Stored as VARCHAR into INT to compare to Date Stored as INT

transfer data from one database to another regarding keys

Recommended approach to merging two tables

Duplicate Data over One-to-Many self relation (Tsql)

Categories

Resources