retrieve data from multiple tables and store them in one table confronted the duplicates - sql-server

I'm working on some critical data retrieving report tasks and find some difficulties to
proceed. Basically, it's belonging to medical area and the whole data is distributed in
several tables and I can't change the architecture of database tables design. In order to
finish my report, I need the following steps:
1- divide the whole report to several parts, for each parts retrieve data by using
several joins. (like for part A can be retrieved by this:
select a1.field1, a2.field2 from a1 left join a2 on a1.fieldA= a2.fieldA ) then I can
got all the data from part A.
2- the same things happened for part B
select b1.field1, b2.field2 from b1 left join b2 on b1.fieldB= b2.fieldB, then I also
get all the data from part B.
3- same case for part C, part D.....and so on.
The reason I divide them is that for each part I need to have more than 8 joins (medical data is always complex) so I can't finish all of them within a single join (with more than 50 joins which is impossible to finish...)
After that, I run my Spring Batch program to insert all the data from part A and data from part b, part c.. into one table as my final report table. The problem is not every part will have same number of rows which means part A may return 10 rows while part b may return 20 rows. Since the time condition for each part is the same (1 day) and can't be changed so just wondering how can I store all these different part of data into one table with minimum overhead. I don't want to have to many duplicates, thanks for the great help.
Lei

Looks to me like what you need are joins over the "data from part A", "data from part B" & "data from part C". Lets call them da, db & dc.
It's perfectly alright that num rows in da/b/c are different. But as you're trying to put them all in a single table at the end, obviously there is some relation between them. Without better description of that relation it's not possible to provide a more concrete answer. So I'll just write my thoughts, which you might already know, but anyway...
Simplest way is to join results from your 3 [inner] queries in a higher level [outer] query.
select j.x, j.y, j.z
from (
' da join db join dc
) j;
If this is not possible (due to way too many joins as you said) then try one of these:
Create 3 separate materialized views (one each for da, db & dc) and perform the join these views. Materialized is optional (i.e. you can use the "normal" views too), but it should improve the performance greatly if available in your DB.
First run queries for da/b/c, fetch the data and put this data in intermediate tables. Run a join on those tables.
PS: If you want to run reports (many/frequent/large size) on some data then that data should be designed appropriately, else you'll run into heap of trouble in future.
If you want something more concrete, please post the relationship between da/b/c.

Related

SQL Server query taking long time to execute

I have the below query. It currently takes 5.5 hours to execute with the results of 5,838 rows. The whole thing runs in less than 2 minutes if I remove the timestamp limiter and just run it against all the results of the views.
I am looking for a way to make this execute faster when trying to limit the time/date range and am open to any suggestions. I can go into more detail about the datasets if needed.
SELECT
m.rpp
,m.SourceName
,m.ckt AS l_ckt
,m.amp AS l_amp
,MAX(m.reading) AS l_reading
,k.ckt AS r_ckt
,k.amp AS r_amp
,MAX(k.reading) AS r_reading
FROM
vRPPPanelLeft m
INNER JOIN
vRPPPanelRight k ON (m.sourcename = k.sourcename)
AND (CAST(m.ckt AS int) + 1 = CAST(k.ckt AS int))
WHERE
m.timestampserverlocal BETWEEN '2020-10-10' AND '2020-10-12'
AND k.timestampserverlocal BETWEEN '2020-10-10' AND '2020-10-11'
GROUP BY
m.rpp, m.sourcename, m.ckt, m.amp, k.ckt, k.amp
I'm assuming m and k are views, and potentially rather complex ones and/or having GROUP BY statements.
I'm guessing that when you look at the execution plans and/or statistics, it will actually be running at least one of the views many times - e.g., one for each row in the other view.
Quick fix
If you are able to (e.g., it's in a stored procedure) I suggest the following process
Creating temporary tables with the same structures as m and k (or with relevant columns for this purpose). Include columns for the modified versions of ckt (e.g., ints for both). Consider a PK (or at least clustered index) of sourcename and ckt (the int) for both temporary tables - it may help, or may not.
Run m and k views independently, storing results in these temporary tables. Use filtering (e.g., WHERE clause on timestampserverlocal) and potentially the relevant GROUP BY clauses to reduce the number of rows created.
Run the original SQL but using the temporary tables rather than the views, and without the need for the WHERE clause. It may still need the GROUP BY.
Longer fix
I suggest doing the quick fix first to confirm that the running-views-many-times is the problem.
If it does (and the issue is that the views are being run many times) one longer fix is to stop using the views in the FROMs, and instead putting it all together in one query and then trying to simplify the code.

How do I return a complete set of records if any one of them has been updated?

I am making a script that gets the incremental updated of data from a DB.
I need some help on getting the whole set of data that is contained in two joined tables whenever one of the records in the seconds table (at least one) has been updated;
So far the only thing that I could come up with is get the Ids from the first table that have been in any way involved in the change and then use those Ids as a filter in a more general query. It works, but it's not pretty and I feel that there might be a better optimised solution to this.
Here's a link to an example:
https://rextester.com/TLCIA97116
(could not get SQL Fiddle to open).
The above example shows two SELECT queries. The second one is my solution and it's result is what I'm looking for. In other words - "give me the records from table A and their related (joined) records from table B, whenever one of the records in table B is added/changed"
The first query is returning the set that has only the updated row, but I need all of them to recalculate some values later and they depend on all records being there.
The best and easiest way to handle is using triggers. Add a trigger in the second table, which will push changes/call updates on the first table.
After consulting with a couple of colleagues, it looks like my solution is the one that we have to go with (mostly because of the lack of better ones).
The request is as follows:
select *
from Placements as p
join PlacementCommissions as pc on p.PlacementId = pc.PlacementId
where p.PlacementId in (select distinct p2.PlacementId from Placements AS p2
JOIN PlacementCommissions AS pc2 ON p2.PlacementId = pc2.PlacementId
where ((pc2.TimeStamp >= #lastSuccessfullETLRun
AND pc2.TimeStamp < #startTimeBullhorn)
OR (p2.TimeStamp >= #lastSuccessfullETLRun
AND p2.TimeStamp < #startTimeBullhorn)))

How do I compare SQL Server databases and update another?

I am currently dealing with a scenario of whereby I have 2 SQL databases and I need to compare the changes in data in each of the tables in the 2 databases. The databases have the same exact tables between them however may have differing data in each of them. The first hurdle I have is how do I compare the data sets and the second challenge I have is once I have identified the differences how do I say for example update data missing in database 2 that may have existed in 1.
Why not use SQL Data Compare instead of re-inventing the wheel? It does exactly what you're asking - compares two databases and writes scripts that will sync in either direction. I work for a vendor that competes with some of their tools and it's still the one I would recommend.
http://www.red-gate.com/products/sql-development/sql-data-compare/
One powerful command for comparing data is EXCEPT. With this, you can compare two tables with same structure simply by doing the following:
SELECT * FROM Database1.dbo.Table
EXCEPT
SELECT * FROM Database2.dbo.Table
This will give you all the rows that exist in Database1 but not in Database2, including rows that exist in both but are different (because it compares every column). Then you can reverse the order of the queries to check the other direction.
Once you have identified the differences, you can use INSERT or UPDATE to transfer the changes from one side to the other. For example, assuming you have a primary key field PK, and new rows only come into Database2, you might do something like:
INSERT INTO Database1.dbo.Table
SELECT T2.*
FROM Database2.dbo.Table T2
LEFT JOIN Database1.dbo.Table T1 on T2.PK = T1.PK
WHERE T1.PK IS NULL -- insert rows that didn't match, i.e., are new
The actual methods used depend on how the two tables are related together, how you can identify matching rows, and what the sources of changes might be.
You also can look at the Data Compare feature in Visual Studio 2010 (Premium and higher)? I use it to make sure configuration tables in all my environments ( i.e. development, test and production ) are in sync. It made my life enormously easier.
You can select tables you want to compare, you can choose columns to compare. What I haven't learned to do though is how to save my selections for the future use.
You can do this with SQL Compare which is a great too for development but if you want to do this on a scheduled basis a better solution might be Simego's Data Sync Studio. I know it can do about 100m (30 cols wide) row compare on 16GB on an i3 iMac (bootcamp). In reality it is comfortable with 1m -> 20m rows on each side. It uses a column storage engine.
In this scenario it would only take a couple of minutes to download, install and test the scenario.
I hope this helps as I always look for the question mark to work out what someone is asking.

Sql query gets too slow

Few days ago I wrote one query and it gets executes quickly but now a days it takes 1 hrs.
This query run on my SQL7 server and it takes about 10 seconds.
This query exists on another SQL7 server and until last week it took about
10 seconds.
The configuration of both servers are same. Only the hardware is different.
Now, on the second server this query takes about 30 minutes to extract the s
ame details, but anybody has changed any details.
If I execute this query without Where, it'll show me the details in 7
seconds.
This query still takes about same time if Where is problem
Without seeing the query and probably the data I can't do a lot other than offer tips.
Can you put more constraints on the query. If you can reduce the amount of data involved then this will speed up the query.
Look at the columns used in your joins, where and having clauses and order by. Check that the tables that the columns belong to contain indices for these columns.
Do you need to use the user defined function or can it be done another way?
Are you using subquerys? If so can these be pulled out into separate views?
Hope this helps.
Without knowing how much data is going into your tables, and not knowing your schema, it's hard to give a definitive answer but things to look at:
Try running UPDATE STATS or DBCC REINDEX.
Do you have any indexes on the tables? If not, try adding indexes to columns used in WHERE clauses and JOIN predicates.
Avoid cross table OR clauses (i.e, where you do WHERE table1.col1 = #somevalue OR table2.col2 = #someothervalue). SQL can't use indexes effectively with this construct and you may get better performance by splitting the query into two and UNION'ing the results.
What do your functions (UDFs) do and how are you using them? It's worth noting that dropping them in the columns part of a query gets expensive as the function is executed per row returned: thus if a function does a select against the database, then you end up running n + 1 queries against the database (where n = number of rows returned in the main select). Try and engineer the function out if possible.
Make sure your JOINs are correct -- where you're using a LEFT JOIN, revisit the logic and see if it needs to be a LEFT or whether it can be turned into an INNER JOIN. Sometimes people use LEFT JOINs, but when you examine the logic in the rest of the query, it can sometimes be apparent that the LEFT JOIN gives you nothing (because, for example, someone may had added a WHERE col IS NOT NULL predicate against the joined table). INNER JOINs can be faster, so it's worth reviewing all of these.
It would be a lot easier to suggest things if we could see the query.

TSQL "LIKE" or Regular Expressions?

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?
What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.
First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.
One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)
Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.
There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.
Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.
Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.

Resources