ETL Transformation of Generic Transaction into Bet and Win - sql-server

I've got a problem in my database, understanding a generic Transaction if its a Bet or Win. Currently it consists both in one transaction.
I've an additional field which can help -> Bet Amount, which is constant.
Here is how the table looks like:
Basically what you see in Amount Field in the first transaction, is the Amount the customer starts with. Whenever the amount of next transactions are higher compare to its previous transaction then the specific Bet has got a Win too, if its lower then its only a Bet.
My need is to create an ETL process which will Create this Table:
Hope you can help me write an efficient SQL Server code in order to create the requested Table.
Thanks in advance,

I'm not sure I'd try to do this purely in SSIS, but here's a query that should do the right thing:
select *,
case
when LAG(Amount, 1, Amount) over (order by TransactionID) = Amount then 'Bet'
else 'Win'
end
from dbo.Transactions;
I'm using the LAG() function to look back one row at the value for the Amount column in the previous row (where "previous" is defined by the ordering of TransactionID†). If that value is the same, consider it a Bet otherwise it's a Win.
† It seems like you'd want to consider games in isolation. If that's the case, change the OVER() clause on the LAG() function to partition by GameID order by TransactionID.

Use the script below as SQL Command in your OLEDB Source
SELECT TransactionID, DateTime, GameID, CASE WHEN ISNULL(Amount, '') = '' THEN
'StartGame' WHEN Amount > BetAmount THEN 'win' ELSE 'bet' END AS
TransactionTypeID, BetAmount, Amount
FROM [your table name]

Related

How can I speed up this sql server query?

-- Holds last 30 valdates
create table #valdates(
date int
)
insert into #valdates
select distinct top (30) valuation_date
from tbsm.tbl_key_rates_summary
where valuation_date <= 20150529
order by valuation_date desc
select
sum(fv_change), sc_group, valuation_date
from
(select *
from tbsm.tbl_security_scorecards_summary
where valuation_date in (select date from #valdates)) as fact
join
(select *
from tbsm.tbl_security_classification
where sc_book = 'UC' ) as dim on fact.classification_id = dim.classification_id
group by
valuation_date, sc_group
drop table #valdates
This query takes around 40 seconds to return because the fact table has almost 13 million rows.. Can I do anything about this?
Based on the fact that there's no proper index that supports the fetch, that's probably the easiest (or only) option to really improve the performance. Most likely index like this would improve the situation a lot:
create index idx_security_scorecards_summary_1 on
tbl_security_scorecards_summary (valuation_date, classification_id)
include (fv_change)
Everything depends of course on how good the selectivity of the valuation_date and classification_id fields are (=how big portion of the table needs to be fetched) and might work better with the fields in opposite order. The field fv_change is in the include section so that it's included in the index structure so there's no need to fetch it from the base table.
Include fields help if the SQL has to fetch a lot of rows from the table. If the amount of rows that this touches is small, then it might not help at all. Like always in indexing, this of course slows down the inserts / updates, and is optimized for this case only and you should of course look at the bigger picture too.
The select is written in a little bit strange way, not sure if that makes any difference, but you could also try the normal way to do this:
select
sum(fact.c), dim.sc_group, fact.valuation_date
from
tbsm.tbl_security_scorecards_summary fact
join tbsm.tbl_security_classification dim
on fact.classification_id = dim.classification_id
where
fact.valuation_date in (select date from #valdates) and
dim.sc_book = 'UC'
group by
fact.valuation_date,
dim.sc_group
Looking at "statistics io" output should give you a good idea which table is causing the slowness, and looking at query plan to see if there's any strange operators might help to understand the situation better.

Computed column performance

I can't find any answer for my problem on the web.
When exactly are computed columns computed? (not persisted ones)
When I select TOP 100 from thousands of records, are they calculated for only those selected rows?
What if I add a WHERE clause for the computed column? Does this change?
The main problem is that I have a one to many relationship, but I want to have information on parent side about... let's say MAX(somecolumn) of child table.
I'm using Entity Framework. I decided to make a computed column.
Is this a good idea? Are there any others? Any help appreciated. Tnx
EDIT:
My column is defined like this:
[ComputedNextClassDate] as [dbo].[ComputeNextClassDate]([Id]),
And my function:
CREATE FUNCTION [dbo].[ComputeNextClassDate](#id INT)
RETURNS DATETIME
AS
BEGIN
DECLARE #nextDate DATETIME;
DECLARE #now DATETIME = GETUTCDATE();
SELECT #nextDate = MIN(Start) FROM [dbo].[Events] WHERE [Start] > #now AND [GroupClassId] = #id
RETURN #nextDate;
END;
For the calculated columns with no persistance, the calculation result is never stored.
On query execution, SQL Server engine search an execution plan. If your query has been well written, the value will be calculated only once even if it is used at many places into your query.
My opinion, I never use calculated columns with no persistence. The calculation must be done at the insertion or when reading. SQL Server, and others, are ineficient for calculation usually.
Call the CLR is catastrophic in terms of performance. Avoid it.
Prefer multiples tables with joins like
SELECT p.product_name
, SUM(ISNULL(sales,0))
FROM product p
LEFT OUTER JOIN sales s ON p.product_id = s.product_id
GROUP BY p.product_name

SQL Server 2005 SELECT TOP 1 from VIEW returns LAST row

I have a view that may contain more than one row, looking like this:
[rate] | [vendorID]
8374 1234
6523 4321
5234 9374
In a SPROC, I need to set a param equal to the value of the first column from the first row of the view. something like this:
DECLARE #rate int;
SET #rate = (select top 1 rate from vendor_view where vendorID = 123)
SELECT #rate
But this ALWAYS returns the LAST row of the view.
In fact, if I simply run the subselect by itself, I only get the last row.
With 3 rows in the view, TOP 2 returns the FIRST and THIRD rows in order. With 4 rows, it's returning the top 3 in order. Yet still top 1 is returning the last.
DERP?!?
This works..
DECLARE #rate int;
CREATE TABLE #temp (vRate int)
INSERT INTO #temp (vRate) (select rate from vendor_view where vendorID = 123)
SET #rate = (select top 1 vRate from #temp)
SELECT #rate
DROP TABLE #temp
.. but can someone tell me why the first behaves so fudgely and how to do what I want? As explained in the comments, there is no meaningful column by which I can do an order by. Can I force the order in which rows are inserted to be the order in which they are returned?
[EDIT] I've also noticed that: select top 1 rate from ([view definition select]) also returns the correct values time and again.[/EDIT]
That is by design.
If you don't specify how the query should be sorted, the database is free to return the records in any order that is convenient. There is no natural order for a table that is used as default sort order.
What the order will actually be depends on how the query is planned, so you can't even rely on the same query giving a consistent result over time, as the database will gather statistics about the data and may change how the query is planned based on that.
To get the record that you expect, you simply have to specify how you want them sorted, for example:
select top 1 rate
from vendor_view
where vendorID = 123
order by rate
I ran into this problem on a query that had worked for years. We upgraded SQL Server and all of a sudden, an unordered select top 1 was not returning the final record in a table. We simply added an order by to the select.
My understanding is that SQL Server normally will generally provide you the results based on the clustered index if no order by is provided OR off of whatever index is picked by the engine. But, this is not a guarantee of a certain order.
If you don't have something to order off of, you need to add it. Either add a date inserted column and default it to GETDATE() or add an identity column. It won't help you historically, but it addresses the issue going forward.
While it doesn't necessarily make sense that the results of the query should be consistent, in this particular instance they are so we decided to leave it 'as is'. Ultimately it would be best to add a column, but this was not an option. The application this belongs to is slated to be discontinued sometime soon and the database server will not be upgraded from SQL 2005. I don't necessarily like this outcome, but it is what it is: until it breaks it shall not be fixed. :-x

Select query optimisation

I have a large table with ID, date, and some other columns. ID is indexed and sequential.
I want to select all rows after a certain date. Given that the IDs are sequential, if the rows are ordered by ID in decreasing order, once the first row that fails the date test there's no need to carry on checking. How can I make use of the index to optimise this?
You could do something like this:
With FirstFailDate AS
(
-- You start by selecting the first fail date
SELECT TOP 1 * FROM YOUR_TABLE WHERE /* DATE TEST FAILING */ ORDER BY ID DESC
)
SELECT *
FROM YOUR_TABLE t
-- Then, you join your table with the first fail date, and get all the records
-- that are before this date (by ID)
JOIN FirstFailDate f
ON f.ID > t.ID
I don't think there is a good "legal" way to do this without actually indexing date.
However, you could try something like this:
Issue the following query to the DBMS: SELECT * FROM YOUR_TABLE ORDER BY ID DESC.
Start fetching the rows in your client application.
As you fetch, check the date.
Stop fetching (and close the cursor) when the date passes the limit.
The idea is that DBMS sometimes doesn't have to finish the whole query before starting to send the partial results to the client. In this case, the hope is that the DBMS will perform an index scan on ID (due to the ORDER BY ID DESC), and you'll be able get the results as it happens and then stop it before it has even finished.
NOTE: If your DBMS gives you an option to balance between getting the first row fast, versus getting the whole result fast, pick the first option (such as /*+ FIRST_ROWS */ hint under Oracle).
Of course, perform measurements on realistic amounts of data, to make sure this actually works in your particular situation.

MAX keyword taking a lot of time to select a value from a column

Well, I have a table which is 40,000,000+ records but when I try to execute a simple query, it takes ~3 min to finish execution. Since I am using the same query in my c# solution, which it needs to execute over 100+ times, the overall performance of the solution is deeply hit.
This is the query that I am using in a proc
DECLARE #Id bigint
SELECT #Id = MAX(ExecutionID) from ExecutionLog where TestID=50881
select #Id
Any help to improve the performance would be great. Thanks.
What indexes do you have on the table? It sounds like you don't have anything even close to useful for this particular query, so I'd suggest trying to do:
CREATE INDEX IX_ExecutionLog_TestID ON ExecutionLog (TestID, ExecutionID)
...at the very least. Your query is filtering by TestID, so this needs to be the primary column in the composite index: if you have no indexes on TestID, then SQL Server will resort to scanning the entire table in order to find rows where TestID = 50881.
It may help to think of indexes on SQL tables in the same way as those you'd find in the back of a big book that are hierarchial and multi-level. If you were looking for something, then you'd manually look under 'T' for TestID then there'd be a sub-heading under TestID for ExecutionID. Without an index entry for TestID, you'd have to read through the entire book looking for TestID, then see if there's a mention of ExecutionID with it. This is effectively what SQL Server has to do.
If you don't have any indexes, then you'll find it useful to review all the queries that hit the table, and ensure that one of those indexes is a clustered index (rather than non-clustered).
Try to re-work everything into something that works in a set based manner.
So, for instance, you could write a select statement like this:
;With OrderedLogs as (
Select ExecutionID,TestID,
ROW_NUMBER() OVER (PARTITION BY TestID ORDER By ExecutionID desc) as rn
from ExecutionLog
)
select * from OrderedLogs where rn = 1 and TestID in (50881, 50882, 50883)
This would then find the maximum ExecutionID for 3 different tests simultaneously.
You might need to store that result in a table variable/temp table, but hopefully, instead, you can continue building up a larger, single, query, that processes all of the results in parallel.
This is the sort of processing that SQL is meant to be good at - don't cripple the system by iterating through the TestIDs in your code.
If you need to pass many test IDs into a stored procedure for this sort of query, look at Table Valued Parameters.

Resources