SQL/pivot Calculated fields, is this possible - sql-server

I'm using Microsoft SQL Server and Excel. I have an issue I'm having problems getting my head round surrounding how to get some fields calculated so that it runs faster. I have a large data set that gets dropped into excel and stuck into a pivot table.
The table at it's simplest will contain a number of fields similar to the below.
Date user WorkType Count TotalTime
My issue is that I need to calculate an average in a particular way. Each user may have several worktypes on any given day. The formula I have is for each Date&User Sum(TotalTime)/Sum(Count) to get me the following
Date user Average
Currently I dump a select query into excel, apply formula to a column to get my averages then construct the pivot table using the personal details and the averages.
The calculation on over 20,000 rows however is about 5-7 minutes.
So my question is that possible to do that type of calculation in either SQL or Pivot table to cut down the processing time. I'm not very confident with Pivot tables, and I'm considered fairly inexperienced at SQL compared to here. I can manage bits of this but pulling it all together with the conditions of matching Date and User is beyond me right now.
I could parse the recordset into an array to do my calculations that way before it gets written to the spreadsheet, but I just feel that there should be a better way to achieve the same end.

Pre-calculated aggregates in SQL can go very wrong in an Excel Pivot.
Firstly, you can accidentally take an average of an average.
Secondly, once users start re-arranging the pivot you can get very strange sub-totals and totals.
Try to ensure you do all of your aggregation in one place.
If possible try to use SQL with SSRS, you can base a report on a parameterised stored procedure. Consequently you push all of the hard work onto the SQL box, and you restrict users from pivoting things around improperly.

SELECT Raw_Data.ID, Raw_Data.fldDate, CONVERT(datetime, DATEADD(second, SUM(DATEDIFF(second, 0, Raw_Data.TotalHandle)) / SUM(Raw_Data.[Call Count]), 0), 108) AS avgHandled
FROM Raw_Data
WHERE (Raw_Data.fldDate BETWEEN '05-12-2016' AND '07-12-2016')
GROUP BY Raw_Data.fldDate, Raw_Data.ID
For anyone interested here is the results of my searching. Thank you for the help that pointed me in the right direction. It seems quite clumsy with the conversions due to a time Datatype but it's working.

Related

Trying to find distinct rows as a percentage for multiple columns in SQL Server

So I am extremely new to SQL, I know the basics but this one is a hard one I guess :)
I have an assignment where I need to find the distinct rows in each column as a % (the query should be able to determine the % of rows in that particular column that are distinct) and this needs to be done for multiple columns. I have to do more but I'll save that once I figure this out.
Can someone please help me out. A starting point would be nice.
Appreciate the help!!
I would go as simple as this, I don't think your teacher wants you to mess up with specifics of SQL-Server if you are learning SQL:
select (100.0*count(distinct MyFirstColumn))/count(*) as FirstPercentage,
(100.0*count(distinct MySecondColumn))/count(*) as secondPercentage
from MyTable
Notice the extra 100.0 * product, to make it an actual percentage, the .0 will make it a decimal number, otherwise you'll get an integer division which would result in 0
There is no straight forward answer for this question in TSQL but there are so many tolls
If you are using SSIS this can be easily archive usning Data Profiling Task with Column Value Distribution Profile Request Options (Data Profiling Task) Or use any another data profiling tool like Datamartist...

Selecting data across day boundaries from Star schema with separate date and time dimensions

What is the correct way to model data in a star schema such that a BI tool (such as PowerBI) can select a date range crossing multiple days?
I've currently got fact tables that have separate date and time dimensions. My time resolution is to the second, date resolution is to the day.
It's currently very easy to do aggregation providing the data of interest is in the same day, or even multiple complete days, but it becomes a lot more complicated when you're asking for, say, a 12 hour rolling window that crosses the midnight boundary.
Yes, I can write a SQL statement to first pull out all rows for the entirety of the days in question, and then by storing the actual date time as a field in the fact table I can further filter down to the actual time range I'm interested in, but that's not trivial (or possible in some cases) to do in BI reporting tools.
However this must be a frequent scenario in data warehouses... So how should it be done?
An example would be give me the count of ordered items from the fact_orders table between 2017/Jan/02 1600 and 2017/Jan/03 0400.
Orders are stored individually in the fact_orders table.
In my actual scenario I'm using Azure SQL database, but it's more of a general design question.
Thank you.
My first option would be (as you mention in the question) to include a calculated column (Date + Time) in the SQL query and then filter the time part inside the BI tool.
If that doesn't work, you can create a view in the database to achieve the same effect. The easiest is to take the full joined fact + dimensions SQL query that you'd like to use in the BI tool and add the date-time column in the view.
Be sure to still filter on the Date field itself to allow index use! So for your sliding window, your parameters would be something like
WHERE Date between 2017/Jan/02 AND 2017/Jan/03
AND DateTime between 2017/Jan/02 1600 and 2017/Jan/03 0400
If that doesn't perform well enough due to data volumes, you might want to set up and maintain a separate table or materialized view (depending on your DB and ETL options) that does a Cartesian join of the time dimension with a small range of the Date dimension (only the last week or whatever period you are interested in partial day reports), then join the fact table to that.
The DateTimeWindow table/view would be indexed on the DateTime column and have only two extra columns: DateKey and TimeKey. Inner join that to the fact table using both keys and you should get exactly the window you want when the BI tool supplies a datetime range.
That is not easily modeled. A solution would be to build a additional dimension with date + time. Of course this could means you have to severely limit the granularity of the time dimension.
10 year hour granularity: 365 * 10 * 24 = 87600 rows
10 year minute granularity: 365 * 10 * 24 * 60 = 5256000 rows
You could use just this dimension, or (better) add it and do not show it to all users. It would means an additional key in the fact table: if the FT is not gigantic, no big deal.

Coldfusion Compare Two Query Results from Same Database

I've done some research on this site for an issue I'm having, however, I'm finding that the solution is not exactly what I'm looking for, or the implementation doesn't relate to what I'm trying to do. Or, simply put, I just can't seem to figure it out. Here is my issue.
We have a monthly query that we would run that we would send to a third party of physicians, their degree, specialty and clinic. I have the query established already. But recently they wanted to just have us export new results from the previous months data, instead of the whole results list. So, I thought I would create a tool that I would start out simply importing the previous months data. And then taking the query I had been using, putting that in a coldfusion page, run it, and it would show me new records ran for the current month we're in, to the previous month. When I run the report of new data each month, it would save that data in the database with the columns r_month and r_year, which simply means report month/year. So to initially populate the database I just imported Octobers data so I can have a base with the r_month/year being "10" and "2014" respectively. There are 674 records. Then created my page and had a button that would run the same query, save those results, which the r_month and r_year is saved as "11" and "2014" respectively. When I do that, I have 682 records. So, for the month of November, there are 8 "different" or new records from the previous month (October). My question is, what is the best way to run a query that takes the data from October (10/2014) and compare it to the November's data (11/2014), and just give me the new 8 records that were new from November.
Sorry this is long, but wanted to give you guys a detail so you have as much information as possible. I don't really have a code sample I can provide, because apparently the way I was attempting before (using loops etc.) was just not working. Tried looping through previous month query and current month query, trying to find a difference, but that wasn't working. Once again, I've tried using similar samples I've found on here, but they are either not what I'm looking for, or I just can't figure them out. Basically at the end of the process, there needs to be a button that only exports the new records (in this example, the 8) into an excel sheet that we can simply email them.
Any help would be greatly appreciated.
SOLUTION 1 - Since you are using SQL server you can do this pretty easily within the query. You have already logged the previous data so you presumably have a key for the "old" physicians in your log table. Try something like this:
<cfquery name="getNewPHys" datasource="#dsn#">
SELECT *
FROM sourceTable
WHERE physID NOT IN
(SELECT physID FROM logtable
WHERE daterange between #somerange# AND #someotherrange#)
</cfquery>
You would have to add your own values and vars but you get the idea.
NOTE: This is psuedo-code. you would OF COURSE use cfqueryparam for any of your variables.
SOLUTION 2
Another way to do this is by using a dateadded or lastUPdated table. Every time a row is updated you update the lastupdated column with the current date/time. Then selecting recent records is a matter of selecting any records which have been updated within your range. Taht's what Leigh suggested in her comment.
I would add one other comment. You seem to be trying to solve this problem without changing anything in your data table. That's not going to work. You need to think about your schema a bit more. For exmaple, solution 2 would involve adding an additional column and you could even add a MSSQL trigger that automatically updated that field whenever the record was updated. Wouldn't that work?
I still think we are missing something. Are you perchance overwriting your data each time? Or producing duplicate records - 674 this month, 682 next month with duplicates? If so, that's what you need to correct. Anything else is going to be a bolt on solution that creates more problems down the road.
Step 1 - Add a computed column to your table. Make sure you persist the data so you can index it. The computation should result in values like '201401' for January 2014, etc. Let's call that column YearMonth
Then your code and query looks like this:
ControlYearMonth = "201410"; // October 2014
<cfquery>
select field1, field2, etc
from yourtable
where YearMonth = <cfqueryparam value="#ControlYearMonth#">
except
select field1, field2, etc
from yourtable
where YearMonth < <cfqueryparam value="#ControlYearMonth#">
<cfquery>

Why ssrs 2005 report takes long time for execution when using parameter?

I am trying to execute one SSRS 2005 report. This report takes one parameter.
If I don't use the parameter and write the value directly then it runs in 10 sec. eg.
Select * from table1 where id = 122
If I use parameter then it takes long time like 10 to 15 min like
Select * from table1 where id = #id
I don't know why this thing is happening.
Thanks in advance.
It's impossible to answer the question as asked: only you have the info to determine why things aren't performing well.
What we can do however, is answer the question "How to investigate SSRS performance issues?". One of the best tools I've found so far is to use the ExecutionLog2 View in the ReportServer catalog database. In your case the important columns to look at:
TimeDataRetrieval, for time spent connecting to the data source and retrieving data rows
TimeProcessing, for time spent turning the data rows into the report
TimeRendering, for time spent creating the final output (pdf, html, excel, etc)
This will give you a starting point for investigating further. Most likely (from your description) I'd guess the problem lies in the first bit. A suitable follow up step would be to analyze the query that is executed by SSRS, possibly using the execution plan.
1) try replacing your sub queries with join logic. To the best possible. I know many a times sub-queries feel more logical as it makes the problem flow thru when we are thinking in macro view [this result set] gets [that result sets out put].
2) Can also put index as well. And since its int it will be faster.

How can I handle the time consuming SQL?

We have a table with 6 million records, and then we have a SQL which need around 7 minutes to query the result. I think the SQL cannot be optimized any more.
The query time causes our weblogic to throw the max stuck thread exception.
Is there any recommendation for me to handle this problem ?
Following is the query, but it's hard for me to change it,
SELECT * FROM table1
WHERE trim(StudentID) IN ('354354','0')
AND concat(concat(substr(table1.LogDate,7,10),'/'),substr(table1.LogDate,1,5))
BETWEEN '2009/02/02' AND '2009/03/02'
AND TerminalType='1'
AND RecStatus='0' ORDER BY StudentID, LogDate DESC, LogTime
However, I know it's time consuming for using strings to compare dates, but someone wrote before I can not change the table structure...
LogDate was defined as a string, and the format is mm/dd/yyyy, so we need to substring and concat it than we can use between ... and ... I think it's hard to optimize here.
The odds are that this query is doing a full-file scan, because you're WHERE conditions are unlikely to be able to take advantage of any indexes.
Is LogDate a date field or a text field? If it's a date field, then don't do the substr's and concat's. Just say "LogDate between '2009-02-02' and '2009-02-03' or whatever the date range is. If it's defined as a text field you should seriously consider redefining it to a date field. (If your date really is text and is written mm/dd/yyyy then your ORDER BY ... LOGDATE DESC is not going to give useful results if the dates span more than one year.)
Is it necessary to do the trim on StudentID? It is far better to clean up your data before putting it in the database then to try to clean it up every time you retrieve it.
If LogDate is defined as a date and you can trim studentid on input, then create indexes on one or both fields and the query time should fall dramatically.
Or if you want a quick and dirty solution, create an index on "trim(studentid)".
If that doesn't help, give us more info about your table layouts and indexes.
SELECT * ... WHERE trim(StudentID) IN ('354354','0')
If this is normal construct, then you need a function based index. Because without it you force the DB server to perform full table scan.
As a rule of thumb, you should avoid as much as possible use of functions in the WHERE clause. The trim(StundentID), substr(table1.LogDate,7,10) prevent DB servers from using any index or applying any optimization to the query. Try to use the native data types as much as possible e.g. DATE instead of VARCHAR for the LogDate. StudentID should be also managed properly in the client software by e.g. triming the data before INSERT/UPDATE.
If your database supports it, you might want to try a materialized view.
If not, it might be worth thinking about implementing something similar yourself, by having a scheduled job that runs a query that does the expensive trims and concats and refreshes a table with the results so that you can run a query against the better table and avoid the expensive stuff. Or use triggers to maintain such a table.
But the query time cause our weblogic to throw the max stuck thread exception.
If the query takes 7 minutes and cannot be made faster, you have to stop running this query real-time. Can you change your application to query a cached results table that you periodically refresh?
As an emergency stop-gap before that, you can implement a latch (in Java) that allows only one thread at a time to execute this query. A second thread would immediately fail with an error (instead of bringing the whole system down). That is probably not making users of this query happy, but at least it protects everyone else.
I updated the query, could you give me some advices ?
Those string manipulations make indexing pretty much impossible. Are you sure you cannot at least get rid of the "trim"? Is there really redundant whitespace in the actual data? If so, you could narrow down on just a single student_id, which should speed things up a lot.
You want a composite index on (student_id, log_date), and hopefully the complex log_date condition can still be resolved using a index range scan (for a given student id).
Without any further information about what kind of query you are executing and wheter you are using indexes or not it is hard to give any specific information.
But here are a few general tips.
Make sure you use indexes on the columns you often filter/order by.
If it is only a certain query that is way too slow, than perhaps you can prevent yourself from executing that query by automatically generating the results while the database changes. For example, instead of a count() you can usually keep a count stored somewhere.
Try to remove the trim() from the query by automatically calling trim() on your data before/while inserting it into the table. That way you can simply use an index to find the StudentID.
Also, the date filter should be possible natively in your database. Without knowing which database it might be more difficult, but something like this should probably work: LogDate BETWEEN '2009-02-02' AND '2009-02-02'
If you also add an index on all of these columns together (i.e. StudentID, LogDate, TerminalType, RecStatus and EmployeeID than it should be lightning fast.
Without knowing what database you are using and what is your table structure, its very difficult to suggest any improvement but queries can be improved by using indexes, hints, etc.
In your query the following part
concat(concat(substr(table1.LogDate,7,10),'/'), substr(table1.LogDate,1,5)) BETWEEN '2009/02/02' AND '2009/02/02'
is too funny. BETWEEN '2009/02/02' AND '2009/02/02' ?? Man, what are yuu trying to do?
Can you post your table structure here?
And 6 million records is not a big thing anyway.
It is told a lot your problem is in date field. You definitely need to change your date from a string field to a native date type. If it is a legacy field that is used in your app in this exact way - you may still create a to_date(logdate, 'DD/MM/YYYY') function-based index that transforms your "string" date into a "date" date, and allows a fast already mentioned between search without modifying your table data.
This should speed things up a lot.
With the little information you have provided, my hunch is that the following clause gives us a clue:
... WHERE trim(StudentID) IN ('354354','0')
If you have large numbers of records with unidentified student (i.e. studentID=0) an index on studentID would be very imbalanced.
Of the 6 million records, how many have studentId=0?
Your main problem is that your query is treating everything as a string.
If LogDate is a Date WITHOUT a time component, you want something like the following
SELECT * FROM table1
WHERE StudentID IN (:SearchStudentId,0)
AND table1.LogDate = :SearchDate
AND TerminalType='1'
AND RecStatus='0'
ORDER BY EmployeeID, LogDate DESC, LogTime
If LogDate has a time component, and SearchDate does NOT have a time component, then something like this. (The .99999 will set the time to 1 second before midnight)
SELECT * FROM table1
WHERE StudentID IN (:SearchStudentId,:StudentId0)
AND table1.LogDate BETWEEN :SearchDate AND :SearchDate+0.99999
AND TerminalType='1'
AND RecStatus='0'
ORDER BY EmployeeID, LogDate DESC, LogTime
Note the use of bind variables for the parameters that change between calls. It won't make the query much faster, but it is 'best practice'.
Depending on your calling language, you may need to add TO_DATE, etc, to cast the incoming bind variable into a Date type.
If StudentID is a char (usually the reason for using trim()) you may be able to get better performance by padding the variables instead of trimming the field, like this (assuming StudentID is a char(10)):
StudentID IN (lpad('354354',10),lpad('0',10))
This will allow the index on StudentID to be used, if one exists.

Resources