How to perform statistical computations in a query? - sql-server

I have a table which is filled with float values. I need to calculate the number of results grouped by their distribution around the mean value (Gaussian Distribution). Basically, it is calculated like this:
SELECT COUNT(*), FloatColumn - AVG(FloatColumn) - STDEV(FloatColumn)
FROM Data
GROUP BY FloatColumn - AVG(FloatColumn) - STDEV(FloatColumn)
But for obvious reasons, SQL Server gives this error: Cannot use an aggregate or a subquery in an expression used for the group by list of a GROUP BY clause.
My question is, can I somehow leave this computation to SQL Server? Or do I have to do it the old fashioned way? Retrieve all the data, and do the calculation myself?

To get the aggregate of the whole set you can use an empty OVER clause
WITH T(Result)
AS (SELECT FloatColumn - Avg(FloatColumn) OVER() - Stdev(FloatColumn) OVER ()
FROM Data)
SELECT Count(*),
Result
FROM T
GROUP BY Result

SQL Fiddle
You can perform a pre-aggregation of the data, and join back to the table.
Schema Setup:
create table data(floatcolumn float);
insert data values
(1234.56),
(134.56),
(134.56),
(234.56),
(1349),
(900);
Query 1:
SELECT COUNT(*) C, D.FloatColumn - A
FROM
(
SELECT AVG(FloatColumn) + STDEV(FloatColumn) A
FROM Data
) preagg
CROSS JOIN Data D
GROUP BY FloatColumn - A;
Results:
| C | COLUMN_1 |
--------------------------
| 2 | -1196.876067819572 |
| 1 | -1096.876067819572 |
| 1 | -431.436067819572 |
| 1 | -96.876067819572 |
| 1 | 17.563932180428 |

Related

Joining 2nd Table with Random Row to each record

I need to join table B to Table A, where Table B's records are randomly assigned, or joined. Most of the queries out there are based off of having a key between them and conditions, where I just want to randomly join records without a key.
I'm not sure where to start, as none of the queries I've found are doing this. I assume a nested join could be helpful for this, but how can I randomly assort the records on join?
**Table A**
| Associate ID| Statement|
|:----: |:------:|
| 33691| John is |
| 82451| Susie is |
| 25485| Sam is|
| 26582| Lonnie is|
| 52548| Carl is|
**Table B**
| RowID | List|
|:----: |:------:|
| 1| admirable|
| 2| astounding|
| 3| excellent|
| 4| awesome|
| 5| first class|
The result would be something like this, where items from the list are not looped through in order, but random:
**Result Table**
| Associate ID| Statement| List|
|:----: |:------:|:------:|
| 33691| John is |astounding|
| 82451| Susie is |first class|
| 25485| Sam is|admirable|
| 26582| Lonnie is|excellent|
| 52548| Carl is|awesome|
These are some of the queries I've tried:
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/aeb83251-e132-435a-8630-e5b842a69368/random-join-between-tables?forum=sqldataaccess
-This seems to loop through values from 'Table B', not random.
https://www.daveperrett.com/articles/2009/08/11/mysql-select-random-row-with-join
-This is based off of a common key between the two tables and returning one of the records with the key, which I do not have.
SQL Join help when selecting random row
- I'll be honest, I don't understand this one, but it doesn't seem to assign random for each row from Table A, but more of a selection overall link the link above this.
Join One Table To Get Random Rows from 2nd Table
- This seems to be specific to a key, and not an overall random.
using 2 CTEs we generate a select which generates a row number for each table based on a random order and then join based on that row number.
Using a CTE to get N times the records in B as described here:
Repeat Rows N Times According to Column Value (Not included below) Note to get the "N" you'll need to get count from A and B, then divide by eachother and Add 1.
Assuming Even Distribution
With A as(
SELECT *, Row_number() over (order by NewID()) RN
FROM A),
B as (
SELECT *, Row_number () over (order by NewID()) RN
FROM B)
SELECT *
FROM A
INNER JOIN B
on A.RN = B.RN
Or use (assuming uneven distribution)
SELECT *
FROM A
CROSS APPLY (SELECT TOP 1 * FROM B ORDER BY NewID()) Z
This method assumes you know in advance which is the smaller table.
First it assigns an ascending row numbering from 1. This does not have to be randomized.
Then for each row in the larger table it uses the modulus operator to randomly calculate a row number in the range to join onto.
WITH Small
AS (SELECT *,
ROW_NUMBER() OVER ( ORDER BY (SELECT 0)) AS RN
FROM SmallTable),
Large
AS (SELECT *,
1 + CRYPT_GEN_RANDOM(3) % (SELECT COUNT(*) FROM SmallTable) AS RND
FROM LargeTable
ORDER BY RND
OFFSET 0 ROWS)
SELECT *
FROM Large
INNER JOIN Small
ON Small.RN = Large.RND
The ORDER BY RND OFFSET 0 ROWS is to get the random numbers materialized in advance.
This will allow a MERGE join on the smaller table. It also avoids an issue that can sometimes happen where the CRYPT_GEN_RANDOM is moved around in the plan and only evaluated once rather than once per row as required.

SQL Server: Performance issue: OR statement substitute in WHERE clause

I want to select only the records from table Stock based on the column PostingDate.
The PostingDate should be after the InitDate in another table called InitClient. However, there are currently 2 clients in both tables (client 1 and client 2), that both have a different InitDate.
With the code below I get exactly what I need currently, based on the sample data also included underneath. However, two problems arise, first of all based on millions of records the query is taking way too long (hours). And second of all, it isn't dynamic at all, every time when a new client is included.
A potential option to cover the performance issue would be to write two separate query's, one for Client 1 and one for Client 2 with a UNION in between. Unfortunately, this then isn't dynamic enough since multiple clients are possible.
SELECT
Material
,Stock
,Stock.PostingDate
,Stock.Client
FROM Stock
LEFT JOIN (SELECT InitDate FROM InitClient where Client = 1) C1 ON 1=1
LEFT JOIN (SELECT InitDate FROM InitClient where Client = 2) C2 ON 1=1
WHERE
(
(Stock.Client = 1 AND Stock.PostingDate > C1.InitDate) OR
(Stock.Client = 2 AND Stock.PostingDate > C2.InitDate)
)
Sample dataset:
CREATE TABLE InitClient
(
Client varchar(300),
InitDate date
);
INSERT INTO InitClient (Client,InitDate)
VALUES
('1', '5/1/2021'),
('2', '1/31/2021');
SELECT * FROM InitClient
CREATE TABLE Stock
(
Material varchar(300),
PostingDate varchar(300),
Stock varchar(300),
Client varchar(300)
);
INSERT INTO Stock (Material,PostingDate,Stock,Client)
VALUES
('322', '1/1/2021', '5', '1'),
('101', '2/1/2021', '5', '2'),
('322', '3/2/2021', '10', '1'),
('101', '4/13/2021', '5', '1'),
('400', '5/11/2021', '170', '2'),
('401', '6/20/2021', '200', '1'),
('322', '7/20/2021', '160', '2'),
('400', '8/9/2021', '93', '2');
SELECT * FROM Stock
Desired result, but then with a substitute for the OR statement to ramp up the performance:
| Material | PostingDate | Stock | Client |
|----------|-------------|-------|--------|
| 322 | 1/1/2021 | 5 | 1 |
| 101 | 2/1/2021 | 5 | 2 |
| 322 | 3/2/2021 | 10 | 1 |
| 101 | 4/13/2021 | 5 | 1 |
| 400 | 5/11/2021 | 170 | 2 |
| 401 | 6/20/2021 | 200 | 1 |
| 322 | 7/20/2021 | 160 | 2 |
| 400 | 8/9/2021 | 93 | 2 |
Any suggestions if there is an substitute possible in the above code to keep performance, while making it dynamic?
You can optimize this query quite a bit.
Firstly, those two LEFT JOINs are basically just semi-joins, because you don't actually return any results from them. So we can turn them into a single EXISTS.
You will also get an implicit conversion to int, because Client is varchar and 1,2 is an int. So change that to '1','2', or you could change the column type.
PostingDate is also varchar, that should really be date
SELECT
s.Material
,s.Stock
,s.PostingDate
,s.Client
FROM Stock s
WHERE s.Client IN ('1','2')
AND EXISTS (SELECT 1
FROM InitClient c
WHERE s.PostingDate > c.InitDate
AND c.Client = s.Client
);
Next you want to look at indexing. For this query (not accounting for any other queries being run), you probably want the following indexes (remove the INCLUDE for a clustered index)
InitClient (Client, InitDate)
Stock (Client) INCLUDE (PostingDate, Material, Stock)
It is possible that even with these indexes that you may get a scan on Stock, because IN functions like an OR. This does not always happen, it's worth checking. If so, instead you can rewrite this to use UNION ALL
SELECT
s.Material
,s.Stock
,s.PostingDate
,s.Client
FROM (
SELECT *
FROM Stock s
WHERE s.Client = '1'
UNION ALL
SELECT *
FROM Stock s
WHERE s.Client = '2'
) s
WHERE EXISTS (SELECT 1
FROM InitClient c
WHERE s.PostingDate > c.InitDate
AND c.Client = s.Client
);
db<>fiddle
There is nothing wrong in expecting your query to be dynamic. However, in order to make it more performant, you may need to reach a compromise between to conflicting expectations. I will present here a few ways to optimize your query, some of them involves some drastic changes, but eventually it is you or your client who decides how this needs to be improved. Also, some of the improvements might be ineffective, so do not take anything for granted, test everything. Without further ado, let's see the suggestions
The query
First I would try to change the query a little, maybe something like this could help you
SELECT
Material
,Stock
,Stock.PostingDate
,C1.InitDate
,C2.InitDate
,Stock.Client
FROM Stock
LEFT JOIN InitClient C1 ON Client = 1
LEFT JOIN InitClient C2 ON Client = 2
WHERE
(
(Stock.Client = 1 AND Stock.PostingDate > C1.InitDate) OR
(Stock.Client = 2 AND Stock.PostingDate > C2.InitDate)
)
Sometimes a simple step of getting rid of subselects does the trick
The indexes
You may want to speed up your process by creating indexes, for example on Stock.PostingDate.
Helper table
You can create a helper table where you store the Stock records' relevant data, so you perform the slow query ONCE in a while, maybe once in a week, or each time a new client enters the stage and store the results in the helper table. Once the prerequisite calculation is done, you will be able to query only the helper table with its few records, reaching lightning fast behavior. So, the idea is to execute the slow query rarely, cache/store the results and reuse them instead of calculating it every time.
A new column
You could create a column in your Stock table named InitDate and fill that with data for each record periodically. It will take a long while at the first execution, but then you will be able to query only the Stock table without joins and subselects.

T-SQL Verify each character of each value in a column against values of another table

I have a table in a database with a column which has values like XX-xx-cccc-ff-gg. Let's assume this is table ABC and column is called ABC_FORMAT_STR. In another table, ABC_FORMAT_ELEMENTS I have a column called CHARS with values like, A, B, C, D... X, Y, Z, a, d, f, g, x, y, z etc. (please don't assume I have all ASCII values there, it's mainly some letters and numbers plus some special characters like *, ;, -, & etc.).
I need to add a constraint in [ABC].[ABC_FORMAT_STR] column, in such a way so, each and every character of every value of that column, should exist in [ABC_FORMAT_ELEMENTS].[CHARS]
Is the possible? Can someone help me with this?
Thank you very much in advance.
This is an example with simple names, keeping the names of the object above for clarity:
Example
SELECT [ABC_FORMAT_STR] FROM [ABC]
Nick
George
Adam
SELECT [CHARS] FROM [ABC_FORMAT_ELEMENTS]
A
G
N
a
c
e
g
i
k
o
r
After the coonstraint:
SELECT [ABC_FORMAT_STR] FROM [ABC]
Nick
George
Note on the result:
"Adam" cannot be included because "d" and "m" character are not in [ABC_FORMAT_ELEMENTS] table.
Here is a simple and most natural solution based on the TRANSLATE() function.
It will work starting from SQL Server 2017 onwards.
SQL
-- DDL and sample data population, start
DECLARE #ABC TABLE (ABC_FORMAT_STR VARCHAR(50));
INSERT INTO #ABC VALUES
('Nick'),
('George'),
('Adam');
DECLARE #ABC_FORMAT_ELEMENTS TABLE (CHARS CHAR(1));
INSERT INTO #ABC_FORMAT_ELEMENTS VALUES
('A'), ('G'), ('N'),('a'), ('c'), ('e'), ('g'),
('i'), ('k'), ('o'), ('r');
-- DDL and sample data population, end
SELECT a.*
, t1.legitChars
, t2.badChars
FROM #ABC AS a
CROSS APPLY (SELECT STRING_AGG(CHARS, '') FROM #ABC_FORMAT_ELEMENTS) AS t1(legitChars)
CROSS APPLY (SELECT TRANSLATE(a.ABC_FORMAT_STR, t1.legitChars, SPACE(LEN(t1.legitChars)))) AS t2(badChars)
WHERE TRIM(t2.badChars) = '';
Output
+----------------+-------------+----------+
| ABC_FORMAT_STR | legitChars | badChars |
+----------------+-------------+----------+
| Nick | AGNacegikor | |
| George | AGNacegikor | |
+----------------+-------------+----------+
Output with WHERE clause commented out
Just to see why the row with the 'Adam' value was filtered out.
+----------------+-------------+----------+
| ABC_FORMAT_STR | legitChars | badChars |
+----------------+-------------+----------+
| Nick | AGNacegikor | |
| George | AGNacegikor | |
| Adam | AGNacegikor | d m |
+----------------+-------------+----------+
Based on your sample data, here's one method to identify valid/invalid rows in ABC. You could easily adapt this to be part of a trigger that can check inserted or updated rows in inserted and rollback if any rows violate the criteria.
This uses a tally/numbers table (very often used for splitting strings), This defines one using a CTE but a permanent solution would have a permanent numbers table to reuse.
The logic is to split the strings into rows and then count the rows that exist in the lookup table and reject any with a count of rows that is less than the length of the string.
with
numbers (n) as (select top 100 Row_Number() over (order by (select null)) from sys.messages ),
strings as (
select a.ABC_FORMAT_STR, Count(*) over(partition by a.ABC_FORMAT_STR) n
from abc a cross join numbers n
where n.n<=Len(a.ABC_FORMAT_STR)
and exists (select * from ABC_FORMAT_ELEMENTS e where e.chars=Substring(a.ABC_FORMAT_STR,n,1))
)
select ABC_FORMAT_STR
from strings
where Len(ABC_FORMAT_STR)=n
group by ABC_FORMAT_STR
/* change to where Len(ABC_FORMAT_STR) <> n to find rows that aren't allowed */
See this DB Fiddle

How to use group by in SQL Server query?

I have problem with group by in SQL Server
I have this simple SQL statement:
select *
from Factors
group by moshtari_ID
and I get this error :
Msg 8120, Level 16, State 1, Line 1
Column 'Factors.ID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
This is my result without group by :
and this is error with group by command :
Where is my problem ?
In general, once you start GROUPing, every column listed in your SELECT must be either a column in your GROUP or some aggregate thereof. Let's say you have a table like this:
| ID | Name | City |
| 1 | Foo bar | San Jose |
| 2 | Bar foo | San Jose |
| 3 | Baz Foo | Santa Clara |
If you wanted to get a list of all the cities in your database, and tried:
SELECT * FROM table GROUP BY City
...that would fail, because you're asking for columns (ID and Name) that aren't in the GROUP BY clause. You could instead:
SELECT City, count(City) as Cnt FROM table GROUP BY City
...and that would get you:
| City | Cnt |
| San Jose | 2 |
| Santa Clara | 1 |
...but would NOT get you ID or Name. You can do more complicated things with e.g. subselects or self-joins, but basically what you're trying to do isn't possible as-stated. Break down your problem further (what do you want the data to look like?), and go from there.
Good luck!
When you group then you can select only the columns you group by. Other columns need to be aggrgated. This can be done with functions like min(), avg(), count(), ...
Why is this? Because with group by you make multiple records unique. But what about the column not being unique? The DB needs a rule for those on how to display then - aggregation.
You need to apply aggregate function such as max(), avg() , count() in group by.
For example this query will sum totalAmount for all moshtari_ID
select moshtari_ID,sum (totalAmount) from Factors group by moshtari_ID;
output will be
moshtari_ID SUM
2 120000
1 200000
Try it,
select *
from Factorys
Group by ID, date, time, factorNo, trackingNo, totalAmount, createAt, updateAt, bark_ID, moshtari_ID
If you are applying group clause then you can only use group columns and aggregate function in select
syntax:
SELECT expression1, expression2, ... expression_n,
aggregate_function (aggregate_expression)
FROM tables
[WHERE conditions]
GROUP BY expression1, expression2, ... expression_n
[ORDER BY expression [ ASC | DESC ]];

Set-based approach to updating multiple tables, rather than a WHILE loop?

Apparently I'm far too used to procedural programming, and I don't know how to handle this with a set-based approach.
I have several temporary tables in SQL Server, each with thousands of records. Some of them have tens of thousands of records each, but they're all part of a record set. I'm basically loading a bunch of xml data that looks like this:
<root>
<entry>
<id-number>12345678</id-number>
<col1>blah</col1>
<col2>heh</col2>
<more-information>
<col1>werr</col1>
<col2>pop</col2>
<col3>test</col3>
</more-information>
<even-more-information>
<col1>czxn</col1>
<col2>asd</col2>
<col3>yyuy</col3>
<col4>moat</col4>
</even-more-information>
<even-more-information>
<col1>uioi</col1>
<col2>qwe</col2>
<col3>rtyu</col3>
<col4>poiu</col4>
</even-more-information>
</entry>
<entry>
<id-number>12345679</id-number>
<col1>bleh</col1>
<col2>sup</col2>
<more-information>
<col1>rrew</col1>
<col2>top</col2>
<col3>nest</col3>
</more-information>
<more-information>
<col1>234k</col1>
<col2>fftw</col2>
<col3>west</col3>
</more-information>
<even-more-information>
<col1>asdj</col1>
<col2>dsa</col2>
<col3>mnbb</col3>
<col4>boat</col4>
</even-more-information>
</entry>
</root>
Here's a brief display of what the temporary tables look like:
Temporary Table 1 (entry):
+------------+--------+--------+
| UniqueID | col1 | col2 |
+------------+--------+--------+
| 732013 | blah | heh |
| 732014 | bleh | sup |
+------------+--------+--------+
Temporary Table 2 (more-information):
+------------+--------+--------+--------+
| UniqueID | col1 | col2 | col3 |
+------------+--------+--------+--------+
| 732013 | werr | pop | test |
| 732014 | rrew | top | nest |
| 732014 | 234k | ffw | west |
+------------+--------+--------+--------+
Temporary Table 3 (even-more-information):
+------------+--------+--------+--------+--------+
| UniqueID | col1 | col2 | col3 | col4 |
+------------+--------+--------+--------+--------+
| 732013 | czxn | asd | yyuy | moat |
| 732013 | uioi | qwe | rtyu | poiu |
| 732014 | asdj | dsa | mnbb | boat |
+------------+--------+--------+--------+--------+
I am loading this data from an XML file, and have found that this is the only way I can tell which information belongs to which record, so every single temporary table has the following inserted at the top:
T.value('../../id-number[1]', 'VARCHAR(8)') UniqueID,
As you can see, each temporary table has a UniqueID assigned to it's particular record to indicate that it belongs to the main record. I have a large set of items in the database, and I want to update every single column in each non-temporary table using a set-based approach, but it must be restricted by UniqueID.
In tables other than the first one, there is a Foreign_ID based on the PrimaryKey_ID of the main table, and the UniqueID will not be inserted... it's just to help tell what goes where.
Here's the exact logic that I'm trying to figure out:
If id-number currently exists in the main table, update tables based on the PrimaryKey_ID number of the main table, which is the same exact number in every table's Foreign_ID. The foreign-key'd tables will have a totally different number than the id-number -- they are not the same.
If id-number does not exist, insert the record. I have done this part.
However, I'm currently stuck in the mind-set that I have to set temporary variables, such as #IDNumber, and #ForeignID, and then loop through it. Not only am I getting multiple results instead of the current result, but everyone says WHILE shouldn't be used, especially for such a large volume of data.
How do I update these tables using a set-based approach?
Assuming you already have this XML extracted, you could do something similar to:
UPDATE ent
SET ent.col1 = tmp1.col1,
ent.col2 = tmp1.col2
FROM dbo.[Entry] ent
INNER JOIN #TempEntry tmp1
ON tmp1.UniqueID = ent.UniqueID;
UPDATE mi
SET mi.col1 = tmp2.col1,
mi.col2 = tmp2.col2,
mi.col3 = tmp2.col3
FROM dbo.[MoreInformation] mi
INNER JOIN dbo.[Entry] ent -- mapping of Foreign_ID ->UniqueID
ON ent.PrimaryKey_ID = mi.Foreign_ID
INNER JOIN #TempMoreInfo tmp2
ON tmp2.UniqueID = ent.UniqueID
AND tmp2.SomeOtherField = mi.SomeOtherField; -- need 1 more field
UPDATE emi
SET ent.col1 = tmp3.col1,
emi.col2 = tmp3.col2,
emi.col3 = tmp3.col3,
emi.col4 = tmp3.col4
FROM dbo.[EvenMoreInformation] emi
INNER JOIN dbo.[Entry] ent -- mapping of Foreign_ID ->UniqueID
ON ent.PrimaryKey_ID = mi.Foreign_ID
INNER JOIN #TempEvenMoreInfo tmp3
ON tmp3.UniqueID = ent.UniqueID
AND tmp3.SomeOtherField = emi.SomeOtherField; -- need 1 more field
Now, I should point out that if the goal is truly to
update every single column in each non-temporary table
then there is a conceptual issue for any sub-tables that have multiple records. If there is no record in that table that will remain the same outside of the Foreign_ID field (and I guess the PK of that table?), then how do you know which row is which for the update? Sure, you can find the correct Foreign_ID based on the UniqueID mapping already in the non-temporary Entry table, but there needs to be at least one field that is not an IDENTITY (or UNIQUEIDENTIFIER populated via NEWID or NEWSEQUENTIALID) that will be used to find the exact row.
If it is not possible to find a stable, matching field, then you have no choice but to do a wipe-and-replace method instead.
P.S. I used to recommend the MERGE command but have since stopped due to learning of all of the bugs and issues with it. The "nicer" syntax is just not worth the potential problems. For more info, please see Use Caution with SQL Server's MERGE Statement.
you can use MERGE which does upsert ( update and insert ) in a single statement
First merge entries to the main table
For other tables, you can do a join with main table to get foreign id mapping
MERGE Table2 as Dest
USING ( select t2.*, m.primaryKey-Id as foreign_ID
from #tempTable2 t2
join mainTable m
on t2.id-number = m.id-number
) as Source
on Dest.Foreign_ID = m.foreign_ID
WHEN MATCHED
THEN Update SET Dest.COL1 = Source.Col1
WHEN NOT MATCHED then
INSERT (FOREGIN_ID, col1, col2,...)
values ( src.foreign_Id, src.col1, src.col2....)

Resources