SQL Server fast random sampling

SQL Server fast random sampling - sql-server

I have been looking for methods for fast sampling from sql table and I found this very useful article:
https://www.sisense.com/blog/how-to-sample-rows-in-sql-273x-faster/
The query looks like:
select *
from users
where id in (select round(random() * 21e6)::integer as id
from generate_series(1, 110)
group by id -- Discard duplicates
)
limit 100
However for SQL Server, I got errors like:
'random' is not a recognized built-in function name.
Do we have some SQL Server counter part of this fast sampling method? Thanks.
I tried edit as:
select top 100 * from users
where id in (
select round(RAND() * 21e6) CAST integer as id
from generate_series(1, 110)
group by id -- Discard duplicates
)
still having error:
The round function requires 2 to 3 arguments.

For T-SQL it is RAND(), not RANDOM()
Also, apart from the comment regarding having to use TOP(n) instead of limit - you also have to use CAST or CONVERT for the data type conversion. And generate_series is not a built-in function in sql server.

From Sean's comment: TABLESAMPLE . Would be the solution, now it is extremely fast to sample.

Related

Using Sum(Cast(Replace syntax

I am trying to write a simple query in order to change some of our stage data. I have a varchar $ column (unfortunately) that needs to be summed. My issue is that because of commas, I cannot change the datatype.
So, I can use REPLACE(AMT,',','') to remove the commas but It still wont let me cast it as a decimal and I get
Error converting data type varchar to numeric.
I am trying the following below with no luck. Any ideas? Can this be done or am I using the wrong syntax here?
Select SUM(Cast(REPLACE(Amt,',','') as Decimal (18,2)) )

I was able to resolve this with #HABO suggestion. I used Cast(Ltrim(rtrim(table.AMT)) as Money) for all instances of the varchar amount. This removed white space and removed the commas from the numbers.

This should work... including an example.
Edit: if you are on SQL Server 2012+, you may be able to shorten your task by using Try_Convert
DECLARE #SomeTable AS TABLE (Amt Varchar(100));
INSERT INTO #Sometable (Amt) VALUES ('abc123,456.01'),(' 123,456.78 '),(Null),('asdad'),('');
With NumericsOnly AS
(
SELECT
REPLACE(Left(SubString(Amt, PatIndex('%[0-9.-,]%', Amt), 8000), PatIndex('%[^0-9.,-]%', SubString(Amt, PatIndex('%[0-9.,-]%', Amt), 8000) + 'X')-1),',','') AS CleanAmt
FROM
#SomeTable
)
SELECT
SUM(CONVERT(DECIMAL(18,2), CleanAmt)) AS TotalAmt
FROM
NumericsOnly
WHERE
IsNumeric(CleanAmt)=1
General methodology is taken from here

I wouldn't use money as a data type as it is notorious for rounding error.
The error is due to SQL order of operations within your SUM(CAST(REPLACE... operation. This issue can be resolved by summing the column AFTER it's been staged to be summed via a subquery:
SELECT SUM(Field),...
FROM ( SELECT
Cast(REPLACE(Amt,',','') as NUMERIC) as 'Field'
,...
) [Q]
If the table you're summing is administered by a BI Team, get them to stage the data there. Happy Data Happy life.

Calling oracle package from SSRS

I want to send the date range and employee ids from SSRS to Oracle package. Does anyone knows how to do it? Even if i try to declare an oracle package with three in parameters from_date, to_date and employee_id it works fine with one employee id. But fails if i select multiple employees in SSRS web interface saying wrong number of parameters. Does anyone know how to handle this?

Normally with SQL and SSRS you would do something like
Select * From Table where Table.Field in #Param
Oracle is different of course and depends on the Connection Type you are using.
ODBC connection you would use something like:
Select * From Table where Table.Field = ?
Order of the parameters is important here so be careful.
OLEDB connection you would use something like:
Select * From Table where Table.Field = :Param
However, none of the above work with multi selecting of the Parameters. You could do this:
=”Select * From Table where Table.Field in (‘” + Join(Parameters!parameter.Value,”‘, ‘”) + “‘)”
or
=”Select * From Table where Table.Field in(” + Join(Parameters!parameter.Value,”, “) + “)” if the values or your field is only numeric.
Here are a couple of good articles that you can look at. It has a better explanation than I can give personally without more information; Details of your Oracle Environment, Connection Type, etc.
Quick Reference:
Be sure your Oracle version is 9 or greater, none of this works on
versions 8 or older.
When using parameters with Oracle, use a : instead of an # sign –
:param instead of #param.
Add an ‘ALL’ option to your datasets that supply values for
multivalued drop down parameters.
Check for the ALL in your where clause by using “where ( field1 in (
:prmField1 ) or ‘ALL’ in ( :prmField1 ) )” syntax.
You can execute your query from the dataset window, but can only
supply 1 value. However that value can be ‘ALL’.
Educate your users on ‘ALL’ versus ‘(select all)’ .
Another good article about Multiple Parameters in SSRS with Oracle: Davos Collective Which is referenced in the top portion.
Hope this helps!

String column Search/Replace GUIDs

i have a SQL Profiler trace saved to a table in SQL Server.
i want to perform sum/avg/count analysis of CPU/Reads/Duration on the queries in the trace. But most of the profiler data records calls to stored procedures with uniqueidentifer parameter(s):
EXECUTE GetTransactionCounts #BankGUID = '{231281D7-F6C2-4EAE-98AE-E9196D8016F0}', #SessionGUID='{7F34361F-CEEA-4CEA-8CBD-2704FFE92DEF}'
SELECT SUM(Total) AS Total FROM fn_BalancingAdditionsUS('{C08961DB-0B6A-4E67-A82B-5BBBA0A84A74}')
EXEC CreateCloser '{7F34361F-CEEA-4CEA-8CBD-2704FFE92DEF}', NULL , '{08E74DBB-3BC4-49A7-AA10-95AA6BD24784}'
EXECUTE GetMachineImpressmentForSession #SessionGUID = '{446881BA-1439-4AD8-B33B-C784120EFBA2}'
SELECT SUM(Total) AS Total FROM fn_BalancingAdditionsCanadian('{446881BA-1439-4AD8-B33B-C784120EFBA2}')
SELECT SUM(Total) AS Total FROM fn_BalancingSubtractionsUS('{446881BA-1439-4AD8-B33B-C784120EFBA2}')
So when i try to aggregate the profiler trace data to find the worst performing queries:
SELECT
Description,
COUNT(*) AS EventCount,
AVG(CPU) AS CPU, SUM(CPU) AS CpuTotal,
AVG(Reads) AS Reads, SUM(Reads) AS ReadsTotal,
AVG(Duration) AS Duration, SUM(Duration) AS DurationTotal
FROM SlowQueriesTrace
GROUP BY Description
then no aggregation occurs, because every GUID is unique. What i need is some way to replace the uniqueidentifier parameters with a generic %g marker:
EXECUTE GetTransactionCounts #BankGUID = %g, #SessionGUID=%g
SELECT SUM(Total) AS Total FROM fn_BalancingAdditionsUS(%g)
EXEC CreateCloser %g, NULL , %g
EXECUTE GetMachineImpressmentForSession #SessionGUID = %g
SELECT SUM(Total) AS Total FROM fn_BalancingAdditionsCanadian(%g)
SELECT SUM(Total) AS Total FROM fn_BalancingSubtractionsUS(%g)
Then my aggregation will work.
Aside from exporting the table to Excel and hand editing all 10,270 events, can anything think of any way to perform GUID search & replace pattern matching inside SQL Server?
Other hacks i tried:
Trim description to first 40 characters (i.e. CAST(description AS varchar(40))):
EXECUTE GetTransactionCounts #BankGUID =
SELECT SUM(Total) AS Total FROM fn_Balan
EXEC CreateCloser '{7F34361F-CEEA-4CEA-8
EXECUTE GetMachineImpressmentForSession
SELECT SUM(Total) AS Total FROM fn_Balan
SELECT SUM(Total) AS Total FROM fn_Balan
Except that merges items that shouldn't be merged, and other items that should be merged are not.
Use SoundEx:
E223
S423
E220
E223
S423
Except that you can see lines that are completely different are given the same soundex. Also i am unable to determine what query S338 corresponds to.
The hack i ended up using was to create a new Category column, initally null. i then spent two hours with carefully selected LIKE clauses to pick out a particular query and then "tag" them all with the query. e.g.:
UPDATE QueryTrace
SET Category = 'EXECUTE GetTransactionCounts #BankGUID ='
WHERE Description LIKE 'EXECUTE GetTransactionCounts #BankGUID =%'
and
UPDATE QueryTrace
SET Category = 'SELECT SUM(Total) AS Total FROM fn_BalancingAdditionsCanadian'
WHERE Description LIKE '%FROM fn_BalancingAdditionsCanadian%'
That doesn't mean i don't need a solution using this question.

Have you tried using ClearTrace which performs certain query parameterisations/normalisations?
Another option is to use a CLR function: Determining Poorly Performing Queries for Tuning from SQL Server Workload Trace Files
Whenever you gather workload traces to
identify poorly performing queries,
you need to import this data into a
database table, and to "normalise" and
aggregate this information to identify
the worst offenders. This can be done
in a variety of ways. One way is to
define a regular expression such as
this SQL CLR method based on work done
by Itzik Ben-Gan and modified by Adam
Machanic:
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true)]
public static SqlString sqlsig(SqlString querystring)
{
return (SqlString)Regex.Replace(
querystring.Value,
#"([\s,(=<>!](?![^\]]+[\]]))(?:(?:(?:(?:(?# expression coming
)(?:([N])?(')(?:[^']'')*('))(?# character
)(?:0x[\da-fA-F]*)(?# binary
)(?:[-+]?(?:(?:[\d]*\.[\d]*[\d]+)(?# precise number
)(?:[eE]?[\d]*)))(?# imprecise number
)(?:[~]?[-+]?(?:[\d]+))(?# integer
)(?:[nN][uU][lL][lL])(?# null
))(?:[\s]?[\+\-\*\/\%\&\\^][\s]?)?)+(?# operators
)))",
#"$1$2$3#$4");
}
Edit by OP: i had not heard of ClearTrace. i tried it:
Edit: Did you use the right trace template to gather the trace?

What is a good way to paginate out of SQL 2000 using Start and Length parameters?

I have been given the task of refactoring an existing stored procedure so that the results are paginated. The SQL server is SQL 2000 so I can't use the ROW_NUMBER method of pagination. The Stored proc is already fairly complex, building chunks of a large sql statement together before doing an sp_executesql and has various sorting options available.
The first result out of google seems like a good method but I think the example is wrong in that the 2nd sort needs to be reversed and the case when the start is less than the pagelength breaks down. The 2nd example on that page also seems like a good method but the SP is taking a pageNumber rather than the start record. And the whole temp table thing seems like it would be a performance drain.
I am making progress going down this path but it seems slow and confusing and I am having to do quite a bit of REPLACE methods on the Sort order to get it to come out right.
Are there any other easier techniques I am missing?

There are two SQL Server 2000 compliant answers in this StackOverflow question - skip the accepted one, which is 2005-only:

No, I'm afraid not - SQL Server 2000 doesn't have any of the 2005 niceties like Common Table Expression (CTE) and such..... the method described in the Google link seems to be one way to go.
Marc

Also take a look here
http://databases.aspfaq.com/database/how-do-i-page-through-a-recordset.html
scroll down to Stored Procedure Methods

Depending on your application architecture (and your amount of data, it's structure, DB server load etc.) you could use the DB access layer for paging.
For example, with ADO you can define a page size on the record set (DataSet in ADO.NET) object and do the paging on the client. Classic ADO even lets you use a server side cursor, though I don't know if that scales well (I think this was removed altogether in ADO.NET).
MSDN documentation: Paging Through a Query Result (ADO.NET)

After playing with this for a while there seems to be only one way of really doing this (using Start and Length parameters) and that's with the temp table.
My final solution was to not use the #start parameter and instead use a #page parameter and then use the
SET #sql = #sql + N'
SELECT * FROM
(
SELECT TOP ' + Cast( #length as varchar) + N' * FROM
(
SELECT TOP ' + Cast( #page*#length as varchar) + N'
field1,
field2
From Table1
order by field1 ASC
) as Result
Order by Field1 DESC
) as Result
Order by Field 1 ASC'
The original query was much more complex than what is shown here and the order by was ordered on at least 3 fields and determined by a long CASE clause, requiring me to use a series of REPLACE functions to get the fields in the right order.

We've been using variations on this query for a number of years. This example gives items 50,000 to 50,300.
select top 300
Items.*
from Items
where
Items.CustomerId = 1234 AND
Items.Active = 1 AND
Items.Id not in
(
select top 50000 Items.Id
from Items
where
Items.CustomerId = 1234 AND
Items.Active = 1
order by Items.id
)
order by Items.Id

Hidden Features of SQL Server

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What are some hidden features of SQL Server?
For example, undocumented system stored procedures, tricks to do things which are very useful but not documented enough?
Answers
Thanks to everybody for all the great answers!
Stored Procedures
sp_msforeachtable: Runs a command with '?' replaced with each table name (v6.5 and up)
sp_msforeachdb: Runs a command with '?' replaced with each database name (v7 and up)
sp_who2: just like sp_who, but with a lot more info for troubleshooting blocks (v7 and up)
sp_helptext: If you want the code of a stored procedure, view & UDF
sp_tables: return a list of all tables and views of database in scope.
sp_stored_procedures: return a list of all stored procedures
xp_sscanf: Reads data from the string into the argument locations specified by each format argument.
xp_fixeddrives:: Find the fixed drive with largest free space
sp_help: If you want to know the table structure, indexes and constraints of a table. Also views and UDFs. Shortcut is Alt+F1
Snippets
Returning rows in random order
All database User Objects by Last Modified Date
Return Date Only
Find records which date falls somewhere inside the current week.
Find records which date occurred last week.
Returns the date for the beginning of the current week.
Returns the date for the beginning of last week.
See the text of a procedure that has been deployed to a server
Drop all connections to the database
Table Checksum
Row Checksum
Drop all the procedures in a database
Re-map the login Ids correctly after restore
Call Stored Procedures from an INSERT statement
Find Procedures By Keyword
Drop all the procedures in a database
Query the transaction log for a database programmatically.
Functions
HashBytes()
EncryptByKey
PIVOT command
Misc
Connection String extras
TableDiff.exe
Triggers for Logon Events (New in Service Pack 2)
Boosting performance with persisted-computed-columns (pcc).
DEFAULT_SCHEMA setting in sys.database_principles
Forced Parameterization
Vardecimal Storage Format
Figuring out the most popular queries in seconds
Scalable Shared Databases
Table/Stored Procedure Filter feature in SQL Management Studio
Trace flags
Number after a GO repeats the batch
Security using schemas
Encryption using built in encryption functions, views and base tables with triggers

In Management Studio, you can put a number after a GO end-of-batch marker to cause the batch to be repeated that number of times:
PRINT 'X'
GO 10
Will print 'X' 10 times. This can save you from tedious copy/pasting when doing repetitive stuff.

A lot of SQL Server developers still don't seem to know about the OUTPUT clause (SQL Server 2005 and newer) on the DELETE, INSERT and UPDATE statement.
It can be extremely useful to know which rows have been INSERTed, UPDATEd, or DELETEd, and the OUTPUT clause allows to do this very easily - it allows access to the "virtual" tables called inserted and deleted (like in triggers):
DELETE FROM (table)
OUTPUT deleted.ID, deleted.Description
WHERE (condition)
If you're inserting values into a table which has an INT IDENTITY primary key field, with the OUTPUT clause, you can get the inserted new ID right away:
INSERT INTO MyTable(Field1, Field2)
OUTPUT inserted.ID
VALUES (Value1, Value2)
And if you're updating, it can be extremely useful to know what changed - in this case, inserted represents the new values (after the UPDATE), while deleted refers to the old values before the UPDATE:
UPDATE (table)
SET field1 = value1, field2 = value2
OUTPUT inserted.ID, deleted.field1, inserted.field1
WHERE (condition)
If a lot of info will be returned, the output of OUTPUT can also be redirected to a temporary table or a table variable (OUTPUT INTO #myInfoTable).
Extremely useful - and very little known!
Marc

sp_msforeachtable: Runs a command with '?' replaced with each table name.
e.g.
exec sp_msforeachtable "dbcc dbreindex('?')"
You can issue up to 3 commands for each table
exec sp_msforeachtable
#Command1 = 'print ''reindexing table ?''',
#Command2 = 'dbcc dbreindex(''?'')',
#Command3 = 'select count (*) [?] from ?'
Also, sp_MSforeachdb

Connection String extras:
MultipleActiveResultSets=true;
This makes ADO.Net 2.0 and above read multiple, forward-only, read-only results sets on a single database connection, which can improve performance if you're doing a lot of reading. You can turn it on even if you're doing a mix of query types.
Application Name=MyProgramName
Now when you want to see a list of active connections by querying the sysprocesses table, your program's name will appear in the program_name column instead of ".Net SqlClient Data Provider"

TableDiff.exe
Table Difference tool allows you to discover and reconcile differences between a source and destination table or a view. Tablediff Utility can report differences on schema and data. The most popular feature of tablediff is the fact that it can generate a script that you can run on the destination that will reconcile differences between the tables.
Link

A less known TSQL technique for returning rows in random order:
-- Return rows in a random order
SELECT
SomeColumn
FROM
SomeTable
ORDER BY
CHECKSUM(NEWID())

In Management Studio, you can quickly get a comma-delimited list of columns for a table by :
In the Object Explorer, expand the nodes under a given table (so you will see folders for Columns, Keys, Constraints, Triggers etc.)
Point to the Columns folder and drag into a query.
This is handy when you don't want to use heinous format returned by right-clicking on the table and choosing Script Table As..., then Insert To... This trick does work with the other folders in that it will give you a comma-delimited list of names contained within the folder.

Row Constructors
You can insert multiple rows of data with a single insert statement.
INSERT INTO Colors (id, Color)
VALUES (1, 'Red'),
(2, 'Blue'),
(3, 'Green'),
(4, 'Yellow')

If you want to know the table structure, indexes and constraints:
sp_help 'TableName'

HashBytes() to return the MD2, MD4, MD5, SHA, or SHA1 hash of its input.

Figuring out the most popular queries
With sys.dm_exec_query_stats, you can figure out many combinations of query analyses by a single query.
Link
with the commnad
select * from sys.dm_exec_query_stats
order by execution_count desc

The spatial results tab can be used to create art.
enter link description here http://michaeljswart.com/wp-content/uploads/2010/02/venus.png

EXCEPT and INTERSECT
Instead of writing elaborate joins and subqueries, these two keywords are a much more elegant shorthand and readable way of expressing your query's intent when comparing two query results. New as of SQL Server 2005, they strongly complement UNION which has already existed in the TSQL language for years.
The concepts of EXCEPT, INTERSECT, and UNION are fundamental in set theory which serves as the basis and foundation of relational modeling used by all modern RDBMS. Now, Venn diagram type results can be more intuitively and quite easily generated using TSQL.

I know it's not exactly hidden, but not too many people know about the PIVOT command. I was able to change a stored procedure that used cursors and took 2 minutes to run into a speedy 6 second piece of code that was one tenth the number of lines!

useful when restoring a database for Testing purposes or whatever. Re-maps the login ID's correctly:
EXEC sp_change_users_login 'Auto_Fix', 'Mary', NULL, 'B3r12-36'

Drop all connections to the database:
Use Master
Go
Declare #dbname sysname
Set #dbname = 'name of database you want to drop connections from'
Declare #spid int
Select #spid = min(spid) from master.dbo.sysprocesses
where dbid = db_id(#dbname)
While #spid Is Not Null
Begin
Execute ('Kill ' + #spid)
Select #spid = min(spid) from master.dbo.sysprocesses
where dbid = db_id(#dbname) and spid > #spid
End

Table Checksum
Select CheckSum_Agg(Binary_CheckSum(*)) From Table With (NOLOCK)
Row Checksum
Select CheckSum_Agg(Binary_CheckSum(*)) From Table With (NOLOCK) Where Column = Value

I'm not sure if this is a hidden feature or not, but I stumbled upon this, and have found it to be useful on many occassions. You can concatonate a set of a field in a single select statement, rather than using a cursor and looping through the select statement.
Example:
DECLARE #nvcConcatonated nvarchar(max)
SET #nvcConcatonated = ''
SELECT #nvcConcatonated = #nvcConcatonated + C.CompanyName + ', '
FROM tblCompany C
WHERE C.CompanyID IN (1,2,3)
SELECT #nvcConcatonated
Results:
Acme, Microsoft, Apple,

If you want the code of a stored procedure you can:
sp_helptext 'ProcedureName'
(not sure if it is hidden feature, but I use it all the time)

A stored procedure trick is that you can call them from an INSERT statement. I found this very useful when I was working on an SQL Server database.
CREATE TABLE #toto (v1 int, v2 int, v3 char(4), status char(6))
INSERT #toto (v1, v2, v3, status) EXEC dbo.sp_fulubulu(sp_param1)
SELECT * FROM #toto
DROP TABLE #toto

In SQL Server 2005/2008 to show row numbers in a SELECT query result:
SELECT ( ROW_NUMBER() OVER (ORDER BY OrderId) ) AS RowNumber,
GrandTotal, CustomerId, PurchaseDate
FROM Orders
ORDER BY is a compulsory clause. The OVER() clause tells the SQL Engine to sort data on the specified column (in this case OrderId) and assign numbers as per the sort results.

Useful for parsing stored procedure arguments: xp_sscanf
Reads data from the string into the argument locations specified by each format argument.
The following example uses xp_sscanf
to extract two values from a source
string based on their positions in the
format of the source string.
DECLARE #filename varchar (20), #message varchar (20)
EXEC xp_sscanf 'sync -b -fproducts10.tmp -rrandom', 'sync -b -f%s -r%s',
#filename OUTPUT, #message OUTPUT
SELECT #filename, #message
Here is the result set.
-------------------- --------------------
products10.tmp random

Return Date Only
Select Cast(Floor(Cast(Getdate() As Float))As Datetime)
or
Select DateAdd(Day, 0, DateDiff(Day, 0, Getdate()))

dm_db_index_usage_stats
This allows you to know if data in a table has been updated recently even if you don't have a DateUpdated column on the table.
SELECT OBJECT_NAME(OBJECT_ID) AS DatabaseName, last_user_update,*
FROM sys.dm_db_index_usage_stats
WHERE database_id = DB_ID( 'MyDatabase')
AND OBJECT_ID=OBJECT_ID('MyTable')
Code from: http://blog.sqlauthority.com/2009/05/09/sql-server-find-last-date-time-updated-for-any-table/
Information referenced from:
SQL Server - What is the date/time of the last inserted row of a table?
Available in SQL 2005 and later

Here are some features I find useful but a lot of people don't seem to know about:
sp_tables
Returns a list of objects that can be
queried in the current environment.
This means any object that can appear
in a FROM clause, except synonym
objects.
Link
sp_stored_procedures
Returns a list of stored procedures in
the current environment.
Link

Find records which date falls somewhere inside the current week.
where dateadd( week, datediff( week, 0, TransDate ), 0 ) =
dateadd( week, datediff( week, 0, getdate() ), 0 )
Find records which date occurred last week.
where dateadd( week, datediff( week, 0, TransDate ), 0 ) =
dateadd( week, datediff( week, 0, getdate() ) - 1, 0 )
Returns the date for the beginning of the current week.
select dateadd( week, datediff( week, 0, getdate() ), 0 )
Returns the date for the beginning of last week.
select dateadd( week, datediff( week, 0, getdate() ) - 1, 0 )

Not so much a hidden feature but setting up key mappings in Management Studio under Tools\Options\Keyboard:
Alt+F1 is defaulted to sp_help "selected text" but I cannot live without the adding Ctrl+F1 for sp_helptext "selected text"

Persisted-computed-columns
Computed columns can help you shift the runtime computation cost to data modification phase. The computed column is stored with the rest of the row and is transparently utilized when the expression on the computed columns and the query matches. You can also build indexes on the PCC’s to speed up filtrations and range scans on the expression.
Link

There are times when there's no suitable column to sort by, or you just want the default sort order on a table and you want to enumerate each row. In order to do that you can put "(select 1)" in the "order by" clause and you'd get what you want. Neat, eh?
select row_number() over (order by (select 1)), * from dbo.Table as t

Simple encryption with EncryptByKey

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight