SQL Server GUID sort algorithm. Why? - sql-server

Problem with UniqueIdentifiers
We have an existing database which uses uniqueidentifiers extensively (unfortunately!) both as primary keys and some nullable columns of some tables. We came across a situation where some reports that run on these tables sort on these uniqueidentifiers because there is no other column in the table that would give a meaningful sort (isn't that ironic!). The intent was to sort so that it shows the items in the order they were inserted but they were not inserted using NewSequentialId() - hence a waste of time.
Fact about the Sort Algorithm
Anyway, considering SQL Server sorts uniqueidentifiers based on byte groups starting from the ending 5th byte group (6 bytes) and moving towards the 1st byte group (4 bytes) reversing the order on the 3rd byte-group (2 bytes) from right-left to left-right,
My Question
I was curious to know if there is any real life situation that this kind of sort helps at all.
How does SQL Server store the uniqueidentifier internally which might provide insight on
why it has this whacky sort algorithm?
Reference:
Alberto Ferrari's discovery of the SQL Server GUID sort
Example
Uniqueidentifiers are sorted as shown below when you use a Order By on a uniqueidentifier column having the below data.
Please note that the below data is sorted ascendingly and highest sort preference is from the 5th byte group towards the 1st byte group (backwards).
-- 1st byte group of 4 bytes sorted in the reverse (left-to-right) order below --
01000000-0000-0000-0000-000000000000
10000000-0000-0000-0000-000000000000
00010000-0000-0000-0000-000000000000
00100000-0000-0000-0000-000000000000
00000100-0000-0000-0000-000000000000
00001000-0000-0000-0000-000000000000
00000001-0000-0000-0000-000000000000
00000010-0000-0000-0000-000000000000
-- 2nd byte group of 2 bytes sorted in the reverse (left-to-right) order below --
00000000-0100-0000-0000-000000000000
00000000-1000-0000-0000-000000000000
00000000-0001-0000-0000-000000000000
00000000-0010-0000-0000-000000000000
-- 3rd byte group of 2 bytes sorted in the reverse (left-to-right) order below --
00000000-0000-0100-0000-000000000000
00000000-0000-1000-0000-000000000000
00000000-0000-0001-0000-000000000000
00000000-0000-0010-0000-000000000000
-- 4th byte group of 2 bytes sorted in the straight (right-to-left) order below --
00000000-0000-0000-0001-000000000000
00000000-0000-0000-0010-000000000000
00000000-0000-0000-0100-000000000000
00000000-0000-0000-1000-000000000000
-- 5th byte group of 6 bytes sorted in the straight (right-to-left) order below --
00000000-0000-0000-0000-000000000001
00000000-0000-0000-0000-000000000010
00000000-0000-0000-0000-000000000100
00000000-0000-0000-0000-000000001000
00000000-0000-0000-0000-000000010000
00000000-0000-0000-0000-000000100000
00000000-0000-0000-0000-000001000000
00000000-0000-0000-0000-000010000000
00000000-0000-0000-0000-000100000000
00000000-0000-0000-0000-001000000000
00000000-0000-0000-0000-010000000000
00000000-0000-0000-0000-100000000000
Code:
Alberto's code extended to denote that sorting is on the bytes and not on the individual bits.
With Test_UIDs As (-- 0 1 2 3 4 5 6 7 8 9 A B C D E F
Select ID = 1, UID = cast ('00000000-0000-0000-0000-100000000000' as uniqueidentifier)
Union Select ID = 2, UID = cast ('00000000-0000-0000-0000-010000000000' as uniqueidentifier)
Union Select ID = 3, UID = cast ('00000000-0000-0000-0000-001000000000' as uniqueidentifier)
Union Select ID = 4, UID = cast ('00000000-0000-0000-0000-000100000000' as uniqueidentifier)
Union Select ID = 5, UID = cast ('00000000-0000-0000-0000-000010000000' as uniqueidentifier)
Union Select ID = 6, UID = cast ('00000000-0000-0000-0000-000001000000' as uniqueidentifier)
Union Select ID = 7, UID = cast ('00000000-0000-0000-0000-000000100000' as uniqueidentifier)
Union Select ID = 8, UID = cast ('00000000-0000-0000-0000-000000010000' as uniqueidentifier)
Union Select ID = 9, UID = cast ('00000000-0000-0000-0000-000000001000' as uniqueidentifier)
Union Select ID = 10, UID = cast ('00000000-0000-0000-0000-000000000100' as uniqueidentifier)
Union Select ID = 11, UID = cast ('00000000-0000-0000-0000-000000000010' as uniqueidentifier)
Union Select ID = 12, UID = cast ('00000000-0000-0000-0000-000000000001' as uniqueidentifier)
Union Select ID = 13, UID = cast ('00000000-0000-0000-0001-000000000000' as uniqueidentifier)
Union Select ID = 14, UID = cast ('00000000-0000-0000-0010-000000000000' as uniqueidentifier)
Union Select ID = 15, UID = cast ('00000000-0000-0000-0100-000000000000' as uniqueidentifier)
Union Select ID = 16, UID = cast ('00000000-0000-0000-1000-000000000000' as uniqueidentifier)
Union Select ID = 17, UID = cast ('00000000-0000-0001-0000-000000000000' as uniqueidentifier)
Union Select ID = 18, UID = cast ('00000000-0000-0010-0000-000000000000' as uniqueidentifier)
Union Select ID = 19, UID = cast ('00000000-0000-0100-0000-000000000000' as uniqueidentifier)
Union Select ID = 20, UID = cast ('00000000-0000-1000-0000-000000000000' as uniqueidentifier)
Union Select ID = 21, UID = cast ('00000000-0001-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 22, UID = cast ('00000000-0010-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 23, UID = cast ('00000000-0100-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 24, UID = cast ('00000000-1000-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 25, UID = cast ('00000001-0000-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 26, UID = cast ('00000010-0000-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 27, UID = cast ('00000100-0000-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 28, UID = cast ('00001000-0000-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 29, UID = cast ('00010000-0000-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 30, UID = cast ('00100000-0000-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 31, UID = cast ('01000000-0000-0000-0000-000000000000' as uniqueidentifier)
Union Select ID = 32, UID = cast ('10000000-0000-0000-0000-000000000000' as uniqueidentifier)
)
Select * From Test_UIDs Order By UID, ID

The algorithm is documented by the SQL Server guys here: How are GUIDs compared in SQL Server 2005? I Quote here here (since it's an old article that may be gone forever in a few years)
In general, equality comparisons make a lot of sense with
uniqueidentifier values. However, if you find yourself needing general
ordering, then you might be looking at the wrong data type and should
consider various integer types instead.
If, after careful thought, you decide to order on a uniqueidentifier
column, you might be surprised by what you get back.
Given these two uniqueidentifier values:
#g1= '55666BEE-B3A0-4BF5-81A7-86FF976E763F' #g2 =
'8DD5BCA5-6ABE-4F73-B4B7-393AE6BBB849'
Many people think that #g1 is less than #g2, since '55666BEE' is
certainly smaller than '8DD5BCA5'. However, this is not how SQL Server
2005 compares uniqueidentifier values.
The comparison is made by looking at byte "groups" right-to-left, and
left-to-right within a byte "group". A byte group is what is delimited
by the '-' character. More technically, we look at bytes {10 to 15}
first, then {8-9}, then {6-7}, then {4-5}, and lastly {0 to 3}.
In this specific example, we would start by comparing '86FF976E763F'
with '393AE6BBB849'. Immediately we see that #g2 is indeed greater
than #g1.
Note that in .NET languages, Guid values have a different default sort
order than in SQL Server. If you find the need to order an array or
list of Guid using SQL Server comparison semantics, you can use an
array or list of SqlGuid instead, which implements IComparable in a
way which is consistent with SQL Server semantics.
Plus, the sort follows byte groups endianness (see here: Globally unique identifier). The groups 10-15 and 8-9 are stored as big endian (corresponding to the Data4 in the wikipedia article), so they are compared as big endian. Other groups are compared using little endian.

A special service for those that find that the accepted answer a bit vague. The code speaks for itself; the magical parts are:
System.Guid g
g.ToByteArray();
int[] m_byteOrder = new int[16] // 16 Bytes = 128 Bit
{10, 11, 12, 13, 14, 15, 8, 9, 6, 7, 4, 5, 0, 1, 2, 3};
public int Compare(Guid x, Guid y)
{
byte byte1, byte2;
//Swap to the correct order to be compared
for (int i = 0; i < NUM_BYTES_IN_GUID; i++)
{
byte1 = x.ToByteArray()[m_byteOrder[i]];
byte2 = y.ToByteArray()[m_byteOrder[i]];
if (byte1 != byte2)
return (byte1 < byte2) ? (int)EComparison.LT : (int)EComparison.GT;
} // Next i
return (int)EComparison.EQ;
}
Full code:
namespace BlueMine.Data
{
public class SqlGuid
: System.IComparable
, System.IComparable<SqlGuid>
, System.Collections.Generic.IComparer<SqlGuid>
, System.IEquatable<SqlGuid>
{
private const int NUM_BYTES_IN_GUID = 16;
// Comparison orders.
private static readonly int[] m_byteOrder = new int[16] // 16 Bytes = 128 Bit
{10, 11, 12, 13, 14, 15, 8, 9, 6, 7, 4, 5, 0, 1, 2, 3};
private byte[] m_bytes; // the SqlGuid is null if m_value is null
public SqlGuid(byte[] guidBytes)
{
if (guidBytes == null || guidBytes.Length != NUM_BYTES_IN_GUID)
throw new System.ArgumentException("Invalid array size");
m_bytes = new byte[NUM_BYTES_IN_GUID];
guidBytes.CopyTo(m_bytes, 0);
}
public SqlGuid(System.Guid g)
{
m_bytes = g.ToByteArray();
}
public byte[] ToByteArray()
{
byte[] ret = new byte[NUM_BYTES_IN_GUID];
m_bytes.CopyTo(ret, 0);
return ret;
}
int CompareTo(object obj)
{
if (obj == null)
return 1; // https://msdn.microsoft.com/en-us/library/system.icomparable.compareto(v=vs.110).aspx
System.Type t = obj.GetType();
if (object.ReferenceEquals(t, typeof(System.DBNull)))
return 1;
if (object.ReferenceEquals(t, typeof(SqlGuid)))
{
SqlGuid ui = (SqlGuid)obj;
return this.Compare(this, ui);
} // End if (object.ReferenceEquals(t, typeof(UInt128)))
return 1;
} // End Function CompareTo(object obj)
int System.IComparable.CompareTo(object obj)
{
return this.CompareTo(obj);
}
int CompareTo(SqlGuid other)
{
return this.Compare(this, other);
}
int System.IComparable<SqlGuid>.CompareTo(SqlGuid other)
{
return this.Compare(this, other);
}
enum EComparison : int
{
LT = -1, // itemA precedes itemB in the sort order.
EQ = 0, // itemA occurs in the same position as itemB in the sort order.
GT = 1 // itemA follows itemB in the sort order.
}
public int Compare(SqlGuid x, SqlGuid y)
{
byte byte1, byte2;
//Swap to the correct order to be compared
for (int i = 0; i < NUM_BYTES_IN_GUID; i++)
{
byte1 = x.m_bytes[m_byteOrder[i]];
byte2 = y.m_bytes[m_byteOrder[i]];
if (byte1 != byte2)
return (byte1 < byte2) ? (int)EComparison.LT : (int)EComparison.GT;
} // Next i
return (int)EComparison.EQ;
}
int System.Collections.Generic.IComparer<SqlGuid>.Compare(SqlGuid x, SqlGuid y)
{
return this.Compare(x, y);
}
public bool Equals(SqlGuid other)
{
return Compare(this, other) == 0;
}
bool System.IEquatable<SqlGuid>.Equals(SqlGuid other)
{
return this.Equals(other);
}
}
}

Here's a different approach. The GUID is simply shuffled around ready for a normal string comparison like it occurs in SQL Server. This is Javascript but it is very easy to convert to any language.
function guidForComparison(guid) {
/*
character positions:
11111111112222222222333333
012345678901234567890123456789012345
00000000-0000-0000-0000-000000000000
byte positions:
111111111111
00112233 4455 6677 8899 001122334455
*/
return guid.substr(24, 12) +
guid.substr(19, 4) +
guid.substr(16, 2) +
guid.substr(14, 2) +
guid.substr(11, 2) +
guid.substr(9, 2) +
guid.substr(6, 2) +
guid.substr(4, 2) +
guid.substr(2, 2) +
guid.substr(0, 2);
};

Related

Linq How to run queries in chunks

I have bundle of delete queries like following :-
DELETE FROM [Entry]
WHERE CompanyId = 1
AND EmployeeId IN (3, 4, 6, 7, 14, 17, 20, 21, 22,....100 more)
AND Entry_Date = '2016-12-01'
AND Entry_Method = 'I'
SO in my code, i run this list of queries as below :-
using (var ctx = new ApplicationDbContext(schemaName))
{
foreach (var item in queries)
{
ctx.Database.ExecuteSqlCommand(item);
}
}
But due to the large number of queries executing it creates a lock on sql, so i decided to execute the queries in chunk, so i found the below code :-
SET ROWCOUNT 500
delete_more:
DELETE FROM [Entry]
WHERE CompanyId = 1
AND EmployeeId IN (3, 4, 6, 7, 14, 17, 20, 21, 22,....100 more)
AND Entry_Date = '2016-12-01'
AND Entry_Method = 'I'
IF ##ROWCOUNT > 0 GOTO delete_more
SET ROWCOUNT 0
Now the problem is How do i run this thing as i was running it previously through ctx.Database.ExecuteSqlCommand?
What is the way i can run this chunk query code in Linq?
I would create a SQL Server stored procedure that get the employee ids as a parameter. Let's call it 'sp_deleteEmployees' with the param #ids
Then in C# you create a string on the ids
string idsList = "3, 4, 6, 7, 14, 17, 20, 21, 22"
context.Database.ExecuteSqlCommand("usp_CreateAuthor #ids={0} ", idsList);
EDIT
Sorry, I guess I didn't understand the problem. If you need to delete the employees in chunk you can split the list of employee with this
public static List<IEnumerable<T>> Partition<T>(this IEnumerable<T> source, int length)
{
var count = source.Count();
var numberOfPartitions = count / length + ( count % length > 0 ? 1 : 0);
List<IEnumerable<T>> result= new List<IEnumerable<T>>();
for (int i = 0; i < numberOfPartitions; i++)
{
result.Add(source.Skip(length*i).Take(length));
}
return result;
}
You can use this method to split the list to small chunks and delete them one chunk at a time

PostgreSQL 8.3.11 locked; orphaned pg_toast database object recovery

Howdy Slack Overflowvians.
So I came across this PostgreSQL server running 8.3.11 (yeah I know), that was in a locked state with:
ERROR: database is not accepting commands to avoid wraparound data loss in database "postgres"
HINT: Stop the postmaster and use a standalone backend to vacuum that database.
Normally the auto vaccum daemon (autovacuum=on), would handle this, but because the following four TOAST (allows storage of large field values 8 kB slices, like bread), database object. But the XID of this database never was reset because of these corrupt database objects.
Below is a snippet of the output when running the server in single-user mode with the admin user:
SELECT oid, relname, age(relfrozenxid) FROM pg_class WHERE relkind = 't' ORDER BY age(relfrozenxid) DESC LIMIT 4;
----
1: oid = "2421459" (typeid = 26, len = 4, typmod = -1, byval = t)
2: relname = "pg_toast_2421456" (typeid = 19, len = 64, typmod = -1, byval = f)
3: age = "2146484084" (typeid = 23, len = 4, typmod = -1, byval = t)
----
1: oid = "2421450" (typeid = 26, len = 4, typmod = -1, byval = t)
2: relname = "pg_toast_2421447" (typeid = 19, len = 64, typmod = -1, byval = f)
3: age = "2146484084" (typeid = 23, len = 4, typmod = -1, byval = t)
----
1: oid = "2421435" (typeid = 26, len = 4, typmod = -1, byval = t)
2: relname = "pg_toast_2421432" (typeid = 19, len = 64, typmod = -1, byval = f)
3: age = "2146484084" (typeid = 23, len = 4, typmod = -1, byval = t)
----
1: oid = "2421426" (typeid = 26, len = 4, typmod = -1, byval = t)
2: relname = "pg_toast_2421423" (typeid = 19, len = 64, typmod = -1, byval = f)
3: age = "2146484084" (typeid = 23, len = 4, typmod = -1, byval = t)
Notice the age is well above the vacuum_freeze_min_age (value set after a successful VACUUM), on this server and thus why it was issuing the original errors above. The above was AFTER running a VACUUM FULL; all other tables fine.
SELECT relfilenode FROM pg_class WHERE oid=2421459;
So when we looked on disk (used the pg_class.relfilenode value for each table above) the toast table's file was missing:
$ find /var/lib/pgsql/data/ -type f -name '2421426' | wc -l # Bad toast
0
and when we looked on disk at the index of the toast
SELECT relfilenode FROM pg_class WHERE (select reltoastidxid FROM pg_class WHERE oid=2421459)
$ find /var/lib/pgsql/data/ -type f -name '2421459' | wc -l # Bad toast's index
0
We then tried to find the table that the bad toast record is related to with:
SELECT * FROM pg_class WHERE reltoastrelid=2421459;
got 0 results for each table above! There are no tables for the VACUUM command to reset the XID of these relations.
When we checked the pg_depend table and found that these TOAST tables have NO references:
SELECT * FROM pg_depend WHERE refobjid IN(2421459,2421450,2421435,2421426)
Question
Can you delete the bad TOAST table and TOAST table indexes from the
pg_class table (e.g. DELETE FROM pg_class where oid=2421459)
Are there any other tables that we also need to remove the relation
from?
Could we just create a temp table and link it to the TOAST's
index's oid?
Example for #3 above:
CREATE TABLE adoptedparent (colnameblah char(1));
UPDATE pg_class SET reltoastrelid=2421459 WHERE relname='adoptedparent';
VACUUM FULL VERBOSE adoptedparent
EDIT:
select txid_current() is 3094769499 so these tables were corrupted a long time ago. We don't need to recover the data. We are running ext4 file system on Linux 2.6.18-238.el5. We checked the relevant lost+found/ directories and the files were not there.
Just for the home audience, in this particular case the resolution was to edit pg_class directly. And update the server to a supported version of Postgres, of course!
Specific answers:
Yes you can, although in most cases it's better to create an empty table, attach the toast relation to that table, add the pg_depend entries, and drop the table. In this case, that didn't make sense because there were truly no other objects depending on those toast tables.
Usually toast tables also have an index in pg_index, and entries in pg_depend. These did not.
See above.

Convert SQL Server varbinary(max) into a set of primary keys of type int

Disclaimer: not my code, not my database design!
I have a column of censusblocks(varbinary(max), null) in a MS SQL Server 2008 db table (call it foo for simplicity).
This column is actually a null or 1 to n long list of int. The ints are actually foreign keys to another table (call it censusblock with a pk id of type of int), numbering from 1 to ~9600000.
I want to query to extract the censusblocks list from foo, and use the extracted list of int from each row to look up the corresponding censusblock row. There's a long, boring rest of the query that will be used from there, but it needs to start with the census blocks pulled from the foo table's censusblocks column.
This conversion-and-look-up is currently handled on the middle tier, with a small .NET utility class to convert from List<int> to byte[] (and vice versa), which is then written into/read from the db as varbinary. I would like to do the same thing, purely in SQL.
The desired query would go something along the lines of
SELECT f.id, c.id
FROM foo f
LEFT OUTER JOIN censusblock c ON
c.id IN f.censusblocks --this is where the magic happens
where f.id in (1,2)
Which would result in:
f.id | c.id
1 8437314
1 8438819
1 8439744
1 8441795
1 8442741
1 8444984
1 8445568
1 8445641
1 8447953
2 5860657
2 5866881
2 5866881
2 5866858
2 5862557
2 5870475
2 5868983
2 5865207
2 5863465
2 5867301
2 5864057
2 5862256
NB: the 7-digit results are coincidental. The range is, as stated above, 1-7 digits.
The actual censusblocks column looks like
SELECT TOP 2 censusblocks FROM foo
which results in
censublocks
0x80BE4280C42380C7C080CFC380D37580DC3880DE8080DEC980E7D1
0x596D3159858159856A59749D59938B598DB7597EF7597829598725597A79597370
For further clarification, here's the guts of the .NET utility classes conversion methods:
public static List<int> getIntegersFromBytes(byte[] data)
{
List<int> values = new List<int>();
if (data != null && data.Length > 2)
{
long ids = data.Length / 3;
byte[] oneId = new byte[4];
oneId[0] = 0;
for (long i = 0; i < ids; i++)
{
oneId[0] = 0;
Array.Copy(data, i * 3, oneId, 1, 3);
if (BitConverter.IsLittleEndian)
{ Array.Reverse(oneId); }
values.Add(BitConverter.ToInt32(oneId, 0));
}}
return values;
}
public static byte[] getBytesFromIntegers(List<int> values)
{
byte[] data = null;
if (values != null && values.Count > 0)
{
data = new byte[values.Count * 3];
int count = 0;
byte[] idBytes = null;
foreach (int id in values)
{
idBytes = BitConverter.GetBytes(id);
if (BitConverter.IsLittleEndian)
{ Array.Reverse(idBytes); }
Array.Copy(idBytes, 1, data, count * 3, 3);
count++;
} }
return data;
}
An example of how this might be done. It is unlikely to scale brilliantly.
If you have a numbers table in your database it should be used in place of nums_cte.
This works by converting the binary value to a literal hex string, then reading it in 8-character chunks
-- create test data
DECLARE #foo TABLE
(id int ,
censusblocks varbinary(max)
)
DECLARE #censusblock TABLE
(id int)
INSERT #censusblock (id)
VALUES(1),(2),(1003),(5030),(5031),(2),(6)
INSERT #foo (id,censusblocks)
VALUES (1,0x0000000100000002000003EB),
(2,0x000013A6000013A7)
--query
DECLARE #biMaxLen bigint
SELECT #biMaxLen = MAX(LEN(CONVERT(varchar(max),censusblocks,2))) FROM #foo
;with nums_cte
AS
(
SELECT TOP (#biMaxLen) ((ROW_NUMBER() OVER (ORDER BY a.type) - 1) * 8) AS n
FROM master..spt_values as a
CROSS JOIN master..spt_values as b
)
,binCTE
AS
(
SELECT d.id, CAST(CONVERT(binary(4),SUBSTRING(s,n + 1,8),2) AS int) as cblock
FROM (SELECT Id, CONVERT(varchar(max),censusblocks,2) AS s FROM #foo) AS d
JOIN nums_cte
ON n < LEN(d.s)
)
SELECT *
FROM binCTE as b
LEFT
JOIN #censusblock c
ON c.id = b.cblock
ORDER BY b.id, b.cblock
You could also consider adding your existing .Net conversion methods into the database as an assembly and accessing them through CLR functions.
This is off-topic, but I couldn't resist writing these conversions so they use IEnumerables instead of arrays and Lists. This might not be faster per se, but is more general and would allow you to perform the conversion without loading the whole array at once, which may be helpful if the arrays you are dealing with are large.
Here it is, for what it's worth:
static IEnumerable<int> BytesToInts(IEnumerable<byte> bytes) {
var buff = new byte[4];
using (var en = bytes.GetEnumerator()) {
while (en.MoveNext()) {
buff[0] = en.Current;
if (en.MoveNext()) {
buff[1] = en.Current;
if (en.MoveNext()) {
buff[2] = en.Current;
if (en.MoveNext()) {
buff[3] = en.Current;
if (BitConverter.IsLittleEndian)
Array.Reverse(buff);
yield return BitConverter.ToInt32(buff, 0);
continue;
}
}
}
throw new ArgumentException("Wrong number of bytes.", "bytes");
}
}
}
static IEnumerable<byte> IntsToBytes(IEnumerable<int> ints) {
if (BitConverter.IsLittleEndian)
return ints.SelectMany(
b => {
var buff = BitConverter.GetBytes(b);
Array.Reverse(buff);
return buff;
}
);
return ints.SelectMany(BitConverter.GetBytes);
}
Your code seems to like encoding an int into 3 bytes instead of 4, which would cause problems with values that don't fit into 3 bytes (including negatives) - is that intentional?
BTW, you should be able to adapt this (or your) code for execution in SQL Server CLR. This is not exactly "in SQL", but is "in DBMS".
you can use Convert(int, censusBlock) to convert the varchar value to int value.
the you can join on that column.
Or have i misunderstood the question?

Optimize delete query generated by Castle ActiveRecord

Lets say I have Id (primary key) list that I want to delete (e.g 1, 2, 3, 4).
Using this query :
Console.WriteLine ("DELETE DATA :");
ActiveRecordMediator<PostgrePerson>.DeleteAll ("Id IN (1, 2, 3, 4)");
I expect the console output is :
DELETE DATA :
NHibernate: DELETE FROM Person WHERE Id IN (1, 2, 3, 4)
but, the actual console output is (I use showsql option) :
DELETE DATA :
NHibernate: select postgreper0_.Id as Id5_, postgreper0_.Name as Name5_, postgreper0_.Age as Age5_, postgreper0_.Address as Address5_ from Person postgreper0_ w
here postgreper0_.Id in (1 , 2 , 3 , 4)
NHibernate: DELETE FROM Person WHERE Id = :p0;:p0 = 1
NHibernate: DELETE FROM Person WHERE Id = :p0;:p0 = 2
NHibernate: DELETE FROM Person WHERE Id = :p0;:p0 = 3
NHibernate: DELETE FROM Person WHERE Id = :p0;:p0 = 4
What should I do to make Castle ActiveRecord generate the expected (optimized) query?
Update
This is my implementation based on accepted answer.
int[] idList = GetIdList ();
ActiveRecordMediator<PostgrePerson>.Execute ((session, obj) => {
string hql = "DELETE PostgrePerson WHERE Id IN (:idList)";
return session.CreateQuery (hql)
.SetParameterList ("idList", idList)
.ExecuteUpdate ();
}, null);
Use the Execute callback method and run a DML-style HQL DELETE on the NHibernate ISession.

Natural (human alpha-numeric) sort in Microsoft SQL 2005

We have a large database on which we have DB side pagination. This is quick, returning a page of 50 rows from millions of records in a small fraction of a second.
Users can define their own sort, basically choosing what column to sort by. Columns are dynamic - some have numeric values, some dates and some text.
While most sort as expected text sorts in a dumb way. Well, I say dumb, it makes sense to computers, but frustrates users.
For instance, sorting by a string record id gives something like:
rec1
rec10
rec14
rec2
rec20
rec3
rec4
...and so on.
I want this to take account of the number, so:
rec1
rec2
rec3
rec4
rec10
rec14
rec20
I can't control the input (otherwise I'd just format in leading 000s) and I can't rely on a single format - some are things like "{alpha code}-{dept code}-{rec id}".
I know a few ways to do this in C#, but can't pull down all the records to sort them, as that would be to slow.
Does anyone know a way to quickly apply a natural sort in Sql server?
We're using:
ROW_NUMBER() over (order by {field name} asc)
And then we're paging by that.
We can add triggers, although we wouldn't. All their input is parametrised and the like, but I can't change the format - if they put in "rec2" and "rec10" they expect them to be returned just like that, and in natural order.
We have valid user input that follows different formats for different clients.
One might go rec1, rec2, rec3, ... rec100, rec101
While another might go: grp1rec1, grp1rec2, ... grp20rec300, grp20rec301
When I say we can't control the input I mean that we can't force users to change these standards - they have a value like grp1rec1 and I can't reformat it as grp01rec001, as that would be changing something used for lookups and linking to external systems.
These formats vary a lot, but are often mixtures of letters and numbers.
Sorting these in C# is easy - just break it up into { "grp", 20, "rec", 301 } and then compare sequence values in turn.
However there may be millions of records and the data is paged, I need the sort to be done on the SQL server.
SQL server sorts by value, not comparison - in C# I can split the values out to compare, but in SQL I need some logic that (very quickly) gets a single value that consistently sorts.
#moebius - your answer might work, but it does feel like an ugly compromise to add a sort-key for all these text values.
order by LEN(value), value
Not perfect, but works well in a lot of cases.
Most of the SQL-based solutions I have seen break when the data gets complex enough (e.g. more than one or two numbers in it). Initially I tried implementing a NaturalSort function in T-SQL that met my requirements (among other things, handles an arbitrary number of numbers within the string), but the performance was way too slow.
Ultimately, I wrote a scalar CLR function in C# to allow for a natural sort, and even with unoptimized code the performance calling it from SQL Server is blindingly fast. It has the following characteristics:
will sort the first 1,000 characters or so correctly (easily modified in code or made into a parameter)
properly sorts decimals, so 123.333 comes before 123.45
because of above, will likely NOT sort things like IP addresses correctly; if you wish different behaviour, modify the code
supports sorting a string with an arbitrary number of numbers within it
will correctly sort numbers up to 25 digits long (easily modified in code or made into a parameter)
The code is here:
using System;
using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;
public class UDF
{
[SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic=true)]
public static SqlString Naturalize(string val)
{
if (String.IsNullOrEmpty(val))
return val;
while(val.Contains(" "))
val = val.Replace(" ", " ");
const int maxLength = 1000;
const int padLength = 25;
bool inNumber = false;
bool isDecimal = false;
int numStart = 0;
int numLength = 0;
int length = val.Length < maxLength ? val.Length : maxLength;
//TODO: optimize this so that we exit for loop once sb.ToString() >= maxLength
var sb = new StringBuilder();
for (var i = 0; i < length; i++)
{
int charCode = (int)val[i];
if (charCode >= 48 && charCode <= 57)
{
if (!inNumber)
{
numStart = i;
numLength = 1;
inNumber = true;
continue;
}
numLength++;
continue;
}
if (inNumber)
{
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
inNumber = false;
}
isDecimal = (charCode == 46);
sb.Append(val[i]);
}
if (inNumber)
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
var ret = sb.ToString();
if (ret.Length > maxLength)
return ret.Substring(0, maxLength);
return ret;
}
static string PadNumber(string num, bool isDecimal, int padLength)
{
return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
}
}
To register this so that you can call it from SQL Server, run the following commands in Query Analyzer:
CREATE ASSEMBLY SqlServerClr FROM 'SqlServerClr.dll' --put the full path to DLL here
go
CREATE FUNCTION Naturalize(#val as nvarchar(max)) RETURNS nvarchar(1000)
EXTERNAL NAME SqlServerClr.UDF.Naturalize
go
Then, you can use it like so:
select *
from MyTable
order by dbo.Naturalize(MyTextField)
Note: If you get an error in SQL Server along the lines of Execution of user code in the .NET Framework is disabled. Enable "clr enabled" configuration option., follow the instructions here to enable it. Make sure you consider the security implications before doing so. If you are not the db admin, make sure you discuss this with your admin before making any changes to the server configuration.
Note2: This code does not properly support internationalization (e.g., assumes the decimal marker is ".", is not optimized for speed, etc. Suggestions on improving it are welcome!
Edit: Renamed the function to Naturalize instead of NaturalSort, since it does not do any actual sorting.
I know this is an old question but I just came across it and since it's not got an accepted answer.
I have always used ways similar to this:
SELECT [Column] FROM [Table]
ORDER BY RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))), 1000)
The only common times that this has issues is if your column won't cast to a VARCHAR(MAX), or if LEN([Column]) > 1000 (but you can change that 1000 to something else if you want), but you can use this rough idea for what you need.
Also this is much worse performance than normal ORDER BY [Column], but it does give you the result asked for in the OP.
Edit: Just to further clarify, this the above will not work if you have decimal values such as having 1, 1.15 and 1.5, (they will sort as {1, 1.5, 1.15}) as that is not what is asked for in the OP, but that can easily be done by:
SELECT [Column] FROM [Table]
ORDER BY REPLACE(RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))) + REPLICATE('0', 100 - CHARINDEX('.', REVERSE(LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX))))), 1)), 1000), '.', '0')
Result: {1, 1.15, 1.5}
And still all entirely within SQL. This will not sort IP addresses because you're now getting into very specific number combinations as opposed to simple text + number.
RedFilter's answer is great for reasonably sized datasets where indexing is not critical, however if you want an index, several tweaks are required.
First, mark the function as not doing any data access and being deterministic and precise:
[SqlFunction(DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None,
IsDeterministic = true, IsPrecise = true)]
Next, MSSQL has a 900 byte limit on the index key size, so if the naturalized value is the only value in the index, it must be at most 450 characters long. If the index includes multiple columns, the return value must be even smaller. Two changes:
CREATE FUNCTION Naturalize(#str AS nvarchar(max)) RETURNS nvarchar(450)
EXTERNAL NAME ClrExtensions.Util.Naturalize
and in the C# code:
const int maxLength = 450;
Finally, you will need to add a computed column to your table, and it must be persisted (because MSSQL cannot prove that Naturalize is deterministic and precise), which means the naturalized value is actually stored in the table but is still maintained automatically:
ALTER TABLE YourTable ADD nameNaturalized AS dbo.Naturalize(name) PERSISTED
You can now create the index!
CREATE INDEX idx_YourTable_n ON YourTable (nameNaturalized)
I've also made a couple of changes to RedFilter's code: using chars for clarity, incorporating duplicate space removal into the main loop, exiting once the result is longer than the limit, setting maximum length without substring etc. Here's the result:
using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;
public static class Util
{
[SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
public static SqlString Naturalize(string str)
{
if (string.IsNullOrEmpty(str))
return str;
const int maxLength = 450;
const int padLength = 15;
bool isDecimal = false;
bool wasSpace = false;
int numStart = 0;
int numLength = 0;
var sb = new StringBuilder();
for (var i = 0; i < str.Length; i++)
{
char c = str[i];
if (c >= '0' && c <= '9')
{
if (numLength == 0)
numStart = i;
numLength++;
}
else
{
if (numLength > 0)
{
sb.Append(pad(str.Substring(numStart, numLength), isDecimal, padLength));
numLength = 0;
}
if (c != ' ' || !wasSpace)
sb.Append(c);
isDecimal = c == '.';
if (sb.Length > maxLength)
break;
}
wasSpace = c == ' ';
}
if (numLength > 0)
sb.Append(pad(str.Substring(numStart, numLength), isDecimal, padLength));
if (sb.Length > maxLength)
sb.Length = maxLength;
return sb.ToString();
}
private static string pad(string num, bool isDecimal, int padLength)
{
return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
}
}
Here's a solution written for SQL 2000. It can probably be improved for newer SQL versions.
/**
* Returns a string formatted for natural sorting. This function is very useful when having to sort alpha-numeric strings.
*
* #author Alexandre Potvin Latreille (plalx)
* #param {nvarchar(4000)} string The formatted string.
* #param {int} numberLength The length each number should have (including padding). This should be the length of the longest number. Defaults to 10.
* #param {char(50)} sameOrderChars A list of characters that should have the same order. Ex: '.-/'. Defaults to empty string.
*
* #return {nvarchar(4000)} A string for natural sorting.
* Example of use:
*
* SELECT Name FROM TableA ORDER BY Name
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1-1.
* 2. A1-1. 2. A1.
* 3. R1 --> 3. R1
* 4. R11 4. R11
* 5. R2 5. R2
*
*
* As we can see, humans would expect A1., A1-1., R1, R2, R11 but that's not how SQL is sorting it.
* We can use this function to fix this.
*
* SELECT Name FROM TableA ORDER BY dbo.udf_NaturalSortFormat(Name, default, '.-')
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1.
* 2. A1-1. 2. A1-1.
* 3. R1 --> 3. R1
* 4. R11 4. R2
* 5. R2 5. R11
*/
ALTER FUNCTION [dbo].[udf_NaturalSortFormat](
#string nvarchar(4000),
#numberLength int = 10,
#sameOrderChars char(50) = ''
)
RETURNS varchar(4000)
AS
BEGIN
DECLARE #sortString varchar(4000),
#numStartIndex int,
#numEndIndex int,
#padLength int,
#totalPadLength int,
#i int,
#sameOrderCharsLen int;
SELECT
#totalPadLength = 0,
#string = RTRIM(LTRIM(#string)),
#sortString = #string,
#numStartIndex = PATINDEX('%[0-9]%', #string),
#numEndIndex = 0,
#i = 1,
#sameOrderCharsLen = LEN(#sameOrderChars);
-- Replace all char that have the same order by a space.
WHILE (#i <= #sameOrderCharsLen)
BEGIN
SET #sortString = REPLACE(#sortString, SUBSTRING(#sameOrderChars, #i, 1), ' ');
SET #i = #i + 1;
END
-- Pad numbers with zeros.
WHILE (#numStartIndex <> 0)
BEGIN
SET #numStartIndex = #numStartIndex + #numEndIndex;
SET #numEndIndex = #numStartIndex;
WHILE(PATINDEX('[0-9]', SUBSTRING(#string, #numEndIndex, 1)) = 1)
BEGIN
SET #numEndIndex = #numEndIndex + 1;
END
SET #numEndIndex = #numEndIndex - 1;
SET #padLength = #numberLength - (#numEndIndex + 1 - #numStartIndex);
IF #padLength < 0
BEGIN
SET #padLength = 0;
END
SET #sortString = STUFF(
#sortString,
#numStartIndex + #totalPadLength,
0,
REPLICATE('0', #padLength)
);
SET #totalPadLength = #totalPadLength + #padLength;
SET #numStartIndex = PATINDEX('%[0-9]%', RIGHT(#string, LEN(#string) - #numEndIndex));
END
RETURN #sortString;
END
I know this is a bit old at this point, but in my search for a better solution, I came across this question. I'm currently using a function to order by. It works fine for my purpose of sorting records which are named with mixed alpha numeric ('item 1', 'item 10', 'item 2', etc)
CREATE FUNCTION [dbo].[fnMixSort]
(
#ColValue NVARCHAR(255)
)
RETURNS NVARCHAR(1000)
AS
BEGIN
DECLARE #p1 NVARCHAR(255),
#p2 NVARCHAR(255),
#p3 NVARCHAR(255),
#p4 NVARCHAR(255),
#Index TINYINT
IF #ColValue LIKE '[a-z]%'
SELECT #Index = PATINDEX('%[0-9]%', #ColValue),
#p1 = LEFT(CASE WHEN #Index = 0 THEN #ColValue ELSE LEFT(#ColValue, #Index - 1) END + REPLICATE(' ', 255), 255),
#ColValue = CASE WHEN #Index = 0 THEN '' ELSE SUBSTRING(#ColValue, #Index, 255) END
ELSE
SELECT #p1 = REPLICATE(' ', 255)
SELECT #Index = PATINDEX('%[^0-9]%', #ColValue)
IF #Index = 0
SELECT #p2 = RIGHT(REPLICATE(' ', 255) + #ColValue, 255),
#ColValue = ''
ELSE
SELECT #p2 = RIGHT(REPLICATE(' ', 255) + LEFT(#ColValue, #Index - 1), 255),
#ColValue = SUBSTRING(#ColValue, #Index, 255)
SELECT #Index = PATINDEX('%[0-9,a-z]%', #ColValue)
IF #Index = 0
SELECT #p3 = REPLICATE(' ', 255)
ELSE
SELECT #p3 = LEFT(REPLICATE(' ', 255) + LEFT(#ColValue, #Index - 1), 255),
#ColValue = SUBSTRING(#ColValue, #Index, 255)
IF PATINDEX('%[^0-9]%', #ColValue) = 0
SELECT #p4 = RIGHT(REPLICATE(' ', 255) + #ColValue, 255)
ELSE
SELECT #p4 = LEFT(#ColValue + REPLICATE(' ', 255), 255)
RETURN #p1 + #p2 + #p3 + #p4
END
Then call
select item_name from my_table order by fnMixSort(item_name)
It easily triples the processing time for a simple data read, so it may not be the perfect solution.
Here is an other solution that I like:
http://www.dreamchain.com/sql-and-alpha-numeric-sort-order/
It's not Microsoft SQL, but since I ended up here when I was searching for a solution for Postgres, I thought adding this here would help others.
EDIT: Here is the code, in case the link goes away.
CREATE or REPLACE FUNCTION pad_numbers(text) RETURNS text AS $$
SELECT regexp_replace(regexp_replace(regexp_replace(regexp_replace(($1 collate "C"),
E'(^|\\D)(\\d{1,3}($|\\D))', E'\\1000\\2', 'g'),
E'(^|\\D)(\\d{4,6}($|\\D))', E'\\1000\\2', 'g'),
E'(^|\\D)(\\d{7}($|\\D))', E'\\100\\2', 'g'),
E'(^|\\D)(\\d{8}($|\\D))', E'\\10\\2', 'g');
$$ LANGUAGE SQL;
"C" is the default collation in postgresql; you may specify any collation you desire, or remove the collation statement if you can be certain your table columns will never have a nondeterministic collation assigned.
usage:
SELECT * FROM wtf w
WHERE TRUE
ORDER BY pad_numbers(w.my_alphanumeric_field)
For the following varchar data:
BR1
BR2
External Location
IR1
IR2
IR3
IR4
IR5
IR6
IR7
IR8
IR9
IR10
IR11
IR12
IR13
IR14
IR16
IR17
IR15
VCR
This worked best for me:
ORDER BY substring(fieldName, 1, 1), LEN(fieldName)
If you're having trouble loading the data from the DB to sort in C#, then I'm sure you'll be disappointed with any approach at doing it programmatically in the DB. When the server is going to sort, it's got to calculate the "perceived" order just as you would have -- every time.
I'd suggest that you add an additional column to store the preprocessed sortable string, using some C# method, when the data is first inserted. You might try to convert the numerics into fixed-width ranges, for example, so "xyz1" would turn into "xyz00000001". Then you could use normal SQL Server sorting.
At the risk of tooting my own horn, I wrote a CodeProject article implementing the problem as posed in the CodingHorror article. Feel free to steal from my code.
Simply you sort by
ORDER BY
cast (substring(name,(PATINDEX('%[0-9]%',name)),len(name))as int)
##
I've just read a article somewhere about such a topic. The key point is: you only need the integer value to sort data, while the 'rec' string belongs to the UI. You could split the information in two fields, say alpha and num, sort by alpha and num (separately) and then showing a string composed by alpha + num. You could use a computed column to compose the string, or a view.
Hope it helps
You can use the following code to resolve the problem:
Select *,
substring(Cote,1,len(Cote) - Len(RIGHT(Cote, LEN(Cote) - PATINDEX('%[0-9]%', Cote)+1)))alpha,
CAST(RIGHT(Cote, LEN(Cote) - PATINDEX('%[0-9]%', Cote)+1) AS INT)intv
FROM Documents
left outer join Sites ON Sites.IDSite = Documents.IDSite
Order BY alpha, intv
regards,
rabihkahaleh#hotmail.com
I'm fashionably late to the party as usual. Nevertheless, here is my attempt at an answer that seems to work well (I would say that). It assumes text with digits at the end, like in the original example data.
First a function that won't end up winning a "pretty SQL" competition anytime soon.
CREATE FUNCTION udfAlphaNumericSortHelper (
#string varchar(max)
)
RETURNS #results TABLE (
txt varchar(max),
num float
)
AS
BEGIN
DECLARE #txt varchar(max) = #string
DECLARE #numStr varchar(max) = ''
DECLARE #num float = 0
DECLARE #lastChar varchar(1) = ''
set #lastChar = RIGHT(#txt, 1)
WHILE #lastChar <> '' and #lastChar is not null
BEGIN
IF ISNUMERIC(#lastChar) = 1
BEGIN
set #numStr = #lastChar + #numStr
set #txt = Substring(#txt, 0, len(#txt))
set #lastChar = RIGHT(#txt, 1)
END
ELSE
BEGIN
set #lastChar = null
END
END
SET #num = CAST(#numStr as float)
INSERT INTO #results select #txt, #num
RETURN;
END
Then call it like below:
declare #str nvarchar(250) = 'sox,fox,jen1,Jen0,jen15,jen02,jen0004,fox00,rec1,rec10,jen3,rec14,rec2,rec20,rec3,rec4,zip1,zip1.32,zip1.33,zip1.3,TT0001,TT01,TT002'
SELECT tbl.value --, sorter.txt, sorter.num
FROM STRING_SPLIT(#str, ',') as tbl
CROSS APPLY dbo.udfAlphaNumericSortHelper(value) as sorter
ORDER BY sorter.txt, sorter.num, len(tbl.value)
With results:
fox
fox00
Jen0
jen1
jen02
jen3
jen0004
jen15
rec1
rec2
rec3
rec4
rec10
rec14
rec20
sox
TT01
TT0001
TT002
zip1
zip1.3
zip1.32
zip1.33
I still don't understand (probably because of my poor English).
You could try:
ROW_NUMBER() OVER (ORDER BY dbo.human_sort(field_name) ASC)
But it won't work for millions of records.
That why I suggested to use trigger which fills separate column with human value.
Moreover:
built-in T-SQL functions are really
slow and Microsoft suggest to use
.NET functions instead.
human value is constant so there is no point calculating it each time
when query runs.

Resources