How to do hit-highlighting of results from a SQL Server full-text query - sql-server

We have a web application that uses SQL Server 2008 as the database. Our users are able to do full-text searches on particular columns in the database. SQL Server's full-text functionality does not seem to provide support for hit highlighting. Do we need to build this ourselves or is there perhaps some library or knowledge around on how to do this?
BTW the application is written in C# so a .Net solution would be ideal but not necessary as we could translate.

Expanding on Ishmael's idea, it's not the final solution, but I think it's a good way to start.
Firstly we need to get the list of words that have been retrieved with the full-text engine:
declare #SearchPattern nvarchar(1000) = 'FORMSOF (INFLECTIONAL, " ' + #SearchString + ' ")'
declare #SearchWords table (Word varchar(100), Expansion_type int)
insert into #SearchWords
select distinct display_term, expansion_type
from sys.dm_fts_parser(#SearchPattern, 1033, 0, 0)
where special_term = 'Exact Match'
There is already quite a lot one can expand on, for example the search pattern is quite basic; also there are probably better ways to filter out the words you don't need, but it least it gives you a list of stem words etc. that would be matched by full-text search.
After you get the results you need, you can use RegEx to parse through the result set (or preferably only a subset to speed it up, although I haven't yet figured out a good way to do so). For this I simply use two while loops and a bunch of temporary table and variables:
declare #FinalResults table
while (select COUNT(*) from #PrelimResults) > 0
begin
select top 1 #CurrID = [UID], #Text = Text from #PrelimResults
declare #TextLength int = LEN(#Text )
declare #IndexOfDot int = CHARINDEX('.', REVERSE(#Text ), #TextLength - dbo.RegExIndexOf(#Text, '\b' + #FirstSearchWord + '\b') + 1)
set #Text = SUBSTRING(#Text, case #IndexOfDot when 0 then 0 else #TextLength - #IndexOfDot + 3 end, 300)
while (select COUNT(*) from #TempSearchWords) > 0
begin
select top 1 #CurrWord = Word from #TempSearchWords
set #Text = dbo.RegExReplace(#Text, '\b' + #CurrWord + '\b', '<b>' + SUBSTRING(#Text, dbo.RegExIndexOf(#Text, '\b' + #CurrWord + '\b'), LEN(#CurrWord) + 1) + '</b>')
delete from #TempSearchWords where Word = #CurrWord
end
insert into #FinalResults
select * from #PrelimResults where [UID] = #CurrID
delete from #PrelimResults where [UID] = #CurrID
end
Several notes:
1. Nested while loops probably aren't the most efficient way of doing it, however nothing else comes to mind. If I were to use cursors, it would essentially be the same thing?
2. #FirstSearchWord here to refers to the first instance in the text of one of the original search words, so essentially the text you are replacing is only going to be in the summary. Again, it's quite a basic method, some sort of text cluster finding algorithm would probably be handy.
3. To get RegEx in the first place, you need CLR user-defined functions.

It looks like you could parse the output of the new SQL Server 2008 stored procedure sys.dm_fts_parser and use regex, but I haven't looked at it too closely.

You might be missing the point of the database in this instance. Its job is to return the data to you that satisfies the conditions you gave it. I think you will want to implement the highlighting probably using regex in your web control.
Here is something a quick search would reveal.
http://www.dotnetjunkies.com/PrintContent.aspx?type=article&id=195E323C-78F3-4884-A5AA-3A1081AC3B35

Some details:
search_kiemeles=replace(lcase(search),"""","")
do while not rs.eof 'The search result loop
hirdetes=rs("hirdetes")
data=RegExpValueA("([A-Za-zöüóőúéáűíÖÜÓŐÚÉÁŰÍ0-9]+)",search_kiemeles) 'Give back all the search words in an array, I need non-english characters also
For i=0 to Ubound(data,1)
hirdetes = RegExpReplace(hirdetes,"("&NoAccentRE(data(i))&")","<em>$1</em>")
Next
response.write hirdetes
rs.movenext
Loop
...
Functions
'All Match to Array
Function RegExpValueA(patrn, strng)
Dim regEx
Set regEx = New RegExp ' Create a regular expression.
regEx.IgnoreCase = True ' Set case insensitivity.
regEx.Global = True
Dim Match, Matches, RetStr
Dim data()
Dim count
count = 0
Redim data(-1) 'VBSCript Ubound array bug workaround
if isnull(strng) or strng="" then
RegExpValueA = data
exit function
end if
regEx.Pattern = patrn ' Set pattern.
Set Matches = regEx.Execute(strng) ' Execute search.
For Each Match in Matches ' Iterate Matches collection.
count = count + 1
Redim Preserve data(count-1)
data(count-1) = Match.Value
Next
set regEx = nothing
RegExpValueA = data
End Function
'Replace non-english chars
Function NoAccentRE(accent_string)
NoAccentRE=accent_string
NoAccentRE=Replace(NoAccentRE,"a","§")
NoAccentRE=Replace(NoAccentRE,"á","§")
NoAccentRE=Replace(NoAccentRE,"§","[aá]")
NoAccentRE=Replace(NoAccentRE,"e","§")
NoAccentRE=Replace(NoAccentRE,"é","§")
NoAccentRE=Replace(NoAccentRE,"§","[eé]")
NoAccentRE=Replace(NoAccentRE,"i","§")
NoAccentRE=Replace(NoAccentRE,"í","§")
NoAccentRE=Replace(NoAccentRE,"§","[ií]")
NoAccentRE=Replace(NoAccentRE,"o","§")
NoAccentRE=Replace(NoAccentRE,"ó","§")
NoAccentRE=Replace(NoAccentRE,"ö","§")
NoAccentRE=Replace(NoAccentRE,"ő","§")
NoAccentRE=Replace(NoAccentRE,"§","[oóöő]")
NoAccentRE=Replace(NoAccentRE,"u","§")
NoAccentRE=Replace(NoAccentRE,"ú","§")
NoAccentRE=Replace(NoAccentRE,"ü","§")
NoAccentRE=Replace(NoAccentRE,"ű","§")
NoAccentRE=Replace(NoAccentRE,"§","[uúüű]")
end function

Related

Correct usage of Dynamic SQL in SQL Server

I’m using static SQL for 99% of the time, but a recent scenario led me to write a dynamic SQL and I want to make sure I didn’t miss anything before this SQL is released to production.
The tables’ names are a combination of a prefix, a 2 letters variable and a suffix and column name is a prefix + 2 letters variable.
First I’ve checked that #p_param is 2 letters length and is “whitelisted”:
IF (LEN(#p_param) = 2 and (#p_param = ‘aa’ or #p_param = ‘bb’ or #p_param = ‘cc’ or #p_param = ‘dd’ or #p_param = ‘aa’)
BEGIN
set #p_table_name = 'table_' + #p_param + '_suffix';
set #sql = 'update ' + QUOTENAME(#p_table_name) + ' set column_name = 2 where id in (1,2,3,4);';
EXEC sp_executesql #sql;
--Here I’m checking the second parameter that I will create the column name with
IF (LEN(#p_column) = 2 and (#p_column = 'ce' or #p_column = 'pt')
BEGIN
Set #column_name = 'column_name_' + #p_column_param;
set #second_sql = 'update ' + QUOTENAME(#p_table_name) + ' set ' +
QUOTENAME(#column_name) + ' = 2 where id in (#p_some_param);';
EXEC sp_executesql #second_sql, N'#p_some_param NVARCHAR(200)', #p_some_param = #p_some_param;
END
END
Is this use case safe? Are there any pitfalls I should be a ware of?
Seems like you've lost some things in the translation to meaningless names to prepare your query to post here, so it's kinda hard to tell. However, the overall approach seems OK to me.
Using a whitelist with QUOTENAME for the identifiers will protect you from SQL injections using the identifiers parameters, and passing the value parameters as a parameter to sp_executeSql will protect you from SQL injections using the value parameters, so I would say you are doing fine on that front.
There are a couple of things I would change, though.
In addition to testing your tables and columns names against a hard coded white list, I would also test then against information_schema.columns, just to make sure that the procedure will not raise an error in case a table or column is missing.
Also, Your whitelist conditions can be improved - Instead of:
IF (LEN(#p_param) = 2 and (#p_param = ‘aa’ or #p_param = ‘bb’ or #p_param = ‘cc’ or #p_param = ‘dd’ or #p_param = ‘aa’)
You can simply write:
IF #p_param IN('aa', 'bb', 'cc','dd')

Query fails on "converting character string to smalldatetime data type"

I've been tasked with fixing some SQL code that doesn't work. The query reads from a view against a predicate. The query right now looks like so.
SELECT TOP (100) Beginn
FROM V_LLAMA_Seminare
//Removal of the following line makes the query successful, keeping it breaks it
where Beginn > (select cast (getdate() as smalldatetime))
order by Beginn desc
When I run the above query, I am greeted with the following error.
Msg 295, Level 16, State 3, Line 1
Conversion failed when converting character string to smalldatetime data type.
I decided to remove the WHERE clause, and now it runs returning 100 rows.
At first, I thought that behind the scenes, SQL Server was somehow including my predicate when bringing back the View . But then I investigated how the View was being created, especially the Beginn field, and at no point does it return a String.
Long story short, the column that becomes the Beginn field is a BIGINT timestamp like 201604201369.... The original user transforms this BIGINT to a smalldatetime using the following magic.
....
CASE WHEN ma.datum_dt = 0
THEN null
ELSE CONVERT(smalldatetime, SUBSTRING(CAST(ma.datum_dt AS varchar(max)),0,5) + '-' +
SUBSTRING(CAST(ma.datum_dt AS varchar(max)),5,2) + '-' +
SUBSTRING(CAST(ma.datum_dt AS varchar(max)),7,2) + ' ' +
SUBSTRING(CAST(ma.datum_dt AS varchar(max)),9,2) +':'+
SUBSTRING(CAST(ma.datum_dt AS varchar(max)),11,2) +':' +
RIGHT(CAST(ma.datum_dt AS varchar(max)),2)) END AS Beginn
...
My last attempt at finding the problem was to query the view and run the function ISDATE over the Beginn column and see if it returned a 0 which it never did.
So my question is two fold, "Why does a predicate break something" and two "Where on earth is this string error coming from when the Beginn value is being formed from a BIGINT".
Any help is greatly appreciated.
This problem is culture related...
Try this and then change the first SET LANGUAGE to GERMAN
SET LANGUAGE ENGLISH;
DECLARE #bi BIGINT=20160428001600;
SELECT CASE WHEN #bi = 0
THEN null
ELSE CONVERT(datetime, SUBSTRING(CAST(#bi AS varchar(max)),0,5) + '-' +
SUBSTRING(CAST(#bi AS varchar(max)),5,2) + '-' +
SUBSTRING(CAST(#bi AS varchar(max)),7,2) + ' ' +
SUBSTRING(CAST(#bi AS varchar(max)),9,2) +':'+
SUBSTRING(CAST(#bi AS varchar(max)),11,2) +':' +
RIGHT(CAST(#bi AS varchar(max)),2)) END AS Beginn
It is a very bad habit to think, that date values look the same everywhere (Oh no, my small application will never go international ...)
Try to stick to culture independent formats like ODBC or ISO
EDIT
A very easy solution for you actually was to replace the blank with a "T"
SUBSTRING(CAST(ma.datum_dt AS varchar(max)),7,2) + 'T' +
Then it's ISO 8601 and will convert...
The solution was found after looking through #Shnugo's comment. When I took my query which contained the Bigint->Datetime conversion logic, and put it into a CTE with "TOP 100000000" to avoid any implicit conversion actions, my query worked. Here is what my view looks like now with some unimportant parts omitted.
---Important part---
CREATE VIEW [dbo].[V_SomeView] AS
WITH CTE AS (
SELECT TOP 1000000000 ma.id AS MA_ID,
---Important part---
vko.extkey AS ID_VKO,
vko.text AS Verkaufsorganisation,
fi.f7000 AS MDM_Nr,
vf.f7105 AS SAPKdnr,
CASE WHEN ma.datum_dt = 0 --Conversion logic
CASE WHEN ma.endedatum_dt = 0 --Conversion logic
CONVERT(NVARCHAR(MAX),art.text) AS Art,
.....
FROM [ucrm].[dbo].[CRM_MA] ma,
[ucrm].[dbo].[CRM_fi] fi,
[ucrm].[dbo].[CRM_vf] vf,
[ucrm].[dbo].[CRM_ka] vko,
[ucrm].[dbo].[CRM_ka] art,
[ucrm].[dbo].[CRM_ka] kat
where ma.loskz = 0
and fi.loskz = 0
and vf.loskz = 0
and fi.F7029 = 0
and vf.F7023 = 0
...
GROUP BY ma.id,
vko.extkey,
vko.text,
fi.f7000 ,
vf.f7105,
ma.datum_dt,
ma.endedatum_dt,
....
)
select * FROM CTE;

MS SQL REPLACE based on 1 character to the left of the $

I am not a SQL expert so please forgive me if this is SQL 101 :).
In a select statement there are 2 replace functions. They look for a Servername and it's admin share d$ by it's UNC path. Example '\SERVERNAME\d$'
It then replaces '\SERVERNAME\d$' with 'D:'.
Here is the query currently:
select Replace(p.Path,'\\SERVERNAME\d$','D:') as searchpath
,p.path as fullpath
,s.ShareName
,s.SharePath
,p.Member
,p.Access
From Paths As p
Left Outer Join Shares as s on
Replace(p.Path,'\\SERVERNAME\d$','D:') Like s.SharePath + '\%'
Up until now it has always been d$.
Today my needs have changed and I need the query to find ANY servername UNC path admin share regardless of share letter (c$, d$, e$, f$...etc) and replace it with it's respective drive letter (D:, E:, F:... etc).
My thought is replace function could find the $ and look one character to the left of it to get the proper share letter, then use that for the replace. The issue I have, not being a SQL professional, is that I know SQL can likley do what I need it to do...I just don't know how to get there. I've googled and found some examples, but haven't had any luck in getting them to work.
Any help would be greatly appreciated.
You can use a combination of STUFF, PATINDEX, LEN to get what you want.
Sample Query
DECLARE #ReplaceChar VARCHAR(100) = '[prefixcharacters]\\SERVERNAME\d$[postcharacter]'
DECLARE #SearchString VARCHAR(100) = '\\SERVERNAME\_$'
SELECT
STUFF(#ReplaceChar,PATINDEX('%' + #SearchString + '%',#ReplaceChar),LEN(#SearchString),
UPPER(SUBSTRING(#ReplaceChar,PATINDEX('%' + #SearchString + '%',#ReplaceChar) + LEN(#SearchString) - 2,1)) + ':') as searchpath
WHERE PATINDEX('%' + #SearchString + '%',#ReplaceChar) > 0
Output
[prefixcharacters]D:[postcharacter]
Alternate Query
You can shorten the query if you want to get the previous character before $ as per your title. Something like this
DECLARE #ReplaceChar VARCHAR(100) = '[prefixcharacters]\\SERVERNAME\d$[postcharacter]'
DECLARE #SearchString VARCHAR(100) = '\\SERVERNAME\_$'
SELECT
STUFF(#ReplaceChar,
PATINDEX('%'+#SearchString+'%',#ReplaceChar),
LEN(#SearchString),
UPPER(SUBSTRING(#ReplaceChar,CHARINDEX('$',#ReplaceChar) -1,1)) + ':')
WHERE PATINDEX('%'+#SearchString+'%',#ReplaceChar) > 0
In this query
STUFF replaces your pattern with with the character before $ + ':'
Start of pattern is identified by PATINDEX('%'+#SearchString+'%',#ReplaceChar)
D is identified by getting the charindex of '$' and then getting the previous character using SUBSTRING
What about ¸
select Replace(SUBSTRING(p.path, 14, Len(#spath)-14),'$',':') as searchpath
,p.path as fullpath
,s.ShareName
,s.SharePath
,p.Member
,p.Access
From Paths As p
Left Outer Join Shares as s on
Replace(SUBSTRING(p.path, 14, Len(#spath)-14),'$',':') Like s.SharePath + '\%
select as searchpath
DECLARE #str nvarchar (100)
SET #str = '\\SERVERNAME\d$'
IF #str LIKE '\\SERVERNAME\_$'
SET #str = UPPER(SUBSTRING(#str, 14, 1)) + ':'
SELECT #str
Starting from previous, something like
select UPPER(SUBSTRING(p.path, 14, 1)) + ':' as searchpath
,p.path as fullpath
,s.ShareName
,s.SharePath
,p.Member
,p.Access
From Paths As p
Left Outer Join Shares as s on
SUBSTRING(p.path, 14, 1) + ':' Like s.SharePath + '\%'
I am no mysql expert either :)
Based on the logic you mentioned in the last part of the question, I have used concat and substring to get to the drive letter in the column.
Hope this helps
select replace(path, concat(substring(path, 1, locate('$', path) - 2), substring(path, locate('$', path) - 1, 1) , '$'), concat(substring(path, locate('$', path) - 1, 1) , ':')) as searchpath ...
The remaining part of the query would be the same.

Query to fetch data between two characters in informix

I have a value in informix which is like this :
value AMOUNT: <15000000.00> USD
I need to fetch 15000000.00 afrom the above.
I am using this query to fetch the data between <> as workaround
select substring (value[15,40]
from 1 for length (value[15,40]) -5 )
from tablename p where value like 'AMOUNT%';
But, this is not generic as the lenght may vary.
Please help me with a generic query for this, fetch the data between <>.
The database I am using is Informix version 9.4.
It's a diabolical problem, created by whoever chose to break one of the fundamental rules of database design: that the content of a column should be a single, indivisible value.
The best solution would be to modify the table to contain a value_descr = "AMOUNT", a value = 15000000.00, and a value_type = "USD", and ensure that the incoming data is stored in that fashion. Easier said than done, I know.
Failing that, you'll have to write a UDR that parses the string and returns the numeric portion of it. This would be feasible in SPL, but probably very slow. Something along the lines of:
CREATE PROCEDURE extract_value (inp VARCHAR(255)) RETURNING DECIMAL;
DEFINE s SMALLINT;
DEFINE l SMALLINT;
DEFINE i SMALLINT;
FOR i = 1 TO LENGTH(inp)
IF SUBSTR(inp, i, 1) = "<" THEN
LET s = i + 1;
ELIF SUBSTR(inp, i, 1) = ">" THEN
LET l = i - s - 1;
RETURN SUBSTR(inp, s, l)::DECIMAL;
END IF;
END FOR;
RETURN NULL::DECIMAL; -- could not parse out number
END PROCEDURE;
... which you would execute thus:
SELECT extract_value(p.value)
FROM tablename AS p
WHERE p.value LIKE 'AMOUNT%'
NB: that procedure compiles and produces output in my limited testing on version 11.5. There is no validation done to ensure the string between the <> parses as a number. I don't have an instance of 9.4 handy, but I haven't used any features not available in 9.4 TTBOMK.

Natural (human alpha-numeric) sort in Microsoft SQL 2005

We have a large database on which we have DB side pagination. This is quick, returning a page of 50 rows from millions of records in a small fraction of a second.
Users can define their own sort, basically choosing what column to sort by. Columns are dynamic - some have numeric values, some dates and some text.
While most sort as expected text sorts in a dumb way. Well, I say dumb, it makes sense to computers, but frustrates users.
For instance, sorting by a string record id gives something like:
rec1
rec10
rec14
rec2
rec20
rec3
rec4
...and so on.
I want this to take account of the number, so:
rec1
rec2
rec3
rec4
rec10
rec14
rec20
I can't control the input (otherwise I'd just format in leading 000s) and I can't rely on a single format - some are things like "{alpha code}-{dept code}-{rec id}".
I know a few ways to do this in C#, but can't pull down all the records to sort them, as that would be to slow.
Does anyone know a way to quickly apply a natural sort in Sql server?
We're using:
ROW_NUMBER() over (order by {field name} asc)
And then we're paging by that.
We can add triggers, although we wouldn't. All their input is parametrised and the like, but I can't change the format - if they put in "rec2" and "rec10" they expect them to be returned just like that, and in natural order.
We have valid user input that follows different formats for different clients.
One might go rec1, rec2, rec3, ... rec100, rec101
While another might go: grp1rec1, grp1rec2, ... grp20rec300, grp20rec301
When I say we can't control the input I mean that we can't force users to change these standards - they have a value like grp1rec1 and I can't reformat it as grp01rec001, as that would be changing something used for lookups and linking to external systems.
These formats vary a lot, but are often mixtures of letters and numbers.
Sorting these in C# is easy - just break it up into { "grp", 20, "rec", 301 } and then compare sequence values in turn.
However there may be millions of records and the data is paged, I need the sort to be done on the SQL server.
SQL server sorts by value, not comparison - in C# I can split the values out to compare, but in SQL I need some logic that (very quickly) gets a single value that consistently sorts.
#moebius - your answer might work, but it does feel like an ugly compromise to add a sort-key for all these text values.
order by LEN(value), value
Not perfect, but works well in a lot of cases.
Most of the SQL-based solutions I have seen break when the data gets complex enough (e.g. more than one or two numbers in it). Initially I tried implementing a NaturalSort function in T-SQL that met my requirements (among other things, handles an arbitrary number of numbers within the string), but the performance was way too slow.
Ultimately, I wrote a scalar CLR function in C# to allow for a natural sort, and even with unoptimized code the performance calling it from SQL Server is blindingly fast. It has the following characteristics:
will sort the first 1,000 characters or so correctly (easily modified in code or made into a parameter)
properly sorts decimals, so 123.333 comes before 123.45
because of above, will likely NOT sort things like IP addresses correctly; if you wish different behaviour, modify the code
supports sorting a string with an arbitrary number of numbers within it
will correctly sort numbers up to 25 digits long (easily modified in code or made into a parameter)
The code is here:
using System;
using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;
public class UDF
{
[SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic=true)]
public static SqlString Naturalize(string val)
{
if (String.IsNullOrEmpty(val))
return val;
while(val.Contains(" "))
val = val.Replace(" ", " ");
const int maxLength = 1000;
const int padLength = 25;
bool inNumber = false;
bool isDecimal = false;
int numStart = 0;
int numLength = 0;
int length = val.Length < maxLength ? val.Length : maxLength;
//TODO: optimize this so that we exit for loop once sb.ToString() >= maxLength
var sb = new StringBuilder();
for (var i = 0; i < length; i++)
{
int charCode = (int)val[i];
if (charCode >= 48 && charCode <= 57)
{
if (!inNumber)
{
numStart = i;
numLength = 1;
inNumber = true;
continue;
}
numLength++;
continue;
}
if (inNumber)
{
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
inNumber = false;
}
isDecimal = (charCode == 46);
sb.Append(val[i]);
}
if (inNumber)
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
var ret = sb.ToString();
if (ret.Length > maxLength)
return ret.Substring(0, maxLength);
return ret;
}
static string PadNumber(string num, bool isDecimal, int padLength)
{
return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
}
}
To register this so that you can call it from SQL Server, run the following commands in Query Analyzer:
CREATE ASSEMBLY SqlServerClr FROM 'SqlServerClr.dll' --put the full path to DLL here
go
CREATE FUNCTION Naturalize(#val as nvarchar(max)) RETURNS nvarchar(1000)
EXTERNAL NAME SqlServerClr.UDF.Naturalize
go
Then, you can use it like so:
select *
from MyTable
order by dbo.Naturalize(MyTextField)
Note: If you get an error in SQL Server along the lines of Execution of user code in the .NET Framework is disabled. Enable "clr enabled" configuration option., follow the instructions here to enable it. Make sure you consider the security implications before doing so. If you are not the db admin, make sure you discuss this with your admin before making any changes to the server configuration.
Note2: This code does not properly support internationalization (e.g., assumes the decimal marker is ".", is not optimized for speed, etc. Suggestions on improving it are welcome!
Edit: Renamed the function to Naturalize instead of NaturalSort, since it does not do any actual sorting.
I know this is an old question but I just came across it and since it's not got an accepted answer.
I have always used ways similar to this:
SELECT [Column] FROM [Table]
ORDER BY RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))), 1000)
The only common times that this has issues is if your column won't cast to a VARCHAR(MAX), or if LEN([Column]) > 1000 (but you can change that 1000 to something else if you want), but you can use this rough idea for what you need.
Also this is much worse performance than normal ORDER BY [Column], but it does give you the result asked for in the OP.
Edit: Just to further clarify, this the above will not work if you have decimal values such as having 1, 1.15 and 1.5, (they will sort as {1, 1.5, 1.15}) as that is not what is asked for in the OP, but that can easily be done by:
SELECT [Column] FROM [Table]
ORDER BY REPLACE(RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))) + REPLICATE('0', 100 - CHARINDEX('.', REVERSE(LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX))))), 1)), 1000), '.', '0')
Result: {1, 1.15, 1.5}
And still all entirely within SQL. This will not sort IP addresses because you're now getting into very specific number combinations as opposed to simple text + number.
RedFilter's answer is great for reasonably sized datasets where indexing is not critical, however if you want an index, several tweaks are required.
First, mark the function as not doing any data access and being deterministic and precise:
[SqlFunction(DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None,
IsDeterministic = true, IsPrecise = true)]
Next, MSSQL has a 900 byte limit on the index key size, so if the naturalized value is the only value in the index, it must be at most 450 characters long. If the index includes multiple columns, the return value must be even smaller. Two changes:
CREATE FUNCTION Naturalize(#str AS nvarchar(max)) RETURNS nvarchar(450)
EXTERNAL NAME ClrExtensions.Util.Naturalize
and in the C# code:
const int maxLength = 450;
Finally, you will need to add a computed column to your table, and it must be persisted (because MSSQL cannot prove that Naturalize is deterministic and precise), which means the naturalized value is actually stored in the table but is still maintained automatically:
ALTER TABLE YourTable ADD nameNaturalized AS dbo.Naturalize(name) PERSISTED
You can now create the index!
CREATE INDEX idx_YourTable_n ON YourTable (nameNaturalized)
I've also made a couple of changes to RedFilter's code: using chars for clarity, incorporating duplicate space removal into the main loop, exiting once the result is longer than the limit, setting maximum length without substring etc. Here's the result:
using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;
public static class Util
{
[SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
public static SqlString Naturalize(string str)
{
if (string.IsNullOrEmpty(str))
return str;
const int maxLength = 450;
const int padLength = 15;
bool isDecimal = false;
bool wasSpace = false;
int numStart = 0;
int numLength = 0;
var sb = new StringBuilder();
for (var i = 0; i < str.Length; i++)
{
char c = str[i];
if (c >= '0' && c <= '9')
{
if (numLength == 0)
numStart = i;
numLength++;
}
else
{
if (numLength > 0)
{
sb.Append(pad(str.Substring(numStart, numLength), isDecimal, padLength));
numLength = 0;
}
if (c != ' ' || !wasSpace)
sb.Append(c);
isDecimal = c == '.';
if (sb.Length > maxLength)
break;
}
wasSpace = c == ' ';
}
if (numLength > 0)
sb.Append(pad(str.Substring(numStart, numLength), isDecimal, padLength));
if (sb.Length > maxLength)
sb.Length = maxLength;
return sb.ToString();
}
private static string pad(string num, bool isDecimal, int padLength)
{
return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
}
}
Here's a solution written for SQL 2000. It can probably be improved for newer SQL versions.
/**
* Returns a string formatted for natural sorting. This function is very useful when having to sort alpha-numeric strings.
*
* #author Alexandre Potvin Latreille (plalx)
* #param {nvarchar(4000)} string The formatted string.
* #param {int} numberLength The length each number should have (including padding). This should be the length of the longest number. Defaults to 10.
* #param {char(50)} sameOrderChars A list of characters that should have the same order. Ex: '.-/'. Defaults to empty string.
*
* #return {nvarchar(4000)} A string for natural sorting.
* Example of use:
*
* SELECT Name FROM TableA ORDER BY Name
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1-1.
* 2. A1-1. 2. A1.
* 3. R1 --> 3. R1
* 4. R11 4. R11
* 5. R2 5. R2
*
*
* As we can see, humans would expect A1., A1-1., R1, R2, R11 but that's not how SQL is sorting it.
* We can use this function to fix this.
*
* SELECT Name FROM TableA ORDER BY dbo.udf_NaturalSortFormat(Name, default, '.-')
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1.
* 2. A1-1. 2. A1-1.
* 3. R1 --> 3. R1
* 4. R11 4. R2
* 5. R2 5. R11
*/
ALTER FUNCTION [dbo].[udf_NaturalSortFormat](
#string nvarchar(4000),
#numberLength int = 10,
#sameOrderChars char(50) = ''
)
RETURNS varchar(4000)
AS
BEGIN
DECLARE #sortString varchar(4000),
#numStartIndex int,
#numEndIndex int,
#padLength int,
#totalPadLength int,
#i int,
#sameOrderCharsLen int;
SELECT
#totalPadLength = 0,
#string = RTRIM(LTRIM(#string)),
#sortString = #string,
#numStartIndex = PATINDEX('%[0-9]%', #string),
#numEndIndex = 0,
#i = 1,
#sameOrderCharsLen = LEN(#sameOrderChars);
-- Replace all char that have the same order by a space.
WHILE (#i <= #sameOrderCharsLen)
BEGIN
SET #sortString = REPLACE(#sortString, SUBSTRING(#sameOrderChars, #i, 1), ' ');
SET #i = #i + 1;
END
-- Pad numbers with zeros.
WHILE (#numStartIndex <> 0)
BEGIN
SET #numStartIndex = #numStartIndex + #numEndIndex;
SET #numEndIndex = #numStartIndex;
WHILE(PATINDEX('[0-9]', SUBSTRING(#string, #numEndIndex, 1)) = 1)
BEGIN
SET #numEndIndex = #numEndIndex + 1;
END
SET #numEndIndex = #numEndIndex - 1;
SET #padLength = #numberLength - (#numEndIndex + 1 - #numStartIndex);
IF #padLength < 0
BEGIN
SET #padLength = 0;
END
SET #sortString = STUFF(
#sortString,
#numStartIndex + #totalPadLength,
0,
REPLICATE('0', #padLength)
);
SET #totalPadLength = #totalPadLength + #padLength;
SET #numStartIndex = PATINDEX('%[0-9]%', RIGHT(#string, LEN(#string) - #numEndIndex));
END
RETURN #sortString;
END
I know this is a bit old at this point, but in my search for a better solution, I came across this question. I'm currently using a function to order by. It works fine for my purpose of sorting records which are named with mixed alpha numeric ('item 1', 'item 10', 'item 2', etc)
CREATE FUNCTION [dbo].[fnMixSort]
(
#ColValue NVARCHAR(255)
)
RETURNS NVARCHAR(1000)
AS
BEGIN
DECLARE #p1 NVARCHAR(255),
#p2 NVARCHAR(255),
#p3 NVARCHAR(255),
#p4 NVARCHAR(255),
#Index TINYINT
IF #ColValue LIKE '[a-z]%'
SELECT #Index = PATINDEX('%[0-9]%', #ColValue),
#p1 = LEFT(CASE WHEN #Index = 0 THEN #ColValue ELSE LEFT(#ColValue, #Index - 1) END + REPLICATE(' ', 255), 255),
#ColValue = CASE WHEN #Index = 0 THEN '' ELSE SUBSTRING(#ColValue, #Index, 255) END
ELSE
SELECT #p1 = REPLICATE(' ', 255)
SELECT #Index = PATINDEX('%[^0-9]%', #ColValue)
IF #Index = 0
SELECT #p2 = RIGHT(REPLICATE(' ', 255) + #ColValue, 255),
#ColValue = ''
ELSE
SELECT #p2 = RIGHT(REPLICATE(' ', 255) + LEFT(#ColValue, #Index - 1), 255),
#ColValue = SUBSTRING(#ColValue, #Index, 255)
SELECT #Index = PATINDEX('%[0-9,a-z]%', #ColValue)
IF #Index = 0
SELECT #p3 = REPLICATE(' ', 255)
ELSE
SELECT #p3 = LEFT(REPLICATE(' ', 255) + LEFT(#ColValue, #Index - 1), 255),
#ColValue = SUBSTRING(#ColValue, #Index, 255)
IF PATINDEX('%[^0-9]%', #ColValue) = 0
SELECT #p4 = RIGHT(REPLICATE(' ', 255) + #ColValue, 255)
ELSE
SELECT #p4 = LEFT(#ColValue + REPLICATE(' ', 255), 255)
RETURN #p1 + #p2 + #p3 + #p4
END
Then call
select item_name from my_table order by fnMixSort(item_name)
It easily triples the processing time for a simple data read, so it may not be the perfect solution.
Here is an other solution that I like:
http://www.dreamchain.com/sql-and-alpha-numeric-sort-order/
It's not Microsoft SQL, but since I ended up here when I was searching for a solution for Postgres, I thought adding this here would help others.
EDIT: Here is the code, in case the link goes away.
CREATE or REPLACE FUNCTION pad_numbers(text) RETURNS text AS $$
SELECT regexp_replace(regexp_replace(regexp_replace(regexp_replace(($1 collate "C"),
E'(^|\\D)(\\d{1,3}($|\\D))', E'\\1000\\2', 'g'),
E'(^|\\D)(\\d{4,6}($|\\D))', E'\\1000\\2', 'g'),
E'(^|\\D)(\\d{7}($|\\D))', E'\\100\\2', 'g'),
E'(^|\\D)(\\d{8}($|\\D))', E'\\10\\2', 'g');
$$ LANGUAGE SQL;
"C" is the default collation in postgresql; you may specify any collation you desire, or remove the collation statement if you can be certain your table columns will never have a nondeterministic collation assigned.
usage:
SELECT * FROM wtf w
WHERE TRUE
ORDER BY pad_numbers(w.my_alphanumeric_field)
For the following varchar data:
BR1
BR2
External Location
IR1
IR2
IR3
IR4
IR5
IR6
IR7
IR8
IR9
IR10
IR11
IR12
IR13
IR14
IR16
IR17
IR15
VCR
This worked best for me:
ORDER BY substring(fieldName, 1, 1), LEN(fieldName)
If you're having trouble loading the data from the DB to sort in C#, then I'm sure you'll be disappointed with any approach at doing it programmatically in the DB. When the server is going to sort, it's got to calculate the "perceived" order just as you would have -- every time.
I'd suggest that you add an additional column to store the preprocessed sortable string, using some C# method, when the data is first inserted. You might try to convert the numerics into fixed-width ranges, for example, so "xyz1" would turn into "xyz00000001". Then you could use normal SQL Server sorting.
At the risk of tooting my own horn, I wrote a CodeProject article implementing the problem as posed in the CodingHorror article. Feel free to steal from my code.
Simply you sort by
ORDER BY
cast (substring(name,(PATINDEX('%[0-9]%',name)),len(name))as int)
##
I've just read a article somewhere about such a topic. The key point is: you only need the integer value to sort data, while the 'rec' string belongs to the UI. You could split the information in two fields, say alpha and num, sort by alpha and num (separately) and then showing a string composed by alpha + num. You could use a computed column to compose the string, or a view.
Hope it helps
You can use the following code to resolve the problem:
Select *,
substring(Cote,1,len(Cote) - Len(RIGHT(Cote, LEN(Cote) - PATINDEX('%[0-9]%', Cote)+1)))alpha,
CAST(RIGHT(Cote, LEN(Cote) - PATINDEX('%[0-9]%', Cote)+1) AS INT)intv
FROM Documents
left outer join Sites ON Sites.IDSite = Documents.IDSite
Order BY alpha, intv
regards,
rabihkahaleh#hotmail.com
I'm fashionably late to the party as usual. Nevertheless, here is my attempt at an answer that seems to work well (I would say that). It assumes text with digits at the end, like in the original example data.
First a function that won't end up winning a "pretty SQL" competition anytime soon.
CREATE FUNCTION udfAlphaNumericSortHelper (
#string varchar(max)
)
RETURNS #results TABLE (
txt varchar(max),
num float
)
AS
BEGIN
DECLARE #txt varchar(max) = #string
DECLARE #numStr varchar(max) = ''
DECLARE #num float = 0
DECLARE #lastChar varchar(1) = ''
set #lastChar = RIGHT(#txt, 1)
WHILE #lastChar <> '' and #lastChar is not null
BEGIN
IF ISNUMERIC(#lastChar) = 1
BEGIN
set #numStr = #lastChar + #numStr
set #txt = Substring(#txt, 0, len(#txt))
set #lastChar = RIGHT(#txt, 1)
END
ELSE
BEGIN
set #lastChar = null
END
END
SET #num = CAST(#numStr as float)
INSERT INTO #results select #txt, #num
RETURN;
END
Then call it like below:
declare #str nvarchar(250) = 'sox,fox,jen1,Jen0,jen15,jen02,jen0004,fox00,rec1,rec10,jen3,rec14,rec2,rec20,rec3,rec4,zip1,zip1.32,zip1.33,zip1.3,TT0001,TT01,TT002'
SELECT tbl.value --, sorter.txt, sorter.num
FROM STRING_SPLIT(#str, ',') as tbl
CROSS APPLY dbo.udfAlphaNumericSortHelper(value) as sorter
ORDER BY sorter.txt, sorter.num, len(tbl.value)
With results:
fox
fox00
Jen0
jen1
jen02
jen3
jen0004
jen15
rec1
rec2
rec3
rec4
rec10
rec14
rec20
sox
TT01
TT0001
TT002
zip1
zip1.3
zip1.32
zip1.33
I still don't understand (probably because of my poor English).
You could try:
ROW_NUMBER() OVER (ORDER BY dbo.human_sort(field_name) ASC)
But it won't work for millions of records.
That why I suggested to use trigger which fills separate column with human value.
Moreover:
built-in T-SQL functions are really
slow and Microsoft suggest to use
.NET functions instead.
human value is constant so there is no point calculating it each time
when query runs.

Resources