Natural Sort in NHibernate - sql-server

I've got a hibernate query that returns a list of objects and I want to order by a title. This is a user maintained field and some of our customers like prefixing their titles with numbers, this isn't something I can control. The data is something like this:
- 1 first thing
- 2 second thing
- 5 fifth thing
- 10 tenth thing
- 20 twentieth thing
- A thing with no number
A traditional
.AddOrder(Order.Asc("Name"))
Is resulting in a textual sort:
- 1 first thing
- 10 tenth thing
- 2 second thing
- 20 twentieth thing
- 5 fifth thing
- A thing with no number
This is correct as this is a nvarchar field but is there any way I can get the numbers to sort as well?
There seem to be several workarounds involving prefixing all fields with leading 0's and such like but do any of these work through NHibernate?
This application runs interchangeably on Oracle and MsSQL.
Cheers,
Matt

You might consider creating a custom sort order implementation that allows you to specify the collation on the column to achieve your desired results. Something along the lines of:
ORDER BY Name COLLATE Latin1_General_BIN ASC
But with an appropriate collation (I don't believe Latin1_General_BIN is what you need specifically)

Add additional field (e.g. named Name_Number ) to object defenition and query on server side.
For Oracle select in this field to_number(Name) as Name_Number .
For For MS SQL select in this field cast(Name as numeric) as Name_Number .
Then sort on client side as .AddOrder(Order.Asc("Name_Number"))
P.S. I am not sure, because don't have enough experience with hibernate/nhibernate and hate ORM at all :)

Related

Word popularity leaderboard in SQL Server based message-board

In a SQL server database, I have a table Messages with the following columns:
Id INT(1,1)
Detail VARCHAR(5000)
DatetimeEntered DATETIME
PersonEntered VARCHAR(25)
Messages are pretty basic, and only allow alphanumeric characters and a handful of special characters, which are as follows:
`¬!"£$%^&*()-_=+[{]};:'##~\|,<.>/?
Ignoring the bulk of the special characters bar the apostrophe, what I need is a way to list each word along with how many times the word occurs in the Detail column, which I can then filter by PersonEntered and DatetimeEntered.
Example output:
Word Frequency
-----------------
a 11280
the 10102
and 8845
when 2024
don't 2013
.
.
.
It doesn't need to be particularly clever. It is perfectly fine if dont and don't are treated as separate words.
I'm having trouble splitting out the words into a temporary table called #Words.
Once I have a temporary table, I would apply the following query:
SELECT
Word,
SUM(Word) AS WordCount
FROM #Words
GROUP BY Word
ORDER BY SUM(Word) DESC
Please help.
Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only ' is going to appear in a word; anything else is going to be grammatical.
You haven't posted what version of SQL you're using, so I've going to use SQL Server 2017 syntax. If you don't have the latest version, you'll need to replace TRANSLATE with a nested REPLACE (So REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, '¬',' '),...),'/',' '),'?',' '), and find a string splitter (for example, Jeff Moden's DelimitedSplit8K).
USE Sandbox;
GO
CREATE TABLE [Messages] (Detail varchar(5000));
INSERT INTO [Messages]
VALUES ('Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only `''` is going to appear in a word; anything else is going to be grammatical. You haven''t posted what version of SQL you''re using, so I''ve going to use SQL Server 2017 syntax. If you don''t have the latest version, you''ll need to replace `TRANSLATE` with a nested `REPLACE` (So `REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, ''¬'','' ''),...),''/'','' ''),''?'','' '')`, and find a string splitter (for example, Jeff Moden''s [DelimitedSplit8K](http://www.sqlservercentral.com/articles/Tally+Table/72993/)).'),
('As a note, this is going to perform **AWFULLY**. SQL Server is not designed for this type of work. I also imagine you''ll get some odd results and it''ll include numbers in there. Things like dates are going to get split out,, numbers like `9,000,000` would be treated as the words `9` and `000`, and hyperlinks will be separated.')
GO
WITH Replacements AS(
SELECT TRANSLATE(Detail, '`¬!"£$%^&*()-_=+[{]};:##~\|,<.>/?',' ') AS StrippedDetail
FROM [Messages] M)
SELECT SS.[value], COUNT(*) AS WordCount
FROM Replacements R
CROSS APPLY string_split(R.StrippedDetail,' ') SS
WHERE LEN(SS.[value]) > 0
GROUP BY SS.[value]
ORDER BY WordCount DESC;
GO
DROP TABLE [Messages];
As a note, this is going to perform AWFULLY. SQL Server is not designed for this type of work. I also imagine you'll get some odd results and it'll include numbers in there. Things like dates are going to get split out,, numbers like 9,000,000 would be treated as the words 9 and 000, and hyperlinks will be separated.

Fulltext index search has large number of page reads

I have a full-text index on a column in a table that contains data like this:
searchColumn
90210 Brooks Diana Miami FL diana.brooks#email.com 5612233395
The column is an aggregate of Zip, last name, first name, city, state, e-mail and phone number.
I use this column to search for a customer based on any of this possible information.
The issue I am worried about is with the high number of reads that occurs when doing a query on this column. The query I am using is:
declare #searchTerm varchar(100) = ' "FL" AND "90210*" AND "Diana*" AND "Brooks*" '
select *
from CustomerInformation c
where contains(c.searchColumn, #searchTerm)
Now, when running Profiler I can see that this search has about 50.000 page reads to return a single row, as opposed to when using a different approach using regular indexes and multiple variables, broken down like #firstName, #LastName, like below:
WHERE C.FirstName like coalesce(#FirstName + '%' , C.FirstName)
AND C.LastName like coalesce(#LastName + '%' , C.LastName)
etc.
Using this approach I get only around 140 page reads. I know the approaches are quite different, but I'm trying to understand why the full-text version has so much more reads and if there is any way I can bring that down to something closer to the numbers I get when using regular indexes.
I have a couple of thoughts on this. First the Select * will generate a great number of page reads because it has to pull all columns which may or may not be indexed. When you pull every column it most likely will not make use of the best Index plan out there.
As to your Where clauses, when using the #searchTerm and the value of "FL" AND "90210*" AND "Diana*" AND "Brooks*" it has to check the datapages multiple times each time it is run. Think of how you would look up this information if you had to do it. You look at a piece of paper with the info on it and see if the search column contains FL. Now does it contain FL and 90210*. Now does it contain both of those plus Diana...etc.
You can see why it would keep having to go back to the page to read over and over again. The second query only has to look at 2 columns narrowly defined.
If you want more information on this, I would suggest a class by Brent Ozar that is free right now.
How to think like the SQL Server Engine
I hope that helps.

Sql Server 2008 - FullText rounding money values?

Lets assume we have a full text indexed table with those records:
blabla bla bla 101010,65 blabla bla bla
blabla bla bla 1012344,34 blabla bla bla
(The decimal separator in Portuguese is "," not "." as in English)
When we execute a query like:
where contains(field, "101011") or
where contains(field, "1012344")
The full text engine is returning those records because it seems to me that it is rounding the numbers as:
101010,65 becomes 101011
1012344,34 becomes 1012344
Is there any way of avoiding that?
EDIT
Sorry, i forgot to say that the column is a varchar max column and not a currency column. This is happening in this field when it has a float value despite the fact that it is a varchar column
EDIT2
This is not the only data I have in my column. Numbers like those appears frequently on my indexed texts. It is not concatenated. As I said, this is part of the original text and I have done nothing to the original text. I guess this is a behavior of the word breaker, but who knows for sure?
EDIT:
< Ignore >
The reason you are seeing this behaviour is that, the default wordbreakers for SQL fulltext search are defined by the English language (locale 1033). In English, a comma is a valid word-breaker, thereby breaking your number into two different numbers. However, if you use the Portuguese word-breaker, FTS quite cleverly retains the numbers together. Try running the following query on your SQL Server to see how the fulltext engine parses the same input differently depending on the locale specified:
--use locale English
select * from sys.dm_fts_parser('"12345,10"',1033,NULL,0)
--use locale Portuguese
select * from sys.dm_fts_parser('"12345,10"',2070,NULL,0)
< /Ignore >
UPDATE:
Alright, I have managed to replicate your scenario and yes it does seem to be default behaviour with SQL Server FTS. However, it only seems to round up to nearest 1/10th of the number (the nearest 10 centavos in your case), and NOT to the nearest whole number.
So for example; 12345,88 would be returned in searches for both 12345,88 as well as 12345,9, while 56789,98 would appear in searches for 56789,98 as well as 56790. However, a number such as 45678,60 will remain intact with no rounding up or down, so it's not as bad as you think.
Not sure if there is anything you can do to change this behaviour though. A quick search on Google returned nothing.
My suggestion would be to not use the Money data type in the first place. All it buys you is a little formatting ease (which you should be doing at the presentation layer anyway), but brings about other complications and inflexibility. I'm not sure DECIMAL/NUMERIC would solve this particular issue, as I'm not a full-text guy, but I try to steer people away from problematic data types like MONEY whenever I can. See this previous question for lots of discussion about this. Should you choose the MONEY or DECIMAL(x,y) datatypes in SQL Server?

SQL LIKE Operator doesn't work with Asian Languages (SQL Server 2008)

Dear Friends,
I've faced with a problem never thought of ever. My problem seems too simple but I can't find a solution to it.
I have a sql server database column that is of type NVarchar and is filled with standard persian characters. when I'm trying to run a very simple query on it which incorporates the LIKE operator, the resultset becomes empty although I know the query term is present in the table. Here is the very smiple example query which doesn't act corectly:
SELECT * FROM T_Contacts WHERE C_ContactName LIKE '%ف%'
ف is a persian character and the ContactName coulmn contains multiple entries which contain that character.
Please tell me how should I rewrite the expression or what change should I apply. Note that my database's collation is SQL_Latin1_General_CP1_CI_AS.
Thank you very much
Also, if those values are stored as NVARCHAR (which I hope they are!!), you should always use the N'..' prefix for any string literals to make sure you don't get any unwanted conversions back to non-Unicode VARCHAR.
So you should be searching:
SELECT * FROM T_Contacts
WHERE C_ContactName COLLATE Persian_100_CI_AS LIKE N'%ف%'
Shouldn't it be:
SELECT * FROM T_Contacts WHERE C_ContactName LIKE N'%ف%'
ie, with the N in front of the comparing string, so it treats it like an nvarchar?

Ordering numbers that are stored as strings in the database

I have a bunch of records in several tables in a database that have a "process number" field, that's basically a number, but I have to store it as a string both because of some legacy data that has stuff like "89a" as a number and some numbering system that requires that process numbers be represented as number/year.
The problem arises when I try to order the processes by number. I get stuff like:
1
10
11
12
And the other problem is when I need to add a new process. The new process' number should be the biggest existing number incremented by one, and for that I would need a way to order the existing records by number.
Any suggestions?
Maybe this will help.
Essentially:
SELECT process_order FROM your_table ORDER BY process_order + 0 ASC
Can you store the numbers as zero padded values? That is, 01, 10, 11, 12?
I would suggest to create a new numeric field used only for ordering and update it from a trigger.
Can you split the data into two fields?
Store the 'process number' as an int and the 'process subtype' as a string.
That way:
you can easily get the MAX processNumber - and increment it when you need to generate a
new number
you can ORDER BY processNumber ASC,
processSubtype ASC - to get the
correct order, even if multiple records have the same base number with different years/letters appended
when you need the 'full' number you
can just concatenate the two fields
Would that do what you need?
Given that your process numbers don't seem to follow any fixed patterns (from your question and comments), can you construct/maintain a process number table that has two fields:
create table process_ordering ( processNumber varchar(N), processOrder int )
Then select all the process numbers from your tables and insert into the process number table. Set the ordering however you want based on the (varying) process number formats. Join on this table, order by processOrder and select all fields from the other table. Index this table on processNumber to make the join fast.
select my_processes.*
from my_processes
inner join process_ordering on my_process.processNumber = process_ordering.processNumber
order by process_ordering.processOrder
It seems to me that you have two tasks here.
• Convert the strings to numbers by legacy format/strip off the junk• Order the numbers
If you have a practical way of introducing string-parsing regular expressions into your process (and your issue has enough volume to be worth the effort), then I'd
• Create a reference table such as
CREATE TABLE tblLegacyFormatRegularExpressionMaster(
LegacyFormatId int,
LegacyFormatName varchar(50),
RegularExpression varchar(max)
)
• Then, with a way of invoking the regular expressions, such as the CLR integration in SQL Server 2005 and above (the .NET Common Language Runtime integration to allow calls to compiled .NET methods from within SQL Server as ordinary (Microsoft extended) T-SQL, then you should be able to solve your problem.
• See
http://www.codeproject.com/KB/string/SqlRegEx.aspx
I apologize if this is way too much overhead for your problem at hand.
Suggestion:
• Make your column a fixed width text (i.e. CHAR rather than VARCHAR).
• Pad the existing values with enough leading zeros to fill each column and a trailing space(s) where the values do not end in 'a' (or whatever).
• Add a CHECK constraint (or equivalent) to ensure new values conform to the pattern e.g. something like
CHECK (process_number LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][ab ]')
• In your insert/update stored procedures (or equivalent), pad any incoming values to fit the pattern.
• Remove the leading/trailing zeros/spaces as appropriate when displaying the values to humans.
Another advantage of this approach is that the incoming values '1', '01', '001', etc would all be considered to be the same value and could be covered by a simple unique constraint in the DBMS.
BTW I like the idea of splitting the trailing 'a' (or whatever) into a separate column, however I got the impression the data element in question is an identifier and therefore would not be appropriate to split it.
You need to cast your field as you're selecting. I'm basing this syntax on MySQL - but the idea's the same:
select * from table order by cast(field AS UNSIGNED);
Of course UNSIGNED could be SIGNED if required.

Resources