Full Text not indexing varbinary column (with html) - sql-server

I have a table with HTML data, that I want to search using the Full Text Index via an html-filter
So I created an index:
CREATE FULLTEXT CATALOG myCatalog AS DEFAULT
CREATE FULLTEXT INDEX ON myTable (Body TYPE COLUMN Filetype)
KEY INDEX PK_myTable
Body is a varbinary(max) column with HTML. The Filetype column is a computed column returns .html.
No results are being returned.
I verified that .html filter is installed. FullText index is also installed properly and works fine if I convert the column to nvarchar and create just a "plain text" index (not html).
No errors in the SQL log or FTS log.
The keywords table is just empty!
SELECT *
FROM sys.dm_fts_index_keywords
(DB_ID('myDatabase'), OBJECT_ID('myTable'))
All it returns is "END OF FILE" symbol.
It says "document count 35" which mean the documents were processed, but no keywords were extracted.
PS. I have SQL Server Express Edition 2012 (with all advanced features including full text). Can this be the reason? But again, the "plain" full text search works just fine!
PPS. Asked my coworker to test this on SQL Express 2016 - same result... Tried on our production server "Enterprise" edition - same.
UPDATE
OK it turns out the full text index DOES NOT SUPPORT UNICODE!! in varbinary columns. When I converted the column to non-unicode (by converting it to nvarchar then to varchar and then back to varbinary) It started working.
Anyone knows any workarounds?

OK, so it turns out fulltext index DOES support unicode data in varbinary but pay attention to this:
If your varbinary column is created from Nvarchar be sure to include the 0xFFFE unicode signature at the beginning
For example, I'm using a computed column for full text index, so I had to change my computed column to this:
alter table myTable
add FTS_Body as 0xFFFE + (CAST(HtmlBody as VARBINARY(MAX)))
--HtmlBody is my nvarchar column that contains html

Related

DB collation VS Column collation when INSERTing

I've create 2 demo DB's.
Server Collation - Hebrew_CI_AS
DB1 Collation - Hebrew_CI_AS
DB2 Collation - Latin1_General_CS_AS.
In DB2 I have one column with Hebrew_CI_AS Collation. I'm trying to insert Hebrew text into that column. The Datatype is nvarchar(250).
This is the sample script:
INSERT INTO [Table] (HebCol)
VALUES('1בדיקה')
When I run this on DB1, everything works fine.
On DB2, Although the column has Hebrew Collation, I get question marks instead of the Hebrew text.
Why is the result different if the collation is identical?
P.S: I cannot add N before the text. In the real world an app is doing the inserts.
When using literal strings the collation used is that of the database, not the destination column. As the collation of the database you are inserting into is Latin1_General_CS_AS then for the literal string '1בדיקה' most of the characters are outside of the code page of the collation; thus you get ? for those characters as they are unknown.
As such there are only 2 solutions to stop the ? appearing in the column:
Fix your application and define your literal string(s) as an nvarchar not a varchar; you are after all storing an nvarchar so it makes sense to pass a literal nvarchar.
Change the collation of your database to be the same as your other database, Hebrew_CI_AS.
Technically there is a 3rd, which is use a UTF-8 collation if you are on SQL Server 2019, but such collations come with caveats that I don't think are in scope of this question.

SQL Server Full-Text search Empty catalog

I'm trying to get a full-text search running on SQL Server in Azure.
My data is in a varbinary(max) with all columns containing data. The data is strings of html.
The SearchableData column is computed and filled using:
CONVERT(VARBINARY(MAX),[Title] + [Body])
Doing a select and a convert back yields data.
I would like to utilize the built in html filter of SQL Server.
If I do the following I can search and everything works, however, without a filter:
CREATE FULLTEXT INDEX
ON ArticleContent (Body LANGUAGE 0, Title LANGUAGE 0)
KEY INDEX PK_ArticleContent ON AcademyFTS
WITH (STOPLIST = SYSTEM, CHANGE_TRACKING AUTO)
However, I want to be able to utalize the .html filtering.
I've created the following:
CREATE FULLTEXT CATALOG AcademyFTS WITH ACCENT_SENSITIVITY = OFF AS DEFAULT
and
CREATE FULLTEXT INDEX
ON ArticleContent (SearchableData TYPE COLUMN FileExtension LANGUAGE 0)
KEY INDEX PK_ArticleContent ON AcademyFTS
WITH (STOPLIST = SYSTEM, CHANGE_TRACKING AUTO)
However the catalog is empty and I don't get any results back from a simple search
SELECT *
FROM ArticleContent
WHERE FREETEXT(SearchableData, 'wiki')
I've been using these two guides:
Hands on Full-Text Search in SQL Server
How to implement a full-text search on HTML documents with Microsoft SQL Server
I found the answer!
You can't full-text search on a computed column! :)
Microsoft Docs

SQL Server nchar column data equality

I have a strange problem. In my SQL Server database there is table containing an nchar(8) column. I have inserted several rows into it with different Unicode data for nchar(8) column.
Now when I fire a query like this
select *
from table
where nacharColumnName = N'㜲㤵㠱㠷㔳'
It gives me a row which contains 㤱㔱㄰〴㐰' as unicode data for nchar(8) column.
How does SQL Server compare unicode data?
You should configure your column (or entire database) collation, so that equality and like operators work as expected. Simplified Chinese of whatever

Inserting arabic characters into sql server database from grails

I have a grails application that is intended to insert arabic characters from interface to sql server database. First I did it with field type varchar but it appears ???????. Then I found that the field type should be nvarchar but with, application simply fails to insert any row by giving the following error,
java.sql.SQLException: Cannot insert the value NULL into column 'id', table 'smsCampaigner.dbo.message'; column does not allow nulls. INSERT fails.
I know that this is constraints error that id cannot be null but this was not the case before changing the field type. Now I want to know whether we can specify fro config.groovy which encoding should be used in database ? any other type of help will be appreciated.
here you go..
when you create the domain class use String type for that column and that's it. you could insert any character it will accept.
for example TestDomain.groovy
class TestDomain {
String column1
}
In mysql you can set up the collation of the database/table/column to use utf8_unicode_ci and if I am not wrong that is valid to save arabic chars. Not sure how to do it with SQL Server but it should be something similar.
Change the collation of the database and tables and columns to Arabic_CI_AS
In Sql Server management studio (SSMS), Execute :
ALTER DATABASE DB06 COLLATE Arabic_CI_AS
where DB06 is the name of the database.
right-click on table, then click design, then select the desired column, in its properties, choose the collation to Arabic_CI_AS
by the way, the error :
Cannot insert the value NULL into column
is not related to Arabic, this is because you are trying to insert NULL in the column that does not allow NULL.

2 different collations conflict when merging tables with Sql Server?

I have DB1 which has a Hebrew collation
I also have DB2 which has latin general collation.
I was asked to merge a table (write a query) between DB1.dbo.tbl1 and DB2.dbo.tbl2
I could write in the wuqery
insert into ...SELECT Col1 COLLATE Latin1_General_CI_AS...
But I'm sick of doing it.
I want to make both dbs/tables to the same collation so I don't have to write every time COLLATE...
The question is -
Should I convert latin->hebrew or Hebrew->latin ?
we need to store everything from everything. ( and all our text column are nvarachr(x))
And if so , How do I do it.
If you are using Unicode data types in resulted database - nvarchar(x), then you are to omit COLLATE in INSERT. SQL Server will convert data from your source collation to Unicode automatically. So you should not convert anything if you are inserting to nvarchar column.

Resources