Using Full-Text indexing to crawl binary blobs - sql-server
If i store binary files (e.g. doc, html, xml, xps, docx, pdf) inside a varbinary(max) column in SQL Server, how can i use Full-Text indexing to crawl the binary files?
Imagine i create a table to store binary files:
CREATE TABLE Documents (
DocumentID int IDENTITY,
Filename nvarchar(32000),
Data varbinary(max),
)
How can i leverage the IFilter system provided by Windows to crawl these binary files and extract useful, searchable, information?
The motivation for this, of course, is that Microsoft's Indexing Service is derprecated, and replaced with Windows Search. Indexing Service provided an OLEDB provider (MSIDX that SQL Server could use to query the Indexing Service catalog. The Indexing Service OLE DB Provider
Windows Search, on the other hand has no way to query the catalog. There is no way for SQL Server to access Windows Search.
Fortunately, the capabilities of Windows Search (and Indexing Service before it) were brought into SQL Server proper. The SQL Server Full-Text indexing service uses the same IFilter mechanism that has been around for 19 years.
The question is: how to use it to crawl blobs stored in the database.
SQL Server fulltext can index varbinary and image columns.
You can see the list of all file types currently supported by SQL Server:
SELECT * FROM sys.fulltext_document_types
For example:
| document_type | class_id | path | version | manufacturer |
|---------------|--------------------------------------|----------------------------------------------------------------------------------|-------------------|-----------------------|
| .doc | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll | 2008.0.9200.16384 | Microsoft Corporation |
| .txt | C7310720-AC80-11D1-8DF3-00C04FB6EF4F | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\msfte.dll | 12.0.6828.0 | Microsoft Corporation |
| .xls | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll | 2008.0.9200.16384 | Microsoft Corporation |
| .xml | 41B9BE05-B3AF-460C-BF0B-2CDD44A093B1 | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\xmlfilt.dll | 12.0.9735.0 | Microsoft Corporation |
When creating the varbinary (or image) column to contain your binary file, you must have another string column that gives the file type through its extension (e.g. ".doc")
CREATE TABLE Documents (
DocumentID int IDENTITY,
Filename nvarchar(32000),
Data varbinary(max),
DataType varchar(50) --contains the file extension (e.g. ".docx", ".pdf")
)
When adding the binary column to the full-text index SQL Server needs you to tell it which column contains the data type string:
ALTER FULLTEXT INDEX ON [dbo].[Documents]
ADD ([Data] TYPE COLUMN [DataType])
You can test by importing a binary file from the filesystem on the server:
INSERT INTO Documents(filename, DataType, data)
SELECT
'Managing Storage Spaces with PowerShell.doc' AS Filename,
'.doc', *
FROM OPENROWSET(BULK N'C:\Managing Storage Spaces with PowerShell.doc', SINGLE_BLOB) AS Data
You can view the catalog status using:
DECLARE #CatalogName varchar(50);
SET #CatalogName = 'Scratch';
SELECT
CASE FULLTEXTCATALOGPROPERTY(#CatalogName, 'PopulateStatus')
WHEN 0 THEN 'Idle'
WHEN 1 THEN 'Full population in progress'
WHEN 2 THEN 'Paused'
WHEN 3 THEN 'Throttled'
WHEN 4 THEN 'Recovering'
WHEN 5 THEN 'Shutdown'
WHEN 6 THEN 'Incremental population in progress'
WHEN 7 THEN 'Building index'
WHEN 8 THEN 'Disk is full. Paused.'
WHEN 9 THEN 'Change tracking'
ELSE 'Unknown'
END+' ('+CAST(FULLTEXTCATALOGPROPERTY(#CatalogName, 'PopulateStatus') AS varchar(50))+')' AS PopulateStatus,
FULLTEXTCATALOGPROPERTY(#CatalogName, 'ItemCount') AS ItemCount,
CAST(FULLTEXTCATALOGPROPERTY(#CatalogName, 'IndexSize') AS varchar(50))+ ' MiB' AS IndexSize,
CAST(FULLTEXTCATALOGPROPERTY(#CatalogName, 'UniqueKeyCount') AS varchar(50))+' words' AS UniqueKeyCount,
FULLTEXTCATALOGPROPERTY(#CatalogName, 'PopulateCompletionAge') AS PopulateCompletionAge,
DATEADD(second, FULLTEXTCATALOGPROPERTY(#CatalogName, 'PopulateCompletionAGe'), 0) AS PopulateCompletionDate
And you can query the catalog:
SELECT * FROM Documents
WHERE FREETEXT(Data, 'Bruce')
Additional IFilters
SQL Server has a limited set of built-in filters. It can also use IFilter implementations registered on the system (e.g. Microsoft Office 2010 Filter Pack that provides docx, msg, one, pub, vsx, xlsx and zip support).
You must enable the use of the OS-level filters by enabling the option:
sp_fulltext_service 'load_os_resources', 1
and restart the SQL Server service.
load_os_resources int
Indicates whether operating system word breakers, stemmers, and filters are registered and used with this instance of SQL Server. One of:
0: Use only filters and word breakers specific to this instance of SQL Server.
1: Load operating system filters and word breakers.
By default, this property is disabled to prevent inadvertent behavior changes by updates made to the operating system. Enabling use of operating system resources provides access to resources for languages and document types registered with Microsoft Indexing Service that do not have an instance-specific resource installed. If you enable the loading of operating system resources, ensure that the operating system resources are trusted signed binaries; otherwise, they cannot be loaded when verify_signature is set to 1.
If using SQL Server before SQL Server 2008, you must also restart the Full-Text indexing service after enabling this option:
net stop msftesql
net start msftesql
Microsoft provides filter packs contain IFilter for the Office 2007 file types:
Microsoft Office 2007 Filter Packs
Microsoft Office 2010 Filter Packs
And Adobe provides an IFilter for indexing PDFs (Foxit provides one, but theirs is not free):
Adobe PDF iFilter 64 11.0.01
Bonus Reading
KB945934: How to register Microsoft Filter Pack IFilters with SQL Server
MS Blogs: Getting a custom IFilter working with SQL Server 2008/R2 (IFilterSample)
Related
MS Access: How to specify a ODBC Connection String with country specific date queries?
I have severe troubles to set up an MS access application that uses linked tables to an SQL Server 2012 Database. The problem is that SQL Queries have problems to interpret German dates: e.g. "31.12.2019" doesn't work, "01.01.2019" works. So I suspect that it is a problem with localization. E.g. select * from table where date >= [Forms]![someForm]![fromDate] [Forms]![someForm]![fromDate] is a string in a form, edited by a date picker. I was able to solve the problem by using the ODBC Microsoft SQL Server Setup Wizzard, and selecting "Ländereinstellungen verwenden" (engl. use country specific settings). (Sorry the following screenshot is in German). I would like to specify this in a classic ODBC connection string: e.g DRIVER=ODBC Driver 13 for SQL Server;SERVER=.\SqlExpress2012;Trusted_Connection=Yes;APP=Microsoft Office;DATABASE=suplattform;?country-specific=yes? However I did not find such an parameter in any documentation. Is this possible? Best regards Michael
Also, specify the data type of the parameter - and date is a reserved word in Access SQL: parameters [Forms]![someForm]![fromDate] DateTime; select * from table where [date] >= [Forms]![someForm]![fromDate]
Insert on linked table (from MS Access to SQL Server) fails because of field length
I'm trying to do an append query from MS Access into SQL server. The SQL server column is varchar(max) which I thought meant it can accept more than 4000 characters I get the following error when running this query from VBA in MS Access Run Time Error 3155 ODBC - insert on a linked table failed [Microsoft] [SQL Server Native Client 10.0] {SQL Server] The size (7596) given to the parameter '#P6' exceeds the maximum allowed (4000). (#2717) adding my queries this query is based on an linked outlook folder, Deleted Items SELECT Trim(Mid([contents],InStr([contents],"Short Description: ")+19,(InStr([contents],"Requestor: ")-1)-(InStr([contents],"Short Description: ")+19)-3)) AS ShortDesc, Trim(Mid([contents],InStr([contents],"Requestor: ")+10,(InStr([contents],"Requestor EMail: ")-1)-(InStr([contents],"Requestor: ")+10)-3)) AS Requester, Trim(Mid(Mid([contents],InStr([contents],"Office Location: ")+3),InStr(Mid([contents],InStr([contents],"Office Location: ")),"Description:")+13,(InStr(Mid([contents],InStr([contents],"Office Location: ")),"Assigned Task: ")-1)-(InStr(Mid([contents],InStr([contents],"Office Location: ")),"Description:")+13)-3)) AS Description, Trim(Mid([contents],InStr([contents],"Request Item: ")+14,12)) AS TicketNoText, Val(Mid([contents],InStr([contents],"Request Item: ")+19,7)) AS TicketNo, Val(Mid([contents],InStr([contents],"Assigned Task: ")+19,7)) AS TaskNo, Mid([contents],InStr([contents],"Delivery Date: ")+15,10) AS DeliveryDate, Trim(Mid([contents],InStr([contents],"Requestor EMail: ")+17,(InStr([contents],"Office Location: ")-1)-(InStr([contents],"Requestor EMail: ")+17)-3)) AS RequesterEMail FROM [Deleted Items] WHERE ((([Deleted Items].From)="xxxxxxx#service-now.com") AND (([Deleted Items].Subject)="you just assigned a ticket to yourself")); then, the append query is based on this one and a few other ones INSERT INTO PROJECTS ( TaskNo, RequesterID, Description, TicketNo, ProjectFolderLink, SNLink, OpenedOn, DateDue ) SELECT QSNNew.taskno, cmbRequesters.RequesterID, "SHORT DESCRIPTION: " & [shortdesc] & Chr(13) & Chr(10) & Chr(13) & Chr(10) & "DESCRIPTION: " & [Description] AS Expr4, QSNNew.TicketNo, "#\\link to a network folder" & lpad([ticketno],"0",7) & "\#" AS Expr1, "#https://xxxx.service-now.com/nav_to.do?uri=sc_task.do?sysparm_query=number=TASK" & lpad([taskno],"0",7) & "#" AS Expr2, Now() AS Expr3, Mid([DeliveryDate],6,2) & "/" & Right([DeliveryDate],2) & "/" & Left([DeliveryDate],4) AS Expr5 FROM (QSNNew LEFT JOIN PROJECTS ON QSNNew.TicketNo = PROJECTS.TicketNo) LEFT JOIN cmbRequesters ON QSNNew.[Requester] = cmbRequesters.RequesterName WHERE (((PROJECTS.TicketNo) Is Null)); in case anyone is wondering what I'm doing, I'm loading tickets from Service Now into an Access database and there's no other way of doing it, other than parsing notification emails i get from Service Now when a ticket is assigned to me. So I'm parsing those emails and creating my own version with links to the ServiceNow page, network folders for the ticket, etc.
It's a matter of driver (SQL Native Client and ODBC Driver 17 are limited to 4000 chars). If you use SQL Server Driver (10.09.18362.01) limit is 64000 chars. As yu2 suggested ADODB query would avoid that (ODBC Passthrough should do it too). The parameter #P6 is produced by ODBC see Optimizing Microsoft Office Access Applications Linked to SQL Server Understanding Dynasets
This is answer for your question 3155 Insert into Linked Table error On this link Microsoft recommends to use ADODB instead of ODBC if you can't decrease field size in application. ODBC protocol is generally for big data -- sql server -- which uses "Fire Hose" bandwith. Access uses (my term) garden hose bandwith (with all due respect for mini RDBMSs) because Access is basically a mini RDBMS (relational database management system) which is also file based. Unless everything (front/back ends) is set up perfectly and conditions are ideal -- you will encounter the problem you are having. Microsoft came up with ADODB as a workaround for this problem. When I have to interface between sql server and Access -- I use ADODB. This has proven to be much more reliable and consistent between the small and large RDBMSs. Here is some sample ADODB code for reading from and writing to a Sql Server from Access '--add a reference in Tools/References to Microsoft ActiveX Data Objects 2.x Library '--(2.5 or higher) ...
The Access "Long Text" column can contain a text string up to a gigabyte in size. The message says you are trying to fit 7596 characters into a 4000 character field. If so, your SQL server database should be exposing an ODBC LongVarChar column instead of VarChar. LongVarChar is an ODBC type. The mapping is done by the ODBC driver. If you use an ODBC driver that maps VarChar(MAX) to a VarChar ODBC column, you can either get a different driver, or, possibly, use a SQL SERVER 'TEXT' column instead. TEXT is the old SQL Server column type, from when VarChar could only go to 4000. Old ODBC drivers recognize that TEXT columns map to LongVarChar.
I think the error message is quite helpful. You are trying to fit a string with length of 7596 while your maximum varchar length is 4000. I guess you either truncate it or store it as a blob.
SSIS Oracle table w/BLOBs XML to SQL Server table
We have an Oracle table that contains archived data in the table format: BLOB | BLOBID Each BLOB is an XML file that contains all the business objects we need. Every BLOB needs to be read, the XML parsed and put into 1 SQL Server table that will hold all data. The table has 5 columns R | S | T | D | BLOBID Sample XML derived from BLOB: <N> <E> <R>33</R> <S>1</S> <T>2012-01-25T09:48:43.213</T> <D>6.9534619e+003</D> </E> <E> <R>33</R> <S>1</S> <T>2012-01-25T09:48:45.227</T> <D>1.1085871e+004</D> </E> <E> <R>33</R> <S>1</S> <T>2012-01-25T09:48:47.227</T> <D>1.1561764e+004</D> </E> </N> There are a few million BLOBs and we want to avoid copying all the data over as an XML column then to a table, instead we want to go BLOB to table in one step. What is the best approach to doing this with SSIS/SQL Server? The code below almost does what we are looking for but only in Oracle Developer and only for one BLOB: ALTER SESSION SET NLS_TIMESTAMP_FORMAT='yyyy-mm-dd HH24:MI:SS.FF'; SELECT b.BLOBID, a.R as R, a.S as S, a.T as T, cast(a.D AS float(23)) as D FROM XMLTABLE('/N/E' PASSING (SELECT xmltype.createxml(BLOB, NLS_CHARSET_ID('UTF16'), null) FROM CLOUD --Oracle BLOB Cloud WHERE BLOBID = 23321835) COLUMNS R int PATH 'R', S int PATH 'S', T TIMESTAMP(3) PATH 'T', D VARCHAR(23) PATH 'D' ) a Removing WHERE BLOBID = 23321835 gives the following error ORA-01427: single-row subquery returns more than one row since there are millions of BLOBS. Even so is there a way to run this through SSIS? Adding the SQL to the OLE DB Source did not work for pulling the data from Oracle even for 1 BLOB and would result in errors. Using SQL Server 2012 and Oracle 10g To summarize, how would we go from a Oracle BLOB containing XML to SQL Server table with business objects derived from XML with SSIS? I'm new to working with Oracle, any help would be greatly appreciated! Update: I was able to get some of my code to work in SSIS by modifying the Oracle Source in SSIS to use the SQL command code above minus the first line, ALTER SESSION SET NLS_TIMESTAMP_FORMAT='yyyy-mm-dd HH24:MI:SS.FF'; SSIS doesn't like this line. Error message with the ALTER SESSION line above included: No column information was returned by the SQL Command Would there be another way to format the date without losing data? I'll try experimenting more, possible using varchar(23) for the date instead of timestamp.
How do I avoid date type column of MSSQL INTO PIVOTAL HAWQ null at DBMS migration
We are trying to pull data from external source (mssql) to postgres. But when i checked for invoicedate column entries are getting blank at the same time mssql is showing invoicedate values for those entries. ie We tried following query on both the DBMS: When query executed in SQL Server: select * from tablename where salesorder='168490' getting 12 rows where invoicedate column is '2015-10-26 00:00:00.000' But same query is executed on Postgres select "InvoceDt" from tablename where salesorder='168490' Getting 12 rows where the column invoicedate is null. Question is why? Postgres InvoiceDt column is coming null rather than we can see that SQL Server is showing appropriate data values. Why is the data different between SQL Server and Postgres for this particular column?
Vicps, you aren't using Postgres and that is why a_horse_with_no_name is having such a hard time trying to understand your question. You are using Pivotal HDB (formally called HAWQ). HAWQ is now associated with the incubator project, "Apache HAWQ" and the commercial version is "Pivotal HDB". Pivotal HDB is a fork of Pivotal Greenplum database which is a fork of PostgreSQL 8.2. It has many similarities to Postgres but it is most definitely not Postgres. You are also using Spring-XD to move the data from SQL Server to HDFS which is critical in understanding what the true problem is. You provided this example: CREATE TABLE tablename ( "InvoiceDt" timestamp ) LOCATION ('pxf://hostname/path/to/hdfs/?profile=HdfsTextSimple') FORMAT 'csv' ( delimiter '^' null 'null' quote '~'); Your file only has one column in it? How is this possible? Above, you mention the salesorder column. Secondly, have you tried looking at the file written by Spring-XD? hdfs dfs -cat hdfs://hostname:8020/path/to/hdfs | grep 168490 I bet you have an extra delimiter, null character, or an escape character in the data which is causing the problem. You also may want to tag your question with spring-xd too.
decimal(18,10) values in SQL Server appear rounded in Access linked table
I have an Access database with a linked table from SQL Server. The SQL Server table has the table structure and the values as below Name |Total |Esc |Variance (Data Type decimal (18,10)) ================================================ Name |29 |2 |0.0689655170 but if i open the Access linked table it is showing the Variance value rounded to two decimal places Name |Total |Esc |Variance ================================================ Name |29 |2 |0.07 How can I make the Access table look the same as the SQL Server table?
I had this same problem with a Decimal(18,10) on SQL2005 with Access 2010 Front End. Linked table rounds/truncates data. I have found that this occurs with the default sqlsrv32 ODBC driver updating to the sql Native client driver and creating a new ODBC connection to use the sqlncli driver seems to fix the error.