I have a SQL Server 2008 database with a large amount of varchar(max) data that is currently indexed with full-text search. Unfortunately, row-level compression in SQL Server 2008 does not support LOB data.
I am toying with the idea of using SQLCLR to compress the data and a custom iFilter to enable the data to be indexed with full-text search.
I'm interested in getting some feedback on this idea. Could it work? Has it been done before? What are the possible pitfalls? Can you recommend an better solution?
A long time ago, I built a mini-SharePoint, which would compress incoming files using a zip library, and store the bytes in a varbinary(max) column. Since the spec called for metadata as opposed to actual file contents, I didn't have to worry about Full Text Search.
You could achieve the same thing with CLR now. Pitfalls would be the CPU load during decompression of data for indexing during the search, but CPUs are fast now.
Option two? Buy more storage.
Related
Can anyone advise in detail what does the varbinary(max) value represent if say a BLOB file (.pdf) is stored in the file system through the filestream attribute in sql server?
And how does it get copied across databases on different servers by using the usual T-SQL queries?
Many thanks.
Best regards,
Storing BLOB data using a FILESTREAM setup enables you to store your documents on disk while keeping your document's reference information in the database. Sometimes it is advised to use this approach if your file storage solution is cheap while your database storage is not, but it really depends on your requirements.
If you are working with small BLOB files, it might be better to leave the FILESTREAM setup alone as it comes with some overhead configuring and maintaining this. For instance in your comment's example to copy over data from one server to another.
I am writing an asp.net web application that stores APPLICANTS data in a SQL Server database.
Applicant might post name, address, telephone and a file.
The file might be of any extension including .docx for resume, 'jpg, .pdf for photos.
or even an Excel file.
Is it possible to store all these file extension on my database?
Or will that be lengthy?
Please help
Good question! Personally I would use FILESTREAM in your case and here's why
In SQL Server, BLOBs can be standard varbinary(max) data that stores
the data in tables, or FILESTREAM varbinary(max) objects that store
the data in the file system. The size and use of the data determines
whether you should use database storage or file system storage. If the
following conditions are true, you should consider using FILESTREAM:
Objects that are being stored are, on average, larger than 1 MB.
Fast read access is important.
You are developing applications that use a middle tier for application logic.
For smaller objects, storing varbinary(max) BLOBs
in the database often provides better streaming performance.
You can read up on FILESTREAM here.
Also consider using it in conjunction with FILETABLE.
Finally, here's a .net C# example on how to read from FILESTREAM column.
Please note, FILESTREAM is available in SQL Server starting from 2008 version.
Hope it helps!
I am building a search engine and I finished the first phase which is spidering (fetching html documents and parsing each document to get the other links). Now I must index the content of html documents. First of all I decided to use DBMS (like SQL Server) for this purpose but I found another library called Lucene.NET.
What is the difference between lucene.NET and SQL Server and which one is better to use to index html documents? I read alot about Lucene.Net and I surprised that it gives better performance than SQL Server. Can any one explain this to me?
SQL Server is a general purpose RDBMS that is not optimized for very fast text indexing (yes, it has full text indexes, but it does lots of other things at the same time).
Lucene.NET is not a RDBMS and its main function is fast text indexing.
Not that surprising it is better at it than SQL Server.
I need to implement a service to search PDFs. Initially I started using SQL Server 2008 FTS, but soon realized that my PDFs would have to be stored in the DB itself. I was then pointed to Indexing Services as well as to the SQL 2008 FILESTREAM data type so that I can store PDFs in the file system. So how do these three (Indexing Services, FTS, and the FILESTREAM option) relate with each other? Do I need to use all three together to implement my search?
Also, Do hosting services like DiscountASP typically have these enabled? Or should I consider switching to Lucene.NET?
WE used to use a PDF iFilter which allows you to store the PDF in the DB and then perform a FTS against it. HOwever, we now convert our PDFs to text and store the text in the full text index. This allows us to store all our docs now (we store .doc, .pdf etc) in the same index.
DiscountASP does allow FTS /iFTS on the hosted database.
If you know in advance what you want to find (eg you get hundreds of PDFs a day and will need to find the ones with certain "known-before-reception" strings then you could make a text version on reception, create index entries for the PDF file, and then throw away the text.
If you do not know the search terms in advance, life becomes much slower :( There is a program called PDF Search that claims to do full-text search in PDF files. I haven't needed to use it, so I can't say how it is, but it's here: http://www.getpdf.com/.
Hope this helps
Quick q, could be a silly one given my (lack of) findings on Google so far.
I have a Database. In this database is a Table with some Data. The Data is a large BLOB but can't be compressed (for reasons out of my control).
I have an Application that talks to this Database. I would really like to be able to ensure that the Data is compressed during transit.
As I understand it, the Database Provider would handle compression etc.
Is this the case? Are there settings on common ones, say SQL Server to enable compression?
For SQL Server, I found this "connect" entry, but no: I don't think TDS is currently compressed. You could (although I don't like it much) use SQL-CLR to compress it in .NET code, but it could have too much overhead.
I know it isn't an option in this case (from the question), but it is usually preferable to store BLOBs the way you want to get them. So if you want to get them compressed, store them compressed. SQL isn't a good tool for manipulating binary ;-p Such a strategy also means that you aren't using vendor-specific features - just the ability to store an opaque BLOB.
If your database access layer does not provide compression, you can set up a VPN link between the database server and the application host. Most serious VPN solutions compress data in transit. OpenVPN is a simple and easy to set up solution for quickly creating a tunnel. Data is compressed in transit. Probably won't be as efficient as a native compression, but it's a possible solution. And you get encryption thrown in for free :).
SQL Server 2008 is the first version of SQL Server to natively support compression of backups. Pre 2008, you need to do it with third party products.