sql server - full-text search - sql-server

So let's say I have two databases, one for production purposes and another one for development purposes.
When we copied the development database, the full-text catalog did not get copied properly, so we decided to create the catalog ourselves. We matched all the tables and indexes and created the database and the search feature seems to be working okay too (but been entirely tested yet).
However, the former catalog had a lot more files in its folder than the one we manually created. Is that fine? I thought they would have exact same number of files (but the size may vary)

First...when using full text search I would suggest that you don't manually try to create what the wizard does for you. I have to wonder about missing more than just some data. Why not just recreate the indexes?
Second...I suggest that you don't use freetext feature of sql server unless you have no other choice. I used to be a big believer in freetext but was shown an example of creating a Lucene(.net) index and searching it in comparison to creating an index in SQL Server and searching it. Creating a SQL Server index in comparison to creating a Lucene index is considerably slower and hard to maintain. Searching a SQL Server index is considerably less accurate (poor results) in comparison to Lucene. Lucene is like having your own personal Google for searching data.
How? Index your data (only the data you need to search) in Lucene and include the Primary Key of the data that you are indexing for use later. Then search the index using your language and the Lucene(.net) API (many articles written on this topic). In your search results make sure you return the PK. Once you have identified the records you are interested in you can then go get the rest of the data and/or any related data based on the PK that was returned.
Gotchas? Updating the index is also much quicker and easier. However, you have to roll your own for creating the index, updating the index, and searching the index. SUPER EASY to do...but still...there are no wizards or one handed coding here! Also, the index is on the file system. If the file is open and being searched and you try to open it again for another search you will obviously have some issues...so writing some form of infrastructure around opening and reading these indexes needs to be built.
How does this help in SQL Server? You can easily wrap your Lucene search in a CLR function or proc which can be installed in the database that you can then use as though it were native to your t-SQL queries.

Related

Semantic Search without using filetable in sql server 2012

We have an file upload system and would like to use the new MSSQL2012 semantic search feature for sql server 2012. Is that possible without using filetables?
This is our schema:
I think there are two questions here.
Can you use Semantic Search without using filetable?
Yes, you can. It can be used on any table with Full-Text indexing turned on.
Here is the list of prerequisites:
link.
Basically you can use it on the data, which is loaded into the database.
The second question is whether your schema benefit from Semantic Search an to what extent.
Looking at your scheema I understand, that your database hold only paths to the documents and their "descriptions". Therefore, you can enable Semantic Search on the columns in your database. It will allow to use Semantic Search on FileName and Description, but not on documents' contents.
In order to use Semantic Search on the contents of these documents you'll need to store these documnets in SQL database. FileTable structure helps this task, although you can choose another way of storing whole documnets in your database.

Full Text Search Auto-Partition Schemes and Functions

We have some full text searches running on our SQL Server 2012 Development (Enterprise) database. We noticed that partition schemes and functions are being (periodically) added to the DB. I can only assume that the partitions are for FTS as they have the following form:
Scheme:
CREATE PARTITION SCHEME [ifts_comp_fragment_data_space_46093FC3] AS PARTITION [ifts_comp_fragment_partition_function_46093FC3] TO ([FTS], [FTS], [FTS])
Function:
CREATE PARTITION FUNCTION [ifts_comp_fragment_partition_function_46093FC3](varbinary(128)) AS RANGE LEFT FOR VALUES (0x00330061007A00660073003200360036, 0x0067006F00730066006F00720064)
The problem is that our production servers are running SQL Server 2012 Standard which does not support partitions. Thus it adds an extra admin burden on our schema compares (using SSDT) to exclude these partitions every time. When one does (inevitably) creep in it is a pain to remove. We have done some extensive research and have not been able to come up with any answer as to why this is even happening. Any ideas?
Yes, those are internal to the fulltext search functionality. You have no control over them.
However, I would consider it a bug that they show up in your schema compares. You'll never create/alter/drop them yourselves, and they completely maintained by sql server, so I would file a bug report on http://connect.microsoft.com

DB technology for efficient search in tabular data?

We have a repository of tables. Around 200 tables, each table can be thousands of rows, all tables are originally in Excel sheets.
Each table has a different scheme. All data is text or numbers.
We would like to create an application that allows free text search on all tables (we define which columns will be searched in each table) efficiently - speed is important.
The main dilemma is which DB technology we should choose.
We created a mock up by importing all tables to MS SQL Server, and creating a full text index over them. The search is done using the CONTAINS keyword. This solution works well for a small number of tables, but it doesn't scale.
We thought about a NoSQL solution, but we don't yet have any experience in it.
Our limitations (which unfortunately I can not effect): Windows servers only. But we can install on them whatever we want.
Thank you.
Check out ElasticSearch! It's a search server based on Apache Lucene and has a clean REST- and JavaScript-based API. Although it's used usually as a search-index for a primary database, it can also be used stand-alone. So you may want to write a backup routine for a few of your tables/data and try it out.
http://www.elasticsearch.org/
http://en.wikipedia.org/wiki/ElasticSearch
Comparison of ElasticSearch and Apache Solr (another Lucene-based search server):
https://docs.google.com/present/view?id=dc6zhtt5_1frfxwfff&pli=1

Sql Server XML columns substitute for Document DB?

Is it possible to use Sql Server XML columns as a substitute for a real Document DB (such as Couch or Mongo) ?
If I were to create a table with a guid PK Id and an XML column for the document.
What would be the main problems compared to using a document DB?
Sql Server supports indexing over XML columns so querying should not be completely horrible?
You've got several questions in here:
Is it possible to use Sql Server XML columns as a substitute for a real Document DB (such as Couch or Mongo) ? Yes, you can use it as a substitute, but no, you probably wouldn't be satisfied with performance if you're exclusively storing XML and not leveraging any of SQL Server's relational tools.
If I were to create a table with a guid PK Id and an XML column for the document. What would be the main problems compared to using a document DB? In a nutshell, scaling out. SQL Server doesn't scale this kind of thing out well. You can do it with replication, but it's painful to manage relative to a "real" Document DB.
Sql Server supports indexing over XML columns so querying should not be completely horrible? The problem is that SQL Server's XML indexes can take several times the storage space of the original data. These indexes can't be maintained online (as in defrags), so you end up with locking issues during maintenance windows.
I'm doing some experimenting with this on:
http://rogeralsing.com/2011/03/02/linq-to-sqlxml-projections/
Query speed is 'decent' , it's nothing I'd use for scaling.
But the joy of schema free storage running on standard infrastructure is quite nice.
Yes, you can. Storing a document inside a SqlServer XML column will work and if you use standard XML serialization that will leave you with a decent ACID complant key/value store. Also, it will allow you to do queries on it with relative ease and you can join the results to data that you store in a more relational way. We do so, it works. If you store content in XML fields, storage demands are a lot lower than using NTEXT and querying it will be more flexible and faster.
What SqlServer will not get you (comparing to mongo) is the seamless failover of replica-sets an the autosharding of mongo. Also, atomic operations like incrementing a specific property deep inside a document is hard (though not impossible with the XQuery update function). Updates tend to be faster on most NoSql databases, because they are more relaxed on the "data is only safe on disk" principle.
Yes, it is possible. As to whether it's a good idea, this is just my 2 cents...
Before the XML datatype came along I worked on a system storing XML in an NTEXT column - that wasn't pleasant, and to get any real use out of the data meant shredding some of that data out into relational form.
OK, the XML datatype now makes it easier to query an XML blob and to extract certain values/index them. But personally, in general, I wouldn't. I'm not saying never use XML as there are scenarios for that - rather if that's all your planning on doing then I'd be thinking "is this the right tool for the job". Using a RDBMS as a document database makes me feel a bit uneasy. Whereas something like MongoDB has been built from the ground up as a document database.
In all honesty, I haven't done any performance testing on storing data as XML so I can't give you an indication of what performance would be like. Would be interested to know how this performs at scale.

Changing VC++6 app database from Access to SQL Server - can I use linked tables?

We have a Visual C++ 6 app that stores data in an Access database using DAO. The database classes have been made using the ClassWizard, basing them on CDaoRecordset.
We need to move from Access to SQL Server because some clients have huge (1.5Gb+) databases that are really slow to run reports on (using Crystal Reports and a different app).
We're not too worried about performance on this VC++ app - it is downloading data from data recorders and putting it in the database.
I used the "Microsoft SQL Server Migration Assistant 2008 for Access" to migrate my database from Access into SQL Server - it then linked the tables in the original Access database. If I open the Access database then I can browse the data in the SQL Server database.
I've then tried to use that database with my app and keep running into problems.
I've changed all my recordsets to be dbOpenDynaset instead of dbOpenTable. I also changed the myrecordsetptr->open() calls to be myrecordsetptr->open(dbOpenDynaset, NULL, dbSeeChanges) so that I don't get an exception.
But... I'm now stuck getting an exception 3251 - 'Operation is not supported for this type of object' for one of my tables when I try to set the current index using myrecordsetptr->->SetCurrentIndex(_T("PrimaryKey"));
Are there any tricks to getting the linked tables to work without rewriting all the database access code?
[UPDATE 17/7/09 - thanks for the tips - I'll change all the Seek() references to FindFirst() / FindNext() and update this based on how I go]
Yes, but I don't think you can set/change the index of a linked table in the recordset, so you'll have to change the code accordingly.
For instance: If your code is expecting to set an index & call seek, you'll basically have to rewrite it use the Find method instead.
Why are you using SetCurrentIndex when you have moved your table from Access to SQL Server?
I mean - you are using Access only for linked table.
Also, as per this page - it says that SetCurrentIndex can be used for table type recordset.
In what context are you using the command SetCurrentIndex? If it's a subroutine that uses SEEK you can't use it with linked tables.
Also, it's Jet-only and isn't going to be of any value with a different back end.
I advise against the use of SEEK (even in Access with Jet tables) except for the most unusual situations where you need to jump around a single table thousands of times in a loop. In all other DAO circumstances, you should either be retrieving a limited number of records by using a restrictive WHERE clause (if you're using SEEK to get to a single record), or you should be using .FindFirst/FindNext. Yes, the latter two are proportionally much slower than SEEK, but they are much more portable, and also the absolute performance difference is only going to be relevant if you're doing thousands of them.
Also, if your SEEK is on an ordered field, you can optimize your navigation by checking whether the sought value is greater or lesser than the value of the current record, and choosing .FindPrevious or .FindNext, accordingly (because the DAO recordset Find operations work sequentially through the index).

Resources