Full-text Search on Joined, Hierarchical Records in SQL Server 2008

Full-text Search on Joined, Hierarchical Records in SQL Server 2008 - sql-server

Probably a noob question, but I'll go for it nevertheless.
For sake of example, I have a Person table, a Tag table and a ContactMethod table. A Person will have multiple Tag records and multiple ContactMethod records associated with them.
I'd like to have a forgiving search which will search among several fields from each table. So I can find a person by their email (via ContactMethod), their name (via Person) or a tag assigned to them.
As a complete noob to FTS, two approaches come to mind:
Build some complex query which addresses each field individually
Build some sort of lookup table which concatenates the fields I want to index and just do a full-text query on that derived table.
(Feel free to edit for clarity; I'm not in it for the rep points.)

If your sql server supports it you can create an indexed view and full text search that; you can use containstable(*,'"chris"') to read all the columns.
If it doesn't support it as the fields are all coming from different tables I think for scalability; if you can easily populate the fields into a single row per record in a separate table I would full text search that rather than the individual records. You will end up with a less complex FTS catalog and your queries will not need to do 4 full text searches at a time. Running lots of separate FTS queries over different tables at the same time is a ticket to query performance issues in my experience. The downside with doing this is you lose the ability to search for Surname on its own; if that is something you need you might need to look at an alternative.
In our app we found that the single table was quicker (we can't rely on customers having enterprise sql at hand); so we populate the data with spaces into an FTS table through an update sp then our main contact lookup runs a search over the list. We have two separate searches to handle finding things with precision (i.e. names or phone numbers) or just for free text. The other nice thing about the table is it is relatively easy and low cost to add further columns to the lookup (we have been asked for social security number for example; to do it we just added the column to the update SP and we were away with little or no impact.

One possibility is to make a view which has these columns: PersonID, ContentType, Content. ContentType would be something like "Email", "PhoneNumber", etc... and Content would hold that. You'd be searching on the Content column, and you'd be able to see what the person's ID is. I'm not 100% sure how full text search works though, so I'm not sure if you could use that on a view.

The FTS can search multiple fields out-of-the-box. The CONTAINS predicate accepts a list of columns to search. Also CONTAINSTABLE.

Related

NetSuite - UNION ALL equivalent in saved search?

I'm in the process of writing a SuiteTalk integration, and I've hit an interesting data transformation issue. In the target system, we have a sort of notes table which has a category column and then the notes column. Data going into that table from NetSuite could be several different fields on a single entity in NetSuite terms, but several records of different categories in our terms.
If you take the example of a Sales Order, you might have two text fields that we need to bring across as notes. For each of those fields I need to create a row, with both the notes field in the same column but separate rows. This would allow me to add a dynamic column that give the category for each of those fields.
So instead of
SO number notes 1 notes 2
SO1234567 some text1 some text2
You’d get
SO Number Category Text
SO1234567 category 1 some text1
SO1234567 category 2 some text2
The two problems I’m really trying to solve here are:
Where can I store the category name? It can’t be the field name in NetSuite. It needs to be configurable per customer as the number of notes fields in each record type might vary across implementations. This is currently my main blocker.
Performance – I could create a saved search for each type of note, and bring one row across each time, but that’s not really an acceptable performance hit if I can do it all in one call.
I use Saved Searches in NetSuite to provide a configurable way of filtering the data to import into the target system.
If I were writing a SQL query, i would use the UNION clause, with the first column being a dynamic column denoting the category and the second column being the actual data field from NetSuite. My ideal would be if I could somehow do a similar thing either as a single saved search, or as one saved search per entity, without having to create any additional fields within NetSuite itself, so that from the SuiteTalk side I can just query the search and pull in the data.
As a temporary kludge, I now have multiple saved searches in NetSuite, one per category, and within the ID of the saved search I expect the category name and an indicator of the record type. I then have a parent search which gives me the searches for that record type - it's very clunky, and ultimately results in far too many round trips for me to be satisfied.
Any idea if something like this is at all possible?? Or if not, is there a way of solving this without hard-coding the category values in the front end? Even if I can bring back multiple recordsets in one call, that would be a performance enhancement.
I've asked the same question on the NetSuite forums but to no avail.
Thanks

At first read it sounds like you are trying to query a set of fields from entities. The fields may be custom fields or built in fields. Can you not just query the entities where your saved search has all the potential category columns and then transform the received data into categories?
Otherwise please provide more specifics in Netsuite terms about what you are trying to do.

Dynamic columns in database tables vs EAV

I'm trying to decide which way to go if I have an app that needs to be able to change the db schema based on the user input.
For example, if I have a "car" object that contains car properties, like year, model, # of doors etc, how do I store it in the DB in such a way, that the user should be able to add new properties?
I read about EAV tables and they seem right for this thing, but the problem is that queries will get pretty complicated when I try to get a list of cars filtered by a set of properties.
Could I generate the tables dynamically instead? I see that Sqlite has support for ADD COLUMN, but how fast is it when the table reaches many records? And it looks like there's no way to remove a column. I have to create a new table without the column I want to remove, and copy the data from the old table. That's certainly slow on large tables :(

I will assume that SQLite (or another relational DBMS) is a requirement.
EAVs
I have worked with EAVs and generic data models, and I can say that the data model is very messy and hard to work with in the long run.
Lets say that you design a datamodel with three tables: entities, attributes, and _entities_attributes_:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY, name TEXT);
CREATE TABLE attributes
(attribute_id INTEGER PRIMARY KEY, name TEXT, type TEXT);
CREATE TABLE entity_attributes
(entity_id INTEGER, attribute_id INTEGER, value TEXT,
PRIMARY KEY(entity_id, attribute_id));
In this model, the entities table will hold your cars, the attributes table will hold the attributes that you can associate to your cars (brand, model, color, ...) and its type (text, number, date, ...), and the _entity_attributes_ will hold the values of the attributes for a given entity (for example "red").
Take into account that with this model you can store as many entities as you want and they can be cars, houses, computers, dogs or whatever (ok, maybe you need a new field on entities, but it's enough for the example).
INSERTs are pretty straightforward. You only need to insert a new object, a bunch of attributes and its relations. For example, to insert a new entity with 3 attributes you will need to execute 7 inserts (one for the entity, three more for the attributes, and three more for the relations.
When you want to perform an UPDATE, you will need to know what is the entity that you want to update, and update the desired attribute joining with the relation between the entity and its attributes.
When you want to perform a DELETE, you will also need to need to know what is the entity you want to delete, delete its attributes, delete the relation between your entity and its attributes and then delete the entity.
But when you want to perform a SELECT the thing becomes nasty (you need to write really difficult queries) and the performance drops horribly.
Imagine a data model to store car entities and its properties as in your example (say that we want to store brand and model). A SELECT to query all your records will be
SELECT brand, model FROM cars;
If you design a generic data model as in the example, the SELECT to query all your stored cars will be really difficult to write and will involve a 3 table join. The query will perform really bad.
Also, think about the definition of your attributes. All your attributes are stored as TEXT, and this can be a problem. What if somebody makes a mistake and stores "red" as a price?
Indexes are another thing that you could not benefit of (or at least not as much as it would be desirable), and they are very neccesary as the data stored grows.
As you say, the main concern as a developer is that the queries are really hard to write, hard to test and hard to maintain (how much would a client have to pay to buy all red, 1980, Pontiac Firebirds that you have?), and will perform very poorly when the data volume increases.
The only advantage of using EAVs is that you can store virtually everything with the same model, but is like having a box full of stuff where you want to find one concrete, small item.
Also, to use an argument from authority, I will say that Tom Kyte argues strongly against generic data models:
http://tkyte.blogspot.com.es/2009/01/this-should-be-fun-to-watch.html
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
Dynamic columns in database tables
On the other hand, you can, as you say, generate the tables dynamically, adding (and removing) columns when needed. In this case, you can, for example create a car table with the basic attributes that you know that you will use and then add columns dynamically when you need them (for example the number of exhausts).
The disadvantage is that you will need to add columns to an existing table and (maybe) build new indexes.
This model, as you say, also has another problem when working with SQLite as there's no direct way to delete columns and you will need to do this as stated on http://www.sqlite.org/faq.html#q11
BEGIN TRANSACTION;
CREATE TEMPORARY TABLE t1_backup(a,b);
INSERT INTO t1_backup SELECT a,b FROM t1;
DROP TABLE t1;
CREATE TABLE t1(a,b);
INSERT INTO t1 SELECT a,b FROM t1_backup;
DROP TABLE t1_backup;
COMMIT;
Anyway, I don't really think that you will need to delete columns (or at least it will be a very rare scenario). Maybe someone adds the number of doors as a column, and stores a car with this property. You will need to ensure that any of your cars have this property to prevent from losing data before deleting the column. But this, of course depends on your concrete scenario.
Another drawback of this solution is that you will need a table for each entity you want to store (one table to store cars, another to store houses, and so on...).
Another option (pseudo-generic model)
A third option could be to have a pseudo-generic model, with a table having columns to store id, name, and type of the entity, and a given (enough) number of generic columns to store the attributes of your entities.
Lets say that you create a table like this:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY,
name TEXT,
type TEXT,
attribute1 TEXT,
attribute1 TEXT,
...
attributeN TEXT
);
In this table you can store any entity (cars, houses, dogs) because you have a type field and you can store as many attributes for each entity as you want (N in this case).
If you need to know what the attribute37 stands for when type is "red", you would need to add another table that relates the types and attributes with the description of the attributes.
And what if you find that one of your entities needs more attributes? Then simply add new columns to the entities table (attributeN+1, ...).
In this case, the attributes are always stored as TEXT (as in EAVs) with it's disadvantages.
But you can use indexes, the queries are really simple, the model is generic enough for your case, and in general, I think that the benefits of this model are greater than the drawbacks.
Hope it helps.
Follow up from the comments:
With the pseudo-generic model your entities table will have a lot of columns. From the documentation (https://www.sqlite.org/limits.html), the default setting for SQLITE_MAX_COLUMN is 2000. I have worked with SQLite tables with over 100 columns with great performance, so 40 columns shouldn't be a big deal for SQLite.
As you say, most of your columns will be empty for most of your records, and you will need to index all of your colums for performance, but you can use partial indexes (https://www.sqlite.org/partialindex.html). This way, your indexes will be small, even with a high number of rows, and the selectivity of each index will be great.
If you implement a EAV with only two tables, the number of joins between tables will be less than in my example, but the queries will still be hard to write and maintain, and you will need to do several (outer) joins to extract data, which will reduce performance, even with a great index, when you store a lot of data. For example, imagine that you want to get the brand, model and color of your cars. Your SELECT would look like this:
SELECT e.name, a1.value brand, a2.value model, a3.value color
FROM entities e
LEFT JOIN entity_attributes a1 ON (e.entity_id = a1.entity_id and a1.attribute_id = 'brand')
LEFT JOIN entity_attributes a2 ON (e.entity_id = a2.entity_id and a2.attribute_id = 'model')
LEFT JOIN entity_attributes a3 ON (e.entity_id = a3.entity_id and a3.attribute_id = 'color');
As you see, you would need one (left) outer join for each attribute you want to query (or filter). With the pseudo-generic model the query will be like this:
SELECT name, attribute1 brand, attribute7 model, attribute35 color
FROM entities;
Also, take into account the potential size of your _entity_attributes_ table. If you can potentially have 40 attributes for each entity, lets say that you have 20 not null for each of them. If you have 10,000 entities, your _entity_attributes_ table will have 200,000 rows, and you will be querying it using one huge index. With the pseudo-generic model you will have 10,000 rows and one small index for each column.

It all depends on the way in which your application needs to reason about the data.
If you need to run queries which need to do complicated comparisons or joins on data whose schema you don't know in advance, SQL and the relational model are rarely a good fit.
For instance, if your users can set up arbitrary data entities (like "car" in your example), and then want to find cars whose engine capacity is greater than 2000cc, with at least 3 doors, made after 2010, whose current owner is part of the "little old ladies" table, I'm not aware of an elegant way of doing this in SQL.
However, you could achieve something like this using XML, XPath etc.
If your application has a set on data entities with known attributes, but users can extend those attributes (a common requirement for products like bug trackers), "add column" is a good solution. However, you may need to invent a custom query language to allow users to query those columns. For instance, Atlassian Jira's bug tracking solution has JQL, a SQL-like language for querying bugs.
EAV is great if your task is to store and then show data. However, even moderately complex queries become very hard in an EAV schema - imagine how you'd execute my made up example above.

For your use case, a document oriented database like MongoDB would do great.

Another option that I haven't seen mentioned above is to use denormalized tables for the extended attributes. This is a combination of the pseudo-generic model and the dynamic columns in database tables. Instead of adding columns to existing tables, you add columns or groups of columns into new tables with FK indexes to the source table. Of course, you'll want a good naming convention (car, car_attributes_door, car_attributes_littleOldLadies)
Your selection problem becomes that of applying a LEFT OUTER JOIN to include the extended attributes that you want to include.
Slower than normalized, but not as slow as EAV.
Adding new extended attributes becomes a problem of adding a new table.
Harder than EAV, easier/faster than modifying table schema.
Deleting attributes becomes a problem of dropping whole tables.
Easier/faster than modifying table schema.
These new attributes can be strongly typed.
As good as modifying table schema, faster than EAV or generic columns.
The biggest advantage to this approach that I can see is that deleting unused attributes is quite easy compared to any of the others via a single DROP TABLE command. You also have the option to later normalize often-used attributes into larger groups or into the main table using a single ALTER TABLE process rather than one for each new column you were adding as you added them, which helps with the slow LEFT OUTER JOIN queries.
The biggest disadvantage is that you're cluttering up your table list, which admittedly is often not a trivial concern. That and I'm not sure how much better LEFT OUTER JOIN's actually perform than EAV table joins. It's definitely closer to EAV join performance than normalized table performance.
If you're doing a lot of comparisons/filters of values that benefit greatly from strongly typed columns, but you add/remove these columns frequently enough to make modifying a huge normalized table intractable, this seems like a good compromise.

I would try EAV.
Adding columns based on user input doesn't sounds nice to me and you can quickly run out of capacity. Queries on very flat table can also be a problem. Do you want to create hundreds of indexes?
Instead of writing every thing to one table, I would store as many as possible common properties (price, name , color, ...) in the main table and those less common properties in an "extra" attributes table. You can always balance them later with a little effort.
EAV can performance well for small to middle sized data set. Since you want to use SQLlite, I guess it's not be a problem.
You may also want to avoid "over" normalizing your data. With the cheap storage
we currently have, you can use one table to store all "Extra" attributes, instead of two:
ent_id, ent_name, ...
ent_id, attr_name, attr_type, attr_value ...
People against EAV will say its performance is poor on large database. It's sure that it won't performance as well as normalized structure but you don't want to change structure on a 3TB table either.

I have a low quality answer, but possible, that came from HTML tags that are like : <tag width="10px" height="10px" ... />
In this dirty way you will have just one column as a varchar(max) for all properties say it Props column and you will store data in it like this:
Props
------------------------------------------------------------
Model:Model of car1|Year:2010|# of doors:4
Model:Model of car2|NewProp1:NewValue1|NewProp2:NewValue2
In this way all works will go to the programming code in business layer with using some functions like concatCustom that get an array and return a string and unconcatCustom that get a string and return an array.
For more validity of special characters like ':' and '|', I suggest '#:#' and '#|#' or something more rare for splitter part.
In a similar way you can use a text or binary field and store an XML data in the column.

How can simplify my database?

I am working on a project in which I have generated a unique id of a customer with the customer's Last name's first letter. And stored it in a database in different tables as if customer's name starting with a then the whole information of the customer will stored in Registration_A table. As such I have created tables of Registration up to Z. But retrieving if data with such structure is quiet difficult. can you suggest me another method to save data so that retrieving become more flexible?

Put all of your registration data into one table. There's absolutely no need for you to break it into alphabetical pieces like that unless you have some serious performance issues.
When querying for registration data, use SQL's WHERE clause to narrow down your results.

You have to merge this to one table ´Registration´, then let the database care about unique ids. This depends on your database, but searching for PRIMARY KEY or AUTO INCREMENT should give you lots of results.
If you have done the the splitting because of performance reasons, you can add a Index on the users last name.

How do you implement a fulltext search over multiple columns in sql server?

I am trying to implement a fulltext search on two columns which I created a view for: VendorName, ProductName. I have the full text index etc working but the actual query is what is causing some issues for me.
I want users to be able to use some standard search conventions, AND OR NOT and grouping of terms by () which is fine but I want to apply the search over both the columns so for example if I were to run a query such as:
SELECT * FROM vw_Search
WHERE CONTAINS((VendorName, ProductName), "Apple AND iTunes")
It seems to be applying the query to each column individually i.e. checking vendor name for both terms and then checking product name for both terms which wont match unless the vendor was called "Apple iTunes".
If I change the query to :
SELECT * FROM vw_Search
WHERE CONTAINS(VendorName, "Apple OR iTunes")
AND CONTAINS(ProductName, "Apple OR iTunes")
then it works but breaks other search queries (such as searching for just apple) and from user writing the query it doesn't make much sense as what they are likely to write is AND but it requires an OR to work.
What I want is it to return if between the two the search term was valid so it would match all vendors named apple with a product name itunes for example.
Should I create a separate field in the view that concatenates the Vendor and Product fields and performs the first query on that new field or is there something I am missing out?
Aside from that would anyone know of an existing method of validating the queries?

In earlier versions of SQL Server, the queries matched across multiple columns.
However, this was considered a bug.
To match across multiple columns, you should concatenate them in a computed column and create an index over that column.

How to design data storage for partitioned tagging system?

How to design data storage for huge tagging system (like digg or delicious)?
There is already discussion about it, but it is about centralized database. Since the data is supposed to grow, we'll need to partition the data into multiple shards soon or later. So, the question turns to be: How to design data storage for partitioned tagging system?
The tagging system basically has 3 tables:
Item (item_id, item_content)
Tag (tag_id, tag_title)
TagMapping(map_id, tag_id, item_id)
That works fine for finding all items for given tag and finding all tags for given item, if the table is stored in one database instance. If we need to partition the data into multiple database instances, it is not that easy.
For table Item, we can partition its content with its key item_id. For table Tag, we can partition its content with its key tag_id. For example, we want to partition table Tag into K databases. We can simply choose number (tag_id % K) database to store given tag.
But, how to partition table TagMapping?
The TagMapping table represents the many-to-many relationship. I can only image to have duplication. That is, same content of TagMappping has two copies. One is partitioned with tag_id and the other is partitioned with item_id. In scenario to find tags for given item, we use partition with tag_id. If scenario to find items for given tag, we use partition with item_id.
As a result, there is data redundancy. And, the application level should keep the consistency of all tables. It looks hard.
Is there any better solution to solve this many-to-many partition problem?

I doubt there is a single approach that optimizes all possible usage scenarios. As you said, there are two main scenarios that the TagMapping table supports: finding tags for a given item, and finding items with a given tag. I think there are some differences in how you will use the TagMapping table for each scenario that may be of interest. I can only make reasonable assumptions based on typical tagging applications, so forgive me if this is way off base!
Finding Tags for a Given Item
A1. You're going to display all of the tags for a given item at once
A2. You're going to ensure that all of an item's tags are unique
Finding Items for a Given Tag
B1. You're going to need some of the items for a given tag at a time (to fill a page of search results)
B2. You might allow users to specify multiple tags, so you'd need to find some of the items matching multiple tags
B3. You're going to sort the items for a given tag (or tags) by some measure of popularity
Given the above, I think a good approach would be to partition TagMapping by item. This way, all of the tags for a given item are on one partition. Partitioning can be more granular, since there are likely far more items than tags and each item has only a handful of tags. This makes retrieval easy (A1) and uniqueness can be enforced within a single partition (A2). Additionally, that single partition can tell you if an item matches multiple tags (B2).
Since you only need some of the items for a given tag (or tags) at a time (B1), you can query partitions one at a time in some order until you have as many records needed to fill a page of results. How many partitions you will have to query will depend on how many partitions you have, how many results you want to display and how frequently the tag is used. Each partition would have its own index on tag_id to answer this query efficiently.
The order you pick partitions in will be important as it will affect how search results are grouped. If ordering isn't important (i.e. B3 doesn't matter), pick partitions randomly so that none of your partitions get too hot. If ordering is important, you could construct the item id so that it encodes information relevant to the order in which results are to be sorted. An appropriate partitioning scheme would then be mindful of this encoding. For example, if results are URLs that are sorted by popularity, then you could combine a sequential item id with the Google Page Rank score for that URL (or anything similar). The partitioning scheme must ensure that all of the items within a given partition have the same score. Queries would pick partitions in score order to ensure more popular items are returned first (B3). Obviously, this only allows for one kind of sorting and the properties involved should be constant since they are now part of a key and determine the record's partition. This isn't really a new limitation though, as it isn't easy to support a variety of sorts, or sorts on volatile properties, with partitioned data anyways.

The rule is that you partition by field that you are going to query by. Otherwise you'll have to look through all partitions. Are you sure you'll need to query Tag table by tag_id only? I believe not, you'll also need to query by tag title. It's no so obvious for Item table, but probably you also would like to query by something like URL to find item_id for it when other user will assign tags for it.
But note, that Tag and Item tables has immutable title and URL. That means you can use the following technique:
Choose partition from title (for Tag) or URL (for Item).
Choose sequence for this partition to generate id.
You either use partition-localID pair as global identifier or use non-overlapping number sets. Anyway, now you can compute partition from both id and title/URL fields. Don't know number of partitions in advance or worrying it might change in future? Create more of them and join in groups, so that you can regroup them in future.
Sure, you can't do the same for TagMapping table, so you have to duplicate. You need to query it by map_id, by tag_id, by item_id, right? So even without partitioning you have to duplicate data by creating 3 indexes. So the difference is that you use different partitioning (by different field) for each index. I see no reason to worry about.

Most likely your queries are going to be related to a user or a topic. Meaning that you should have all info related to those in one place.
You're talking about distribution of DB, usually this is mostly an issue of synchronization. Reading, which is about 90% of the work usually, can be done on a replicated database. The issue is how to update one DB and remain consistent will all others and without killing the performances. This depends on your scenario details.
The other possibility is to partition, like you asked, all the data without overlapping. You probably would partition by user ID or topic ID. If you partition by topic ID, one database could reference all topics and just telling which dedicated DB is holding the data. You can then query the correct one. Since you partition by ID, all info related to that topic could be on that specialized database. You could partition also by language or country for an international website.
Last but not least, you'll probably end up mixing the two: Some non-overlapping data, and some overlapping (replicated) data. First find usual operations, then find how to make those on one DB in least possible queries.
PS: Don't forget about caching, it'll save you more than distributed-DB.