Is a full text search on one table faster than two tables? - sql-server

In the full text search page http://msdn.microsoft.com/en-us/library/ms189760.aspx on MSDN it says that if you want to do a full text search on multiple tables just "use a joined table in your FROM clause to search on a result set that is the product of two or more tables."
My question is, isn't this going to be really slow if you have to merge two very large tables?
If I'm merging a product table with a category table and there are millions of records, won't the join take a long time and then have to search after the join?

Joins on millions of records can still be fast if the join is optimized for performance, for example, a single int column that is indexed in both tables. But there can be other factors at play so the best approach is to try it and gauge the performance yourself.
If the join doesn't perform well, you have a couple of options:
Create a view of the tables joined together, create a full text index on that view, and run your full text queries against that view.
Create a 3rd table which is a combination of the 2 tables you are joining, create a full text index on it, and run your full text queries against that table. You'll need something like an ETL process to keep it updated.

Related

SSIS Lookup Transform use Table or Query

I have a Lookup Transformation on a table with 30 columns but I only am using two columns: ID column for the join and Update column as Output.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
Using table in Drop down would this be like doing Select * From T1 or is SSIS clever enough to know I only need 2 columns.
I'm thinking I should go with the Query Select ID, Update From T1.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
It is best to specify which columns you want.
Using table in Drop down, would this be like doing Select * From T1
Yes, it is a SELECT *.
or is SSIS clever enough to know I only need 2 columns?
Nope.
Keep in mind that Lookups are good for pulling data from Dimension Tables where the row count and record set is small. If you are dealing with large amounts of unique data, then it will be better to perform a MERGE JOIN, instead. The performance difference can be substantial. For example, when using a Lookup on 20K rows of data, you could experience run times in the tens of minutes. A MERGE JOIN, however, would run within seconds.
Lookups have the drawback of behaving like correlated sub-queries in that they fire off a query to the server for every row passing through it. You can have the Lookup cache the data, which means SSIS will store the results in memory and then check the memory before going to the server for all subsequent rows passing through the Lookup. As a result, this is only effective if there are a large number of matching records for a small cache set. In other words, Lookups are not optimal when there is large amount of Distinct ID's to lookup. To that point, caching data is almost pointless.
This is where you would switch over to using a MERGE JOIN. Note: you will need to perform a SORT on both of the data flows before the MERGE JOIN because the MERGE JOIN component requires the incoming rows to be sorted.
When handled incorrectly, a single poorly placed Lookup can bring an entire package to its knees - lookups can be huge performance bottlenecks. Though, handled correctly, a Lookup can simplify the design of the dataflow and speed development by removing the extra development required to MERGE JOIN data flows.
The bottom line to all of this is that you want the Lookup performing the fewest number of queries against the server.
If you need only two columns from the lookup table then it is better to use a select query then selecting table from drop down list but the columns specified must contains the primary key (ID). Because reading all columns will consume more resources. Even if it may not meaningful effect in small tables.
You can refer to the following answer on database administrators community for more information:
SSIS OLE DB Source Editor Data Access Mode: “SQL command” vs “Table or view”
Note that what #JWeezy mentioned about lookup from large table is right. Lookups is not designed for large table, i will use SQL JOINs instead.

DB design for combining two one to one table

I have a legacy application which has below tables which has 1 to 1 mapping
customer (has already 40 columns)
customer_additional_attributes(has 20 columns)
My question :- Would not it be better design if customer and customer_additional_attributes tables were combined as it would have saves extra join or query sometime to fetch data
from customer_additional_attributes ?
Is there any disadvantage of single table(like in above scenario) but large number of columns?
The data format that you have is called "vertical partitioning". This is when rows of an entity are split across multiple tables. In a normalized structure, this is problematic, because inserts of rows (for instance) are not necessarily atomic -- they affect two tables.
But there are good reasons for doing this. The most obvious is when the rows are too wide. If the columns are too wide, they simply will not fit in one table, so they are spread through multiple tables.
Similarly, if some columns are much larger -- and rarely used -- then putting them in another table can be a big win on performance.
Before combining the tables, you should recognize that the data structure is intentional. It might simply be the result of "laziness". The first table was created -- and then additional attributes came along so they were put into another table. Or, it could be quite intentional, and you would want to understand why.
Note that the join between the two tables should be pretty fast, particularly if the same primary key is used for both.
You have many to many relationship maybe you have to create intermediate table so one for customer, one for customer_attributes and one for customer_additional_attibutes containing id of the two table

How to create a 'sanitized' copy of our SQL Server database?

We're a manufacturing company, and we've hired a couple of data scientists to look for patterns and correlation in our manufacturing data. We want to give them a copy of our reporting database (SQL 2014), but it must be in a 'sanitized' form. This means that all table names get converted to 'Table1', 'Table2' etc., and column names in each table become 'Column1', 'Column2' etc. There will be roughly 100 tables, some having 30+ columns, and some tables have 2B+ rows.
I know there is a hard way to do this. This would be to manually create each table, with the sanitized table name and column names, and then use something like SSIS to bulk insert the rows from one table to another. This would be rather time consuming and tedious because of the manual SSIS column mapping required, and manual setup of each table.
I'm hoping someone has done something like this before and has a much faster, more efficienct, way.
By the way, the 'sanitized' database will have no indexes or foreign keys. Also, it may seem to make any sense why we would want to do this, but this is what was agreed to by our Director of Manufacturing and the data scientists, as the first round of analysis which will involve many rounds.
You basically want to scrub the data and objects, correct? Here is what I would do.
Restore a backup of the db.
Drop all objects not needed (indexes, constraints, stored procedures, views, functions, triggers, etc.)
Create a table with two columns, populate the table, each row has orig table name and new table name
Write a script that iterates through the table, roe by row, and renames your tables. Better yet, put the data into excel, and create a third column that builds the tsql you want to build, then cut/paste and execute in ssms.
Repeat step 4, but for all columns. Best to query sys.columns to get all the objects you need, put to excel, and build your tsql
Repeat again for any other objects needed.
Backip/restore will be quicker than dabbling in SSIS and data transfer.
They can see the data but they can't see the column names? What can that possibly accomplish? What are you protecting by not revealing the table or column names? How is a data scientist supposed to evaluate data without context? Without a FK all I see is a bunch of numbers on a column named colx. What are expecting to accomplish? Get a confidentially agreement. Consider a FK columns customerID verses a materialID. Patterns have widely different meanings and analysis. I would correlate a quality measure with materialID or shiftID but not with a customerID.
Oh look there is correlation between tableA.colB and tableX.colY. Well yes that customer is college team and they use aluminum bats.
On top of that you strip indexes (on tables with 2B+ rows) so the analysis they run will be slow. What does that accomplish?
As for the question as stated do a back up restore. Using system table drop all triggers, FK, index, and constraints. Don't forget to drop the triggers and constraints - that may disclose some trade secret. Then rename columns and then tables.

Full-Text Index Design Considerations (SQL Server 2008)

My website has a requirement that the user can search a number of different tables and columns. So I'm working to implement this using full-text search.
I'd like to get some input from someone with more FTS experience on the following issues.
While FTS allows you to search multiple columns from the same table in a single search, I'm not seeing an option to search multiple columns from multiple tables in a single search. Is this in fact the case?
If I need multiple searches to search across multiple tables, does it make sense to put the index for each table in a different full-text catalog? The wizards seem to recommend a new catalog for larger tables, but I have no idea what "large" means in this case.
Finally, is there any want to order the results such that matches in one column of a table come before matches in another column?
1.While FTS allows you to search multiple columns from the same table in a single search, I'm not seeing an option to search multiple
columns from multiple tables in a single search. Is this in fact the
case?
A FTIndex on a single table cannot include columns from another table. So typically, you'd just have to write your query so that its making multiple searches (you alluded to this in #2).
Another option, would be to create an Indexed View (see requirements) that spans multiple tables and then build a FTIndex on top of the view. I believe this is possible, but you should test for certainty.
2.If I need multiple searches to search across multiple tables, does it make sense to put the index for each table in a different full-text
catalog? The wizards seem to recommend a new catalog for larger
tables, but I have no idea what "large" means in this case.
shouldn't make a difference in SQL2008 since the catalog is just a logical grouping. You might, however, consider putting the FTIndexes on different filegroups if you have a disk-sub-system that makes sense (similar considerations to partitioning tables across filegroups on different disks...to spread the IO).
3.Finally, is there any want to order the results such that matches in one column of a table come before matches in another column?
I don't believe this is possible...

Sql Server 2005 Indexed View

I have created a unique, clustered index on a view. The clustered index contains 5 columns (out of the 30 on this view), but a typical select using this view will want all 30 columns.
Doing some testing shows that the time it takes to query for the 5 columns is way faster than all 30 columns. Is this because that is just natural overhead regarding selecting on 6x as many columns, or because the indexed view is not storing the non-indexed columns in a temp table, and therefore needs to perform some extra steps to gather the missing columns (joins on base tables I guess?)
If the latter, what are some steps to prevent this? Well, even if the former... what are some ways around this!
Edit: for comparison purposes, a select on the indexed view with just the 5 columns is about 10x faster than the same query on the base tables. But a select on all columns is basically equivalent in speed to the query on the base tables.
A clustered index, by definition, contains every field in every row in the table. It basically is a recreation of the table, but with the physical data pages in order by the clustered index, with b-tree sorting to allow quick access to specified values of your clustered key(s).
Are you just pulling values or are you getting aggregate functions like MIN(), MAX(), AVG(), SUM() for the other 25 fields?
An indexed view is a copy of the data, stored (clustered) potentially (and normally) in a different way to the base table. For all purposes
you now have two copies of the data
SQL Server is smart enough to see that the view and table are aliases of each other
for queries that involve only the columns in the indexed view
if the indexed view contains all columns, it is considered a full alias and can be used (substituted) by the optimizer wherever the table is queried
the indexed view can be used as just another index for the base table
When you select only the 5 columns from tbl (which has an indexed view ivw)
SQL Server completely ignores your table, and just gives you data from ivw
because the data pages are shorter (5 columns only), more records can be grabbed into memory in each page retrieval, so you get a 5x increase in speed
When you select all 30 columns - there is no way for the indexed view to be helpful. The query completely ignores the view, and just selects data from the base table.
IF you select data from all 30 columns,
but the query filters on the first 4 columns of the indexed view,*
and the filter is very selective (will result in a very small subset of records)
SQL Server can use the indexed view (scanning/seeking) to quickly generate a small result set, which it can then use to JOIN back to the base table to get the rest of the data.
However, similarly to regular indexes, an index on (a,b,c,d,e) or in this case clustered indexed view on (a,b,c,d,e) does NOT help a query that searches on (b,d,e) because they are not the first columns in the index.

Resources