I am approaching on Vertica and I have some questions about flex tables:
how many rows can manage a flex table? Is it a good idea to manage with a flex table up to half a billion of rows?
can I delete certain rows from big flex table without any performance problem? I need to run an optimize table (if exists like mysql)?
for run sql query and csv export, is better to perform from flex table or from the associated view?
I still have to find a limit to the rows you can put into a flex table. So the idea is good.
I would personally not use a flex table for DML operations. As soon as you want to keep the table for reporting and further processing, I would make it a standard table: If the view you created is foo_view, run: CREATE TABLE foo_fix AS SELECT * FROM foo_view;. That's the best possible optimisation you can get. Vertica flex tables are a perfect means to discover what is in an unknown data source. But for serious querying, they are orders of magnitude slower than standard Vertica ROS tables.
Selecting from flex table directly (knowing all keys of the flex table) is roughly the same as selecting from the view generated on it.
Related
I have a table in a production server with 350 million rows and aproximatelly 25GB size. It has a single clustered identity index.
The queries targeting this table require some missing indexes for better perfomance.
I need to delete unnecessary data (aprox 200 million rows) and then create two non-clustered indexes.
However, I have some concerns:
I need to avoid increasing the log too much
Keep the database downtime as low as possible.
Keep the identity (primary key) the same in the remaining data.
I would like to hear you opinion for the best solution to adopt.
The following is a guideline on how you might do this:
Suspend insert/update operations or start logging them explicitly (this might result in degraded performance).
Select the records to keep into a new table.
Then you have two options. If this is the only table in your universe:
Build the indexes on the new table.
Stop the system.
Rename the existing table to something else.
Rename the new table to the real table name
Turn the system back on.
If there are other tables (such as foreign key relationships):
Truncate the existing table
Insert the data into the existing table
Build the secondary indexes
Turn the system back on
Depending on your user requirements, one of the above variations is likely to work for your problem.
Note that there are other more internally intensive techniques. For instance, create a replicated database and once that is working, you have two systems and can do the clean-up work on one at a time (a method such as this would be the preferred method for a system with near 100% uptime requirements). Or create a separate table that is just right and swap the table spaces.
We're a manufacturing company, and we've hired a couple of data scientists to look for patterns and correlation in our manufacturing data. We want to give them a copy of our reporting database (SQL 2014), but it must be in a 'sanitized' form. This means that all table names get converted to 'Table1', 'Table2' etc., and column names in each table become 'Column1', 'Column2' etc. There will be roughly 100 tables, some having 30+ columns, and some tables have 2B+ rows.
I know there is a hard way to do this. This would be to manually create each table, with the sanitized table name and column names, and then use something like SSIS to bulk insert the rows from one table to another. This would be rather time consuming and tedious because of the manual SSIS column mapping required, and manual setup of each table.
I'm hoping someone has done something like this before and has a much faster, more efficienct, way.
By the way, the 'sanitized' database will have no indexes or foreign keys. Also, it may seem to make any sense why we would want to do this, but this is what was agreed to by our Director of Manufacturing and the data scientists, as the first round of analysis which will involve many rounds.
You basically want to scrub the data and objects, correct? Here is what I would do.
Restore a backup of the db.
Drop all objects not needed (indexes, constraints, stored procedures, views, functions, triggers, etc.)
Create a table with two columns, populate the table, each row has orig table name and new table name
Write a script that iterates through the table, roe by row, and renames your tables. Better yet, put the data into excel, and create a third column that builds the tsql you want to build, then cut/paste and execute in ssms.
Repeat step 4, but for all columns. Best to query sys.columns to get all the objects you need, put to excel, and build your tsql
Repeat again for any other objects needed.
Backip/restore will be quicker than dabbling in SSIS and data transfer.
They can see the data but they can't see the column names? What can that possibly accomplish? What are you protecting by not revealing the table or column names? How is a data scientist supposed to evaluate data without context? Without a FK all I see is a bunch of numbers on a column named colx. What are expecting to accomplish? Get a confidentially agreement. Consider a FK columns customerID verses a materialID. Patterns have widely different meanings and analysis. I would correlate a quality measure with materialID or shiftID but not with a customerID.
Oh look there is correlation between tableA.colB and tableX.colY. Well yes that customer is college team and they use aluminum bats.
On top of that you strip indexes (on tables with 2B+ rows) so the analysis they run will be slow. What does that accomplish?
As for the question as stated do a back up restore. Using system table drop all triggers, FK, index, and constraints. Don't forget to drop the triggers and constraints - that may disclose some trade secret. Then rename columns and then tables.
In the full text search page http://msdn.microsoft.com/en-us/library/ms189760.aspx on MSDN it says that if you want to do a full text search on multiple tables just "use a joined table in your FROM clause to search on a result set that is the product of two or more tables."
My question is, isn't this going to be really slow if you have to merge two very large tables?
If I'm merging a product table with a category table and there are millions of records, won't the join take a long time and then have to search after the join?
Joins on millions of records can still be fast if the join is optimized for performance, for example, a single int column that is indexed in both tables. But there can be other factors at play so the best approach is to try it and gauge the performance yourself.
If the join doesn't perform well, you have a couple of options:
Create a view of the tables joined together, create a full text index on that view, and run your full text queries against that view.
Create a 3rd table which is a combination of the 2 tables you are joining, create a full text index on it, and run your full text queries against that table. You'll need something like an ETL process to keep it updated.
My website has a requirement that the user can search a number of different tables and columns. So I'm working to implement this using full-text search.
I'd like to get some input from someone with more FTS experience on the following issues.
While FTS allows you to search multiple columns from the same table in a single search, I'm not seeing an option to search multiple columns from multiple tables in a single search. Is this in fact the case?
If I need multiple searches to search across multiple tables, does it make sense to put the index for each table in a different full-text catalog? The wizards seem to recommend a new catalog for larger tables, but I have no idea what "large" means in this case.
Finally, is there any want to order the results such that matches in one column of a table come before matches in another column?
1.While FTS allows you to search multiple columns from the same table in a single search, I'm not seeing an option to search multiple
columns from multiple tables in a single search. Is this in fact the
case?
A FTIndex on a single table cannot include columns from another table. So typically, you'd just have to write your query so that its making multiple searches (you alluded to this in #2).
Another option, would be to create an Indexed View (see requirements) that spans multiple tables and then build a FTIndex on top of the view. I believe this is possible, but you should test for certainty.
2.If I need multiple searches to search across multiple tables, does it make sense to put the index for each table in a different full-text
catalog? The wizards seem to recommend a new catalog for larger
tables, but I have no idea what "large" means in this case.
shouldn't make a difference in SQL2008 since the catalog is just a logical grouping. You might, however, consider putting the FTIndexes on different filegroups if you have a disk-sub-system that makes sense (similar considerations to partitioning tables across filegroups on different disks...to spread the IO).
3.Finally, is there any want to order the results such that matches in one column of a table come before matches in another column?
I don't believe this is possible...
I'm looking for some advice for a table structure in sql.
Basically I will have a table with about 30 columns of strings, ints and decimals. A service will be writing to this table about 500 times a day. Each record in the table can either be 'inactive' or 'active'. This table will constantly grow and at any one time there will be about 100 'active' records that need to be returned.
While the table is small the performance to return the 'active' records is responsive. My concern comes 12-18 months down the line when the table is much larger or even later when there will be millions of records in the table.
Is it better to maintain two tables one for 'active' records and one for 'inactive' records from a performance view or will creating a index on the active column solve any potential performance issues?
It certainly will be more performant to have a small "active" table. The most obvious cost is that maintaining the records correctly is more troublesome than with one table. I would probably not do so immediately, but bear it in mind as a potential optimisation.
An index on the active column is going to massively improve matters. Even more so, would multi-column index (or indices) appropriate for the query (or queries) most often used. For example, if you would often ask for active rows created after a certain date, then an index on both date and active could be used to have a single index for retrieval. Likewise, if you wanted all active rows ordered by id, then one on both id and active could be used.
Testing with Database Engine Tuning Advisor can be very informative here, though not as good at predicting what the best approach for data you expect to change in months to come - as you do here.
An indexed view may well be your best approach, as that way you can create the closest thing to a partial index that is available in SQLServer 2005 (which your tags suggest you are using). See http://technet.microsoft.com/en-us/library/cc917715.aspx#XSLTsection124121120120 This will create an index based on your general search/join/order criteria, but only on the relevant rows (ignoring the others entirely).
Better still, if you can use SQLServer 2008, then use a filtered index (what Microsoft have decided to call partial indices). See http://technet.microsoft.com/en-us/library/cc280372.aspx for more on them.
If you'd tagged with 2008 rather than 2005 I'd definitely be suggesting filtered indices, as is I'd probably go for the indexed view, but might just go for the multi-column index.
Index the active field and rebuild the index each weekend and you will be good for ages if it's really only 500 records a day.
365 days times 500 is 182500 and you wrote
millions of records in the table
but with only 500 a day that would take eleven years.
Index is probably the way to go for performance on a table like that.
You can consider using another table by putting data you are sure you won't use unless on certain specific report.