I've got data potentially to be pushed to SQL-ce on a 3rd party windows phone application but I don't have anywhere to conduct a test so I need to figure if we'll exceed the 4Gb max database size (many millions of records).
I know the sizes of various data types but are there additional requirements for indexes, row id's, etc. Also this data will need to be synchronized/replicated so I assume every row needs a GUID or the like as well?
Table1 (first 2 fields are clustered primary key)
nvarchar(20)
int
int
datetime
Table2 (First field is primary key)
int
int
datetime
Table3 (First two fields are clustered primary key)
int
int
int
I have access to Sql Server (not CE) but I'm an Oracle guy and don't know my way around there very well. Any help or insight is appreciated.
This will be a starting point: http://support.microsoft.com/kb/827968
I have command line tools to migrate from SQL Server to SQL Compact, that will give you more rprecise results: http://exportsqlce.codeplex.com
Also, Merge replication adds columns and system tables to your database.
Luckily your tables are very narrow, so the 4 GB can be stretched to a ton of rows. Every row will need a GUID, you're correct. Look into SEQUENTIALID, which will keep your records in some sort of order, reducing some performance hindrances of GUIDs. Do you currently have access to the data, or do you have a rough estimate of how many records you'll be storing? If you have the data I'd create a clean DB, create your tables and insert it. Index it to your liking and check the size. Indexes can take up quite a bit of space, but you shouldn't need much in the way of indexes on these narrow tables.
Related
I have a large table consisting of 4 Billion+ rows and 50 columns, most of which are either datetime or numeric except a few which are varchar.
Data will be inserted into the table on a weekly basis (about 20 million rows).
I expect queries with where clauses on some of the datetime columns, and a couple of the the varchar columns. There is no primary key in the table.
There are no indexes, nor the table is partitioned. I am using SQL Server 2016.
I understand that I need to partition or index the table, but I am not sure which approach to take or both in-fact.
Since the table is large, should I create the indexes first or should I create the partitions first? If I do create the indexes and then create the partitions, what should I do to maintain these with new data coming in weekly.
EDIT: Also, minimal updates and deletes are expected on the table
I understand that I need to partition or index the table
You need to understand what you gain from partitioning. It is not at all the case that SQL Server requires partitioning on big tables to function adequately. SQL Server scales to arbitrary tables sizes without any inherent issues.
Common benefits of partitioning are:
Mass deletion in constant time
Different storage for older partitions
Not backing up old partitions
Sometimes in special situations (e.g. columnstore), partitioning can help as a strategy to speed up queries. Normally, indexing is better for that.
Essentially, partitioning splits the table physically into multiple sub tables. Most often this has a negative effect on query plans. Indexes are perfectly capable of restricting the set of data that needs to be touched. Partitions are worse for that.
Most of the queries will be filtering on the datetime columns and on some of the varchar columns. Like, get data for a certain daterange for a certain entity. With the indexes, it will be fragmented a lot because of new inserts and rebuilding/reorganising the indexes will also consume a lot of time. I can do it but again not sure which approach.
It seems you can best solve this by indexing:
Index according to the queries you expect.
Maintain the indexes properly. This is not too hard. For example, rebuild them after the weekly load.
Since the table is large, should I create the indexes first or should I create the partitions first?
Set up that partitioning objects first. Then, create or rebuild the clustered index on the new partitioning scheme. If possible drop other indexes first and recreate them afterwards (might not work due to availability restrictions).
what should I do to maintain these with new data coming in weekly.
What concerns do you have? New data will be stored in the appropriate partitions automatically. Make sure to create new partitions before loading the data. Keep partitions ready for 2 weeks in advance. The latest partitions must always be empty to avoid costly splits.
There is no primary key in the table.
Most often this is a not a good design. Most tables should have a primary key and a clustered index. If there is no natural key use an artifical one such as a bigint identity.
You definitely can apply partitioning but my feeling is that it will not gain you what you maybe expect. But it will force you to take on additional maintenance burdens, possibly reduce performance and there is risk of making mistakes that threaten availability. Simplicity is important.
My question is about performance on SQL server tables.
Assume I have a table that has many columns, for example 30 columns, with 1 column indexed. This table has approximately 30,000 rows.
If I perform a select that selects the indexed column, and one more, for example this:
SELECT IndexedColumn, column1
FROM table
Will this be slower than performing the same select on a table that only has 2 columns, and doing a SELECT * ...
So basically, will the existence of the extra columns slow down the select query event if I am not retrieving the data from the extra columns?
There will be minor difference on the very end of the process as you don't have to print/pass the rest of information for the end client (either SSMS or other app).
When performing a read based on clustered index all of the column (without BLOB) are saved on the same page set so to read the data you have to access the same set of pages anyway.
You would see a performance increase if you would have a nonclustered index on the column list you are after as then they are saved in their own structure of data pages (so it would be less to read).
Assuming that you are using the default Clustered Index created by SQL server when defining the primary key on the table in both scenarios then no, there shouldn't be any performance difference between these two scenarios. Maybe worth just checking it out and generating an Actual Execution plan to see for yourself? -- Actually not sure above is true, as given this is rowstore, the first table wont be able to fit as many rows onto each page so will suffer more of an IO/Disk overhead when reading data.
I manage a SQL Server 2008 R2 single instance server. I have one table that is my largest, as well as my most diversely used table. It is basically an event table that logs about 400k events a day, and holds 13 months of history.
For the sake of this solution, changing the design of this table or the data in it is not an option. Because this table is
huge (135 million records, 41 GB in size)
queried using a multitude of field combinations
queried both by tools using consistently structured queries, as well as ad hoc queries
important for queries to be relatively fast
managing Indexes on this table has been a bear.
The table currently has 1 clustered index (PK on an int identity field) and 23 Nonclustered indexes. Total Index storage is 372 GB, 9x larger than the table itself. The table is updated once per day, then all other activity is "SELECT" statements. Most of the fields being used in WHERE clauses are varchar(50) fields, with a few datetime fields as well.
On the performance side, the table queries pretty quickly in nearly all situations, so no complaints there...
ASK:
I just wonder if there is a better way to index this table to make it more "generic" to support the multitude of ways it can be queried without taking up so much disk space... Thoughts? Looking for some high level theories or general best practices with a situation like this.
The best indexing IMO is usage based - run Profiler, and capture the queries that are run for this table and fine tune for those queries.
If you're able to change the Partitioning or Clustered index strategy, this would give you a big boost.
Q: Why is there a PK on an Identity column on a table used for reporting purposes? Is it heavily used in JOINs? If not, is it merely for uniqueness?
You can check if there are indexes you could combine into one index.
- The column order for INCLUDED Columns is irrelevant. For example:
Index 1:
Key Columns (A, B, C, D, E) Includes (L, M, N)
Index 2:
Key Columns (A, B, C, D, E, F, G) Includes (N, M, L)
So you could drop Index 1. But you may do more I/O because Index 2 is larger.
On the other hand, you do not have to have the two Indexes in RAM and on disk/backup.
It also can be that the changing the order of less selective index columns doesn't cost much more. As you may know the order of the key columns in an index should be from most selective first to more and more lesser selective columns.
Do you want the same indexing strategy for actual data as for older data? So you could use filtered indexes and use less indexes for older data and a more flexible index strategy for newer data. Older data may not be queried as fast as it does today. But how often it is queried in comparison to new data?
My database has one very large table with over 2 billion rows with 3 columns.
Id(uniqueidentity), Type(int, between 0-10. 0 = most used. 10 = least used), Data(Binary data between 1-10MB)
What are some ways I can optimize this database? (primarily select queries)
*Note: I might add a few more columns to this table later (eg: location, date...)
Assuming that the id column is the clustered index key, and assuming that by uniqueidentity you mean uniqueidentifier:
do you need the uniqueidentifier type? Why?
What other alternatives have you considered?
Do you populate the data using sequential GUIDs or not?
GUIDs are a notoriously poor choise for clustered keys. See GUIDs as PRIMARY KEYs and/or the clustering key for a more detailed discussion:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 2^63-1 rows
Also read Disk space is cheap...That's not the point! as a follow up.
Other than this, you need to do your homework and post the required details for such a question: exact table and index definition, prevalent data access pattern (by key, by range, filters sort order, joins etc etc).
Have you done any work to identify problems so far? If not, start with Waits and Queues, a proven methodology to identify performance bottlenecks. Once you measure and find places that need improvement, we can advise how to improve.
Add an Index(es). Decide which column(s) are the most appropriate clustered index.
Decide if storing 10MB of binary data in each (otherwise small) row is a good use of a database
[Updated in response to Remus's comment]
The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.