Huge database table design issue [closed] - sql-server

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
In my DB design I've encountered an issue. My app consists of a workflow on specific kind of media which has 6 stages as follows :
Resources
Ingest
Review
VideoRepair
Listing
backup
Since in all the stages the type of data being added (or updated) to table is the same but only their names change for example in ingest we have following columns
CaptureSup_Name, Assign_DateByCaptureSup, AssignedCaptureOp_Name,
LastCapture_Date, LastCaptureOp_Name, LastCapture_Date,
and in review we have exactly same columns but only Ingest replaces with Review and almost the same happens for the other columns of table (with one or two columns more or less)and for the purpose of having better performance on select queries So I decided not to assign one table for each stage (tradition normalization) and on the contrary I wrapped them all in one unified table
And now my table has 30 columns and the number of 30 in columns is scaring me because I never designed such a big table. Which of the following scenarios is most suitable for my case considering my database is supposed to support huge amount of data (about 1500 records being added daily) and speed is vital?
following the traditional normalization approach and break my big table into 6 or 7 tables each of which have about 5 or 6 columns o that I have to write (n-1) joins to retrieve complete data of cycle
keep my current design (one table with 30 columns) and find solution for reducing size of logs because in this case logs would be more massive (because of updates)

30 fields is not a wide table. SQL Server supports over 1,000 fields per table.
1,500 records per day is not a large volume of inserts. That is only about 500 thousand rows per year. SQL Server can handle billions of rows.
If you are having an issue with logs, then you have several options, such as:
switching to simple recovery mode;
periodically backing up the database and truncating the log; and,
using database checkpoints.
You should break your table up into other tables, if that makes sense in terms of the relational model -- this is likely, because 30 columns in a table would often be combinations of other entities. But your data structure is easily in the realm of what databases readily support. And, it can grow much, much larger with no problems (assuming none of your columns are really humongous).

Related

How does sortkey in Redshift work internally? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I'm a beginner to Redshift and Data Warehouses in general.
When a numeric or timestamp column is specified as sortkey, does the Redshift DBMS use binary search during a query to find the desired row as efficiently as possible?
I feel that knowing more about this would improve my table design skill.
Amazon Redshift is a columnar datastore, which means that each column is stored separately. This is great for wide tables because Redshift only needs to read in the columns that are specifically used in the query. The most time-consuming part of database queries is disk access, so anything that reduces/avoids disk access is a good thing.
When data is stored on disk, it is stored in 1MB disk blocks. Each column can consume multiple blocks, and each block only contains data relating to one column. Redshift keeps a Zone Map of each block, which stores the minimum and maximum values stored in the block. For example, if a query is searching for data from 2021 and Redshift knows that the timestamp column for a particular block has a maximum value in 2018, it does not need to read the block from disk to examine the contents. This greatly reduces query time.
Data is stored in the blocks based upon the selected Compression Encoding. These are very clever techniques for reducing the storage space for data. For example, if a column contains a list of Countries and the rows are sorted in alphabetical order by country, then Redshift could simply store the fact that the block contains Jamaica x 63, then Japan x 104, then Jordan x 26. This might only require 24 bytes to store 193 rows of data, and don't forget that each block is 1MB in size. Thus, compression reduces the amount of disk access required to retrieve data, again making queries faster.
To answer your question about how Redshift would find the desired rows:
If the SORTKEY is used in a WHERE statement, then Redshift can quickly find the relevant blocks that potentially contain the desired data. I'm not sure if it does that with a binary search.
If WHERE statement does not use the SORTKEY, then finding the right rows is not as efficient because many blocks on disk might contain the rows that match the WHERE statement in various columns since they are not sorted together. This makes the query less efficient.
Redshift can still 'skip-over' blocks that do not contain matching data in the Zone Maps for all columns, avoiding the need to read those blocks from disk. Plus, compression on various columns can reduce the number of blocks that need to be read from disk.
The general rules for Amazon Redshift are:
Set the DISTKEY to the column that is most frequently used in JOIN
Set the SORTKEY to the column that is most frequently used in WHERE
See: Tuning query performance - Amazon Redshift

Structuring database for financial data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a lot of stocks price data saved in CSV files that I've been collecting for a while and intend to keep collecting, but now into a DB instead of CSVs.
There are 73 files (a file for each asset), each with around 2 million rows. The data is formatted the same way in all of them:
date, timestamp, open, high, low, close, volume
I want to create a individual table for each of the CSV files because:
For the uses I have in mind, I wont need more than one asset at once.
I know 140 million lines isn't a heavy load for a RDBMS, but I think it would have a better perfomance searching a table of 2M records instead of 140M.
Separating by asset I can have a column with Unique constraints (like date or timestamp) and prevent records being duplicated.
Are any of those points a wrong assumption or bad practice? Is there a compelling reason to save them all in a single table?
I've read this question, although a similar problem don't think the answer applies to my case.
In case it wasn't clear, I don't have much experience with DBs, so guidance and educational answers are heavily appreciated.
I would store them in a single table just because I wouldn't have to maintain 73 tables.
If you update your data on a daily or weekly or even monthly basis you would have to insert into 73 tables from 73 csv files, or maintain an automated script for that purpose which I think is a bit too much for this.
For the uses I have in mind, I wont need more than one asset at once. -> I don't understand this.
Separating by asset I can have a column with Unique constraints (like date or timestamp) and prevent records being duplicated. -> If you store them in a single table you could add a column asset_id, which would identify for which asset and date, timestamp is the data for.
I know 140 million lines isn't a heavy load for a RDBMS, but I think it would have a better perfomance searching a table of 2M records instead of 140M. -> You could partition your table on date and asset_id, but this is a much broader discussion, with the details you've given I would do this.

SQL Server - Inserting new data worsens query performance

We have a 4-5TB SQL Server database. The largest table is around 800 GB big containing 100 million rows. 4-5 other comparable tables are 1/3-2/3 of this size. We went through a process to create new indexes to optimize performance. While the performance certainly improved we saw that the newly inserted data was slowest to query.
It's a financial reporting application with a BI tool working on top of the database. The data is loaded overnight continuing in the late morning, though the majority of the data is loaded by 7am. Users start to query data around 8am through the BI tool and are most concerned with the latest (daily) data.
I wanted to know if newly inserted data causes indexes to go out of order. Is there anything we can do so that we get better performance on the newly inserted data than the old data. I hope I have explained the issue well here. Let me know in case of any missing information. Thanks
Edit 1
Let me describe the architecture a bit.
I have a base table (let’s call it Base) with Date,id as clustered index.
It has around 50 columns
Then we have 5 derived tables (Derived1, Derived2,...) , according to different metric types, which also have Date,Id as clustered index and foreign key constraint on the Base table.
Tables Derived1 and Derived2 have 350+ columns. Derived3,4,5 have around 100-200 columns. There is one large view created to join all the data tables due limitations of the BI tool. The date,ID are the joining columns for all the tables joining to form the view (Hence I created clustered index on those columns). The main concern is with regard to BI tool performance. The BI tool always uses the view and generally sends similar queries to the server.
There are other indexes as well on other filtering columns.
The main question remains - how to prevent performance from deteriorating.
In addition I would like to know
If NCI on Date,ID on all tables would be better bet in addition to the clustered index on date,ID.
Does it make sense to have 150 columns as included in NCI for the derived tables?
You have about a 100 million rows, increasing every day with new portions and those new portions are usually selected. I should use partitioned indexes with those numbers and not regular indexes.
Your solution within sql server would be partitioning. Take a look at sql partitioning and see if you can adopt it. Partitioning is a form of clustering where groups of data share a physical block. If you use year and month for example, all 2018-09 records will share the same physical space and easy to be found. So if you select records with those filters (and plus more) it is like the table has the size of 2018-09 records. That is not exactly accurate but its is quite like it. Be careful with data values for partitioning - opposite to standard PK clusters where each value is unique, partitioning column(s) should result a nice set of different unique combinations thus partitions.
If you cannot use partitions you have to create 'partitions' yourself using regular indexes. This will require some experiments. The basic idea is data (a number?) indicating e.g. a wave or set of waves of imported data. Like data imported today and the next e.g. 10 days will be wave '1'. Next 10 days will be '2' and so on. Filtering on the latest e.g. 10 waves, you work on the latest 100 days import effectively skip out all the rest data. Roughly, if you divided your existing 100 million rows to 100 waves and start on at wave 101 and search for waves 90 or greater then you have 10 million rows to search if SQL is put correctly to use the new index first (will do eventually)
This is a broad question especially without knowing your system. But one thing that I would try is manually update your stats on the indexes/table once you are done loading data. With tables that big, it is unlikely that you will manipulate enough rows to trigger an auto-update. Without clean stats, SQL Server won't have an accurate histogram of your data.
Next, dive into your execution plans and see what operators are the most expensive.

Will enabling row movements on list-partitioned table cause performance problems(oracle 11g r2) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Our incident management system has run for years, and the large quantity of closed incident tickets makes the table rather huge and slows down the search queries.
Our database version is oracle 11g r2 enterprise edition(partition option added). The concerned two large tables are: incident and incident_area_info(one incident corresponds to multiple incident_area_info, and they're joined using incident_id). So I want to use the following strategy:
Split incident table into two partitions: closed_incidents, active_incidents, using list-partition and status as partition key. Also, I manually enabled row movements on incident table and incident_area_info table, so that I can close the incident.
Split incident_area_info using reference partition.
Drop the original indexes and replace them with corresponding partitioned local indexes
Search open incidents only by default
I have applied this strategy in my developing environment and the search operation's execution time reduces to roughly 10% of the original on average(we have nearly 4 million closed incidents and only about 40,000 active incidents).
But "row movements" is disabled by default, so maybe enabling it will cause some performance problems. Well, of course, when a row moves – it will be updated, deleted and re-inserted with all relevant index entries adjusted accordingly. And rowid will be modified after a row is moved(I'm quite sure we do not use rowid in our system, so this won't be a problem).
Question1:
Besides those mentioned above, will there be any other bad side effects when enabling row movements?
Question2:
I suspect that moving rows will create space holes in the original partition and the data file will be fragmented after long-term running. Is this true?
Question3
If question2 is true, then is there a way to remove these space holes, like alter table mytable shrink space;
Question4
Here one guy said 'everybody should be carefull when enabling row movement in production system since enabling row movement invalidates all dependent views, which could result into plenty invalidate objects', but in my developing environment, after moving rows in incident table, the materialized view counting on incident table still works, and in dba_mviews everything seems to be fine. So, did I misunderstand what he means?
Any suggestion is greatly appreciated.
As long as you don't have any code that uses the ROWID and your status changes aren't causing too many rows to move between partitions (which seems unlikely if you're just creating two partitions), you should be fine. The other downsides are generally of the "someone might do something that causes issues because they didn't understand the implications" such as shrinking a table.
Any space that is freed up when a row is deleted from the active partition should be reused on subsequent inserts into the table (assuming you have a standard OLTP application that is doing conventional-path inserts). You shouldn't need to worry about "space holes".
I read the comment you quote as noting that the actual "ALTER TABLE <> ENABLE ROW MOVEMENT` command is DDL so it invalidates dependent objects and forces Oracle to recompile them. It would be unfortunate if you ran that DDL at noon on a busy day in an OLTP system.

Oracle IO Performance & Optimize design database [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm being optimize my Oracle Database.
I confusing about IO Performance between write concurrent 10 request to a Table and write concurrent 10 request to 10 table
and if i have 10 type data can store in 1 table --> which way bring the best performance insert data between 1 or 10 table
anybody know about it ?
Performance tuning is a large topic, and questions can't really by answered with so little information to begin with. But I'll try to provide some basic pointers.
If I got it right, you are mainly concerned with insert performance into what is currently a single table.
The first step should be to find out what is actually limiting your performance. Let's consider some scenarios:
disk I/O: Disks are slow. So get the fastest disks you can get. This might well mean SSDs. Put them in a RAID that is tuned for performance, "striping" is the key word as far as I know. Of course the SSDs will fail, as your HDs do so you want to plan for that. HDs are also faster when they aren't completely full (never really checked that). Partitioned tables might help as well (see below). But most of the time we can reduce the I/O load which is way more efficient then more and faster hardware ...
contention on locks (of primary keys for example).
Partitioned tables and indexes might be a solution. A partitioned table is logically one table (you can select it and write to it just like a normal table), but internally the data gets spread across multiple tables. A partitioned index is similar but an index. This might help, because an index underlying a unique key get locked when a new value gets added, so two sessions can't insert the same value. If the values are spread between n indexes, this might reduce the contention on such locks. Also partitions can be spread over different tablespaces/disks, so you have less waiting time for your physical stuff.
time to very constraints: If you have constraints on the table they need time to do their job. If you do batch insert, you should consider deferred constraints, they only get checked on commit time instead of on every insert. If you are careful with your application you can even disable them and enable them afterwards without checking them. This is fast, but of course you have to be really really sure the constraints actually hold. of course you should make sure your constraints have all the indexes they need to perform good.
talking about batch inserts. If you are doing those you might want to look into direct load: http://docs.oracle.com/cd/A58617_01/server.804/a58227/ch_dlins.htm (I think this is the Oracle 8 version, I'm sure there is an updated documentation somewhere)
To wrap it up. Without knowing where exactly your performance problem is, there is no way one can tell how to fix it. So find out where your problem is, then come back with with a more precise question.

Resources