Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
We create indexes on database to speed up search queries by reducing the number of records/rows in a table that need to be analyzed or worked.
This query is asked to me in interview that when not to index database, I answered that when database is small, having less number of columns, and when space id matter. But the interviewer not so much agreed.
What are the main reasons when not to index the database?
If the number of records is small (e.g. below 50,000 - the real number depends on hardware, is it PC or embedded device, RAM, CPU).
If you have intensive inserts/updates and rare selects. Data modification become heavy if there are (many) indexes.
There are a few scenarios where adding an index might not be beneficial.
You always read all records in the table and you have frequent writes. In this scenario an index has no benefit and would just slows down the writes
The number of writes is significantly higher than reads and you do not care about the read performance. For example, if you log to a table in production, you may want this to be as fast as possible. As you generally only query this table when an error occurs, you may be happy to perform a table scan.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a lot of stocks price data saved in CSV files that I've been collecting for a while and intend to keep collecting, but now into a DB instead of CSVs.
There are 73 files (a file for each asset), each with around 2 million rows. The data is formatted the same way in all of them:
date, timestamp, open, high, low, close, volume
I want to create a individual table for each of the CSV files because:
For the uses I have in mind, I wont need more than one asset at once.
I know 140 million lines isn't a heavy load for a RDBMS, but I think it would have a better perfomance searching a table of 2M records instead of 140M.
Separating by asset I can have a column with Unique constraints (like date or timestamp) and prevent records being duplicated.
Are any of those points a wrong assumption or bad practice? Is there a compelling reason to save them all in a single table?
I've read this question, although a similar problem don't think the answer applies to my case.
In case it wasn't clear, I don't have much experience with DBs, so guidance and educational answers are heavily appreciated.
I would store them in a single table just because I wouldn't have to maintain 73 tables.
If you update your data on a daily or weekly or even monthly basis you would have to insert into 73 tables from 73 csv files, or maintain an automated script for that purpose which I think is a bit too much for this.
For the uses I have in mind, I wont need more than one asset at once. -> I don't understand this.
Separating by asset I can have a column with Unique constraints (like date or timestamp) and prevent records being duplicated. -> If you store them in a single table you could add a column asset_id, which would identify for which asset and date, timestamp is the data for.
I know 140 million lines isn't a heavy load for a RDBMS, but I think it would have a better perfomance searching a table of 2M records instead of 140M. -> You could partition your table on date and asset_id, but this is a much broader discussion, with the details you've given I would do this.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
When we have 1 million lines in a single table, dealing with anything about this table becomes very slow with the specified commands like Select * From TBL_USERS Where ID = 20 but if we deal with any other table it becomes faster and easier as normal and this indicates that the defect is not in the database.
Is there an explanation for this phenomenon and solution?
Typically there are two things that contribute to the time it takes to process a query:
How many rows SQL Server has to read to satisfy the conditions of the query. If all the rows have to be read (and possibly if the pages the data is on bought into memory) this is called a scan and takes a while. If SQL Server has an index and only needs to read one or a few pages this is much faster and is referred to as a seek.
The second part of a query is how many rows (and columns) have to be returned. If millions of rows have to be pushed across a lan this will take time. This is why it is a good idea to avoid using the * (as in select * from tableA) and to try to use where conditions as much as possible which will narrow down the number of rows being returned.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
In my DB design I've encountered an issue. My app consists of a workflow on specific kind of media which has 6 stages as follows :
Resources
Ingest
Review
VideoRepair
Listing
backup
Since in all the stages the type of data being added (or updated) to table is the same but only their names change for example in ingest we have following columns
CaptureSup_Name, Assign_DateByCaptureSup, AssignedCaptureOp_Name,
LastCapture_Date, LastCaptureOp_Name, LastCapture_Date,
and in review we have exactly same columns but only Ingest replaces with Review and almost the same happens for the other columns of table (with one or two columns more or less)and for the purpose of having better performance on select queries So I decided not to assign one table for each stage (tradition normalization) and on the contrary I wrapped them all in one unified table
And now my table has 30 columns and the number of 30 in columns is scaring me because I never designed such a big table. Which of the following scenarios is most suitable for my case considering my database is supposed to support huge amount of data (about 1500 records being added daily) and speed is vital?
following the traditional normalization approach and break my big table into 6 or 7 tables each of which have about 5 or 6 columns o that I have to write (n-1) joins to retrieve complete data of cycle
keep my current design (one table with 30 columns) and find solution for reducing size of logs because in this case logs would be more massive (because of updates)
30 fields is not a wide table. SQL Server supports over 1,000 fields per table.
1,500 records per day is not a large volume of inserts. That is only about 500 thousand rows per year. SQL Server can handle billions of rows.
If you are having an issue with logs, then you have several options, such as:
switching to simple recovery mode;
periodically backing up the database and truncating the log; and,
using database checkpoints.
You should break your table up into other tables, if that makes sense in terms of the relational model -- this is likely, because 30 columns in a table would often be combinations of other entities. But your data structure is easily in the realm of what databases readily support. And, it can grow much, much larger with no problems (assuming none of your columns are really humongous).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
The timeseries-object timestamps are non regular. Ticks are in seconds. Crucial data are in the past 10 days, so these data should be accessed very fast and must be presented in a real-time html/js chart (chart will auto update through asynchronous requests). Data more than 10 days should be stored, too, but can be in some zipped form (blob? file? what?).
For each user I might need millions of entries of data each year. So, my problem is (a) scalability, (b) ease and speed to open the timeseries and compute statistics (median values, etc). We decided that the user will be able to view real-time timestamped values for the past 10 days only.
For the project we will be using django and python/pandas library.
We deal with tick data using PostgreSQL.
In order to keep queries fast, we partition the tables first by date, and then also alphabetically (instruments starting with letters A-B-C-D in one partition, E-F-G-H in another...)
Adding indices on the timestamps, and ticker-id, as well as regular VACUUM cleaning and clustering of the tables at the end of day... also helps, but partitions give the single biggest increase in performace, by far.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm being optimize my Oracle Database.
I confusing about IO Performance between write concurrent 10 request to a Table and write concurrent 10 request to 10 table
and if i have 10 type data can store in 1 table --> which way bring the best performance insert data between 1 or 10 table
anybody know about it ?
Performance tuning is a large topic, and questions can't really by answered with so little information to begin with. But I'll try to provide some basic pointers.
If I got it right, you are mainly concerned with insert performance into what is currently a single table.
The first step should be to find out what is actually limiting your performance. Let's consider some scenarios:
disk I/O: Disks are slow. So get the fastest disks you can get. This might well mean SSDs. Put them in a RAID that is tuned for performance, "striping" is the key word as far as I know. Of course the SSDs will fail, as your HDs do so you want to plan for that. HDs are also faster when they aren't completely full (never really checked that). Partitioned tables might help as well (see below). But most of the time we can reduce the I/O load which is way more efficient then more and faster hardware ...
contention on locks (of primary keys for example).
Partitioned tables and indexes might be a solution. A partitioned table is logically one table (you can select it and write to it just like a normal table), but internally the data gets spread across multiple tables. A partitioned index is similar but an index. This might help, because an index underlying a unique key get locked when a new value gets added, so two sessions can't insert the same value. If the values are spread between n indexes, this might reduce the contention on such locks. Also partitions can be spread over different tablespaces/disks, so you have less waiting time for your physical stuff.
time to very constraints: If you have constraints on the table they need time to do their job. If you do batch insert, you should consider deferred constraints, they only get checked on commit time instead of on every insert. If you are careful with your application you can even disable them and enable them afterwards without checking them. This is fast, but of course you have to be really really sure the constraints actually hold. of course you should make sure your constraints have all the indexes they need to perform good.
talking about batch inserts. If you are doing those you might want to look into direct load: http://docs.oracle.com/cd/A58617_01/server.804/a58227/ch_dlins.htm (I think this is the Oracle 8 version, I'm sure there is an updated documentation somewhere)
To wrap it up. Without knowing where exactly your performance problem is, there is no way one can tell how to fix it. So find out where your problem is, then come back with with a more precise question.