Structuring database for financial data [closed] - database

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a lot of stocks price data saved in CSV files that I've been collecting for a while and intend to keep collecting, but now into a DB instead of CSVs.
There are 73 files (a file for each asset), each with around 2 million rows. The data is formatted the same way in all of them:
date, timestamp, open, high, low, close, volume
I want to create a individual table for each of the CSV files because:
For the uses I have in mind, I wont need more than one asset at once.
I know 140 million lines isn't a heavy load for a RDBMS, but I think it would have a better perfomance searching a table of 2M records instead of 140M.
Separating by asset I can have a column with Unique constraints (like date or timestamp) and prevent records being duplicated.
Are any of those points a wrong assumption or bad practice? Is there a compelling reason to save them all in a single table?
I've read this question, although a similar problem don't think the answer applies to my case.
In case it wasn't clear, I don't have much experience with DBs, so guidance and educational answers are heavily appreciated.

I would store them in a single table just because I wouldn't have to maintain 73 tables.
If you update your data on a daily or weekly or even monthly basis you would have to insert into 73 tables from 73 csv files, or maintain an automated script for that purpose which I think is a bit too much for this.
For the uses I have in mind, I wont need more than one asset at once. -> I don't understand this.
Separating by asset I can have a column with Unique constraints (like date or timestamp) and prevent records being duplicated. -> If you store them in a single table you could add a column asset_id, which would identify for which asset and date, timestamp is the data for.
I know 140 million lines isn't a heavy load for a RDBMS, but I think it would have a better perfomance searching a table of 2M records instead of 140M. -> You could partition your table on date and asset_id, but this is a much broader discussion, with the details you've given I would do this.

Related

How does sortkey in Redshift work internally? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I'm a beginner to Redshift and Data Warehouses in general.
When a numeric or timestamp column is specified as sortkey, does the Redshift DBMS use binary search during a query to find the desired row as efficiently as possible?
I feel that knowing more about this would improve my table design skill.
Amazon Redshift is a columnar datastore, which means that each column is stored separately. This is great for wide tables because Redshift only needs to read in the columns that are specifically used in the query. The most time-consuming part of database queries is disk access, so anything that reduces/avoids disk access is a good thing.
When data is stored on disk, it is stored in 1MB disk blocks. Each column can consume multiple blocks, and each block only contains data relating to one column. Redshift keeps a Zone Map of each block, which stores the minimum and maximum values stored in the block. For example, if a query is searching for data from 2021 and Redshift knows that the timestamp column for a particular block has a maximum value in 2018, it does not need to read the block from disk to examine the contents. This greatly reduces query time.
Data is stored in the blocks based upon the selected Compression Encoding. These are very clever techniques for reducing the storage space for data. For example, if a column contains a list of Countries and the rows are sorted in alphabetical order by country, then Redshift could simply store the fact that the block contains Jamaica x 63, then Japan x 104, then Jordan x 26. This might only require 24 bytes to store 193 rows of data, and don't forget that each block is 1MB in size. Thus, compression reduces the amount of disk access required to retrieve data, again making queries faster.
To answer your question about how Redshift would find the desired rows:
If the SORTKEY is used in a WHERE statement, then Redshift can quickly find the relevant blocks that potentially contain the desired data. I'm not sure if it does that with a binary search.
If WHERE statement does not use the SORTKEY, then finding the right rows is not as efficient because many blocks on disk might contain the rows that match the WHERE statement in various columns since they are not sorted together. This makes the query less efficient.
Redshift can still 'skip-over' blocks that do not contain matching data in the Zone Maps for all columns, avoiding the need to read those blocks from disk. Plus, compression on various columns can reduce the number of blocks that need to be read from disk.
The general rules for Amazon Redshift are:
Set the DISTKEY to the column that is most frequently used in JOIN
Set the SORTKEY to the column that is most frequently used in WHERE
See: Tuning query performance - Amazon Redshift

Why the database become so slow when it have 1 million rows? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
When we have 1 million lines in a single table, dealing with anything about this table becomes very slow with the specified commands like Select * From TBL_USERS Where ID = 20 but if we deal with any other table it becomes faster and easier as normal and this indicates that the defect is not in the database.
Is there an explanation for this phenomenon and solution?
Typically there are two things that contribute to the time it takes to process a query:
How many rows SQL Server has to read to satisfy the conditions of the query. If all the rows have to be read (and possibly if the pages the data is on bought into memory) this is called a scan and takes a while. If SQL Server has an index and only needs to read one or a few pages this is much faster and is referred to as a seek.
The second part of a query is how many rows (and columns) have to be returned. If millions of rows have to be pushed across a lan this will take time. This is why it is a good idea to avoid using the * (as in select * from tableA) and to try to use where conditions as much as possible which will narrow down the number of rows being returned.

Huge database table design issue [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
In my DB design I've encountered an issue. My app consists of a workflow on specific kind of media which has 6 stages as follows :
Resources
Ingest
Review
VideoRepair
Listing
backup
Since in all the stages the type of data being added (or updated) to table is the same but only their names change for example in ingest we have following columns
CaptureSup_Name, Assign_DateByCaptureSup, AssignedCaptureOp_Name,
LastCapture_Date, LastCaptureOp_Name, LastCapture_Date,
and in review we have exactly same columns but only Ingest replaces with Review and almost the same happens for the other columns of table (with one or two columns more or less)and for the purpose of having better performance on select queries So I decided not to assign one table for each stage (tradition normalization) and on the contrary I wrapped them all in one unified table
And now my table has 30 columns and the number of 30 in columns is scaring me because I never designed such a big table. Which of the following scenarios is most suitable for my case considering my database is supposed to support huge amount of data (about 1500 records being added daily) and speed is vital?
following the traditional normalization approach and break my big table into 6 or 7 tables each of which have about 5 or 6 columns o that I have to write (n-1) joins to retrieve complete data of cycle
keep my current design (one table with 30 columns) and find solution for reducing size of logs because in this case logs would be more massive (because of updates)
30 fields is not a wide table. SQL Server supports over 1,000 fields per table.
1,500 records per day is not a large volume of inserts. That is only about 500 thousand rows per year. SQL Server can handle billions of rows.
If you are having an issue with logs, then you have several options, such as:
switching to simple recovery mode;
periodically backing up the database and truncating the log; and,
using database checkpoints.
You should break your table up into other tables, if that makes sense in terms of the relational model -- this is likely, because 30 columns in a table would often be combinations of other entities. But your data structure is easily in the realm of what databases readily support. And, it can grow much, much larger with no problems (assuming none of your columns are really humongous).

How to store large timeseries objects in a relational database? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
The timeseries-object timestamps are non regular. Ticks are in seconds. Crucial data are in the past 10 days, so these data should be accessed very fast and must be presented in a real-time html/js chart (chart will auto update through asynchronous requests). Data more than 10 days should be stored, too, but can be in some zipped form (blob? file? what?).
For each user I might need millions of entries of data each year. So, my problem is (a) scalability, (b) ease and speed to open the timeseries and compute statistics (median values, etc). We decided that the user will be able to view real-time timestamped values for the past 10 days only.
For the project we will be using django and python/pandas library.
We deal with tick data using PostgreSQL.
In order to keep queries fast, we partition the tables first by date, and then also alphabetically (instruments starting with letters A-B-C-D in one partition, E-F-G-H in another...)
Adding indices on the timestamps, and ticker-id, as well as regular VACUUM cleaning and clustering of the tables at the end of day... also helps, but partitions give the single biggest increase in performace, by far.

Oracle IO Performance & Optimize design database [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm being optimize my Oracle Database.
I confusing about IO Performance between write concurrent 10 request to a Table and write concurrent 10 request to 10 table
and if i have 10 type data can store in 1 table --> which way bring the best performance insert data between 1 or 10 table
anybody know about it ?
Performance tuning is a large topic, and questions can't really by answered with so little information to begin with. But I'll try to provide some basic pointers.
If I got it right, you are mainly concerned with insert performance into what is currently a single table.
The first step should be to find out what is actually limiting your performance. Let's consider some scenarios:
disk I/O: Disks are slow. So get the fastest disks you can get. This might well mean SSDs. Put them in a RAID that is tuned for performance, "striping" is the key word as far as I know. Of course the SSDs will fail, as your HDs do so you want to plan for that. HDs are also faster when they aren't completely full (never really checked that). Partitioned tables might help as well (see below). But most of the time we can reduce the I/O load which is way more efficient then more and faster hardware ...
contention on locks (of primary keys for example).
Partitioned tables and indexes might be a solution. A partitioned table is logically one table (you can select it and write to it just like a normal table), but internally the data gets spread across multiple tables. A partitioned index is similar but an index. This might help, because an index underlying a unique key get locked when a new value gets added, so two sessions can't insert the same value. If the values are spread between n indexes, this might reduce the contention on such locks. Also partitions can be spread over different tablespaces/disks, so you have less waiting time for your physical stuff.
time to very constraints: If you have constraints on the table they need time to do their job. If you do batch insert, you should consider deferred constraints, they only get checked on commit time instead of on every insert. If you are careful with your application you can even disable them and enable them afterwards without checking them. This is fast, but of course you have to be really really sure the constraints actually hold. of course you should make sure your constraints have all the indexes they need to perform good.
talking about batch inserts. If you are doing those you might want to look into direct load: http://docs.oracle.com/cd/A58617_01/server.804/a58227/ch_dlins.htm (I think this is the Oracle 8 version, I'm sure there is an updated documentation somewhere)
To wrap it up. Without knowing where exactly your performance problem is, there is no way one can tell how to fix it. So find out where your problem is, then come back with with a more precise question.

Resources