I'm working on an application that stores a lot of quite large time/value datasets (chart data, basically values taken from a sensor every day, hour or 15 minutes for a year+). Currently we're storing them in 2 MySQL tables: a datasets table that stores the info (ID, name, etc) for a dataset, and a table containing (dataset ID, timestamp, value) triplets. This second table is already well over a million rows, and the amount of data to be stored is expected to become many times larger.
The common operations such as retrieving all points for a particular dataset in a range are running quickly enough, but some other more complex operations can be painful.
Is this the best way to organize the data? Is a relational database even particularly suited to this sort of thing? Or do I just need to learn to define better indexes and optimize the queries?
A relational database is definitely what you need for this kind of large structured dataset. If individual queries are causing problems, it's worth profiling each one to find out if different indexes are required or whatever.
Related
I have multiple sources of input with different schemas. To do some analytics using Clickhouse, I though of of 2 approaches to handling the analytic workload, using join or aggregation operation:
Using join involves defining a table corresponding to each input.
Using aggregated functions requires a single table, with a predefined set of columns, The number of columns and the type of the columns would be based on my approximations, and may change in the future.
My question is: If I go with the second approach, defining lots of columns let's say hundred of columns. How does it affect the performance, storage cost... etc ?
Generally speaking, a large table with all your values + the usage of aggregated functions is often the usecase for which clickhouse was designed.
Various types of Join based queries start being efficient on large datasets, when the queries are distributed between machines. But if you can afford to keep your data on a single SSD RAID, try using a single table and aggregated functions.
Of course, that's generic advice, it really depends on your data.
As far as irregular data goes, depending on how varied it can be, you may want to look into using a dynamic solution (e.g. Spark or Elastic Search) or a database that supports "sparse" columns (e.g. Cassandra or ScyllaDb).
If you want to use Clickhouse for this, look into using arrays and tuples to hold them.
Overall, clickhouse is pretty clever about compressing data, so adding a lot of empty values should be fine (e.g. they won't increase query time by almost anything and they won't occupy extra space). The queries are column-based, so if you don't need a column for a specific query, the performance won't be affected by the simple fact said column exists (e.g. like it would in an RDBMS).
So even if your table has, say 200 columns, as long as your query is only using 2 of those columns, it will be basically as efficient as if the table only had 2 columns. Also, the lower the granularity of a column, the faster the queries on that column (with some caveats). That being said, if you plan to query hundreds of columns in the same query... it's probably going to go fairly slow, but clickhouse is very good at parallelizing work, so if your data is in the lower dozens of Tb (uncompressed), getting a machine with some large SSDs and a 2 Xeons will usually do the trick.
But, again, this all depends heavily on the dataset, you have to explain your data and the types of queries you need in order to get a more meaningful answer.
I am using SQLite in my application. The scenario is that I have stock market data and each company is a database with 1 table. That table stores records which can range from couple thousand to half a million.
Currently when I update the data in real time I - open connection, check if that particular data exists or not. If not, I then insert it and close the connection. This is then done in a loop and each database (representing a company) is updated. The number of records inserted is low and is not the problem. But is the process okay?
An alternate way is to have 1 database with many tables (each company can be a table) and each table can have a lot of records. Is this better or not?
You can expect at around 500 companies. I am coding in VS 2010. The language is VB.NET.
The optimal organization for your data is to make it properly normalized, i.e., put all data into a single table with a company column.
This is better for performance because the table- and database-related overhead is reduced.
Queries can be sped up with indexes, but what indexes you need depends on the actual queries.
I did something similar, with similar sized data in another field. It depends a lot on your indexes. Ultimately, separating each large table was best (1 table per file, representing a cohesive unit, in you case one company). Plus you gain the advantage of each company table being the same name, versus having x tables of different names that have the same scheme (and no sanitizing of company names to make new tables required).
Internally, other DBMSs often keep at least one file per table in their internal structure, SQL is thus just a layer of abstraction above that. SQLite (despite its conceptors' boasting) is meant for small projects and querying larger data models will get more finicky in order to make it work well.
I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.
We are developing an application that processes some codes and output large amount of rows each time (millions !). We want to save these rows in a database because the processing itself make take a couple of hours to complete.
1. What is the best way to save these records ?
2. is a NoSql solution usable here ?
Assume that we are saving five million records per day, and may be retrieving from it once in a while.
It depends very much on how you intend to use the data after it is generated. If you will only be looking it up by primary key then NoSQL will probably be fine, but if you ever want to search or sort the data (or join rows together) then an SQL database will probably work better.
Basically, NoSQL is really good at stuffing opaque data into a store and retrieving any individual item very quickly. Relational databases are really good at indexing data that may be joined together or searched.
Any modern SQL database will easily handle 5 million rows per day - disk space is more likely to be your bottleneck, depending on how big your rows are. I haven't done a lot with NoSQL, but I'd be surprised if 5 million items per day would cause a problem.
It depends on exactly what kind of data you want to store - could you elaborate on that? If the data is neatly structured into tables then you don't necessarily need a NoSQL approach. If, however, your data has a graph or network-like structure to it, then you should consider a NoSQL solution. If the latter is true for you, then maybe the following will be helpful to give you an overview of some of the NoSQL databases: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
I'm currently using a MySQL table for an online game under LAMP.
One of the table is huge (soon millions of rows) and contains only integers (IDs,timestamps,booleans,scores).
I did everything to never have to JOIN on this table. However, I'm worried about the scalability. I'm thinking about moving this single table to another faster database system.
I use intermediary tables to calculate the scores but in some cases, I have to use SUM() or AVERAGE() directly on some filtered rowsets of this table.
For you, what is the best database choice for this table?
My requirements/specs:
This table contains only integers (around 15 columns)
I need to filter by certain columns
I'd like to have UNIQUE KEYS
It could be nice to have "INSERT ... ON DUPLICATE UPDATE" but I suppose my scripts can manage it by themselves.
i have to use SUM() or AVERAGE()
thanks
Just make sure you have the correct indexes on so selecting should be quick
Millions of rows in a table isn't huge. You shouldn't expect any problems in selecting, filtering or upserting data if you index on relevant keys as #Tom-Squires suggests.
Aggregate queries (sum and avg) may pose a problem though. The reason is that they require a full table scan and thus multiple fetches of data from disk to memory. A couple of methods to increase their speed:
If your data changes infrequently then caching those query results in your code is probably a good solution.
If it changes frequently then the quickest way to improve their performance is probably to ensure that your database engine keeps the table in memory. A quick calculation of expected size: 15 columns x 8 bytes x millions =~ 100's of MB - not really an issue (unless you're on a shared host). If your RDBMS does not support tuning this for a specific table, then simply put it in a different database schema - shouldn't be a problem since you're not doing any joins on this table. Most engines will allow you to tune that.