Creating Multiple Tables Dynamically in PostgreSQL - database

I'm seeking for a solution for a time-series database that will hold measurement data of IoT devices used by multiple tenants.
Each tenant have their own device types that sends different measurement data (for example pressure, temperature, etc.)
After some research I arrived an SQL solution with TimeScaleDB for postgreSQL and dynamically creating a new SQL table for each device type - each with it's own set of columns.
Assuming I might have 1000 tenants, each with a few device types, is it a good design to have a database with thousands of tables?
As far as I understand, postgreSQL has a limitation of 8KB page size - so even in minimal cases where only 1 device is sending data for a month, I will most probably exceed that - which is a good thing for the multi-table design.
Thanks.

Related

Multi Tenancy in ClickHouse

A lot of people don't want to use ClickHouse just to do analytics for their company or project. They want to use it as the backbone for SaaS data/analytics projects. Which most of the time would require supporting semi-structured json data, which could result in creating a lot of columns for each user you have.
Now, some experinced ClickHouse users say less tables means more performance. So having a seperate table for each user is not an option.
Also, having the data of too many users into the same database will result in a very large number of columns, which some experiments say it could make CH unresponsive.
So what about something like 20 users per database having each user limited to 50 columns.
But what if you got thousands of users? Should you create thousands of databases?
What is the best solution possible to this problem?
Note: In our case, data isolation is not an issue, we are solving it on the application level.
There is no difference between 1000 tables in a single database and 1000 databases with a single table.
There is ALMOST no difference between 1000 tables and a table with *1000 partitions partition by (tenant_id, .some_expression_from_datetime.)
The problem is in overhead from MergeTree and ReplicatedMergeTree Engines. And is in number of files you need to create / read (data locality problem, not related to files, will be the same without a filesystem).
If you have 1000 tenants, the only way is to use order by (tenant_id,..) + restrictions using row policies or on application level.
I have an experience with customers who have 700 Replicated tables -- it's constant straggle with the replication, need to adjust background pools, the problem with Zookeeper (huge DB size, enormous network traffic between CH and ZK). Clickhouse starts for 4 hours because it needs to read metadata from all 1000000 parts. Partition pruning works slower because it iterates through all parts during query analysis for every query.
The source of the issue is the original design, they had about 3 tables in metrika i guess.
Check this for example https://github.com/ClickHouse/ClickHouse/issues/31919

Loading data from SQL Server to Elasticsearch

Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).

SQL server scalability question

We are trying to build an application which will have to store billions of records. 1 trillion+
a single record will contain text data and meta data about the text document.
pl help me understand about the storage limitations. can a databse SQL or oracle support this much data or i have to look for some other filesystem based solution ? What are my options ?
Since the central server has to handle incoming load from many clients, how will parallel insertions and search scale ? how to distribute data over multiple databases or tables ? I am little green to database specifics for such scaled environment.
initally to fill the database the insert load will be high, later as the database grows, search load will increase and inserts will reduce.
the total size of data will cross 1000 TB.
thanks.
1 trillion+
a single record will contain text data
and meta data about the text document.
pl help me understand about the
storage limitations
I hope you have a BIG budget for hardware. This is big as in "millions".
A trillion documents, at 1024 bytes total storage per document (VERY unlikely to be realistic when you say text) is a size of about 950 terabyte of data. Storage limitations means you talk high end SAN here. Using a non-redundant setup of 2tb discs that is 450 discs. Make the maths. Adding redundancy / raid to that and you talk major hardware invesment. An this assumes only 1kb per document. If you have on average 16kg data usage, this is... 7200 2tb discs.
THat is a hardware problem to start with. SQL Server does not scale so high, and you can not do that in a single system anyway. The normal approach for a docuemnt store like this would be a clustered storage system (clustered or somehow distributed file system) plus a central database for the keywords / tagging. Depending on load / inserts possibly with replciations of hte database for distributed search.
Whatever it is going to be, the storage / backup requiments are terrific. Lagre project here, large budget.
IO load is gong to be another issue - hardware wise. You will need a large machine and get a TON of IO bandwidth into it. I have seen 8gb links overloaded on a SQL Server (fed by a HP eva with 190 discs) and I can imagine you will run something similar. You will want hardware with as much ram as technically possible, regardless of the price - unless you store the blobs outside.
SQL row compression may come in VERY handy. Full text search will be a problem.
the total size of data will cross 1000
TB.
No. Seriously. It will be a bigger, I think. 1000tb would assume the documents are small - like the XML form of a travel ticket.
According to the MSDN page on SQL Server limitations, it can accommodate 524,272 terabytes in a single database - although it can only accommodate 16TB per file, so for 1000TB, you'd be looking to implement partitioning. If the files themselves are large, and just going to be treated as blobs of binary, you might also want to look at FILESTREAM, which does actually keep the files on the file system, but maintains SQL Server notions such as Transactions, Backup, etc.
All of the above is for SQL Server. Other products (such as Oracle) should offer similar facilities, but I couldn't list them.
In the SQL Server space you may want to take a look at SQL Server Parallel Data Warehouse, which is designed for 100s TB / Petabyte applications. Teradata, Oracle Exadata, Greenplum, etc also ought to be on your list. In any case you will be needing some expert help to choose and design the solution so you should ask that person the question you are asking here.
When it comes to database its quite tricky and there can be multiple components involved to get performance like Redis Cache, Sharding, Read replicas etc.
Bellow post describes simplified DB scalability.
http://www.cloudometry.in/2015/09/relational-database-scalability-options.html

What is the best database technology for storing OHLC historical prices?

Just for End-Of-Day data there will be billions of rows. What is the best way to store all that data. Is SQL Server 2008 good enough for that or should I look towards NoSQL solution, like MongoDB. Any suggestions?
That would be cool to have one master db with read/write permissions and one ore more replications of it for read only operations. Only master database will be used for adding new prices into the storage. Also that would be cool to be able replicate OHLC prices for most popular securities individually in order to optimize read access.
This data then will be streamed to a trading platform on clients' machines.
You should consider Oracle Berkeley DB which is in production doing this within the infrastructure of a few well known stock exchanges. Berkeley DB will allow you to record information at a master as simple key/value pairs, in your case I'd imagine a timestamp for the key and an encoded OHLC set for the value. Berkeley DB supports single master multi-replica replication (called "HA" for High Availability) to support exactly what you've outlined - read scalability. Berkeley DB HA will automatically fail-over to a new master if/when necessary. Using some simple compression and other basic features of Berkeley DB you'll be able to meet your scalability and data volume targets (billions of rows, tens of thousands of transactions per second - depending on your hardware, OS, and configuration of BDB - see the 3n+1 benchmark with BDB for help) without issue.
When you start working on accessing that OHLC data consider Berkeley DB's support for bulk-get and make sure you use the B-Tree access method (because your data has order and locality will provide much faster access). Also consider the Berkeley DB partitioning API to split your data (perhaps based on symbol or even based on time). Finally, because you'll be replicating the data you can relax the durability constraints to DB_TXN_WRITE_NOSYNC as long as your replication acknowledgement policy is requires a quorum of replicas ACK a write before considering it durable. You'll find that a fast network beats a fast disk in this case. Also, to offload some work from your master, enable peer-to-peer log replica distribution.
But, first read the replication manager getting started guide and review the rep quote example - which already implements some of what you're trying to do (handy, eh?).
Just for the record, full disclosure I work as a product manager at Oracle on Berkeley DB products. I have for the past nine years, so I'm a tad biased. I'd guess that the other solutions - SQL based or not - might eventually give you a working system, but I'm confident that Berkeley DB can without too much effort.
If you're really talking billions of new rows a day (Federal Express' data warehouse isn't that large), then you need an SQL database that can partition across multiple computers, like Oracle or IBM's DB2.
Another alternative would be a heavy-duty system managed storage like IBM's DFSMS.

A huge data storage problem

I'm starting to design a new application that will be used by about 50000 devices. Each device generates about 1440 registries a day, this means that will be stored over 72 million of registries per day. These registries keep coming every minute, and I must be able to query this data by a Java application (J2EE). So it need to be fast to write, fast to read and indexed to allow report generation.
Devices only insert data and the J2EE application will need to read then occasionally.
Now I'm looking to software alternatives to support this kind of operation.
Putting this data on a single table would lead to a catastrophic condition, because I won't be able to use this data due to its amount of data stored over a year.
I'm using Postgres, and database partitioning seems not to be a answer, since I'd need to partition tables by month, or may be more granular approach, days for example.
I was thinking on a solution using SQLite. Each device would have its own SQLite database, than the information would be granular enough for good maintenance and fast insertions and queries.
What do you think?
Record only changes of device positions - most of the time any device will not move - a car will be parked, a person will sit or sleep, a phone will be on unmoving person or charged etc. - this would make you an order of magnitude less data to store.
You'll be generating at most about 1TB a year (even when not implementing point 1), which is not a very big amount of data. This means about 30MB/s of data, which single SATA drive can handle.
Even a simple unpartitioned Postgres database on not too big hardware should manage to handle this. The only problem could be when you'll need to query or backup - this can be resolved by using a Hot Standby mirror using Streaming Replication - this is a new feature in soon to be released PostgreSQL 9.0. Just query against / backup a mirror - if it is busy it will temporarily and automatically queue changes, and catch up later.
When you really need to partition do it for example on device_id modulo 256 instead of time. This way you'd have writes spread out on every partition. If you partition on time just one partition will be very busy on any moment and others will be idle. Postgres supports partitioning this way very well. You can then also spread load to several storage devices using tablespaces, which are also well supported in Postgres.
Time-interval partitioning is a very good solution, even if you have to roll your own. Maintaining separate connections to 50,000 SQLite databases is much less practical than a single Postgres database, even for millions of inserts a day.
Depending on the kind of queries that you need to run against your dataset, you might consider partitioning your remote devices across several servers, and then query those servers to write aggregate data to a backend server.
The key to high-volume tables is: minimize the amount of data you write and the number of indexes that have to be updated; don't do UPDATEs or DELETEs, only INSERTS (and use partitioning for data that you will delete in the future—DROP TABLE is much faster than DELETE FROM TABLE!).
Table design and query optimization becomes very database-specific as you start to challenge the database engine. Consider hiring a Postgres expert to at least consult on your design.
Maybe it is time for a db that you can shard over many machines? Cassandra? Redis? Don't limit yourself to sql db's.
Database partition management can be automated; time-based partitioning of the data is a standard way of dealihg with this type of problem, and I'm not sure that I can see any reason why this can't be done with PostgreSQL.
You have approximately 72m rows per day - assuming a device ID, datestamp and two floats for coordinates you will have (say) 16-20 bytes per row plus some minor page metadata overhead. A back-of-fag-packet capacity plan suggests around 1-1.5GB of data per day, or 400-500GB per year, plus indexes if necessary.
If you can live with periodically refreshed data (i.e. not completely up to date) you could build a separate reporting table and periodically update this with an ETL process. If this table is stored on separate physical disk volumes it can be queried without significantly affecting the performance of your transactional data.
A separate reporting database for historical data would also allow you to prune your operational table by dropping older partitions, which would probably help with application performance. You could also index the reporting tables and create summary tables to optimise reporting performance.
If you need low latency data (i.e. reporting on up-to-date data), it may also be possible to build a view where the lead partitions are reported off the operational system and the historical data is reported from the data mart. This would allow the bulk queries to take place on reporting tables optimised for this, while relatively small volumes of current data can be read directly from the operational system.
Most low-latency reporting systems use some variation of this approach - a leading partition can be updated by a real-time process (perhaps triggers) and contains relatively little data, so it can be queried quickly, but contains no baggage that slows down the update. The rest of the historical data can be heavily indexed for reporting. Partitioning by date means that the system will automatically start populating the next partition, and a periodic process can move, re-index or do whatever needs to be done for the historical data to optimise it for reporting.
Note: If your budget runs to PostgreSQL rather than Oracle, you will probably find that direct-attach storage is appreciably faster than a SAN unless you want to spend a lot of money on SAN hardware.
That is a bit of a vague question you are asking. And I think you are not facing a choice of database software, but an architectural problem.
Some considerations:
How reliable are the devices, and how
well are they connected to the
querying software?
How failsafe do
you need the storage to be?
How much extra processing power do the devices
have to process your queries?
Basically, your idea of a spatial partitioning is a good idea. That does not exclude a temporal partition, if necessary. Whether you do that in postgres or sqlite depends on other factors, like the processing power and available libraries.
Another consideration would be whether your devices are reliable and powerful enough to handle your queries. Otherwise, you might want to work with a centralized cluster of databases instead, which you can still query in parallel.

Resources