Warehouse PostgreSQL database architecture recommendation - database

Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?

Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.

Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.

Related

Saving Large Temp Tables into Perm tables, SQL Server

Am writing some processes to pre-format certain data for another downstream process to consume. The pre-formatting essentially involves gathering data from several permanent tables in one DB, applying some logic, and saving the results into another DB.
The problem i am running into is the volume of data. the resulting data set that i need to commit has about 132.5million rows. The commit itself takes almost 2 hours. I can cut that by changing the logging to simple, but it's still quite substantial (seeing as the generating of the 132.5 million rows into a temp table only takes 9 mins).
I have been reading on best methods to migrate large data, but most of the solutions implicitly assumes that the source data already resides in a single file/data table (which is not the case here). Some solutions like using SSMS task option makes it difficult to embed some of the logic applications that i need.
Am wondering if anyone here has some solutions.
Assuming you're on SQL Server 2014 or later the temp table is not flushed to disk immediately. So the difference is probably just disk speed.
Try making the target table a Clustered Columnstore to optimize for compression and minimize IO.

Can I use Hadoop to speed up a slow SQL stored procedure?

The problem:
I have 2 SQL Server databases from 2 different applications. They describe different aspects of industrial machines: one is about "how many consumables were spent per order", the other is about "how many good/bad production items were produced per operator". Sometimes many operators are working on 1 order one after another, sometimes one operator is working on multiple small orders, and there is no connection Order-Operator in the database.
I want to have united fact table, where for every timestamp I know MachineID, OrderID and OperatorID. If a timestamp exists in DB1, then the record will have numeric measures from it (Consumables); if it exists in DB2, then it will have numeric measures from DB2 (good/bad production items). If it exists in both databases, then it have all numeric measures. A simple UNION ALL is not enough, because I want to have MachineID, OrderID and OperatorID for every record.
I created a T-SQL stored procedure to make FULL JOIN by timestamp and MachineID. But on large data sets (multiple machines, multiple customers) it becomes very slow. Both applications support editing history, so I need to merge full history from both databases at every nightly load.
To speed up the process, I would like to put calculations into multiple parallel threads, separated by Customer, MachineID, and Year.
I tried to do it by using SQL Server stored procedures, running in parallel by SQL Agent with different parameters, but I found that it didn't help the performance. Instead it created multiple deadlocks when updating staging and final tables.
I am looking for an alternative way to resolve this problem, but I don't know what is the right tool. Can Hadoop or similar parallel processing tool help with this task?
I am looking for solution with minimal cost, because it is needed for just one specific task. For everything else, SQL Server and PowerBI reporting are working just fine for me.
Hadoop seems hard to justify in this use case, given limited scope. The thing about Hadoop is that it scales well not only due to parallel processing but thanks to parallel IO, when data is distributed across multiple servers/storage media. Unless you happy to copy all data to HDFS distributed among multiple nodes, it likely will not help much. If you want to spin up a Hadoop cluster and run multiple jobs querying single SQL server, it'll likely end up badly for the later.
Have you considered optimizations which will allow you to limit the amount of data you processing nightly?
E.g. what is 'timestamp' field? Does it reflect last update time? Can you use it to filter rows which haven't been updated since the previous run?
Even if the 'timestamp' is not the time of last updates, can you add an "updateTime" field and triggers on updates which will populate the field, so you don't need to import rows which have not changed since the previous run? If you build an index on the field, then, if the number of updates during the day is not high relative to total table size, a query with a filter on such field will hit the index, and fetching of incremental changes should be fast.
Another thing to consider - are those DBs running on the same node/SQL server? Access to remote DBs is slow, so if that's the case, think about how to fix this first.

Index/Statistics on volatile tables

One of my application has the following use-case:
user inputs some filters and conditions about orders (delivery date ranges,...) to analyze
the application compute a lot of data and save it on several support tables (potentially thousands of record for each analysis)
the application starts a report engine that use data from these tables
when exiting, the application deletes computed record from support tables
Actually I'm analyzing how to ehnance queries performance adding indexes/stastics to support tables and the SQL Profiler suggests me to create 3-4 indexes and 20-25 statistics.
The record in supports tables are costantly created and removed: it's correct to create all this indexes/statistics or there is the risk that all these data will be easily outdated (with the only result of a costant overhead for maintaining indexes/statistics)?
DB server: SQL Server 2005+
App language: C# .NET
Thanks in advance for any hints/suggestions!
First seems like a good situation for a data cube. Second, yes you should update stats before running your query once the support tables are populated. You should disable your indexes when inserting the data. Then the rebuild command will bring your indexes and stats up to date in one go. Profiler these days is usually quite good at these suggestions, but test the combinations to see what actully gives the best performance gains. To look as os cubes here What are the open source tools and techniques to build a complete data warehouse platform?

Copy Multiple Tables into ONE Table (From Multiple Databases)

I've got multiple identical databases (distributed on several servers) and need to gather them to one single point to do data mining, etc.
The idea is to take Table1, Table2, ..., TableN from each database and merge them and put the result into one single big database.
To be able to write queries, and to know from which database each row came from we will add a single column DatabaseID to target table, describing where the row came from.
Editing the source tables is not an option, it belongs to some proprietary software.
We've got ~40 servers, ~170 databases and need to copy ~40 tables.
Now, how should we implement this given that it should be:
Easy to setup
Easy to maintain
Preferably easy to adjust if database schema changes
Reliable, logging/alarm if something fails
Not too hard to add more tables to copy
We've looked into SSIS, but it seemed that we would have to add each table as a source/transformation/destination. I'm guessing it would also be quite tied to the database schema. Right?
Another option would be to use SQL Server Replication, but I don't see how to add the DatabaseID column to each table. It seems it's only possible to copy data, not modify it.
Maybe we could copy all the data into separate databases, and then to run a local job on the target server to merge the tables?
It also seems like a lot of work if we'd need to add more tables to copy, as we'd have to redistribute new publications for each database (manual work?).
Last option (?) is to write a custom application to our needs. Bigger time investment, but it'd at least do precisely what we'd like.
To make it worse... we're using Microsoft SQL Server 2000.
We will upgrade to SQL Server 2008 R2 within 6 months, but we'd like the project to be usable sooner.
Let me know what you guys think!
UPDATE 20110721
We ended up with a F# program opening a connection to the SQL Server where we would like the aggregated databases. From there we query the 40 linked SQL Servers to fetch all rows (but not all columns) from some tables, and add an extra row to each table to say which DatabaseID the row came from.
Configuration of servers to fetch from, which tables and which columns, is a combination of text file configuration and hard coded values (heh :D).
It's not super fast (sequential fetching so far) but it's absolutely manageable, and the data processing we do afterwards takes far longer time.
Future improvements could be to;
improve error handling if it turns out to be a problem (if a server isn't online, etc).
implement parallel fetching, to reduce the total amount of time to finish fetching.
figure out if it's enough to fetch only some of the rows, like only what's been added/updated.
All in all it turned out to be quite simple, no dependencies to other products, and it works well in practice.
Nothing fancy but couldn't you do something like
DROP TABLE dbo.Merged
INSERT INTO dbo.Merged
SELECT [DatabaseID] = "Database1", * FROM ServerA.dbo.Table
UNION ALL SELECT [DatabaseID] = "Database2", * FROM ServerB.dbo.Table
...
UNION ALL SELECT [DatabaseID] = "DatabaseX", * FROM ServerX.dbo.Table
Advantages
Easy to setup
Easy to maintain
Easy to adjust
Easy to add more tables
Disadvantages
Performance
Reliable logging
We had a similar requirement where we took a different approach. first created a central database to collect the data. Then we created a inventory table to store the list of target servers / databases. Then a small vb.net based CLR procedure which take the path of SQL query, target SQL Instance name and the target table which will store the data(This would eliminate the setup of linked server when new targets are added). This also adds two additional columns to the result set. The Target server name and the timestamp when the data is captured.
Then we set up a service broker queue/service and pushed list of target servers to interogate.
The above CLR procedure is wrapped in another procedure which dequeues the message, executes the SQL on the target server provided. The wrapper procedure is then configured as the activated procedure for the queue.
With this we are able to achieve a bit of parallelism to capture the data.
Advantages :
Easy to setup Easy to manage (Add / Remove targets)
Same framework works for multiple queries
Logging tables to check for failed queries.
Works independent of each target, so if one of the target fails to
respond, others still continue.
Workflow can be pause gracefully by disabling the queue (for
maintenance on central server) and then resume collection be
re-enabling it.
Disadvantage:
requires good understanding of service brokers.
should properly handle poison messages.
Please Let me know if it helps

Horizontally partitioning data into an "archive" in SQL Server taking months to execute?

There is a project in flight at my organization to move customer data and all the associated records (billing transactions, etc) from one database to another, if the customer has not had account activity within a certain timeframe.
The total number of rows in all the tables is in the millions. Perhaps 100 million rows, with all the various tables combined. The schema is more-or-less normalized. The project's designers have decided on SSIS to execute this and initial analysis is showing 5 months of execution time.
Basically, the process:
Fills an "archive" database that has the same schema as the database of origin
Delete the original rows from the source database
I can provide more detail if necessary. What I'm wondering is, is SSIS the correct approach? Is there some sort of canonical way to move very large quantities of data around? Are there common performance pitfalls to avoid?
I just can't believe that this is going to take months to run and I'd like to know if there's something else that we should be looking into.
SSIS is just a tool. You can write a 100M rows transfer in SSIS to take 24h, you can write it to take 5 mo. The problem is what you write (ie. the workflow in SSIS case), not SSIS.
There isn't anything specific to SSID that would dictate 'the transfer cannot be done faster than 5 mo'.
The guiding principles for such a task (logically partition the data, process each logical partition in parallel, eliminate access and update contention between processing, batch commit changes, don't transfer more data that is necessary on the wire, use set based processing as much as possible, be able to suspend and resume etc etc) can be implemented on SSIS just as well as any other technology (if not better).
For the record, the ETL world speed record stands at about 2TB per hour. Using SSIS. And just as a matter of fact, I just finished a transfer of 130M rows, ~200Gb of data, took some 24h (I'm lazy and not shooting for ETL record).
I would understand 5mo for development, testing and deployment, but not 5mo for actual processing. That is like 7 rows a second, and is realy realy lame.
SSIS is probably not the right choice if you are simply deleting records.
This might be of interest: Performing fast SQL Server delete operations
UPDATE: as Remus correctly points out, SSIS can perform well or badly depending on how the flows are written, and there have been some huge benchmarks (on high end systems). But for just deletes there are simply ways, such as a SQL Agent job running a TSQL delete in batches.

Resources