SQL Server to PostgreSQL - Migration and design concerns

SQL Server to PostgreSQL - Migration and design concerns - sql-server

Currently migrating from SQL Server to PostgreSQL and attempting to improve a couple of key areas on the way:
I have an Articles table:
CREATE TABLE [dbo].[Articles](
[server_ref] [int] NOT NULL,
[article_ref] [int] NOT NULL,
[article_title] [varchar](400) NOT NULL,
[category_ref] [int] NOT NULL,
[size] [bigint] NOT NULL
)
Data (comma delimited text files) is dumped on the import server by ~500 (out of ~1000) servers on a daily basis.
Importing:
Indexes are disabled on the Articles table.
For each dumped text file
Data is BULK copied to a temporary table.
Temporary table is updated.
Old data for the server is dropped from the Articles table.
Temporary table data is copied to Articles table.
Temporary table dropped.
Once this process is complete for all servers the indexes are built and the new database is copied to a web server.
I am reasonably happy with this process but there is always room for improvement as I strive for a real-time (haha!) system. Is what I am doing correct? The Articles table contains ~500 million records and is expected to grow. Searching across this table is okay but could be better. i.e. SELECT * FROM Articles WHERE server_ref=33 AND article_title LIKE '%criteria%' has been satisfactory but I want to improve the speed of searching. Obviously the "LIKE" is my problem here. Suggestions? SELECT * FROM Articles WHERE article_title LIKE '%criteria%' is horrendous.
Partitioning is a feature of SQL Server Enterprise but $$$ which is one of the many exciting prospects of PostgreSQL. What performance hit will be incurred for the import process (drop data, insert data) and building indexes? Will the database grow by a huge amount?
The database currently stands at 200 GB and will grow. Copying this across the network is not ideal but it works. I am putting thought into changing the hardware structure of the system. The thought process of having an import server and a web server is so that the import server can do the dirty work (WITHOUT indexes) while the web server (WITH indexes) can present reports. Maybe reducing the system down to one server would work to skip the copying across the network stage. This one server would have two versions of the database: one with the indexes for delivering reports and the other without for importing new data. The databases would swap daily. Thoughts?
This is a fantastic system, and believe it or not there is some method to my madness by giving it a big shake up.
UPDATE: I am not looking for help with relational databases, but hoping to bounce ideas around with data warehouse experts.

I am not a data warehousing expert, but a couple of pointers.
Seems like your data can be easily partitioned. See Postgresql documentation about partitioning on how to split data into different physical tables. This lets you manage data at your natural per server granularity.
You can use postgresql transactional DDL to avoid some copying. The process will then look something like this for each input file:
create a new table to store the data.
use COPY to bulk load data into the table.
create any necessary indexes and do any processing that is required.
In a transaction drop the old partition, rename the new table and add it as a partition.
If you do it like this, you can swap out the partitions on the go if you want to. Only the last step requires locking the live table, and it's a quick DDL metadata update.
Avoid deleting and reloading data to an indexed table - that will lead to considerable table and index bloat due to the MVCC mechanism PostgreSQL uses. If you just swap out the underlying table you get a nice compact table and indexes. If you have any data locality on top of the partitioning in your queries then either order your input data on that or if that's not possible use PostgreSQL cluster functionality to reorder the data physically.
To speed up the text searches use a GIN full text index if the constraints are acceptable (can only search at word boundaries). Or a trigram index (supplied by the pg_trgm extension module) if you need to search for arbitrary substrings.

Related

Saving Large Temp Tables into Perm tables, SQL Server

Am writing some processes to pre-format certain data for another downstream process to consume. The pre-formatting essentially involves gathering data from several permanent tables in one DB, applying some logic, and saving the results into another DB.
The problem i am running into is the volume of data. the resulting data set that i need to commit has about 132.5million rows. The commit itself takes almost 2 hours. I can cut that by changing the logging to simple, but it's still quite substantial (seeing as the generating of the 132.5 million rows into a temp table only takes 9 mins).
I have been reading on best methods to migrate large data, but most of the solutions implicitly assumes that the source data already resides in a single file/data table (which is not the case here). Some solutions like using SSMS task option makes it difficult to embed some of the logic applications that i need.
Am wondering if anyone here has some solutions.

Assuming you're on SQL Server 2014 or later the temp table is not flushed to disk immediately. So the difference is probably just disk speed.
Try making the target table a Clustered Columnstore to optimize for compression and minimize IO.

How do I minimize storage impact (bytes) of audit columns in database tables?

I'm a senior database developer who has always practiced creating four auditing columns on most database tables, if not all, as follows:
DATE_INSERTED
USER_INSERTED
DATE_UPDATED
USER_UPDATED
The reasons I wish to capture this information is NOT to comply with some external auditing requirements like Sarbanes Oxley. It's simply for troubleshooting purposes when Development is asked to investigate some data scenario in the database, and knowing who originally inserted some record and when, as well as who last updated it and when, strongly aids in the troubleshooting effort. Needing to store every version/state of that record that ever existed is probably overkill, but might be useful in some cases.
I'm new to both Azure SQL Database (essentially SQL Server) and its memory-optimized tables (in-memory tables), which I'm planning to implement. The amount of bytes you can store in memory database-wide is limited, so I'm being very conscious of limiting the number of bytes I put into memory, especially for large tables. Well, these four auditing columns will eat up a significant number of in-memory bytes if I add them to large in-memory tables, and I'd love to avoid that. These audit columns will never be queried or displayed in app code or reports, but they will be likely be populated by a combination of column default values, triggers and app code.
My question is whether there's a good data modeling strategy to keep these four columns out of memory, while keeping the remainder of the table in memory? The worst-case scenario seems like creating your main in-memory table, and then create a separate on-disk table with the 4 audit columns and a UNIQUE FOREIGN KEY column pointing to the in-memory table (thereby creating a 1-to-1 FK instead of 1-to-many). But I was hoping there was a more elegant approach to accomplish this in-memory/on-disk split, perhaps leveraging some SQL Server feature I haven't been able to find. As a bonus, it'd be nice if the main table and the audit columns appear as a single table to SQL without having to implement a database view.
Thanks in advance for any suggestions!

Database tables optimized for both read and write

We have a web service that pumps data into 3 database tables and a web application that reads that data in aggregated format in a SQL Server + ASP.Net environment.
There is so much data arriving to the database tables and so much data read from them and at such high velocity, that the system started to fail.
The tables have indexes on them, one of them is unique. One of the tables has billions of records and occupies a few hundred gigabytes of disk space; the other table is a smaller one, with only a few million records. It is emptied daily.
What options do I have to eliminate the obvious problem of simultaneously reading and writing from- and to multiple database tables?
I am interested in every optimization trick, although we have tried every trick we came across.
We don't have the option to install SQL Server Enterprise edition to be able to use partitions and in-memory-optimized tables.
Edit:
The system is used to collect fitness tracker data from tens of thousands of devices and to display data to thousands of them on their dashboard in real-time.

Way too broad of requirements and specifics to give a concrete answer. But a suggestion would be to setup a second database and do log shipping over to it. So the original db would be the "write" and the new db would be the "read" database.
Cons
Diskspace
Read db would be out of date by the length of time for log tranfser
Pro
- Could possible drop some of the indexes on "write" db, this would/could increase performance
- You could then summarize the table in the "read" database in order to increase query performance
https://msdn.microsoft.com/en-us/library/ms187103.aspx

Here's some ideas, some more complicated than others, their usefulness depending really heavily on the usage which isn't fully described in the question. Disclaimer: I am not a DBA, but I have worked with some great ones on my DB projects.
[Simple] More system memory always helps
[Simple] Use multiple files for tempdb (one filegroup, 1 file for each core on your system. Even if the query is being done entirely in memory, it can still block on the number of I/O threads)
[Simple] Transaction logs on SIMPLE over FULL recover
[Simple] Transaction logs written to separate spindle from the rest of data.
[Complicated] Split your data into separate tables yourself, then union them in your queries.
[Complicated] Try and put data which is not updated into a separate table so static data indices don't need to be rebuilt.
[Complicated] If possible, make sure you are doing append-only inserts (auto-incrementing PK/clustered index should already be doing this). Avoid updates if possible, obviously.
[Complicated] If queries don't need the absolute latest data, change read queries to use WITH NOLOCK on tables and remove row and page locks from indices. You won't get incomplete rows, but you might miss a few rows if they are being written at the same time you are reading.
[Complicated] Create separate filegroups for table data and index data. Place those filegroups on separate disk spindles if possible. SQL Server has separate I/O threads for each file so you can parallelize reads/writes to a certain extent.
Also, make sure all of your large tables are in separate filegroups, on different spindles as well.
[Complicated] Remove inserts with transactional locks
[Complicated] Use bulk-insert for data
[Complicated] Remove unnecessary indices
Prefer included columns over indexed columns if sorting isn't required on them
That's kind of a generic list of things I've done in the past on various DB projects I've worked on. Database optimizations tend to be highly specific to your situation...which is why DBA's have jobs. Some of the 'complicated' answers could be simple if your architecture supports it already.

Index/Statistics on volatile tables

One of my application has the following use-case:
user inputs some filters and conditions about orders (delivery date ranges,...) to analyze
the application compute a lot of data and save it on several support tables (potentially thousands of record for each analysis)
the application starts a report engine that use data from these tables
when exiting, the application deletes computed record from support tables
Actually I'm analyzing how to ehnance queries performance adding indexes/stastics to support tables and the SQL Profiler suggests me to create 3-4 indexes and 20-25 statistics.
The record in supports tables are costantly created and removed: it's correct to create all this indexes/statistics or there is the risk that all these data will be easily outdated (with the only result of a costant overhead for maintaining indexes/statistics)?
DB server: SQL Server 2005+
App language: C# .NET
Thanks in advance for any hints/suggestions!

First seems like a good situation for a data cube. Second, yes you should update stats before running your query once the support tables are populated. You should disable your indexes when inserting the data. Then the rebuild command will bring your indexes and stats up to date in one go. Profiler these days is usually quite good at these suggestions, but test the combinations to see what actully gives the best performance gains. To look as os cubes here What are the open source tools and techniques to build a complete data warehouse platform?

drop index at partition level

Do you know if there's any way of doing this in SQL Server (2008)?
I'm working on a DataWarehouse loading process, so what I want to do is to drop the indexes of the partition being loaded so I can perform a quick bulk load, and then I can rebuild again the index at partition level.
I think that in Oracle it's possible to achieve this, but maybe not in SQL Server.
thanks,
Victor

No, you can't drop a table's indexes for just a single partition. However, SQL 2008 provides a methodology for bulk-loading that involves setting up a second table with exactly the same schema on a separate partition on the same filegroup, loading it, indexing it in precisely the way, then "switching" your new partition with an existing, empty partition on the production table.
This is a highly simplified description, though. Here's the MSDN article for SQL 2008 on implementing this:
http://msdn.microsoft.com/en-us/library/ms191160.aspx

I know it wasn't possible in SQL 2005. I haven't heard of anything that would let you do this in 2008, but it could be there (I've read about but have not yet used it). The closest I could get was disabling the index, but if you disable a clustered index you can no longer access the table. Not all that useful, imho.
My solution for our Warehouse ETL project was to create a table listing all the indexes and indexing constraints (PKs, UQs). During ETL, we walk through the table (for the desired set of tables being loaded), drop the indexes/indexing constraints, load the data, then walk through the table again and recreate the indexes/constraints. Kind of ugly and a bit awkward, but once up and running it won't break--and has the added advantage of freshly built indexes (i.e. no fragmentation, and fillfactor can be 100). Adding/modifying/dropping indexes is also awkward, but not all that hard.
You could do it dynamically--read and store the indexes/constraints definitions from the target table, drop them, load data, then dynamically build and run the (re)create scripts from your stored data. But, if something crashes during the run, you are so dead. (That's why I settled on permanent tables.)
I find this to work very well with table partitioning, since you do all the work on "Loading" tables, and the live (dbo, for us) tables are untouched.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight