SQL Server 2008 - multiple import processes simultaneously - database

I have a scenario where multiple users will be doing import processes, but all of them will be working for different clients.
I have one core table which gets the most hits whenever import processes run. I have 2 options now
To have one core table and do the sequential imports by making queue for the import processes.
To have 300 core table, one for each client, it will allow the users to work on the import processes simultaneously without waiting for one another.
Can anyone suggest which one is better and why?
I am giving my requirements in more detailed this time. Can you all once again have a look at and provide your comments after going through the requirements.
The query is regarding data modeling for core functionality of my application.
I have a scenario where multiple users will be doing import processes, but all of them will be working for different clients. Also, at the same time client's data could be shown to the user and can be modified/inserted too, while the import process for the same or different client is in process.
I have 2 core tables which get the most hits whenever import processes run.
I have 2 options now
1. To have 2 core tables and do the sequential imports by making queue for the import processes.
Table 1
ID
ClientID
SourceID
Count
AnotherCol1
AnotherCol2
AnotherCol3
Table 2
ID
ClientID
OrderID
Count
AnotherCol4
AnotherCol5
AnotherCol6
To have 1000 core table, 2 for each client (I may have maximum 500 clients), it will allow the users to work on the import processes simultaneously without waiting for one another.
More information about the import process:
1. These table is not going to be used in any Reporting.
2. Each import process will insert 20k-30k records (7 columns) in these each table. And there will be around 40-50 such imports in a day.
3. While the import process is going on, data could be retrieved from these tables by some other user and INSERT OR UPDATED too.
4. These are going to be one of the most usable tables in the application.
5. BULK INSERT will be used for insertion.
6. Clustered index is on the Primary Key which is an Identity column.
7. We are considering the table partitioning too.
Can you please suggest which option is better and why?
Also, if you suggest to go with option 2, then would it not be a performance hit to create so many tables in the database? Should we create a separate database for these 1000 tables in this case?
I am giving my requirements in more detailed this time. Can you all once again have a look at and provide your comments after going through the requirements.
The query is regarding data modeling for core functionality of my application.
I have a scenario where multiple users will be doing import processes, but all of them will be working for different clients. Also, at the same time client's data could be shown to the user and can be modified/inserted too, while the import process for the same or different client is in process.
I have 2 core tables which get the most hits whenever import processes run.
I have 2 options now
1. To have 2 core tables and do the sequential imports by making queue for the import processes.
Table 1
ID
ClientID
SourceID
Count
AnotherCol1
AnotherCol2
AnotherCol3
Table 2
ID
ClientID
OrderID
Count
AnotherCol4
AnotherCol5
AnotherCol6
To have 1000 core table, 2 for each client (I may have maximum 500 clients), it will allow the users to work on the import processes simultaneously without waiting for one another.
More information about the import process:
1. These table is not going to be used in any Reporting.
2. Each import process will insert 20k-30k records (7 columns) in these each table. And there will be around 40-50 such imports in a day.
3. While the import process is going on, data could be retrieved from these tables by some other user and INSERT OR UPDATED too.
4. These are going to be one of the most usable tables in the application.
5. BULK INSERT will be used for insertion.
6. Clustered index is on the Primary Key which is an Identity column.
7. We are considering the table partitioning too.
Can you please suggest which option is better and why?
Also, if you suggest to go with option 2, then would it not be a performance hit to create so many tables in the database? Should we create a separate database for these 1000 tables in this case?

It's not really a question with a definitive answer as each has it's own benefits and drawbacks.
Scenario 1: Central core table
Pros: Central table, easy global modifications
Cons: Slower import, more difficult client-level modifications
Scenario 2: 300 core tables
Pros: Faster imports, easy client customization
Cons: More difficult to deploy changes against all 300 core tables, reporting which needs to touch all tables will be more complicated and probably slower as well
In the end the answer is whatever really works for you

Another option is a third scenario where you have a single table, but you still can do the imports in parallel by having a batch identifier in the table which stops people stepping on each other.
The main problem with having multiple people in the same table is that you cannot do things like TRUNCATE.
For me the decision would be related to about where the data eventually goes. Is this just a staging table for convenience because there is going to be some SQL for transform or lookup run against it after load? Would it be possible to make such tables in a separate database or schema and with unique names which would make it easy for them to be cleaned up without interfering with or bloating the transaction log in your primary database> Do you need to insert in bulk, then apply indexes and eventually drop the table? Is such a table even necessary if you are using SSIS to load the data, you can often do a lot of work in the pipeline without needing a staging table?
All these things would play into my decision making process on the architecture.

Related

How do you optimize query performance on marketplace shared views in snowflake? Note that incremental materialization probably would not work

I am working with data from Snowflake Marketplace, these are large multi-billion record tables.
I have two conflicting needs: speed & up-to-date data
I am able to have up-to-date data by working exclusively with views - meaning the data is up to date from the vendor's perspective at the moment I make a query. However, performance is terrible (the vendor does not cluster their tables the way I would do it)
I can also materialize copies of the tables with my chosen cluster keys. This works great for performance, but it introduces a 10-20h lag every time the tables are updated - which is not good.
My main issue is that this data is "changed" all the time. Ie current and historical values are updated by the vendor in-place (not append). This makes incremental runs almost impossible.
Does Snowflake have any feature that could help in this context?
You mentioned that it's a table, not a view that they're sharing. You can request permission from the vendor of this data to be able to place a stream on their table. Once you have a stream on their table, you'll get all the rows you need to complete a synchronization of their table changes with your local copy. This should reduce the 10-20h lag because without setting a stream on their side you'll wind up doing full refreshes. This approach will allow you to handle incremental changes.
When you try to create a stream on a shared table, unless you've already arranged it with the vendor or the vendor has already enabled this for another share consumer, you may get this message:
SQL access control error: Insufficient privileges to operate on stream
source without CHANGE_TRACKING enabled 'MY_TABLE'
This just means the sharing vendor must enable change tracking on their side. On the sharing account side:
alter table MY_TABLE set change_tracking = true;
As soon as they make that change, any and all sharing consumers will be able to create a stream on the table:
status
Stream MY_STREAM successfully created.

can (should) I use Flink like an in-memory database?

I've used batch Beam but am new to the streaming interface. I'm wondering about the appropriateness of using Apache Flink / Beam kind of like an in-memory database -- I'd like to constantly recompute and materialize one specific view of my data based on edge triggered updates.
More details: I have a few tables in a (normal) database, ranging from thousands to millions of rows, and each one has a many-to-many (M2M) relationship with other ones. Picture to explain:
Hosts <-M2M #1-> Table 1 <-M2M #2-> Table 2 <-M2M #3-> Table 3
Table 1 is a set of objects that the hosts need to know about, and each host needs to know about all downstream rows referenced directly or indirectly by the objects in Table 1 that it's related to. When changes happen anywhere other than the first many-to-many relationship M2M #1, it's not obvious which hosts need to be updated without traversing "left" to find the hosts and then traversing "right" to get all the necessary configuration. The objects and relationships at most levels change frequently, and I need sub-second latency to go from "a record or relationship changed" to recalculating any flattened config files with changes in them so that I can push updates to the hosts very quickly.
Is this an appropriate use case for streaming Flink / Beam? I have worked with Beam in a different system but only in batch mode, and I think that it would be a great tool to use here if I could edge-trigger it. The part I'm getting stuck on is, in batch mode, the PCollections are all "complete" in the sense that I can always join all records in one table with all records in another table. But with streaming, once I process a record once, it gets removed from its PCollection and can't be joined against future updates that arrive later on and relate to it, right? IIUC, it's only available within a window, but I effectively want an infinitely long window where only outdated versions of items in a PCollection (e.g. versions of them which have been overwritten by a new version that came in over the stream) would be freed up.
(Also, to bootstrap this system, I would need to scan the whole database to prefill all the state before I could start reading from a stream of edge-triggered updates. Is that a common pattern?)
I don't know enough about Beam to answer that part of the question, but you can certainly use Flink in the way you've described. The simplest way to accomplish this with Flink is with a streaming join, using the SQL/Table API. The runtime will materialize both tables into managed Flink state, and produce new and/or updated results as new and updated records are processed from the input tables. This is a commonly used pattern.
As for initially bootstrapping the state, before continuing to ingest the updates, I suggest using a CDC-based approach. You might start by looking at https://github.com/ververica/flink-cdc-connectors.

Auto sharding postgresql?

I have a problem where I need to load alot of data (5+ billion rows) into a database very quickly (ideally less than an 30 min but quicker is better), and I was recently suggested to look into postgresql (I failed with mysql and was looking at hbase/cassandra). My setup is I have a cluster (currently 8 servers) that generates alot of data, and I was thinking of running databases locally on each machine in the cluster it writes quickly locally and then at the end (or throughout the data generating) data is merged together. The data is not in any order so I don't care which specific server its on (as long as its eventually there).
My questions are , is there any good tutorials or places to learn about PostgreSQL auto sharding (I found results of firms like sykpe doing auto sharding but no tutorials, I want to play with this myself)? Is what I'm trying to do possible? Because the data is not in any order I was going to use auto-incrementing ID number, will that cause a conflict if data is merged (this is not a big issue anymore)?
Update: Frank's idea below kind of eliminated the auto-incrementing conflict issue I was asking about. The question is basically now, how can I learn about auto sharding and would it support distributed uploads of data to multiple servers?
First: Do you really need to insert the generated data from your cluster straight into a relational database? You don't mind merging it at the end anyway, so why bother inserting into a database at all? In your position I'd have your cluster nodes write flat files, probably gzip'd CSV data. I'd then bulk import and merge that data using a tool like pg_bulkload.
If you do need to insert directly into a relational database: That's (part of) what PgPool-II and (especeially) PgBouncer are for. Configure PgBouncer to load-balance across different nodes and you should be pretty much sorted.
Note that PostgreSQL is a transactional database with strong data durability guarantees. That also means that if you use it in a simplistic way, doing lots of small writes can be slow. You have to consider what trade-offs you're willing to make between data durability, speed, and cost of hardware.
At one extreme, each INSERT can be its own transaction that's synchronously committed to disk before returning success. This limits the number of transactions per second to the number of fsync()s your disk subsystem can do, which is often only in the tens or hundreds per second (without battery backup RAID controller). This is the default if you do nothing special and if you don't wrap your INSERTs in a BEGIN and COMMIT.
At the other extreme, you say "I really don't care if I lose all this data" and use unlogged tables for your inserts. This basically gives the database permission to throw your data away if it can't guarantee it's OK - say, after an OS crash, database crash, power loss, etc.
The middle ground is where you will probably want to be. This involves some combination of asynchronous commit, group commits (commit_delay and commit_siblings), batching inserts into groups wrapped in explicit BEGIN and END, etc. Instead of INSERT batching you could do COPY loads of a few thousand records at a time. All these things trade data durability off against speed.
For fast bulk inserts you should also consider inserting into tables without any indexes except a primary key. Maybe not even that. Create the indexes once your bulk inserts are done. This will be a hell of a lot faster.
Here are a few things that might help:
The DB on each server should have a small meta data table with that server's unique characteristics. Such as which server it is; servers can be numbered sequentially. Apart from the contents of that table, it's probably wise to try to keep the schema on each server as similar as possible.
With billions of rows you'll want bigint ids (or UUID or the like). With bigints, you could allocate a generous range for each server, and set its sequence up to use it. E.g. server 1 gets 1..1000000000000000, server 2 gets 1000000000000001 to 2000000000000000 etc.
If the data is simple data points (like a temperature reading from exactly 10 instruments every second) you might get efficiency gains by storing it in a table with columns (time timestamp, values double precision[]) rather than the more correct (time timestamp, instrument_id int, value double precision). This is an explicit denormalisation in aid of efficiency. (I blogged about my own experience with this scheme.)
Use citus for PostgreSQL auto sharding. Also this link is helpful.
Sorry I don't have a tutorial at hand, but here's an outline of a possible solution:
Load one eight of your data into a PG instance on each of the servers
For optimum load speed, don't use inserts but the COPY method
When the data is loaded, do not combine the eight databases into one. Instead, use plProxy to launch a single statement to query all databases at once (or the right one to satisfy your query)
As already noted, keys might be an issue. Use non-overlapping sequences or uuids or sequence numbers with a string prefix, shouldn't be too hard to solve.
You should start with a COPY test on one of the servers and see how close to your 30-minute goal you can get. If your data is not important and you have a recent Postgresql version, you can try using unlogged tables which should be a lot faster (but not crash-safe). Sounds like a fun project, good luck.
You could use mySQL - which supports auto-sharding across a cluster.

How to set up a new SQL Server database to allow for possible replication in the future?

I'm building a system which has the potential to require support for 500+ concurrent users, each making dozens of queries (selects, inserts AND updates) each minute. Based on these requirements and tables with many millions of rows I suspect that there will be the need to use database replication in the future to reduce some of the query load.
Having not used replication in the past, I am wondering if there is anything I need to consider in the schema design?
For instance, I was once told that it is necessary to use GUIDs for primary keys to enable replication. Is this true?
What special considerations or best practices for database design are there for a database that will be replicated?
Due to time constraints on the project I don't want to waste any time by implementing replication when it may not be needed. (I have enough definite problems to overcome at the moment without worrying about having to solve possible ones.) However, I don't want to have to make potentially avoidable schema changes when/if replication is required in the future.
Any other advice on this subject, including good places to learn about implementing replication, would also be appreciated.
While every row must have a rowguid column, you are not required to use a Guid for your primary key. In reality, you aren't even required to have a primary key (though you will be stoned to death for failing to create one). Even if you define your primary key as a guid, not making it the rowguid column will result in Replication Services creating an additional column for you. You definitely can do this, and it's not a bad idea, but it is by no means necessary nor particularly advantageous.
Here are some tips:
Keep table (or, rather, row) sizes small; unless you use column-level replication, you'll be downloading/uploading the entire contents of a row, even if only one column changes. Additionally, smaller tables make conflict resolution both easier and less frequent.
Don't use sequential or deterministic algorithm-driven primary keys. This includes identity columns. Yes, Replication Services will handle identity columns and allocating key allotments by itself, but it's a headache that you don't want to deal with. This alone is a great argument for using a Guid for your primary key.
Don't let your applications perform needless updates. This is obviously a bad idea to begin with, but this issue is made exponentially worse in replication scenarios, both from a bandwidth usage and a conflict resolution perspective.
You may want to use GUIDs for primary keys - in a replicated system rows must be unique throughout your entire topology, and GUID PKs is one way of achieving this.
Here's a short article about use of GUIDs in SQL Server
I'd say your real question is not how to handle replication, but how to handle scale out, or at least scale out for queryability. And while there are various answers to this conundrum, one answer will stand out: not using replication.
The problem with replication, specially with merge replication, is that writes gets multiplied in replication. Say you have a system which handles a load of 100 queries (90 reads and 10 writes) per second. You want to scale out and you choose replication. Now you have 2 systems, each handling 50 queries, 45 reads and 5 writes each. Now those writes have to be replicated so the actual number of writes is not 5+5, but 5+5 (original writes ) and then another 5+5 (the replica writes), so you have 90 reads and 20 writes. So while the load on each system was reduced, the ratio of writes and reads has increased. This not only changes the IO patterns, but most importantly it changes the concurency pattern of the load. Add a third system and you'll have 90 reads and 30 writes and so on and so forth. Soon you'll have more writes than reads and the replication update latency combined with the concurency issues and merge conflicts will derail your project. The gist of it is that the 'soon' is much sooner than you expect. Is soon enough to justify looking into scale up instead, since you're talking a scale out of 6-8 peers at best anyway, and 6-8 times capacity increase using scale up will be faster, much more simpler and possible even cheaper to start with.
And keep in mind that all these are just purely theorethical numbers. In practice what happens is that the replication infrastructure is not free, it adds its own load on the system. Writes needs to be tracked, changes have to be read, a distributor has to exists to store changes until distributed to subscribers, then changes have to be writes and mediated for possible conflicts. That's why I've seen very few deployments that could claim success with a replication based scale out strategy.
One alternative is to scale out only reads and here replication does work, usualy using transactional replication, but so does log-shipping or mirroring with a database snapshot.
The real alternative is partitioning (ie. sharding). Requests are routed in the application to the proper partition and land on the server containig the appropiate data. Changes on one partiton that need to be reflected on another partition are shipped via asynchronous (usually messaging based) means. Data can only be joined within a partition. For a more detailed discussion of what I'm talking about, read how MySpace does it. Needless to say, such a strategy has a major impact on the application design and cannot be simply glued in after v1.

Importing new database table

Where I'm at there is a main system that runs on a big AIX mainframe. To facility reporting and operations there is nightly dump from the mainframe into SQL Server, such that each of our 50-ish clients is in their own database with identical schemas. This dump takes about 7 hours to finish each night, and there's not really anything we can do about it: we're stuck with what the application vendor has provided.
After the dump into sql server we use that to run a number of other daily procedures. One of those procedures is to import data into a kind of management reporting sandbox table, which combines records from a particularly important table from across the different databases into one table that managers who don't know sql so can use to run ad-hoc reports without hosing up the rest of the system. This, again, is a business thing: the managers want it, and they have the power to see that we implement it.
The import process for this table takes a couple hours on it's own. It filters down about 40 million records spread across 50 databases into about 4 million records, and then indexes them on certain columns for searching. Even at a coupld hours it's still less than a third as long as the initial load, but we're running out of time for overnight processing, we don't control the mainframe dump, and we do control this. So I've been tasked with looking for ways to improve one the existing procedure.
Currently, the philosophy is that it's faster to load all the data from each client database and then index it afterwards in one step. Also, in the interest of avoiding bogging down other important systems in case it runs long, a couple of the larger clients are set to always run first (the main index on the table is by a clientid field). One other thing we're starting to do is load data from a few clients at a time in parallel, rather than each client sequentially.
So my question is, what would be the most efficient way to load this table? Are we right in thinking that indexing later is better? Or should we create the indexes before importing data? Should we be loading the table in index order, to avoid massive re-ordering of pages, rather than the big clients first? Could loading in parallel make things worse by causing to much disk access all at once or removing our ability to control the order? Any other ideas?
Update
Well, something is up. I was able to do some benchmarking during the day, and there is no difference at all in the load time whether the indexes are created at the beginning or at the end of the operation, but we save the time building the index itself (it of course builds nearly instantly with no data in the table).
I have worked with loading bulk sets of data in SQL Server quite a bit and did some performance testing on the Index on while inserting and the add it afterwards. I found that BY FAR it was much more efficient to create the index after all data was loaded. In our case it took 1 hour to load with the index added at the end, and 4 hours to add it with the index still on.
I think the key is to get the data moved as quick as possible, I am not sure if loading it in order really helps, do you have any stats on load time vs. index time? If you do, you could start to experiment a bit on that side of things.
Loading with the indexes dropped is better as a live index will generate several I/O's for every row in the database. 4 million rows is small enough that you would not expect to get a significant benefit from table partitioning.
You could get a performance win by using bcp to load the data into the staging area and running several tasks in parallel (SSIS will do this). Write a generic batch file wrapper for bcp that takes the file path (and table name if necessary) and invoke a series of jobs in half a dozen threads with 'Execute Process' tasks in SSIS. For 50 jobs it's probably not worth trying to write a data-driven load controller process. Wrap these tasks up in a sequence container so you don't have to maintain all of the dependencies explicitly.
You should definitely drop and re-create the indexes as this will greatly reduce the amount of I/O during the process.
If the 50 sources are being treated identically, try loading them into a common table or building a partitioned view over the staging tables.
Index at the end, yes. Also consider setting the log level setting to BULK LOGGED to minimize writes to the transaction log. Just remember to set it back to FULL after you've finished.
To the best of my knowledge, you are correct - it's much better to add the records all at once and then index once at the end.

Resources