Architecting a high performing "inserting solution"

Architecting a high performing "inserting solution" - database

I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.

Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.

Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.

Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.

Related

SQL Server In-Memory table use case with huge data

I have SQL Server table with 160+ million records having continuous CRUD operations from UI, batch jobs etc. basically from multiple sources
Currently I have partitioned the table on a column to have better performance on the table.
I came across In-Memory tables which can be used in case of tables with frequent updates and also if updates happening from multiple sources it won't put a lock instead it will maintain row versioning, so concurrent updates is better using this approach.
So what are my options in this case ?
Partition the table or Create In-Memory table
As I have read SQL server is not supporting In-Memory table when table is partitioned.
What is the better option in this case In-Memory table or partitioned table.

It depends.
In-memory tables look great on theory, but you really need to spend time learning the details in order to make the right implementation. You may find some details disturbing. For example:
there are no parallel inserts in in-memory tables which make creation of rows slower compare to parallel insert in traditional table stored in SSD
not all index operations supported by dis-based indexes are available in in-memory table indexes
not all data types are supported
there are both unsupported features and T-SQL constructs
you may need more RAM then you think
If you are ready to pay the price for using Hekaton, you may start with reading its white-paper.
The partitioning itself comes with benefits but there is no guarantee it will heal your system. Only particular queries and case-scenarios can benefit from it. For example, if 99% of your workload is touching the data in one partition you may see no optimization at all. On the other hand, if your reports are based on historical data and your inserts/updates/deletes touch another partition it will be better.
Both of the technologies are good, but need to be examine in details and applied carefully. Often, folks believe that using some new tech will solve their problems, when the problems can be solved just applying some basic concepts.
For example, you said that you are performing CRUD over 160+ millions records. Ask yourself:
is my table normalized - when data is stored in normalized way you gain two things - first, you will perform CRUD only on part of the data, the engine may read only the data that is needed for particular query (without the need to support an index)
are my T-SQL statements write well - row by agonizing row, calling stored procedures in loops or not processing the data in batches are common sources of slow queries
which are the blocking and deadlocked queries - for example, there is a possibility one long running query to block all your inserts - identify these types of issues first and try to resolve them with data pre-calculation (indexed view) or creating covering indexes (which can be filtered with include columns, too)
are readers and writers being blocked - you can try different isolation levels to solve this type of issues - RCSI is the Azure default isolation level. You may need to add more RAM to your RAMDISK used by your TempDB, but since your are looking at Hekaton, this will be easier to test (and rollback) compare to it(or partitioning)

A huge data storage problem

I'm starting to design a new application that will be used by about 50000 devices. Each device generates about 1440 registries a day, this means that will be stored over 72 million of registries per day. These registries keep coming every minute, and I must be able to query this data by a Java application (J2EE). So it need to be fast to write, fast to read and indexed to allow report generation.
Devices only insert data and the J2EE application will need to read then occasionally.
Now I'm looking to software alternatives to support this kind of operation.
Putting this data on a single table would lead to a catastrophic condition, because I won't be able to use this data due to its amount of data stored over a year.
I'm using Postgres, and database partitioning seems not to be a answer, since I'd need to partition tables by month, or may be more granular approach, days for example.
I was thinking on a solution using SQLite. Each device would have its own SQLite database, than the information would be granular enough for good maintenance and fast insertions and queries.
What do you think?

Record only changes of device positions - most of the time any device will not move - a car will be parked, a person will sit or sleep, a phone will be on unmoving person or charged etc. - this would make you an order of magnitude less data to store.
You'll be generating at most about 1TB a year (even when not implementing point 1), which is not a very big amount of data. This means about 30MB/s of data, which single SATA drive can handle.
Even a simple unpartitioned Postgres database on not too big hardware should manage to handle this. The only problem could be when you'll need to query or backup - this can be resolved by using a Hot Standby mirror using Streaming Replication - this is a new feature in soon to be released PostgreSQL 9.0. Just query against / backup a mirror - if it is busy it will temporarily and automatically queue changes, and catch up later.
When you really need to partition do it for example on device_id modulo 256 instead of time. This way you'd have writes spread out on every partition. If you partition on time just one partition will be very busy on any moment and others will be idle. Postgres supports partitioning this way very well. You can then also spread load to several storage devices using tablespaces, which are also well supported in Postgres.

Time-interval partitioning is a very good solution, even if you have to roll your own. Maintaining separate connections to 50,000 SQLite databases is much less practical than a single Postgres database, even for millions of inserts a day.
Depending on the kind of queries that you need to run against your dataset, you might consider partitioning your remote devices across several servers, and then query those servers to write aggregate data to a backend server.
The key to high-volume tables is: minimize the amount of data you write and the number of indexes that have to be updated; don't do UPDATEs or DELETEs, only INSERTS (and use partitioning for data that you will delete in the future—DROP TABLE is much faster than DELETE FROM TABLE!).
Table design and query optimization becomes very database-specific as you start to challenge the database engine. Consider hiring a Postgres expert to at least consult on your design.

Maybe it is time for a db that you can shard over many machines? Cassandra? Redis? Don't limit yourself to sql db's.

Database partition management can be automated; time-based partitioning of the data is a standard way of dealihg with this type of problem, and I'm not sure that I can see any reason why this can't be done with PostgreSQL.
You have approximately 72m rows per day - assuming a device ID, datestamp and two floats for coordinates you will have (say) 16-20 bytes per row plus some minor page metadata overhead. A back-of-fag-packet capacity plan suggests around 1-1.5GB of data per day, or 400-500GB per year, plus indexes if necessary.
If you can live with periodically refreshed data (i.e. not completely up to date) you could build a separate reporting table and periodically update this with an ETL process. If this table is stored on separate physical disk volumes it can be queried without significantly affecting the performance of your transactional data.
A separate reporting database for historical data would also allow you to prune your operational table by dropping older partitions, which would probably help with application performance. You could also index the reporting tables and create summary tables to optimise reporting performance.
If you need low latency data (i.e. reporting on up-to-date data), it may also be possible to build a view where the lead partitions are reported off the operational system and the historical data is reported from the data mart. This would allow the bulk queries to take place on reporting tables optimised for this, while relatively small volumes of current data can be read directly from the operational system.
Most low-latency reporting systems use some variation of this approach - a leading partition can be updated by a real-time process (perhaps triggers) and contains relatively little data, so it can be queried quickly, but contains no baggage that slows down the update. The rest of the historical data can be heavily indexed for reporting. Partitioning by date means that the system will automatically start populating the next partition, and a periodic process can move, re-index or do whatever needs to be done for the historical data to optimise it for reporting.
Note: If your budget runs to PostgreSQL rather than Oracle, you will probably find that direct-attach storage is appreciably faster than a SAN unless you want to spend a lot of money on SAN hardware.

That is a bit of a vague question you are asking. And I think you are not facing a choice of database software, but an architectural problem.
Some considerations:
How reliable are the devices, and how
well are they connected to the
querying software?
How failsafe do
you need the storage to be?
How much extra processing power do the devices
have to process your queries?
Basically, your idea of a spatial partitioning is a good idea. That does not exclude a temporal partition, if necessary. Whether you do that in postgres or sqlite depends on other factors, like the processing power and available libraries.
Another consideration would be whether your devices are reliable and powerful enough to handle your queries. Otherwise, you might want to work with a centralized cluster of databases instead, which you can still query in parallel.

When to build a separate reporting database?

We're building an application that has a database (yeah, pretty exciting huh :). The database is mainly transactional (to support the app) and also does a bit of "reporting" as part of the app - but nothing too strenuous.
Above and beyond that we have some reporting requirements - but they're pretty vague and high-level at the moment. We have a standard reporting tool that we-use in-house which we'll use to do the "heavier" reporting as the requirements solidify.
My question is: how do you know when a separate database for reporting is required?
What sort of questions need to be asked? What sort of things would make you decide a separate reporting database was necessary?

In general, the more mission critical the transactional app and the more sophisticated the reporting requirements, the more splitting makes sense.
When transaction performance is critical.
When it's hard to get a maintenance window on the transactional app.
If reporting needs to correlate results not only from this app, but from other application silos.
If the reports need to support trending or other types of reporting that are best suited for a star schema/Business Intelligence environment.
If the reports are long running.
If the transactional app is on an expensive hardware resource (cluster, mainframe, etc.)
If you need to do data cleansing/extract-transform-load operations on the transactional data (e.g., state names to canonical state abbreviations).
It adds non-trivial complexity, so imo, there has to be a good reason to split.

Typically, I would try to report off the transactional database initially.
Ensure that any indexes you add to facilitate efficient reporting are all frequently used. The more indexes you add, the poorer performance is going to be on inserts and (if you alter keys) updates.
When you do go to a reporting database, remember there are only a few reasons you are going there:
Ultimately, the number one thing about reporting databases is that you are removing locking contention from the OLTP database. So if your reporting database is a straight copy of the same database, you're simply using delayed snapshots which won't interfere with production transactions.
Next, you can have a separate indexing strategy to support the reporting usage scenarios. These extra indexes are OK to maintain in the reporting database, but would cause unnecessary overhead in the OLTP database.
Now both the above could be done on the same server (even the same instance in a separate database or even just in a separate schema) and still see benefits. When CPU and IO are completely pegged, at that point, you definitely need to have it on a completely separate box (or upgrade your single box).
Finally, for ultimate reporting flexibility, you denormalize the data (usually into a dimensional model or star schemas) so that the reporting database is the same data in a different model. Reporting of large amounts of data (particularly aggregates) is extremely fast in dimensional models because the star schemas are very efficient for that. It also is efficient for a larger variety of queries without a lot of re-indexing or analysis to change indexes, because the dimensional model lends itself better to unforeseen usage patterns (the old "slice and dice every which way" request). You could view this is a kind of mini-data warehouse where you use data warehousing techniques, but aren't necessarily implementing a full-blown data warehouse. Also, star schemas are particular easy for users to get to grips with, and data dictionaries are much simpler and easier to build for BI tools or reporting tools from star schemas. You could do this on the same box or different box etc, just like discussed earlier.

This question requires experience rather than science.
As a BI architect, the approach I take on designing each BI solution for my clients are very different. I don't go through a checklist. It requires a general understanding of their system, their reporting requirements, budget and man power.
I personally prefer to keep the reporting processes as much as possible on the database side (Best practice in BI world). REPORTING TOOLS ARE FOR DISPLAYING PURPOSE ONLY (MAXIMUM FOR SMALL CALCULATIONS). This approach requires a lot of pre-processing of data which requires different staging tables, triggers and etc.
When you said:
I work on projects with hundreds of millions of rows with real time reporting along with hundreds of users accessing the application/database at the same time with out issue.
There are a few things wrong with your statement.
Hundreds of millions of rows are A LOT. even today's in memory tools like Cognos TM1 or Qlikview would struggle to get such a results. (look at SAP HANA from SAP to understand how giants in the industry handle it).
If you have Hundreds of millions of rows in database, it doesn't necessarily mean that the report needs to go through all those records. maybe the report worked on 1000s not millions. probably that's what you saw.
Transactional reports are very different than dashboards. Most dashboard tools pre-processing and cache the data.
My point is that it all comes to experience for deciding when to:
design a new schema
create a semantic database
work on the same transactional database
or even use a reporting tool (Sometimes handwritten dashboards with Java/JSF/Ajax/jQuery or JSP would work fine for client)

The main reason you would need a separate database for reporting issues is when the generation of the reports interferes with the transactional responsibilities of the app. E.g. if a report takes 20 minutes to generate and utilizes 100% of the CPU/Disk/etc... during a time of high activity you might think of using a separate database for reporting.
As for questions, here are some basic one:
Can I do the high intensity reports during non-peak hours?
Does it interfere with the users using the system?
If yes to #2, what are the costs of the interference Vs the cost of another database server, refactoring code, etc...?

I would also add another reason for which you might use a reporting database, and that is: CQRS pattern (Command Query Responsibility Separation).
If you have a large number of users accessing and writing to a small set of data, you would do wise to consider this pattern. It basicly, in its simplest form, means that all your commands (Create, Update, Delete) are pushed to the transactional database.
All of your queries (Read) are from your reporting database. This lets you freely scopy your architecture and upgrade function.
There are MUCH more to it in the pattern, I just mentioned the bit which was interesting due to your question regarding reporting database.

Basically, when the database load from the app becomes incompatible with the database load for reporting. This could be due to:
Reporting consuming inordinate amount of database server resources impacting the app's DB performance.
A part of this category would be the app DB work having to wait on a majorly slow report query due to locking, though it might be possible to resolve with less drastic methods like locking tuning.
Reporting queries being very incompatible with app queries as far as tuning (e.g. indices but not limited to that) - the most dumb example would be something like a hot spot affecting app inserts because of the reporting-purpose index.
Timing issues. E.g. the only small windows for DB maintenance available (due to application usage) are the times of heavy reporting work
Reporting data's sheer volume (e.g. logging, auditing, statistics) is so big that your primary DB server architecture is a bad solution for such reporting (see Sybase ASE vs. Sybase IQ). BTW, this is a real scenario - we moved our performance reporting to IQ because of this.

I would also add that transactional databases are meant to hold current state and oftentimes do so to be self-maintaining. You don't want transactional databases growing beyond their necessary means. When a workflow or transaction is complete then move that data out and into a Reporting database, which is much better designed to hold historical data.

We failed trying database per custom installation. Plan to recover?

There is a web application which is in production mode for 3 years or so by now. Historically, because of different reasons there was made a decision to use database-per customer installation.
Now we came across the fact that now deployments are very slow.
Should we ever consider moving all the databases back to single one to reduce environment complexity? Or is it too risky idea?
The problem I see now is that it's very hard to merge these databases with saving referential integrity(primary keys of different database' tables can not be obviously differentiated).
Databases are not that much big, so we don't have much benefits of reduced load by having multiple databases.

Your question is quite broad.
a) Ensure that merged databases don't suffer from degraded performance with things like JOIN statements when, say, 1000 databases are merged even though each is small. As for your referential integrity ... which I assume is auto_increment based ... you can replace these relationships by altering the schema and supplanting UUID or a similar unique, non-sequential value. Or even a surrogate key pair in addition to your auto increment PK.
b) Do benchmarking to ensure your application would respond within performance limits
c) Is there a direct ROI for doing this? What are the long term cost benefits vs the expense of migration? Is the decreased complexity worth increased (if any) cost?
d) How does this impact your backup and disaster recovery plans? Does it make them cheaper? Slower? More expensive?
Abstraction and management tools approach:
if it were me, depending on the situation, I would keep the scalability that comes with per-client sharding and create a set of management tools to abstractly create one virtual database. Using these tools you can acquire the simplified management without loosing technical flexibility. I suspect you want to simplify the cost of managing all these databases (based on your deployment statement). Creating a 'control panel' for your farm can be a good way to simplify a complex system (especially when deployments may use different schema versions).

For the migrated data... customer one database UUIDs can start with 10000000, Customer two database UUIDs can start with 20000000. Customer three database UUIDs can start with 30000000.....
In my opinion when you host the database for your customers, a single database that handles multiple customers is a better idea overall. Of course you need to add a "customers" table to record the customers, and a "customer_id" column on all top-level data that is within the table, and include checks in all your SQL to ensure the customer's view is limited to their own data.
I'd set up a new database with the additional columns, and then test it with a dummy customer or three for a while to ensure all bugs are wiped out. Then I'd migrate all the customers across, one by one, doing checks that the data will fit.

What should I keep in mind if I wish to merge many DBs into one DB?

I am working with a half dozen DBs. The DBs all have the same schemas, the same SPs, etc. Speaking to the person who originally designed the DBs, a big part of the motivation for using many DBs was efficiency; the alternative would be to add a column to pretty much every table and sp in the database indicating which set of data was being worked in, resulting in one giant (and thus slower) DB instead of several samll DBs. In place of having a column to indicate which set of data is being queried, the connection string is used to select which database is being hit.
The only reason I really dislike this organization is that it involves a lot of code duplication and thus hurts maintenance. For example, every time I wish to change a stored procedure, I need to run the alter statement on every database.
One solution I have considered is to combine all of the data into one big database, adding an extra column all over the place to indicate which database the data would be in if I had not combined it. Then, I could partition all of the tables by this column's value. In theory, the result of all of this is that the underlying representation of all of the data itself will be morally the same as it is now, but without the redundancies in the indexes, schemas, SPs, etc.
My questions are this:
Is this a good idea? Is there a better way to accomplish this?
Are there any gotchas in doing this?
Will this have any impact on performance?

Everyone will deal with this at some point. My own personal opinion is that multiple databases are a pain in the backside and are not faster. They are a pain because of the maintenance headaches. Adding an extra column in each table as necessary will not slow your process done that much, if indexing is set properly. And your maintenance will be much easier. Plus, doing transactions across multiple DB's can be a hassle and involve MTC.
BTW, using a single database is often called a multi-tenant database. You might want to research this a bit. But I would avoid multiple DB's like this if possible.

I'm of a different mind than Randy.
The multi-tenant model has its advantages.
For one, maintenance is not really much different whether you have 5 databases or 500. At some point you stop looking at maintenance of individual databases and look at the set. Yes you must serialize backups and you can't be performing index reorg/rebuild across all databases at once.
But for code changes across multiple more-or-less identical databases, there are easy ways to script a lot of things to be done to multiple databases without really lifting an extra finger. I use a tool called SQLFarms Combine (now sold by JNetDirect), but there are other offerings such as RedGate MultiScript that I haven't played with.
What I like most about the multi-tenant model is that when you grow and scale and suddenly need a new database server, it is very easy to move one of the tenants (say, the busiest or fastest growing) to the new server. If everybody is jammed into the same database, this extraction of only their data becomes quite difficult, especially if there is to be minimized downtime. In the multi-tenant model, you can set up mirroring for just their database, and then switch the primary when you're ready.

I'd be in favor of combining these databases. There are other facilities built into SQL Server to account for the potential performance downfalls of a very large database, like additional indexing on a second physical disk, partitioning, clustering, etc. The headache and overhead involved in deploying schema updates to that many different databases can be time consuming when it's easily handled in a single database. I think SQL Server scales really well in cases like this - let the database server do what it's designed to do and provide responsive access to your data. You can focus on application design and leave the storage model to SQL Server.
Also, though this isn't mentioned above, I'd suspect that there's some level of dynamic SQL involved in the applications that use this "many database" model because you've got to switch between databases based on something you know, so it can't be hard coded into the application or in a configuration file, meaning that either connection strings or actual SQL statements have to be generated on the fly, and that can be a really big security risk (read about "SQL Injection" if you're unfamiliar with the potential risks of dynamic SQL).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight