What does Sharding means from Database Design Perspective? - database

What is the concept of Sharding from Database Design perspecitve ?

Partitioning the key space across a cluster of DB servers to distribute the service load and promote scalability of the overall system.

Well you could start by reading the wikipedia article on it, and perhaps the Hibernate Shards documentation. If you then have a more specific question, then ask that.
Horizontal partitioning is a design
principle whereby rows of a database
table are held separately, rather than
splitting by columns (as for
normalization). Each partition forms
part of a shard, which may in turn be
located on a separate database server
or physical location. The advantage is
the number of rows in each table is
reduced (this reduces index size, thus
improves search performance).

Related

What is the best approach in Snowflake for delete operations on a really large table?

We're thinking of moving our ODS from S3 into Snowflake but have some concerns on performance. Deleting 39 million rows from a 1.5 Billion (this would be on the smaller side) row table took 15 minutes on an x-small warehouse, 8 on a small, and 5 on a medium. We could throw money at larger instances, but really don't want to do that until all other measures were explored.
We were also thinking about implementing a manual partition system, to chunk up the table, but there would be a dev cost to create the supporting functionality.
Does Snowflake have a partitioning system that I'm not aware about that's equivalent to SQL Server? I know that's a reach, but swapping partitions was great.
Thanks for any feedback!
Snowflake doesn't have partitions like SQL Server as such, but the storage method of data in Snowflake is micro-partitions, which are similar...sort of. You can use Snowflake's automatic re-clustering service to align those micro-partitions on one or multiple fields, which would then make deleting on those keys a much faster operation. Leveraging the fields that you are deleting off of to cluster your tables should help quite a bit.
The approach of doing explicit clustering on a table requires thoughtful planning to consider various access patterns and workloads (ingestions, queries etc) touching that same table and cost considerations, so I am not sure if doing explicit clustering for the delete case is good enough reason to change the table layout.
What if instead of deleting from large able, use CTAS to create another table on the surviving rows and the drop original table?

How to scale writings in your DB without recurring to sharding?

How would you scale writings without recurring to sharding (specially with SQL Server 2008)?
Normally ... avoid indexes and foreign keys in big tables. Every insert/update on a indexed column implies rebuilding partially the index and sometimes this can be very costly. Of course, you'll have to trade query speed VS writing speed but this is a known issue in database design. You can combine this with a NoSQL database with a some sort of mechanism for caching queries. Maybe a fast NoSQL system sitting in front of your transactional system.
Another option is to use transactions in order to do many writes in one go, when you commit the transaction the indexes will be rebuilt but just once per transaction not one per write.
Why not shard? The complexities in the code can be avoided by using transparent sharding tools, which ease all the heavy lifting associated with sharding.
Check out ScaleBase for more info

A huge data storage problem

I'm starting to design a new application that will be used by about 50000 devices. Each device generates about 1440 registries a day, this means that will be stored over 72 million of registries per day. These registries keep coming every minute, and I must be able to query this data by a Java application (J2EE). So it need to be fast to write, fast to read and indexed to allow report generation.
Devices only insert data and the J2EE application will need to read then occasionally.
Now I'm looking to software alternatives to support this kind of operation.
Putting this data on a single table would lead to a catastrophic condition, because I won't be able to use this data due to its amount of data stored over a year.
I'm using Postgres, and database partitioning seems not to be a answer, since I'd need to partition tables by month, or may be more granular approach, days for example.
I was thinking on a solution using SQLite. Each device would have its own SQLite database, than the information would be granular enough for good maintenance and fast insertions and queries.
What do you think?
Record only changes of device positions - most of the time any device will not move - a car will be parked, a person will sit or sleep, a phone will be on unmoving person or charged etc. - this would make you an order of magnitude less data to store.
You'll be generating at most about 1TB a year (even when not implementing point 1), which is not a very big amount of data. This means about 30MB/s of data, which single SATA drive can handle.
Even a simple unpartitioned Postgres database on not too big hardware should manage to handle this. The only problem could be when you'll need to query or backup - this can be resolved by using a Hot Standby mirror using Streaming Replication - this is a new feature in soon to be released PostgreSQL 9.0. Just query against / backup a mirror - if it is busy it will temporarily and automatically queue changes, and catch up later.
When you really need to partition do it for example on device_id modulo 256 instead of time. This way you'd have writes spread out on every partition. If you partition on time just one partition will be very busy on any moment and others will be idle. Postgres supports partitioning this way very well. You can then also spread load to several storage devices using tablespaces, which are also well supported in Postgres.
Time-interval partitioning is a very good solution, even if you have to roll your own. Maintaining separate connections to 50,000 SQLite databases is much less practical than a single Postgres database, even for millions of inserts a day.
Depending on the kind of queries that you need to run against your dataset, you might consider partitioning your remote devices across several servers, and then query those servers to write aggregate data to a backend server.
The key to high-volume tables is: minimize the amount of data you write and the number of indexes that have to be updated; don't do UPDATEs or DELETEs, only INSERTS (and use partitioning for data that you will delete in the future—DROP TABLE is much faster than DELETE FROM TABLE!).
Table design and query optimization becomes very database-specific as you start to challenge the database engine. Consider hiring a Postgres expert to at least consult on your design.
Maybe it is time for a db that you can shard over many machines? Cassandra? Redis? Don't limit yourself to sql db's.
Database partition management can be automated; time-based partitioning of the data is a standard way of dealihg with this type of problem, and I'm not sure that I can see any reason why this can't be done with PostgreSQL.
You have approximately 72m rows per day - assuming a device ID, datestamp and two floats for coordinates you will have (say) 16-20 bytes per row plus some minor page metadata overhead. A back-of-fag-packet capacity plan suggests around 1-1.5GB of data per day, or 400-500GB per year, plus indexes if necessary.
If you can live with periodically refreshed data (i.e. not completely up to date) you could build a separate reporting table and periodically update this with an ETL process. If this table is stored on separate physical disk volumes it can be queried without significantly affecting the performance of your transactional data.
A separate reporting database for historical data would also allow you to prune your operational table by dropping older partitions, which would probably help with application performance. You could also index the reporting tables and create summary tables to optimise reporting performance.
If you need low latency data (i.e. reporting on up-to-date data), it may also be possible to build a view where the lead partitions are reported off the operational system and the historical data is reported from the data mart. This would allow the bulk queries to take place on reporting tables optimised for this, while relatively small volumes of current data can be read directly from the operational system.
Most low-latency reporting systems use some variation of this approach - a leading partition can be updated by a real-time process (perhaps triggers) and contains relatively little data, so it can be queried quickly, but contains no baggage that slows down the update. The rest of the historical data can be heavily indexed for reporting. Partitioning by date means that the system will automatically start populating the next partition, and a periodic process can move, re-index or do whatever needs to be done for the historical data to optimise it for reporting.
Note: If your budget runs to PostgreSQL rather than Oracle, you will probably find that direct-attach storage is appreciably faster than a SAN unless you want to spend a lot of money on SAN hardware.
That is a bit of a vague question you are asking. And I think you are not facing a choice of database software, but an architectural problem.
Some considerations:
How reliable are the devices, and how
well are they connected to the
querying software?
How failsafe do
you need the storage to be?
How much extra processing power do the devices
have to process your queries?
Basically, your idea of a spatial partitioning is a good idea. That does not exclude a temporal partition, if necessary. Whether you do that in postgres or sqlite depends on other factors, like the processing power and available libraries.
Another consideration would be whether your devices are reliable and powerful enough to handle your queries. Otherwise, you might want to work with a centralized cluster of databases instead, which you can still query in parallel.

We failed trying database per custom installation. Plan to recover?

There is a web application which is in production mode for 3 years or so by now. Historically, because of different reasons there was made a decision to use database-per customer installation.
Now we came across the fact that now deployments are very slow.
Should we ever consider moving all the databases back to single one to reduce environment complexity? Or is it too risky idea?
The problem I see now is that it's very hard to merge these databases with saving referential integrity(primary keys of different database' tables can not be obviously differentiated).
Databases are not that much big, so we don't have much benefits of reduced load by having multiple databases.
Your question is quite broad.
a) Ensure that merged databases don't suffer from degraded performance with things like JOIN statements when, say, 1000 databases are merged even though each is small. As for your referential integrity ... which I assume is auto_increment based ... you can replace these relationships by altering the schema and supplanting UUID or a similar unique, non-sequential value. Or even a surrogate key pair in addition to your auto increment PK.
b) Do benchmarking to ensure your application would respond within performance limits
c) Is there a direct ROI for doing this? What are the long term cost benefits vs the expense of migration? Is the decreased complexity worth increased (if any) cost?
d) How does this impact your backup and disaster recovery plans? Does it make them cheaper? Slower? More expensive?
Abstraction and management tools approach:
if it were me, depending on the situation, I would keep the scalability that comes with per-client sharding and create a set of management tools to abstractly create one virtual database. Using these tools you can acquire the simplified management without loosing technical flexibility. I suspect you want to simplify the cost of managing all these databases (based on your deployment statement). Creating a 'control panel' for your farm can be a good way to simplify a complex system (especially when deployments may use different schema versions).
For the migrated data... customer one database UUIDs can start with 10000000, Customer two database UUIDs can start with 20000000. Customer three database UUIDs can start with 30000000.....
In my opinion when you host the database for your customers, a single database that handles multiple customers is a better idea overall. Of course you need to add a "customers" table to record the customers, and a "customer_id" column on all top-level data that is within the table, and include checks in all your SQL to ensure the customer's view is limited to their own data.
I'd set up a new database with the additional columns, and then test it with a dummy customer or three for a while to ensure all bugs are wiped out. Then I'd migrate all the customers across, one by one, doing checks that the data will fit.

Database design: one huge table or separate tables?

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
My question, will we be able to handle this table in the future, or is it better to split this table up into smaller tables for each customer?
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.
Start out with one large table, and then apply 2008's table partitioning capabilities where appropriate, if performance becomes an issue.
Datawarehouses are supposed to be big (the clue is in the name). Twenty million rows is about medium by warehousing standards, although six hundred million can be considered large.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.
Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.
I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.
Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
An other option is the DataVault method by Dan Lindstedt. Which is a bit more complex but provides you with full flexibility.
http://danlinstedt.com/category/datavault/
In a properly designed database, that is not a huge anmout of records and SQl server should handle with ease.
A partioned single table is usually the best way to go. Trying to maintain separate indivudal customer tables is very costly in termas of time and effort and far more probne to errors.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.
Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)
One table and use table partitioning.
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.
Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.
One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.
If you're on MS SQL server and you want to keep the single table, table partitioning could be one solution.
Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
You will need to look into the performance and locking issues however - this will prevent your db from scaling.
You can also create supplemental tables that hold already calculated details on historical information if there are common queries.

Resources