Best way to move a data row to another shard? - database

The question says it all.
Example: I'm planning to shard a database table. The table contains customer orders which are flagged as "active", "done" and "deleted". I also have three shards, one for each flag.
As far as I understand a row has to be moved to the right shard, when the flag is changed.
Am I right?
What's the best way to do this?
Can triggers be used?
I thought about not moving the row immediately, but only at the end of the day/week/month, but then it is not determined, in which shard a rows with a specific flag resides and searches have to be done always over all shards.
EDIT: Some clarification:
In general I have to choose on a criterum to decide, in which shard a row resides. In this case I want it to be the flag described above, because it's the most natural way to shard this kind of data. (In my opinion) There is only a limited number of active orders which is accessed very often. There is a large number of finished orders, which are seldom accessed and there's a very huge number of data rows which are almost never accessed.
If I want to now where a specific data row resides I dont have to search all shards. If the user wants to load an active order, I know already in which database I have to look.
Now the flag, which is my sharding criterium, changes and I want to know the best way to deal with this case. If I'd just keep the record in its original database, eventually all data would accumulate in a single table.

In my opinion keeping all active record in single shard may not be a good idea. In such sharding strategy all IOs will be performed on single database instance leaving all other highly underutilized.
Alternate sharding strategy can be to distribute the newly created rows among the shards using some kind of hash function. This will allow
Quick look up of row
Distribute IO on all the shard instances.
No need to move the data from one shard to another (except the case when you want to increase the number of shards).

Sharding usually refer to separating them in different databases on different servers. Oracle can do what you want using a feature called partitioned tables.
If you're using triggers (after/before_update/insert), it would be an immediate move, other methods would result in having different types of data in the first shard (active), until it is cleaned-up.
I would also suggest doing this by date (like a monthly job that moves anything that's inactive and older than a month to another "Archive" Database).
I'd like to ask you to reconsider doing this if you're doing it to increase performance (Unless you have terabytes of data in this table). Please tell us why you want to shard and we'll all think about ways to solve your problem.

Related

Deleting Huge Data In Cassandra Cluster

I have Cassandra cluster with three nodes. We have data close to 7 TB from last 4 years. Now because of less space available in the server, we would like to keep data only for last 2 years. But we don't want to delete it completely(data older than 2 years). We want to keep specific data even it is older than 2 years.
Currently I can think of one approach:
1) Java client using "MutationBatch object". I can get all the records key which fall into date range and excluding rows which we don't want to delete. Then deleting records in a batch. But this solution raises concern over performance as data is huge.
Is it possible to handle it at the server level(opscenter). I read about TTL but how can I apply it to an existing data and also restrict some of the data which I want to keep even if it is older than 2 years.
Please help me in finding out the best solution.
The main thing that you need to understand is that when you remove the data in Cassandra, you're actually adding them by writing the tombstone, and then deletion of actual data will happen during compaction.
So it's very important to perform deletion correctly. There are different types of deletes - individual cells, row, range, partition (from least effective to most effective by number of tombstones generated). The best for you is to remove by partition, then second one is by ranges inside partition. Following article describes how the data is removed in great details.
You may need to perform deletion in several steps, so you don't add too much data as tombstones. You also need to check that you have enough disk space for compaction.

hbase data modeling for activity feeds/news feeds/timeline

I decided to use HBase in a project to store the users activities in a social network. Despite the fact that HBase has a simple way to express data (column oriented) I'm facing some difficulties to decide how I would represent the data.
So, imagine that you have millions of users, and each user is generating an activity when they, for example, comment in a thread, publishes something, like, vote, etc. I thought basically in two approaches with an Activity hbase table:
The key could be the user reference + timestamp of activity creation, the value all the activity metadata (most of time fixed size)
The key is the user reference, and then each activity would be stored as a new column inside a column family.
I saw examples for others types of system (such as blogs) that uses the 2nd approach. The first approach (with fixed columns, varying only when you change the schema) is more commonly seen.
What would be the impact in the way I access the data for these 2 approaches?
In general you are asking if your table should be wide or long. HBase works with both, up to a point. Wide tables should never have a row that exceeds region size (by default 256MB) -- so a really prolific user may crash the system if you store large chunks of data for their actions. However, if you are only storing a few bytes per action, then putting all user activity in one row will allow you to get their full history with one get. However, you will be retrieving the full row, which could cause some slowdown for a lot of history (10s of seconds for > 100MB rows).
Going with a tall table and an inverse time stamp would allow you to get a users recent activity very quickly (start a scan with the key = user id).
Using timestamps as the end of a key is a good idea if you want to query by time, but it is a bad idea if you want to optimize writes to your database (writes will always be in the most recent region in the system, causing hot spotting).
You might also want to consider putting more information (such as the activity) in the key so that you can pick up all activity of a particular type more easily.
Another example to look at is OpenTSDB

Does storing aggregated data go against database normalization?

On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.

database archiving vs timeperiod based tables/fields

I am working on an employee objectives web application.
Lead/Manager sets objectives for team members after discussing with them. This is an yearly/half-yearly/quarterly depending on appraisal cycle the organization follows.
Now question is is better approach to add time period based fields or archive previous quarter's/year's data. When a user want to see previous objectives (not so frequent activity), the archive that belongs to that date may be restored in some temp table and shown to employee.
Points to start with
archiving: reduces db size, results in simpler db queries, adds an overhead when someone tried to see old data.
time-period based field/tables: one or more extra joins in queries, previous data is treated similar to current data so no overhead in retrieving old data.
PS: it is not space cost, my point is if we can achieve some optimization in terms of performance, as this is a web app and at peak times all the employees in an organization will be looking/updating it. so removing time period makes my queries a lot simpler.
Thanks
Assuming you're talking about data that changes over time, as opposed to logging-type data, then my preferred approach is to keep only the "latest" version of the data in your primary table(s), and to automatically copy the previous version of the data into a archive table. This archive table would mirror the primary, with the addition of versioned fields, such as timestamps. This archiving can be done with a trigger.
The main benefit that I see with this approach is that it doesn't compromise your database design. In particular, you don't have to worry about using composite keys that incorporate the version fields (in fact using time-based fields as keys may not even be permitted by your database).
If you need to go and look at the old data, you can run a select against the archive table and add version constraints to the query.
I would start off adding your time period fields and waiting until size becomes an issue. The kind of data you are describing does not sound like it is going to consume a lot of storage space.
Should it grow uncontrollably you can always look at the archive approach later - but the coding is going to take much longer than simply storing the relevant period with your data.
It seems to me that if you have the requirement that a user can look arbitrarily far back in the past, then you really must keep the data accessible.
This just won't be sustainable:
the archive that belongs to that date may be restored in some temp table and shown to employee.
My recommendation would be to periodically (read when absolutely necessary) move 'very old' data to another table for this purpose. Disk space is extremely cheap at this point, so keeping that data around is not nearly as expensive as implementing the system that can go back to an arbitrary time and restore an archive.

Searching across shards?

Short version
If I split my users into shards, how do I offer a "user search"? Obviously, I don't want every search to hit every shard.
Long version
By shard, I mean have multiple databases where each contains a fraction of the total data. For (a naive) example, the databases UserA, UserB, etc. might contain users whose names begin with "A", "B", etc. When a new user signs up, I simple examine his name and put him into the correct database. When a returning user signs in, I again look at his name to determine the correct database to pull his information from.
The advantage of sharding vs read replication is that read replication does not scale your writes. All the writes that go to the master have to go to each slave. In a sense, they all carry the same write load, even though the read load is distributed.
Meanwhile, shards do not care about each other's writes. If Brian signs up on the UserB shard, the UserA shard does not need to hear about it. If Brian sends a message to Alex, I can record that fact on both the UserA and UserB shards. In this way, when either Alex or Brian logs in, he can retrieve all his sent and received messages from his own shard without querying all shards.
So far, so good. What about searches? In this example, if Brian searches for "Alex" I can check UserA. But what if he searches for Alex by his last name, "Smith"? There are Smiths in every shard. From here, I see two options:
Have the application search for Smiths on each shard. This can be done slowly (querying each shard in succession) or quickly (querying each shard in parallel), but either way, every shard needs to be involved in every search. In the same way that read replication does not scale writes, having searches hit every shard does not scale your searches. You may reach a time when your search volume is high enough to overwhelm each shard, and adding shards does not help you, since they all get the same volume.
Some kind of indexing that itself is tolerant of sharding. For example, let's say I have a constant number of fields by which I want to search: first name and last name. In addition to UserA, UserB, etc. I also have IndexA, IndexB, etc. When a new user registers, I attach him to each index I want him to be found on. So I put Alex Smith into both IndexA and IndexS, and he can be found on either "Alex" or "Smith", but no substrings. In this way, you don't need to query each shard, so search might be scalable.
So can search be scaled? If so, is this indexing approach the right one? Is there any other?
There is no magic bullet.
Searching each shard in succession is out of the question, obviously, due to the incredibly high latency you will incur.
So you want to search in parallel, if you have to.
There are two realistic options, and you already listed them -- indexing, and parallelized search. Allow me to go into a little more detail on how you would go about designing them.
The key insight you can use is that in search, you rarely need the complete set of results. You only need the first (or nth) page of results. So there is quite a bit of wiggle room you can use to decrease response time.
Indexing
If you know the attributes on which the users will be searched, you can create custom, separate indexes for them. You can build your own inverted index, which will point to the (shard, recordId) tuple for each search term, or you can store it in the database. Update it lazily, and asynchronously. I do not know your application requirements, it might even be possible to just rebuild the index every night (meaning you will not have the most recent entries on any given day -- but that might be ok for you). Make sure to optimize this index for size so it can fit in memory; note that you can shard this index, if you need to.
Naturally, if people can search for something like "lastname='Smith' OR lastname='Jones'", you can read the index for Smith, read the index for Jones, and compute the union -- you do not need to store all possible queries, just their building parts.
Parallel Search
For every query, send off requests to every shard unless you know which shard to look for because the search happens to be on the distribution key. Make the requests asynchronous. Reply to the user as soon as you get the first page-worth of results; collect the rest and cache locally, so that if the user hits "next" you will have the results ready and do not need to re-query the servers. This way, if some of the servers are taking longer than others, you do not need to wait on them to service the request.
While you are at it, log the response times of the sharded servers to observe potential problems with uneven data and/or load distribution.
I'm assuming you are talking about shards a la :
http://highscalability.com/unorthodox-approach-database-design-coming-shard
If you read that article he goes into some detail on exactly your question, but long answer short, you write custom application code to bring your disparate shards together. You can do some smart hashing to both query individual shards and insert data into shards. You need to ask a more specific question to get a more specific answer.
You actually do need every search to hit every shard, or at least every search needs to be performed against an index that contains the data from all shards, which boils down to the same thing.
Presumably you shard based on a single property of the user, probably a hash of the username. If your search feature allows the user to search based on other properties of the user it is clear that there is no single shard or subset of shards that can satisfy a query, because any shard could contain users that match the query. You can't rule out any shards before performing the search, which implies that you must run the query against all shards.
You may want to look at Sphinx (http://www.sphinxsearch.com/articles.html). It supports distributed searching. GigaSpaces has parallel query and merge support. This can also be done with MySQL Proxy (http://jan.kneschke.de/2008/6/2/mysql-proxy-merging-resultsets).
To build a non-sharded indexed kinds of defeats the purpose of the shard in the first place :-) A centralized index probably won't work if shards were necessary.
I think all the shards need to be hit in parallel. The results need to be filtered, ranked, sorted, grouped and the results merged from all the shards. If the shards themselves become overwhelmed you have to do the usual (reshard, scale up, etc) to underwhelm them again.
RDBMs are not good tool for textual search. You will be much better off looking at Solr. Performance difference between Solr and database will be in the order of magnitude of 100X.

Resources