hbase for storing gamers' last 1000 key hits - database

So for my use case, I need to save only the last 1000 key hits of each gamer. and there will be only 2 fields --> gamerId (all numeric) and keyId (also all numeric). so, lets say, gamer 1123 already has 999 keyIds stored, when the 1000th keyId comes in for that gamer, normal insertion. however, once 1001st keyId comes in, we need to remove the earliest recorded keyId for that gamer and persist that 1001st in. so, at all times, there can only be max 1000 keyIds for each gamer in the db. We have +/- 100 million of gamers and very high keyId traffic, and this table will be looked up and written into very frequently.
will HBase be suitable for this? if it's not, what could be the alternative?
Thanks

In principle, you can get this done in hbase very easily thanks to versioning. I've never tried something as extreme at 1,000 versions per column (normally 5-10) but I don't think there is any specific restriction as to how many versions you can have. You should just see if it creates any performance implications. Also check out this discussion: https://www.quora.com/Is-there-a-limit-to-the-number-of-versions-for-an-HBase-cell
When you define your table and your column family, you can specify the max versions parameter. This way, when you simply keep doing the Puts with the same row value, the key for that row will keep generating new versions (they will all be time-stamped as well. Once you do your 1,001th Put, the 1st put will automatically be deleted, and so on on the FIFO basis. Similarly, when you do a Get on that row-key, you can use various methods to retrieve a range of versions. In that case it depends on what API you will be using to get the values (this is easy to do with native Java API, but not sure about other access methods).
100mln rows is quite small for HBase, so generally it shouldn't be a problem. But of course if each of your rows really has 1,000 versions, then you are looking at 100bln key-values. Again, I'd say it's doable for HBase, but you should see imperially whether this causes any performance problems and you should size your cluster appropriately.

Related

Removing PAGELATCH with randomized ID instead of GUID

We have two tables which receive 1 million+ insertions per minute. This table is heavily indexed, which also can’t be removed to support business requirements. Due to such high volume of insertions, we are seeing PAGELATCH_EX and PAGELATCH_SH. These locks further slowdown insertions.
A commonly accepted solution is to change the identity column to GUID so that insertions are written on random page every time. We can do this but changing IDs will trigger a need for the development cycle of migration scripts so that existing production data can be changed.
I tried another approach which seems to be working well in our load tests. Instead of changing to GUID, We are now generating IDs in a randomized pattern using following logic
SELECT #ModValue = (DATEPART(NANOSECOND, GETDATE()) % 14);
INSERT xxx(id)
SELECT NEXT VALUE FOR Sequence * (#ModValue + IIF(#ModValue IN (0,1,2,3,4,5,6), 100000000000,-100000000000))
It has eliminated PAGELATCH_EX and PAGELATCH_SH locks and our insertions are quite fast now. I also think GUID as PK of such a critical table is less efficient then a bigint ID column.
However, some of team members are sceptical on this as IDs with negative values that too generated on random basis is not a common solution. Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
I am wondering what the community’s take on this solution. If you could please point any disadvantage of approach suggested, that will be highly appreciated.
However, some of team members are skeptical on this as IDs with negative values that too generated on random basis is not a common solution
You have an uncommon problem, and so uncommon solutions might need to be entertained.
Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
Sure. The system as it exists now has a high (but not perfect) correlation between IDs and time. That is, in general a higher ID for a row means that it was created after one with a lower ID. So it's convenient to order by IDs as a stand-in for ordering by time. If that's something that they need to do (i.e. order the data by time), give them a way to do that in the new proposal. Conversely, play out the hypothetical scenario where you're explaining to your CTO why you didn't fix performance on this table for your end users. Would "so that our support personnel don't have to change the way they do things" be an acceptable answer? I know that it wouldn't be for me but maybe the needs of support outweigh the needs of end users in your system. Only you (and your CTO) can answer that question.

Deleting Huge Data In Cassandra Cluster

I have Cassandra cluster with three nodes. We have data close to 7 TB from last 4 years. Now because of less space available in the server, we would like to keep data only for last 2 years. But we don't want to delete it completely(data older than 2 years). We want to keep specific data even it is older than 2 years.
Currently I can think of one approach:
1) Java client using "MutationBatch object". I can get all the records key which fall into date range and excluding rows which we don't want to delete. Then deleting records in a batch. But this solution raises concern over performance as data is huge.
Is it possible to handle it at the server level(opscenter). I read about TTL but how can I apply it to an existing data and also restrict some of the data which I want to keep even if it is older than 2 years.
Please help me in finding out the best solution.
The main thing that you need to understand is that when you remove the data in Cassandra, you're actually adding them by writing the tombstone, and then deletion of actual data will happen during compaction.
So it's very important to perform deletion correctly. There are different types of deletes - individual cells, row, range, partition (from least effective to most effective by number of tombstones generated). The best for you is to remove by partition, then second one is by ranges inside partition. Following article describes how the data is removed in great details.
You may need to perform deletion in several steps, so you don't add too much data as tombstones. You also need to check that you have enough disk space for compaction.

How can I store the date with datastore?

Datastore documentation is very clear that there is an issue with "hotspots" if you include 'monotonically increasing values' (like the current unix time), however there isn't a good alternative mentioned, nor is it addressed whether storing the exact same (rather than increasing values) would create "hotspots":
"Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates."
https://cloud.google.com/datastore/docs/best-practices
I would like to store the time when each particular entity is inserted into the datastore, if that's not possible though, storing just the date would also work.
That almost seems more likely to cause "hotspots" though, since every new entity for 24 hours would get added to the same index (that's my understanding anyway).
Perhaps there's something more going on with how indexes work (I am having trouble finding great explanations of exactly how they work) and having the same value index over and over again is fine, but incrementing values is not.
I would appreciate if anyone has an answer to this question, or else better documentation for how datastore indexes work.
Is your application actually planning on querying the date? If not, consider simply not indexing that property. If you only need to read that property infrequently, consider writing a mapreduce rather than indexing.
That advice is given due to the way BigTable tablets work, which is described here: https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
To the best of my knowledge, it's more important to have the primary key of an entity not be a monotonically increasing number. It would be better to have a string key, so the entity can be stored with better distribution.
But saying this as a non-expert, I can't imagine that indexes on individual properties with monotonic values would be as problematic, if it's legitimately needed. I know with the Nomulus codebase for example, we had a legitimate need for an index on time, because we wanted to delete commit logs older than a specific time.
One cool thing I think happens with these monotonic indexes is that, when these tablet splits don't happen, fetching the leftmost or rightmost element in the index actually has better latency properties than fetching stuff in the middle of the index. For example, if you do a query that just grabs the first result in the index, it can actually go faster than a key lookup.
There is a key quote in the page that Justine linked to that is very helpful:
As a developer, what can you do to avoid this situation? ... Lower your write rate, or figure out how to better distribute values.
It is ok to store an indexed time stamp as long as that entity has a low write rate.
If you have an entity where you want to store an indexed time stamp and the entity has a high write rate, then the solution is to split the entity into two entities. Entity A will have properties that need to be updated frequently and entity B will have the time stamp and properties that don't get updated often.
When I do this, I have a common ID for the two entities to make it really easy to get from one to the other.
You could try storing just the date and put random hours, minutes, and seconds into the timestamp, then throw away that extra data later. (Or keep the hours and minutes and use random seconds, for example). I'm not 100% sure this would work but if you need to index the date it's worth trying.

How do I write a trigger to hash value before insert?

I have a table called employees with 3 columns: FirstName, LastName, and SSN.
Data is fed into this table nightly by a .Net service, something I'm not comfortable updating.
I'd like to have a trigger that says:
Hey, I see you're trying to insert something in the SSN column... let's HASH that before it goes in.
One way is to use an INSTEAD OF TRIGGER:
CREATE TRIGGER dbo.HashSSN
ON dbo.tablename
INSTEAD OF INSERT
AS
BEGIN
SET NOCOUNT ON;
INSERT dbo.tablename(FirstName, LastName, SSN)
SELECT FirstName, LastName, HASHBYTES('SHA1', SSN)
FROM inserted;
END
GO
Business Rule Compliance and Staging Tables
Another way is to not insert to the final table but to use a staging table. The staging table is a sort of permanent temporary table that has no constraints, allows NULLs, is in a schema such as import and is simply a container for an external data source to drop data into. The concept is then that a business process with proper business logic can be set up to operate on the data in the container.
This is a kind of "data scrubbing" layer where the SSN hashing could be done, as well as other business processes operating or business rules being enforced such as nullability or allowed omissions, capitalization, lengths, naming, duplicate elimination, key lookup, change notification, etc, and then finally performing the insert. The benefit is that a set of bad data, instead of having been attempted to insert, being forced to roll back, and then blowing up the original process, can be detected, preserved intact without loss and ultimately be properly handled (such as being moved to an error queue, notifications sent, and so on).
Many people would use SSIS for tasks like this, though I personally find SSIS very hard to work with, since it has problems ranging from brittleness, difficulty using SPs containing temp tables, deployment challenges, not being part of database backups, and others.
If such a scheme seems like overkill to you so that you wouldn't even consider it, step back for a second and think about it: you have an external process that is supposed to be inserting proper, exact, scrubbed, and certainly-known data into a table. But, it's not doing that. Instead, it's inserting data that does not conform to business rules. I think that slapping on a trigger could be a way to handle it, but this is also an opportunity for you to think more about the architecture of the system and explore why you're having this problem in the first place.
How do you think untrusted or non-business-rule-compliant data should be become trusted and business-rule-compliant? Where do transformation tasks such as hashing an SSN column belong?
Should the inserting process be aware of such business rules? If so, is this consistent across the organization, the architecture, the type of process that inserter is? If not, how will you address this so going forward you're not putting patches on fixes on kluges?
The Insecurity of an SSN Hash
Furthermore, I would like to point something else out. There are only about 889 million SSNs possible (888,931,098) if there are no TINs. How long do you think it would take to run through all of them and compare the hash to those in your table? Hashing certainly reduces quick exposure--you can't just read the SSN out extremely easily. But given it only takes a billion tries, it's a matter of days or even hours to pop all of them, depending on resources and planning.
A rainbow table with all SSNs and their SHA1 hashes would only take on the order of 25-30 GB -- quite achievable even on a relatively inexpensive home computer, where once created it would allow popping any SSN in a split second. Even using a longer or more computationally expensive hash isn't going to help much. In a matter of days or weeks a rainbow table can be built. A few hundred bucks can buy multiple terabytes of storage nowadays.
You could salt the SSN hash, which will mean that if someone runs a brute force crack against your table they will have to do it once for each row rather than be able to get all the rows at once. This is certainly better, but it only delays the inevitable. A serious hacker probably has a bot army backing him up that can crack a simple SSN + salt in a matter of seconds.
Further Thoughts
I would be interested in the business rules that are on the one hand requiring you to be able to verify SSNs and use them as a type of password, but on the other hand not allowing you to store the full values. Do you have security concerns about your database? Now that you've updated your question to say that these are employees, my questions about why the exclusion of non-SSN-holders is moot. However, I'm still curious why you need to hash the values and can't just store them. It's not just fine but required for an employer to have its employees' SSNs so it can report earnings and deductions to the government.
If on the other hand, your concern isn't really about security but more about deniability ("your SSN is never stored on our servers!") then that isn't really true, now, is it? All you've done is transform it in a way that can be reversed through brute-force, and the search space is small enough that brute force is quite reasonable. If someone gives you the number 42, and you multiply it by 2 and save 84, then tell the person that his number was not stored, but you can simply divide 84 by 2 to get the original number, then you're not really being completely straightforward.
Certainly, "one-way" hashing is much harder to reverse than multiplying, but we're not dealing with a problem such as "find the original 200 thousand-character document (or whatever) from its hash" but "find a 9 digit number from its hash". Sure, many different inputs will hash to the same values as one particular SSN, but I doubt that there are very many collisions of exactly 9-character strings consisting exclusively of numeric digits.
Actual SHA-1 SSN Hash Reversal Testing
I just did some testing. I have a table with about 3200 real SSNs in it. I hashed them using SHA1 and put those hashes into a temp table containing just the one column. I was able to pop 1% of the SSNs in about 8 minutes searching upward from 001-01-0001. Based on the speed of processing and the total search space it will be done in less than 3 hours (it's taking ~2 minutes per 10 million SSNs, so 88.89 * 2 minutes). And this is from inside SQL Server, not running a compiled program that could be much, much faster. That's not very secure!

hbase data modeling for activity feeds/news feeds/timeline

I decided to use HBase in a project to store the users activities in a social network. Despite the fact that HBase has a simple way to express data (column oriented) I'm facing some difficulties to decide how I would represent the data.
So, imagine that you have millions of users, and each user is generating an activity when they, for example, comment in a thread, publishes something, like, vote, etc. I thought basically in two approaches with an Activity hbase table:
The key could be the user reference + timestamp of activity creation, the value all the activity metadata (most of time fixed size)
The key is the user reference, and then each activity would be stored as a new column inside a column family.
I saw examples for others types of system (such as blogs) that uses the 2nd approach. The first approach (with fixed columns, varying only when you change the schema) is more commonly seen.
What would be the impact in the way I access the data for these 2 approaches?
In general you are asking if your table should be wide or long. HBase works with both, up to a point. Wide tables should never have a row that exceeds region size (by default 256MB) -- so a really prolific user may crash the system if you store large chunks of data for their actions. However, if you are only storing a few bytes per action, then putting all user activity in one row will allow you to get their full history with one get. However, you will be retrieving the full row, which could cause some slowdown for a lot of history (10s of seconds for > 100MB rows).
Going with a tall table and an inverse time stamp would allow you to get a users recent activity very quickly (start a scan with the key = user id).
Using timestamps as the end of a key is a good idea if you want to query by time, but it is a bad idea if you want to optimize writes to your database (writes will always be in the most recent region in the system, causing hot spotting).
You might also want to consider putting more information (such as the activity) in the key so that you can pick up all activity of a particular type more easily.
Another example to look at is OpenTSDB

Resources