Riak backend choice: bitcask vs leveldb - database

I'm planning to use Riak as a backend for a service that stores user session data. The main key used to retrieve data (binary blob) is named UUID and actually is a uuid, but sometimes the data might be retrieved using one or two other keys (e.g. user's email).
Natural option would be to pick leveldb backend with possibility to use secondary indexes for such scenario, but as secondary index search is not very common (around 10% - 20% of lookups), I was wondering if it wouldn't be better to have a separate "indexes" bucket, where such mapping email->uuid would be stored.
In such scenario, when looking using "secondary" index, I would first lookup the uuid in the "indexes" bucket, and then normally read the data using primary key.
Knowing that bitcask is much more predictable when it comes to latency and possibly faster, would you recommend such design, or shall I stick to leveldb and secondary indexes?

I think that both scenario would work. One way to choose which scenario to use is if you need expiration. I guess you'll want to have expiration for user sessions. If that's the case, then I would go with the second scenario, as bitcask offers a very good expiration feature, fully customizable.
If you go that path, you'll have to cleanup the metadata bucket (in eleveldb) that you use for secondary indexes. That can be done easily by also having an index of the last modification of the metadata keys. Then you run a batch to do a 2i query to fetch old metadata and delete them. Make sure you use the latest Riak, that supports aggressive deletion and reclaiming of disk space in leveldb.
That said, maybe you can have everything in bitcask, and avoid secondary indexes altogether. Consider this data design:
one "data" bucket: keys are uuid, value is the session
one "mapping_email" bucket: keys are email, values are uuid
one "mapping_otherstuff" bucket: same for other properties
This works fine if :
most of the time you let your data expire. That means you have no bookkeeping to do
you don't have too many mapping as it's cumbersome to add more
you are ready to properly implement a client library that would manage the 3 buckets, for instance when creating / updating / deleting new values
You could start with that, because it's easier on the administration, bookkeeping, batch-creation (none), and performance (secondary index queries can be expensive).
Then later on if you need, you can add the leveldb route. Make sure you use multi_backend from the start.

Related

Sharding database by user_id vs by entity_id

My current employee has a huge table of items. Each item has user_id and obviously item_id properties. To improve performance and high availability my team decided to shard the table.
We are discussing two strategies:
Shard by item_id
In terms of high availability if shard is down then all users lost temporary 1/N of items. The performance will be even across all shards (random distribution)
Shard by user_id
If shard is down then 1 of N users won't be able to access their items. Performance might be not even cause we have users with 1000s items as well as users with just one item. Also, there is a big disadvantage - now we need to pass item_id and user_id in order to access an item.
So my question is - which one to choose? Maybe you can guide me with some mathematical formula to decide which one is better in different circumstances
P.S. We already have replicas but they are becoming useless for our write throughput
UPDATE
We have serp pages where we need get items by ids as well as pages like user profile where the user wants to see his/her items. The first pattern is the most frequently used, unlike the second one.
We can give up easily on ACID transactions because we've started to build microservices (so eventually almost all big entities will be encapsulated in specific microservice).
I see a couple of ways to attack this:
How do you intend to shard? Separate master servers, separate schemas
serviced by the same server but by different storage backgrounds?
How do you access this data? Is it basically key/value? Do you need to query all of a user's items at once? How transactional do your CRUD operations need to be?
Do you foresee unbalanced shards being a problem, based on the data you're storing?
Do you need to do relational queries of this data against other data
in your system?
TradeOffs
If you split shards across server/database instance boundaries, sharding by item_id means you will not be able to do a single query for info about a single user_id... you will need to query every shard and then aggregate the results at the application level. I find the aggregation has a lot more pitfalls than you'd think... better to keep this in the database.
If you can use a single database instance, sharding by creating tables/schemas that are backed by different storage subsystems would allow you to scale writes will still being able to do relational queries across them. All of your eggs are still in 1 server basket with this method, though.
If you shard by user_id, and you want to rebalance your shards by moving a user to another shard, you will need to atomically move all of the user's rows at once. This can be difficult if there are lots of rows. If you shard by item_id, you can move one item at a time. This allows you to incrementally rebalance your shards, which is awesome.
If you intend to split these into separate servers such that you cannot do relational queries across schemas, it might be better to use a key/value store as DynamoDB. Then you only have to worry about one endpoint, and the sharding is done at the database layer. No middleware to determine which shard to use!
The key tradeoff seems to be the ability to query about all of a particular user's data (sharding by user_id), vs easier balancing and rebalancing of data across shards (sharding by item_id).
I would focus on the question of how you need to store and access your data. If you truly only need access by item_id, then shard by item_id. Avoid splitting your database in ways counterproductive to how you query it.
If you're still unsure, note that you can shard by item_id and then choose to shard by user_id later (you would do this by rebalancing based on user_id and then enforcing new rows only getting written to the shard their user_id belongs to).
Based on your update, it sounds like your primary concerns are not relational queries, but rather scaling writes to this particular pool of data. If that's the case, sharding by item_id allows you the most flexibility to rebalance your data over time, and is less likely to develop hot spots or become unbalanced in the first place. This comes at the price of having to aggregate queries based on user_id across shards, but as long as those "all items for a given user" queries do not need consistency guarantees, you should be fine.
I'm afraid that there is no any formula that can calculate the answer for all cases. It depends of your data schema, and of your system functional requirements.
If in your system separate item_id has sensible meaning and your users usually work with data from separate item_id's (like Instagram like service when item_id's are related to user photos), I would suggest you sharding by item_id because this choice has lot of advantages from the technical point of view:
ensures even load across all shards
ensures graceful degradation of your service: when shard is down users lose access to 1/N of their items, but they can work with other items
you do not have to pass user_id to access item_id
There are also some disadvantages with this approach. For example, it will be more difficult to backup all items of a given user.
When only complete item_id series can have sensible meaning, it is more reasonable to shard by user_id

Best approach for caching lists of objects in memcache

Our Google AppEngine Java app involves caching recent users that have requested information from the server.
The current working solution is that we store the users information in a list, which is then cached.
When we need a recent user we simply grab one from this list.
The list of recent users is not vital to our app working, and if it's dropped from the cache it's simply rebuilt as users continue to request from the server.
What I want to know is: Can I do this a better way?
With the current approach there is only a certain amount of users we can store before the list gets to large for memcache (we are currently limiting the list to 1000 and dropping the oldest when we insert new). Also, the list is going to need updating very quickly which involves retrieving the full list from memcache just to add a single user.
Having each user stored in cache separately would be beneficial to us as we require the recent user to expire after 30 minutes. At the moment this is a manual task we do to make sure the list does not include expired users.
What is the best approach for this scenario? If it's storing the users separately in cache, what's the best approach to keeping track of the user so we can retrieve it?
You could keep in the memcache list just "pointers" that you can use to build individual memcache keys to access user entities separately stored in memcache. This makes the list's memcache size footprint a lot smaller and easy to deal with.
If the user entities have parents then pointers would have to be their keys, which are unique so they can be used as memcache keys as well (well, their urlsafe versions if needed).
But if the user entities don't have parents (i.e. they're root entities in their entity groups) then you can use their datastore key IDs as pointers - typically shorter than the keys. Better yet, if the IDs are numerical IDs you can even store them as numbers, not as strings. The IDs are unique for those entities, but they might not be unique enough to serve as memcache keys, you may need to add a prefix/suffix to make the respective memcache keys unique (to your app).
When you need a user entity data you 1st obtain the "pointer" from the list, build the user entity memcache key and retrieve the entity with that key.
This, of course, assumes you do have reasons to keep that list in place. If the list itself is not mandatory all you need is just the recipe to obtain the (unique) memcache keys for each of your entities.
If you use NDB caching, it will take care of memcache for you. Simply request the users with the key using ndb.Key(Model, id).get() or Model.get_by_id(id), where id is the User id.

Which Database to use for massive update

I need help in select the right Database for my data.
I have table of usersItems with the following columns:
userId , itemId , attribute1 ,attribute2,attribute3 .......,attribute10
There are 1000 users +- , and every user has 100,000 items(avg) .
The data in the table updated every 3 hours from third-party API. (I'm getting file for each user with the updated items.. not all of them really changed).
The data from this table in use as is, without aggregations. Each user can see his items in the website.
Today I'm using mySQL and have few problems with the massive update of records.
I thought to migrate the data to redshift or one of the NOSQL dbs.
I'll be happy to hear your recommendations.
I'd look into Aerospike but this kind of work-load. This is what we've been using over here and we are quite happy with it. It's an open source NoSQL database that is designed for both in-memory and solid state disk-operation. It can handle a lot of IOPS (100k+ IOPS in-memory, like Redis), if you manage to avoid ultra-hot keys (more than 1000 IOPS on single 'rows'). It can be configured to replicate all data and has synchronic (SSD only) as well as asynchronic (HDD) persistence support.
For your use case, you'd have to decide whether lists can be bound in size to 128k - 1MB or whether you need infinite growable lists per user. This will make the difference between using a normal list (limited to record size, 128k-1M) or using a large ordered list (infinite). Note that you overcome your MySQL-limitations at the moment that you start having a single primary key for the list you are trying to query. No joins or anything is required. It only get's a bit fuzzy if list entries need their own primary key (e.g. m:n relations) - however, there's concepts that work around that like de-normalization.
When you give it a few days of figuring out what works best, Aerospike can help you with consistently low latencies that only a product grown up in AdSpace can offer. You might not need it right now, but we found that working with SSDs gives us a lot more freedom in terms of what we store due to the much higher capacity compared to memory.
Other options I'd evaluate would be Redis or Couchbase - if asynchronous persistence is not an issue for you.
You should try an in-memory database with persistence: Redis, CouchBase, Tarantool, Aerospike.
Each of them should handle your workload of heavy updates. This works because these databases don't change the table space on each update, but rather append to the transaction log only. Which is the fastest possible way to persist updates.
So if your update workload is less than 100Mb/sec (the speed of linear writing of a spinning disk) then those databases should help you.
But everything depends on your specific workload though. You can test all of those databases and choose the best one.

Regularly updated data and the Search API

I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.
I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.
Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?
My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.
It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.
One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.
Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.
This could work well as long as the size of your cache does not need to be prohibitively large.
If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.

PIG Latin script for Database access

I am trying to implement a surrogate key generator using PIG.
I need to persist the last generated key in a Database and query the Database for the next available key.
Is there any support in PIG to query the Database using ODBC?
If yes, please provide guidance or some samples.
Sorry for not answering your question directly, but this is not something you want to be doing. For a few reasons:
Your MapReduce job is going to hammer your database as a single performance chokepoint (you are basically defeating the purpose of Hadoop).
With speculative execution, you'll have the same data get loaded up twice so some unique identifiers won't exist when one of the tasks gets killed.
I think if you can conceivably hit the database once per record, you can just do this surrogate key enrichment without MapReduce in a single thread.
Either way, building surrogate keys or automatic counters is not easy in Hadoop because of the shared-nothing nature of the thing.

Resources