Activity Feed with Riak - database

This week I read an interesting article which explain how the authors implemented an activity. Basically, they're using two approaches to handle activities, which I'm adapting to my scenario, so supposing we hava an user foo who has a certain number (x) of followers:
if x<500, then the activity will be copyied to every follower feed
this means slow writes, fast reads
if x>500, only a link will be made between foo and his followoers
in theory, fast writes, but will slow reads
So when some user access your activity feed, the server will fetch and merge all data, so this means fast lookups in their own copyied activities and then query accross the links. If a timeline has a limit of 20, then I fetch 10 of each and then merge.
I'm trying to do it with Riak and the feature of Linking, so this is my question: is linking faster than copy? My idea of architecture is good enough? Are there other solutions and/or technologies which I should see?
PS.: I'm not implementing a activity feed for production, it's just for learning how to implement one which performs well and use Riak a bit.

Two thoughts.
1) No, Linking (in the sense of Riak Link Walking) is very likely not the right way to implement this. For one, each link is stored as a separate HTTP header, and there is a recommended limit in the HTTP spec on how many header fields you should send. (Although, to be fair, in tests you can use upwards of a 1000 links in the header with Riak, seems to work fine. But not recommended). More importantly, querying those links via the Link Walking api actually uses MapReduce on the backend, and is fairly slow for the kind of usage you're intending it for.
This is not to say that you can't store JSON objects that are lists of links, sure, that's a valid approach. I'm just recommending against using Riak links for this.
2) As for how to properly implement it, that's a harder question, and depends on your traffic and use case. But your general approach is valid -- copy the feed for some X value of updates (whether X is 500 or much smaller should be determined in testing), and link when the number of updates is greater than X.
How should you link? You have 3 choices, all with tradeoffs. 1) Use Secondary Indices (2i), 2) Use Search, or 3) Use links "manually", meaning, store JSON documents with URLs that you dereference manually (versus using link walking queries).
I highly recommend watching this video: http://vimeo.com/album/2258285/page:2/sort:preset/format:thumbnail (Building a Social Application on Riak), by the Clipboard engineers, to see how they solved this problem. (They used Search for linking, basically).

Related

Message storage duplication for messaging systems

In many sub-system designs for messaging applications (twitter, facebook e.t.c) I notice duplication of where user message history is stored. On other hand they use tokenizing indexer like ElasticSeach or Solr. It's good for search. On other hand still use some sort of DB for history. Why to duplicate? Why the same instance of ES/Solr/EarlyBird can not be used for history? It's in fact able to.
The usual problem is the following - you want to search and also ideally you want to try index data in a different manner (e.g. wipe index and try new awesome analyzer, that you forgot to include initially). Separating data source and index from each other makes system less coupled. You're not afraid, that you will lose data in the Elasticsearch/Solr.
I am usually strongly against calling Elasticsearch/Solr a database. Since in fact, it's not. For example none of them have support for transactions, which makes your life harder, if you want to update multiple documents following standard relational logic.
Last, but not least - one of the hardest operation in Elasticsearch/Solr is to retrieve stored values, since it's not much optimised to do so, especially if you want to return 10k documents at once. In this case separate datasource would also help, since you will be able to return only matched document ids from Elasticsearch/Solr and later retrieve needed content from datasource and return it to the user.
Summary is just simple - Elasticsearch/Solr should be more think of as a search engines, not data storage.
True that ES is NOT a database per se and will never be. But no one says you cannot use it as such, and many people actually do. It really depends on your specific use case(s), and in the end it's all a question of the trade-offs you are ready to make to support your specific needs. As with pretty much any technology in general, there is no one-size-fits-all approach and with ES (and the like) it's no different.
A primary source of truth might not necessarily be a relational DBMS and they are not necessarily "duplicating" the data in the sense that you meant, it can be anything that has a copy of your data and allows you to rebuild your ES indexes in case something goes wrong. I've seen many many different "sources of truth". It could simply be:
your raw flat files containing your historical logs or business data
Kafka topics that you can replay anytime easily
a snapshot that you take from ES on a regular basis
a relational DB
you name it...
The point is that if something goes wrong for any reason (and that happens), you want to be able to recreate your ES indexes, be it from a real DB, from backups or from raw data. You should see that as a safety net. Even if all you have is a MySQL DB, you usually have a backup of it, so you're already "duplicating" the data in some way.
One thing that you need to think of, though, when architecting your system, is that you might not necessarily need to have the entirety of your data in ES, since ES is a search and analytics engine, you should only store in there what is necessary to support your search and analytics needs and be able to recreate that information anytime. In the end, ES is just a subsystem of your whole architecture, just like your DB, your messaging queue or your web server.
Also worth reading: Using ElasticSeach as primary source for part of my DB

Redis database snapshot diffs or other suggested DB for network/resource monitoring

I have a monitoring service that polls a REST API for information about the latest resources (list of hosts/list of licenses). The monitoring service cache's all this data in a Redis database. Everything works great for discovering new resources.
However the problem I am facing is when a host drops off the network. The challenge I am facing is that I haves no way of knowing that the host has disappeared from the list of hosts. The REST API only gives me a way of querying a list of hosts.
One way that I can come up (theoretically) is by taking a diff of the rdb at different time intervals. However this does not seem efficient to me and honestly I am not sure how I would do this with redis.
The suggestions I am looking for are, maybe some frameworks which are best suited for this kind of an operation or if need be a different database that might be as efficient as redis yet gives me the functionality I need to take diffs. Time series databases spring to mind but I have no experience in them and not sure how they can be used to solve this problem precisely.
There's no need to resort to anywhere besides Redis itself - it is robust enough to continue serving your requirements as long as you tell it what to do (like any other software ;)).
The following is an example but as you didn't specify how you're caching your data, I'll assume for simplicity's sake that you have a key per every host/license in your list where you store some string/binary value, like:
SET acme.org "some cached value"
You have a lot of such keys because the monitoring REST API returns a list, so a common way to keep everything order is use another key to store that list for each request returned by the API. You can achieve that with a Set:
SADD request:<timestamp> acme.org foo.bar ...
Sets are particularly useful here because you can perform Set operations, SDIFF and SINTER and store-variants in your case, to keep track of the current online and dropped hosts. For example:
MULTI
SINTERSTORE online:<timestamp> request:<timestamp> request:<previous-timestamp>
SDIFFSTORE dropped:<timestamp> request:<timestamp> request:<previous-timestamp>
EXEC
Note: as you're caching things it is good practice to expiry values (TTL) to all relevant keys and use an appropriate eviction policy.

Ideal database for a minimalist blog engine

So I'm designing this blog engine and I'm trying to just keep my blog data without considering comments or membership system or any other type of multi-user data.
The blog itself is surrounded around 2 types of data, the first is the actual blog post entry which consists of: title, post body, meta data (mostly dates and statistics), so it's really simple and can be represented by simple json object. The second type of data is the blog admin configuration and personal information. Comment system and other will be implemented using disqus.
My main concern here is the ability of such engine to scale with spiked visits (I know you might argue this but lets take it for granted). So since I've started this project I'm moving well with the rest of my stack except the data layer. Now I've been having this dilemma choosing the database, I've considered MongoDB but some reviews and articles/benchmarking were suggesting slow reads after collections read certain size. Next I was looking at Redis and using its persistence features RDB and AOF, while Redis is good at both fast reading/writing I'm afraid of using it because I'm not familiar with it. And this whole search keeps going on to things like "PostgreSQL 9.4 is now faster than MongoDB for storing JSON documents" etc.
So is there any way I can settle this issue for good? considering that I only need to represent my data in key,value structure and only require fast reading but not writing and the ability to be fault tolerant.
Thank you
If I were you I would start small and not try to optimize for big data just yet. A lot of blogs you read about the downsides of a NoSQL solution are around large data sets - or people that are trying to do relational things with a database designed for de-normalized data.
My list of databases to consider:
Mongo. It has huge community support and based on recent funding - it's going to be around for a while. It runs very well on a single instance and a basic replica set. It's easy to set up and free, so it's worth spending a day or two running your own tests to settle the issue once and for all. Don't trust a blog.
Couchbase. Supports key/value storage and also has persistence to disk. http://www.couchbase.com/couchbase-server/features Also has had some recent funding so hopefully that means stability. =)
CouchDB/PouchDB. You can use PouchDB purely on the client side and it can connect to a server side CouchDB. CouchDB might not have the same momentum as Mongo or Couchbase, but it's an actively supported product and does key/value with persistence to disk.
Riak. http://basho.com/riak/. Another NoSQL that scales and is a key/value store.
You can install and run a proof-of-concept on all of the above products in a few hours. I would recommend this for the following reasons:
A given database might scale and hit your points, but be unpleasant to use. Consider picking a database that feels fun! Sort of akin to picking Ruby/Python over Java because the syntax is nicer.
Your use case and domain will be fairly unique. Worth testing various products to see what fits best.
Each database has quirks and you won't find those until you actually try one. One might have quirks that are passable, one will have quirks that are a show stopper.
The benefit of trying all of them is that they all support schemaless data, so if you write JSON, you can use all of them! No need to create objects in your code for each database.
If you abstract the database correctly in code, swapping out data stores won't be that painful. In other words, your code will be happier if you make it easy to swap out data stores.
This is only an option for really simple CMSes, but it sounds like that's what you're building.
If your blog is super-simple as you describe and your main concern is very high traffic then the best option might be to avoid a database entirely and have your CMS generate static files instead. By doing this, you eliminate all your database concerns completely.
It's not the best option if you're doing anything dynamic or complex, but in this small use case it might fit the bill.

Redis for cakePHP app

I want to start a big cakePHP project where performance will be an issue. I will have a users table with act as tree behavior and many financial data related to the users. This application will make a lot of dynamic reports aggregating data for different tree nodes etc.
Since there is on github an easy to use library which sets data source of model to redis, I was wondering if it's a good idea to use it for entire app? Is there anyone who has experience with it, and what could be potential problems if I decide to depend on redis as main/only data storage?
EDIT: I have installed redis and Tried to use RedisModel for two models with simple relation HasMany/BelongsTo. When I tried to simply use those models like standard AppModels - it simply wont work (Redis Error: Missing key). Apparently you can't use Model->find Model->save etc. in standard way. You have to use redis methods instead (setKeyValue ect.). This means that pagination and other cakePHP futures will also not work. So maybe it is not the best idea to use redisModel for all my models...
I cannot speak for CakePHP specifically, but I'll talk about redis in general and the points of your question in particular, it should be applicable to your framework of choice in the end. Let's see:
You mention you want to start an application where performance will be an issue — I just wanted to mention you should be careful with the assumption that you will need a nosql solution, because this is hard to assess beforehand. Redis is hella fast, but MySQL for instance has been proven to be capable to handling millions of records and operations just fine, provided it's properly configured and used, and it's much simpler if you need lots of relational structures.
Concerning Redis as the main and only data store:
Redis is perfectly stable for the job. Instagram
reportedly stored 300 million key-value pairs pseudo-sharded
using hashes to great effect, and while it's not the only data
storage system they use, it goes to show redis is pretty reliable.
This very site (Stack Overflow) uses redis also extensively for
caching purposes.
Redis is also reported to have an overall excellent continuous uptime on average (which shouldn't be surprising considering the point above)
Options exists to mitigate downtime issues, replication is supported to some extent, and Redis Cluster is coming soon to support proper distributed approaches.
The main problem you could face is not understanding properly how its
persistence works. You should absolutely read this and this article before you get started because this point is important. In a nutshell, redis does not write changes immediately to disk, which means that depending on your configuration, a crash can cause a data loss ranging from a few seconds to several minutes since the last disk write. This might or might not be a problem depending on your use case; if the data is extremely sensitive (ie, financial records) you might want to think twice before jumping to redis, or build a system where redis is not exclusively used but rather combined with another storage system.
Relational structures in a non-relational data store like redis mean doing more work and often duplicating/denormalizing data. It can be done, but it's something to consider; in your question you mention you'll need to aggregate data to generate dynamic reports, are you sure you want to use redis for this? it sounds like a relational database would give you way more flexibility at a very small cost of performance. If you know in advance you'll need to run complex queries over your data, it could be a good idea not to reinvent the wheel unless you absolutely need to.
My advice here would be to first get a better feeling on what redis is and how works, potentially build your own models instead of relying on others to better understand what can and cannot be done, and from there assess where you want to take it. Redis is reliable enough to be used standalone, but at the end of the day what's smart is to use the right tool for the right job, and you might find some things of your app work well with redis while some others are better off to a more traditional storage system.

Are there any algorithms for creating playlists that don't require a massive database (or that work with a publicly accessible one)?

Is there any algorithm with which I can automatically create a playlist of songs that well with each other -- similarly to services like iTunes Genius -- that a single developer can actually implement? It should either a) not require any sort of remote database of listening habits etc. or b) require such a database, but work with one that is freely available.
i did this, and i used the last.fm database as described by tomasz. i didn't use "related artist" directly, but instead constructed my own relationship graph by comparing tags associated with different artists (this is not the approach suggested by lcfseth btw - i have quite a large range of music and i wanted to explore "natural" connections that might not be common partners in "normal" playlists; also i wasn't sure how uniform the related artists were).
i also used a local database to cache data from last.fm, because calls to the api are rate limited, and i experimented with using other parts of the api to improve / normalize the information i was reading from mp3 tags.
generating a useful graph of related artists was actually quite hard; largely because some nodes in the graph naturally tend to be more important than others. if you don't "even out" the graph then your playlist will keep returning to the "important" artists.
the final result did work well, in that the selection of music had a good balance between "central theme" and variation. but the implementation is not at all polished, the calculation of the graph can take a long time (many hours), the program takes up a fair amount of memory when running, and it still seems to play elvis costello a little more than expected ;o)
if you are interested, the code is at http://code.google.com/p/uykfe/
the best part of all, from my point of view as a user, is that it can update logitech media server (squeezeserver) playlists in "realtime", adding a new track whenever the list is empty. that works really well in continuing from whatever music you select "by hand". it can also generate one-off playlists, of course, and, finally, by tweaking parameters you can get a kind of "random walk" through your music collection - it will play related tunes but slowly drift from one style to another (in fact, this is really the "default" mode - to get it to stay on a single theme i needed extra logic that biased it towards whatever music it had played earlier).
ps also, the dump of the final graph to gephi was really cool - i had it printed out and it's now pinned to the wall...
pps i also experimented with the musicbrainz database, which in theory sounds like a fantastic resource. but in practice it is over-complex and poorly documented.
I don't know iTunes Genius, but I think last.fm database and API might be useful for you. Every time you see any track it shows you a list of similar tracks, based on other users preferencs. The same information can be obtained using track.getSimilar API method.
The idea behind most of these databases, is to see what other users listens to after they listen to a given song. The accuracy of these statistics depends on the number of users therefor it is probably hard to use this locally. The algorithm itself is not that hard to implement.
The alternative would be to sort song based on genre, singer... which are informations that are usually embedded in the songs but not always. Winamp have this feature, but it won't work for old songs, unless you manually set the informations or use an On-line song database.

Resources