I'm working on an app that needs to keep track of YouTube videos. I want to periodically pull the info on the relevant videos into Datomic and then serve them as embeds with titles, descriptions, etc. A naive way to do that would be to periodically fetch all the info I want and upsert it into my db.
But most of the time, the information won't have changed. Titles and descriptions can change (and I want to notice when they do), but usually they won't. Using the naive approach, I'd be updating entities with the same value over and over again.
Is that bad? Will I just fill up my storage with history? Will it cause a lot of reindexing? Or should I not worry about that, and let Datomic take care of itself?
A less-naive approach would look at the current values and see if they need updating. If that's a better idea, is there an easy way to do that, or should I expect to be writing a lot of custom code for it?
Upserting too often is definitely an issue for performance of the database. Yes, it will cause indexing issue, but also in terms of speed, its not an ideal solution.
If your app's performance has time as an important factor, I'd write custom code to check and then update if necessary
Related
This is a general cache question, regardless of the code used behind, but for the record I am using Ehcache for Java.
In a classic situation where a system have to load a dynamic list of elements from a database (so query is based on some criteria), are there any known tricks to improve loading performance by leveraging the cache system.
My guess would to be to load a list of IDs instead of a list of elements and then fetch each one of them individually so we can leverage on the caching of the entities.
Thanks for your help.
PS: I hope the question is clear enough. Any suggestion is welcomed.
The generic answer is: "It all depends".
So, yes, if you expect entities to be cached by IDs, loading a list of IDs and then fetching the cache can be faster. However, you need to be sure of it. Because otherwise, fetching one entity after the other is tremendously slow.
But then, just querying from a DB can be fast too. If it doesn't happen too much. Optimizing DB to Java mapping can improve performance as well.
Some other tricks includes retrieving only the data you need. Not the entire entity. I usually start with that.
So as I said, it depends.
I've been looking at the Baobab library and am very attracted to the "single-tree" approach, which I interpret as essentially a single store. But so many Flux tutorials seem to advocate many stores, even a "store per entity." Having multiple stores seems to me to present all kinds of concurrency issues. My question is, why is single store a bad idea?
It depends on what want to do and how big is your project. There are a few reason why having several stores is a good idea:
If your project is not so small afterall you may end up with a huge 2000/3000lines store and you don't want that. That's the point of writing modules in general. You want to avoid files bigger than 1000lines (and below 500 is even nicer :) ).
Writing everything in one store makes that you can't enjoy the dependency management with the dispatcher using the waitFor function.It's gonna be harder to check dependencies and potential circular dependencies between your models (since they are all in one store). I would suggest you take a look at https://facebook.github.io/flux/docs/chat.html for that.
It's harder to read. With several store you can at one glance figure out what type of data you have and with a constant file for the dispatcher events you can see all your events.
So it's possible to keep everything in one store and it may work perfectly but if your project grows you may regret it badly and rewrite everything in several modules/store. Just my opinion I prefer to have clean modules and data workflows.
Hope it helps!
From my experience, working with a single store is definitely not a bad idea. It has some advantages, such as:
A single store to access all data can make it easier to query and make relationships about different pieces of data. Using multiple stores can make this a little bit more difficult (definitely not impossible though).
It will be easier to make atomic updates to the application state (aka data store).
But the way you implement the Flux pattern will influence on your experience with a single data store. The folks at Facebook have been experimenting with this and it seems like they encourage the use of a single data store with their new Relay+GraphQL stuff (read more about it here: http://facebook.github.io/react/blog/2015/02/20/introducing-relay-and-graphql.html).
As I've been working with traditional relational database for a long time, moving to nosql, especially Cassandra, is a big change. I ussually design my application so that everything in the database are loaded into application's internal caches on startup and if there is any update to a database's table, its corresponding cache is updated as well. For example, if I have a table Student, on startup, all data in that table is loaded into StudentCache, and when I want to insert/update/delete, I will call a service which updates both of them at the same time. The aim of my design is to prevent selecting directly from the database.
In Cassandra, as the idea is to build table containing all needed data so that join is unnencessary, I wonder if my favorite design is still useful, or is it more effective to query data directly from the database (i.e. from one table) when required.
Based on your described usecase I'd say that querying data as you need it prevents storing of data you dont need, plus what if your dataset is 5Gb? Are you still going to load the entire dataset?
Maybe consider a design where you dont load all the data on startup, but load it as needed and then store it and check this store before querying again, like what a cache does!
Cassandra is built to scale, your design cant handle scaling, you'll reach a point where your dataset is too large. Based on that, you should think about a tradeoff. Lots of on-the-fly querying vs storing everything in the client. I would advise direct queries, but store data when you do carry out a query, dont discard it and then carry out the same query again!
I would suggest to query the data directly as saving all the data to the application makes the applications performance based on the input. Now this might be a good thing if you know that the amount of data will never exceed your target machine's memory.
Should you however decide that this limit should change (higher!) you will be faced with a problem. Taking this approach will be fast when it comes down to searching (assuming you sort the result at start) but will pretty much kill maintainability.
The former favorite 'approach' is however still usefull should you choose for this.
This week I read an interesting article which explain how the authors implemented an activity. Basically, they're using two approaches to handle activities, which I'm adapting to my scenario, so supposing we hava an user foo who has a certain number (x) of followers:
if x<500, then the activity will be copyied to every follower feed
this means slow writes, fast reads
if x>500, only a link will be made between foo and his followoers
in theory, fast writes, but will slow reads
So when some user access your activity feed, the server will fetch and merge all data, so this means fast lookups in their own copyied activities and then query accross the links. If a timeline has a limit of 20, then I fetch 10 of each and then merge.
I'm trying to do it with Riak and the feature of Linking, so this is my question: is linking faster than copy? My idea of architecture is good enough? Are there other solutions and/or technologies which I should see?
PS.: I'm not implementing a activity feed for production, it's just for learning how to implement one which performs well and use Riak a bit.
Two thoughts.
1) No, Linking (in the sense of Riak Link Walking) is very likely not the right way to implement this. For one, each link is stored as a separate HTTP header, and there is a recommended limit in the HTTP spec on how many header fields you should send. (Although, to be fair, in tests you can use upwards of a 1000 links in the header with Riak, seems to work fine. But not recommended). More importantly, querying those links via the Link Walking api actually uses MapReduce on the backend, and is fairly slow for the kind of usage you're intending it for.
This is not to say that you can't store JSON objects that are lists of links, sure, that's a valid approach. I'm just recommending against using Riak links for this.
2) As for how to properly implement it, that's a harder question, and depends on your traffic and use case. But your general approach is valid -- copy the feed for some X value of updates (whether X is 500 or much smaller should be determined in testing), and link when the number of updates is greater than X.
How should you link? You have 3 choices, all with tradeoffs. 1) Use Secondary Indices (2i), 2) Use Search, or 3) Use links "manually", meaning, store JSON documents with URLs that you dereference manually (versus using link walking queries).
I highly recommend watching this video: http://vimeo.com/album/2258285/page:2/sort:preset/format:thumbnail (Building a Social Application on Riak), by the Clipboard engineers, to see how they solved this problem. (They used Search for linking, basically).
95% of my app costs are related to write operations. In the last few days I've paid $150. And I wouldn't consider the amount of data stored to be that huge.
First I suspected that there may be a lot of write operations because of exploding indexes, but I read that this situation usually happens when you have two list properties in the same model. But during the model design phase those were limited to 1 per model max.
I also did try to go through my models and pass indexed=False to all properties that I will not need to order or filter by.
One other thing that I would need to disclose about my app is that I have bundled write operations in the sense that there are some entities that when they need to be stored, I usually call a proxy function that stores that entity and derivative entities along with it. Not in a transactional way since it's not a big deal if there's a write failure every now and then. But I don't really see a way around that given the logic of how the models are related.
So I'm eager to hear if someone else faced that problem and which approach / tools /etc they followed to solve it. Or if there are just some general things one can do..
You can try appstats tool. It can show you datastore calls stats, etc.