AWS Amplify Graphql optimal nesting depth - reactjs

I've created a GraphQL Schema in AWS App Sync(1100 lines before autogenerating the resources, queries, mutations and subscriptions, 4800 lines afterwards) to use in react. In my react folder I executed
amplify init
amplify add codegen --apiId myId
then I was asked during execution about the desired nesting depth. My types are very nested so it would be great to have a deeper depth but unfortunately I discovered the enormous increase of file size.
3 levels is 2,5mb
4 levels is 9mb
5 levels is 50mb
from the nesting point I would prefer 5 levels or maybe even more but I want the app to be as performant as possible.
No customer wants to wait until 50mb are loaded
what is your experience about loading large files in graphql? Is it possible to split the large files automatically into smaller ones? At level 5 the files have up to 1m lines of code each.
I use lazy loading in React for my components, it would be great if something similar would be possible here as well.
Do you prefer one large nested query with many levels of depth or several cascaded smaller queries with a low depth?
Regards Christian

Lazy...
... probably you don't need full info about each node (all properties) at once.
You can fetch only id and name how deep you need to get structure/relations. Later you can fetch for additional/required (for rendering) properties.

Related

Symfony 4 - Cache vs DB solution

I am currently using the cache for my current project, but i'm not sure if it is the right thing to do.
I need to retrieve a lot of data from a web api (nodes that can be picture, node, folder, gallery.... Those nodes will change very often, so I need fast access (loading up to 300-400 element at once). Currently I store them in cache (key as md5 of node_id, so easy to retrieve, and update).
It is working great so far, but if I clear the cache it takes up to 1 minute to create all the cache again.
Should I use a database to store those nodes ? Will it be quicker / slower / same ?
Your question is very broad and thus hard to answer. Saving 300-400 elements under a cache key sounds problematic to me. You can run into problems where serializing when storing in the cache and deserializing when retrieving the data will cause problems for you. Whenever your cache service is down your app will be practically unusable.
If you already run into problems when clearing/updating the cache you might want to look for an alternative. This might be a database or elasticsearch, advanced cache features like tagged caching could help with preventing you from having to clear the whole cache when part of the information updates. You might also want to use something like the chain provider to store things in multiple caches to prevent the aforementioned problem of an unreachable cache "breaking" your app. You could also look into a pattern that is common with CQRS called a read model.
There are a lot of variables that come into play. If you want to know which one will yield the best results, i.e. which one is quicker, you should do frequent performance tests with realistic data using Symfony's debug toolbar & profilers or a 3rd party service like blackfire.io or tideways. You might also want to do capacity test with a tool like JMeter to ensure those results still hold true, when there are multiple simultaneous users.

DynamoDB - Do I need lots of read capacities to handle multiple getItem-calls per page?

I'm using DynamoDB to store items that are necessary to deliver a specific webpage. However, for one page load, the web server may easily need hundreds of items from about 2-5 different tables. If I have only one read capacity I can only make 2 eventually consistent DB calls per second. Of course if I need to get these items to deliver a webpage, I cannot wait one second for every DB call.
I already use batchGetItems to reduce the workload. Do I now need just lots of more read capacities or am I getting something wrong?
You should be thinking caching, not fetching.
Either AWS ElasticSearch (memcached) or Varnish-like caching.
You can also implement an in-process caching using Google Guava
It's possible to tune your read capacity based on usage and that's one of the advantages of using a hosted solution like DynamoDB. You can setup CloudWatch alarms, receive notifications through a SNS topic and create a simple app to increase/decrease your capacity. There is a nice post about it at: http://engineeringblog.txtweb.com/2013/09/txtweb-scaling-with-dynamodb/

How can I combine similar tasks to reduce total workload?

I use App Engine, but the following problem could very well occur in any server application:
My application uses memcache to cache both large (~50 KB) and small (~0.5 KB) JSON documents which aggregate information which is expensive to refresh from the datastore. These JSON documents can change often, but the changes are sparse in the document (i.e., one item out of hundreds may change at a time). Currently, the application invalidates an entire document if something changes, and then will lazily re-create it later when it needs it. However, I want to move to a more efficient design which updates whatever particular value changed in the JSON document directly from the cache.
One particular concern is contention from multiple tasks / request handlers updating the same document, but I have ways to detect this issue and mitigate it. However, my main concern is that it's possible that there could be rapid changes to a set of documents within a small period of time coming from different request handlers, and I don't want to have to edit the JSON document in the cache separately for each one. For example, it's possible that 10 small changes affecting the same set of 20 documents of 50 KB each could be triggered in less than a minute.
So this is my problem: What would be an effective solution to combine these changes together? In my old solution, although it is expensive to re-create an entire document when a small item changes, the benefit at least is that it does it lazily when it needs it (which could be a while later). However, to update the JSON document with a small change seems to require that it be done immediately (not lazily). That is, unless I come up with a complex solution that lazily applies a set of changes to the document later on. I'm hoping for something efficient but not too complicated.
Thanks.
Pull queue. Everyone using GAE should watch this video:
http://www.youtube.com/watch?v=AM0ZPO7-lcE
When a call comes in, update memcache and do an async_add to your task pull queue. You likely could run a process that will handle thousands of updates each minute without a lot of overhead (i.e. instance issues). Still have an issue should memcache get purged prior to your updates, but that it not too hard to work around. HTH. -stevep

Strategy for caching of remote service; what should I be considering?

My web app contains data gathered from an external API of which I do not have control. I'm limited to about 20,000 API requests per hour. I have about 250,000 items in my database. Each of these items is essentially a cached version. Consider that it takes 1 request to update the cache of 1 item. Obviously, it is not possible to have a perfectly up-to-date cache under these circumstances. So, what things should I be considering when developing a strategy for caching the data. These are the things that come to mind, but I'm hoping someone has some good ideas I haven't thought of.
time since item was created (less time means more important)
number of 'likes' a particular item has (could mean higher probability of being viewed)
time since last updated
A few more details: the items are photos. Every photo belongs to an event. Events that are currently occurring are more like to be viewed by client (therefore they should take priority). Though I only have 250K items in database now, that number increases rather rapidly (it will not be long until 1 million mark is reached, maybe 5 months).
Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick?
Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study.
Anyway, to throw out some thoughts on standards for crawling:
Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.
Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score:
A statistical model of the chance of an image being updated and needing to be recached.
An importance score for each image, using things like the recency of the image, or the currency of its event.
Update things that are being viewed frequently.
Update things that have many views.
Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.
Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat):
5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)
2,500 requests processing new images (which you mentioned are more important)
2,500 requests processing images of current events
2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)
2,500 requests processing images that have been viewed at least
Total: 15,000 requests per hour.
How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often.
andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice.
At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.

Dumping Twitter Streaming API tweets as-is to Apache Cassandra for post-processing

I am using the Twitter Streaming API to monitor several keywords/users. I am planning to dump the tweets json strings I get from twitter directly as-is to cassandra database and do post processing on them later.
Is such a design practical? Will it scale up when I have millions of tweets?
Things I will do later include getting top followed users, top hashtags etc. I would like to save the stream as is for mining them later for any new information that I may not know of now.
What is important is not so much the number of tweets as the rate at which they arrive. Cassandra can easily handle thousands of writes per second, which should be fine (Twitter currently generates around 1200 tweets per second in total, and you will probably only get a small fraction of those).
However, tweets per second are highly variable. In the aftermath of a heavy spike in writes, you may see some slowdown in range queries. See the Acunu blog posts on Cassandra under heavy write load part i and part ii for some discussion of the problem and ways to solve it.
In addition to storing the raw json, I would extract some common features that you are almost certain to need, such as the user ID and the hashtags, and store those separately as well. This will save you a lot of processing effort later on.
Another factor to consider is to plan for how the data stored will grow over time. Cassandra can scale very well, but you need to have a strategy in place for how to keep the load balanced across your cluster and how to add nodes as your database grows. Adding nodes can be a painful experience if you haven't planned out how to allocate tokens to new nodes in advance. Waiting until you have an overloaded node before adding a new one is a good way to make your cluster fall down.
You can easily store millions of tweets in cassandra.
For processing the tweets and getting stats such as top followed users, hashtags look at brisk from DataStax which builds on top of cassandra.

Resources