Compiler options providing CPU cache line alignment are considered to achieve more cache hits. But in multi threaded or multi processing systems what guarantees to cache usage of the optimized code will obtain expected cache usage at run time while other processes or threads also use the same cache. In especially, multi core systems provide multiple cpu threads in parallel using cpu cache.
Cache line alignment can improve the hit rate because data is less likely to be split between lines, reducing the number of lines whose invalidation could cause a cache miss for that data. That's true whether it's a single thread or multiple threads sharing the cache.
Related
I have a Flink job processing data at around 200k qps. Without checkpoints, the job is running fine.
But when I tried to add checkpoints (with interval 50mins), it causes backpressue at the first task, which is adding a key field to each entry, the data lagging goes up constantly as well.
the lagging of my two Kafka topics, first half was having checkpoints enabled, lagging goes up very quickly. second part(very low lagging was having checkpoints disabled, where the lagging is within milliseconds)
I am using at least once checkpoint mode, which should be asynchronized process. Could anyone suggest?
My checkpointing setting
env.enableCheckpointing(1800000,
CheckpointingMode.AT_LEAST_ONCE);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig()
.enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.getCheckpointConfig()
.setCheckpointTimeout(10min);
env.getCheckpointConfig()
.setFailOnCheckpointingErrors(
jobConfiguration.getCheckpointConfig().getFailOnCheckpointingErrors());
my job has 128 containers.
With 10mins checkpoint time, following is the stats:
I am trying to use a 30mins checkpoint and see
I was trying to tune memory usage, but it seems not working.
But in the task manager, it's still:
TLDR; it's sometimes hard to analyse the problem. I have two lucky guesses/shots - if you are using RocksDB state backend, you could switch to FsStateBackend - it's usually faster and RocksDB makes most sense with large state sizes, that do not fit into memory (or if you really need incremental checkpointing feature). Second is to fiddle with parallelism, either increasing or decreasing.
I would suspect the same thing that #ArvidHeise wrote. You checkpoint size is not huge, but it's also not trivial. It can add the extra overhead to bring the job over the threshold of barely keeping up with the traffic, to not keeping up and causing the backpressure. If you are under backpressure, the latency will just keep accumulating, so even a change in couple of % of extra overhead can make a difference between end to end latencies of milliseconds vs unbounded ever growing value.
If you can not just simply add more resources, you would have to analyse what's exactly adding this extra over head and what resource is the bottleneck.
Is it CPU? Check CPU usage on the cluster. If it's ~100%, that's the thing you need to optimise for.
Is it IO? Check IO usage on the cluster and compare it against the maximal throughput/number of requests per second that you can achieve.
If both CPU & IO usage is low, you might want to try to increase parallelism, but...
Keep in mind data skew. Backpressure could be caused by a single task and in that case it makes it hard to analyse the problem, as it will be a single bottlenecked thread (on either IO or CPU), not whole machine.
After figuring out what resource is the bottleneck, next question would be why? It might be immediately obvious once you see it, or it might require digging in, like checking GC logs, attaching profiler etc.
Answering those questions could either give you information what you could try to optimise in your job or allow you to tweak configuration or could give us (Flink developers) an extra data point what we could try to optimise on the Flink side.
Any kind of checkpointing adds computation overhead. Most of the checkpointing is asynchronously (as you have stated), but it still adds up general I/O operations. These additional I/O request may, for example, congest your access to external systems. Also if you enable checkpointing, Flink needs to keep track of more information (new vs. already checkpointed).
Have you tried to add more resources to your job? Could you share your whole checkpointing configuration?
My problem is exactly similar to this except that Backpressure in my application is coming as "OK".
I thought the problem was with my local machine not having enough configuration, so I created a 72 core Windows machine, where I am reading data from Kafka, processing it in Flink and then writing the output back in Kafka. I have checked, writing into Kafka Sink is not causing any issues.
All I am looking for are the areas that may be causing a split in Throughput among task slots by increasing parallelism?
Flink Version: 1.7.2
Scala version: 2.12.8
Kafka version: 2.11-2.2.1
Java version: 1.8.231
Working of application: Data is coming from Kafka (1 partition) which is deserialized by Flink (throughput here is 5k/sec). Then the deserialized message is passed through basic schema validation (Throughput here is 2k/sec).
Even after increasing the parallelism to 2, throughput at Level 1 (deserializing stage) remains same and doesn't increase two fold as per expectation.
I understand, without the code, it is difficult to debug so I am asking for the points which you can suggest for this problem, so that I can go back to my code and try that.
We are using 1 Kafka partition for our input topic.
If you want to process data in parallel, you actually need to read data in parallel.
There are certain requirements to read data in parallel. The most important once are that the source is able to actually split the data into smaller work chunks. For example, if you read from a file system, you have multiple files, or the system subdivides the files into splits. For Kafka, this necessarily means that you have to have more partitions. Ideally, you have at least as many partitions than you have max consumer parallelism.
The 5k/s seems to be the maximum throughput that you can achieve on one partition. You can also calculate the number of partitions by the maximum throughput you want to achieve. If you need to achieve 50k/s, you need at least 10 partitions. You should use more to also catch up in case of reprocessing or failure recovery.
Another way to distribute the work is to add a manual shuffle step. That means, if you keep the single input partition, you would still only reach 5k/s, but after that the work is actually redistributed and processed in parallel, such that you will not see a huge decline in your throughput afterwards. After a shuffle operation, work is somewhat evenly distributed among the parallel downstream tasks.
What could be the reasons for Redis slow work/response?
i.e. I found on Stackoverflow that storing large files or data in Redis makes it slow. What's else?
There is no simple answer to this question. With all NoSQL or SQL based storage solutions, there are plenty of conditions that could result in high latency or slowness of the storage engine. Redis is no exception.
I would suggest to start by reading:
How fast is Redis?
Redis latency problems troubleshooting
Here is a non exhaustive list of potential reasons:
Inadequate hardware (network, memory, CPU)
Software based virtualization (Xen on low-end hardware for instance)
Not enough memory, generating swapping at the OS level
Too many O(n) operations (like KEYS) executed in the single-threaded engine
Large objects stored in Redis, leading to uncontrolled expansion of the communication buffers
Huge number of simultaneous sessions (>30000)
Too many connection operations per second (Redis is not a webserver, connections are supposed to be permanent, not transient).
Too many roundtrips generated by the client application (no pipelining or aggregated command usage)
Large fork operations generated by bgsave or AOF rewrite (especially on VMs)
I/O related latencies when AOF is used
Accumulation of many expire operations triggered at the same time
Accumulation of memory in client and master/slave communication buffers, or slow log data
TCP incast conditions when network bandwidth consumption is significant
Using distributed storage (and especially cloudy ones such as EC2 EBS) to store dump or AOF files
There are probably many other reasons, related to the workload generated by your own application.
If some people think about other general reasons, we can add them to this list.
As was mentioned new connections, > 200 per minute could cause slownesses. A Possible solution is to add a proxy that keeps constant number of connections:
twemproxy
envoy
I deployed an instance of Solr onto a ubuntu machine with tomcat. Then i have a single thread client program to read and inject data into Solr. I am observing memory and cpu usages, and realized that I still have a lot of resources (in terms of memory and CPUs) to use. I wonder if I should change my indexing code to multi-threading to inject into Solr? To index 20 millions of data using current single thread program, it needs about 14 hours. This is why i wonder if i should change to use multi-threading as well. Thanks in advance for your suggestions and help! :)
Multi-threading while indexing in Solr is widely used.
What you say is not very clear if you can also multi-thread the reading from your source, but I think that is the way to go.
I suggest you try it, but first try to analize your code and see which part of the code is the slowest and include that in the multi-threading.
Also keep an eye on your commit strategy.
From the Solr documentation: (http://wiki.apache.org/solr/SolrPerformanceFactors)
"In general, adding many documents per update request is faster than one per update request. ...
Reducing the frequency of automatic commits or disabling them entirely may speed indexing. Beware that this can lead to increased memory usage, which can cause performance issues of its own, such as excessive swapping or garbage collection."
Here's the deal. We would have taken the complete static html road to solve performance issues, but since the site will be partially dynamic, this won't work out for us.
What we have thought of instead is using memcache + eAccelerator to speed up PHP and take care of caching for the most used data.
Here's our two approaches that we have thought of right now:
Using memcache on >>all<< major queries and leaving it alone to do what it does best.
Usinc memcache for most commonly retrieved data, and combining with a standard harddrive-stored cache for further usage.
The major advantage of only using memcache is of course the performance, but as users increases, the memory usage gets heavy. Combining the two sounds like a more natural approach to us, even though the theoretical compromize in performance.
Memcached appears to have some replication features available as well, which may come handy when it's time to increase the nodes.
What approach should we use?
- Is it stupid to compromize and combine the two methods? Should we insted be focusing on utilizing memcache and instead focusing on upgrading the memory as the load increases with the number of users?
Thanks a lot!
Compromize and combine this two method is a very clever way, I think.
The most obvious cache management rule is latency v.s. size rule, which is used in CPU cached also. In multi level caches each next level should have more size for compensating higher latency. We have higher latency but higher cache hit ratio. So, I didn't recommend you to place disk based cache in front of memcache. Сonversely it's should be place behind memcache. The only exception is if you cache directory mounted in memory (tmpfs). In this case file based cache could compensate high load on memcache, and also could have latency profits (because of data locality).
This two storages (file based, memcache) are not only storages that are convenient for cache. You also could use almost any KV database as they are very good at concurrency control.
Cache invalidation is separate question which can engage your attention. There are several tricks you could use to provide more subtle cache update on cache misses. One of them is dog pile effect prediction. If several concurrent threads got cache miss simultaneously all of them go to backend (database). Application should allow only one of them to proceed and rest of them should wait on cache. Second is background cache update. It's nice to update cache not in web request thread but in background. In background you can control concurrency level and update timeouts more gracefully.
Actually there is one cool method which allows you to do tag based cache tracking (memcached-tag for example). It's very simple under the hood. With every cache entry you save a vector of tags versions which it is belongs to (for example: {directory#5: 1, user#8: 2}). When you reading cache line you also read all actual vector numbers from memcached (this could be effectively performed with multiget). If at least one actual tag version is greater than tag version saved in cache line then cache is invalidated. And when you change objects (for example directory) appropriate tag version should be incremented. It's very simple and powerful method, but have it's own disadvantages, though. In this scheme you couldn't perform efficient cache invalidation. Memcached could easily drop out live entries and keep old entries.
And of course you should remember: "There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton.
Memcached is quite a scalable system. For instance, you can replicate cache to decrease access time for certain key buckets or implement Ketama algorithm that enables you to add/remove Memcached instances from pool without remap of all keys. In this way, you can easily add new machines dedicated to Memcached when you happen to have extra memory. Furthermore, as its instance can be run with different sizes, you can throw up one instance by adding more RAM to an old machine. Generally, this approach is more economic and to some extent does not inferior to the first one, especially for multiget() requests. Regarding a performance drop with data growth, the runtime of the algorithms used in Memcached does not vary with the size of the data, and therefore the access time depend only on number of simultaneous requests. Finally, if you want to tune your memory/performance priorities you can set expire time and available memory configuration values which will strict RAM usage or increase cache hits.
At the same time, when you use a hard-disk the file system can become a bottleneck of your application. Besides general I/O latency, such things as fragmentation and huge directories can noticeably affect your overall request speed. Also, beware that default Linux hard disk settings are tuned more for compatibility than for speed, so it is advisable to configure it properly before usage (for instance, you can try hdparm utility).
Thus, before adding one more integrating point, I think you should tune the existent system. Usually, properly designed database, configured PHP, Memcached and handling of static data should be enough even for a high-load web site.
I would suggest that you first use memcache for all major queries. Then, test to find queries that are least used or data that is rarely changed and then provide a cache for this.
If you can isolate common data from rarely used data, then you can focus on improving performance on the more commonly used data.
Memcached is something that you use when you're sure you need to. You don't worry about it being heavy on memory, because when you evaluate it, you include the cost of the dedicated boxes that you're going to deploy it on.
In most cases putting memcached on a shared machine is a waste of time, as its memory would be better used caching whatever else it does instead.
The benefit of memcached is that you can use it as a shared cache between many machines, which increases the hit rate. Moreover, you can have the cache size and performance higher than a single box can give, as you can (and normally would) deploy several boxes (per geographical location).
Also the way memcached is normally used is dependent on a low latency link from your app servers; so you wouldn't normally use the same memcached cluster in different geographical locations within your infrastructure (each DC would have its own cluster)
The process is:
Identify performance problems
Decide how much performance improvement is enough
Reproduce problems in your test lab, on production-grade hardware with necessary driver machines - this is nontrivial and you may need a lot of dedicated (even specialised) hardware to drive your app hard enough.
Test a proposed solution
If it works, release it to production, if not, try more options and start again.
You should not
Cache "everything"
Do things without measuring their actual impact.
As your performance test environment will never be perfect, you should have sufficient instrumentation / monitoring that you can measure performance and profile your app IN PRODUCTION.
This also means that every single thing that you cache should have a cache hit/miss counter on it. You can use this to determine when the cache is being wasted. If a cache has a low hit rate (< 90%, say), then it is probably not worthwhile.
It may also be worth having the individual caches switchable in production.
Remember: OPTIMISATIONS INTRODUCE FUNCTIONAL BUGS. Do as few optimisations as possible, and be sure that they are necessary AND effective.
You can delegate the combination of disk/memory cache to the OS (if your OS is smart enough).
For Solaris, you can actually even add SSD layer in the middle; this technology is called L2ARC.
I'd recommend you to read this for a start: http://blogs.oracle.com/brendan/entry/test.