SWUpdate takes too much time to update the system - swupdate

I am working with SWUpdate to update my system. The update strategy is an asymmetric strategy (system = rootfs + recovery partition).
SWUpdate succeeded in updating the system (running from the rescue partition), but the problem is that the update process takes a lot of time (about 10 min for a 256MB image).
Knowing that I am working on a restricted system with real time constraints, so it will not have to take much time to make a simple update.
In this case, are there configurations on which we can play to optimize the update time taken by SWUpdate ?
Thanks,
Amal.

Related

Clickhouse DB slows down on a daily basis at 10am for seemingly no reason

I have been using Clickhouse at work for analytics purposes for a while now.
I am currently running Clickhouse v22.6.3 revision 54455 on-premise on a VM with:
fast storage
200Gb of RAM
no swap
a 40-cores CPU.
I have a few Tb of data, but no table bigger than 300 Gb. I do not use distributed tables or replication yet, and I write frequently into Clickhouse (but I don't use deletes or updates and prefer using things like the ReplacingMergeTree engine). I also leverage the MaterializedView feature for a few tables. Let me know if you need any more context or parameter, I use a pretty standard configuration.
Now, for a few months I have been experiencing performances issues where the server significantly slows down every day at 10am, and I cannot figure out why.
Based on Clickhouse built-in Graphite monitoring, the "symptoms" of the issue seem to be as follow:
At 10am:
On the server side:
Both load and RAM usage remain reasonable. Load goes up a little.
Disk write await time goes up (which I suspect is what leads to higher load)
Disk utilization % skyrockets to something between 90 and 100%
On Clickhouse side:
DiskSpaceReservedForMerge stays roughly the same (ie between 0 and 70Gb)
both OpenFileForRead and OpenFileForWrite go up by a factor of ~2
BackgroundCommonPoolTask goes slightly up, so does BackgroundSchedulePoolTask (which I found weird, because I thought this pool was dedicated to distributed operations - which I don't use) - both numbers remain seemingly reasonable
The number of active Merge tasks per minutes drop significantly but I'm unsure whether it's a consequence of slow writing or if it's causing it
both insert and general querying time are multiplied by ~10 which renders the database effectively unusable even for small tasks
Restarting Clickhouse usually fixes the problem but I obviously do not want to restart my main database every day at 10am. Most of the heavy load I put on the DB (such as data extraction and transformation, etc) happens earlier in the morning (and end around 7-8am) and runs fine. I do not have any heavy tasks running at 10am. The Clickhouse VM takes most of its host resources and I have confirmed with the devOps team that there doesn't seem to be a problem on the host or anything else scheduled on it at that time.
Is there any kind of background tasks or process that is run by Clickhouse on a daily basis and that could have a high impact on our disk capacity? What else can I monitor to figure out what is causing this problem?
Again, let me know if I can be more thorough on our settings and the state of the DB when the "bug" occurs.
Do you use https://github.com/innogames/graphite-ch-optimizer ?
Do you use TTL ?
select * from system.merges;
select * from system.part_log where event_time between ~10am~

How to make Flink job with huge state finish

We are running a Flink cluster to calculate historic terabytes of streaming data. The data calculation has a huge state for which we use keyed states - Value and Map states with RocksDb backend. At some point in the job calculation the job performance starts degrading, input and output rates drop to almost 0. At this point exceptions like 'Communication with Taskmanager X timeout error" can be seen in the logs, however the job is compromised even before.
I presume the problem we are facing has to the with the RocksDb's disk backend. As the state of the job grows it needs to access the Disk more often which drags the performance to 0. We have played with some of the options and have set some which make sense for our particular setup:
We are using the SPINNING_DISK_OPTIMIZED_HIGH_MEM predefined profile, further optimized with optimizeFiltersForHits and some other options which has somewhat improved performance. However not of this can provide a stable computation and on a job re-run against a bigger data set the job halts again.
What we are looking for is a way to modify the job so that it progresses at SOME speed even when the input and the state increases. We are running on AWS with limits set to around 15 GB for Task Manager and no limit on disk space.
using SPINNING_DISK_OPTIMIZED_HIGH_MEM will cost huge off-heap memory by memtable of RocksDB, Seeing as you are running job with memory limitation around 15GB, I think you will encounter the OOM issue, but if you choose the default predefined profile, you will face the write stall issue or CPU overhead by decompressing the page cache of Rocksdb, so I think you should increase the memory limitation.
and here are some post about Rocksdb FYI:
https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
https://www.ververica.com/blog/manage-rocksdb-memory-size-apache-flink

How can I tell if my Postgres 9.2 database is running in memory?

I'd like to know if when I run a query, the database's contents are in my system's RAM. The dataset is approx 4.1 gb, my machine has 8gb of RAM. Am I reading from disk every time I run a SELECT or UPDATE query?
Aside from monitoring IO activity as others have suggested, you can also run a query to take advantage of PostgreSQL's stats tracking.
The following query will show your cache hit rate. If you hitting only cache, the hit rate should be somewhere around the .99 or higher range, if your doing a lot of disk reads, it'll be lower.
SELECT
sum(heap_blks_read) as heap_read,
sum(heap_blks_hit) as heap_hit,
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM
pg_statio_user_tables;
This query, and other performance queries can be found here
All systems allows to track a IO activity. So you can use a system monitoring tools - there is a iotop for linux for example.
If that query is the only thing actively running on the system, use the system tools (vmstat, sar) to see if there is a spike in IO when it executes. If there is a lot of other things going on, it can be very hard to figure out what you want, as there is no easy way to distinguish data actually read from disk from data read from the OS's file system cache. You can turn on track_io_timing and see if the resulting times are consistent with the data coming from RAM.

Is relational database appropriate for soft real-time system?

I'm working on a real-time video analysis system which processes the video stream frame by frame. At each frame it can generate several events which should be recorded and some delivered to another system via network. The system is soft real-time, i.e. message latencies higher than 25ms are highly undesirable, but not fatal.
Are relational databases (specifically, MySQL and Postgres) appropriate as the datastore for such system?
Can I expect the DB to work well when it is installed on its own server and has ~50 25fps streams of single-row SQL inserts coming in over the network?
EDIT: I think in general performance would not be a problem, but I worry about the latency variance. If it will occasionally delay for 1000 ms, that would be very bad.
Oh, and the system runs 24/7 so the DB could grow arbitrarily big. Does that degrade the insert latency?
I wouldn't worry too much about performance when choosing a relational database over another type of datastore, choose the solution that best meets your requirements for accessing that data later. However, if you do choose not only a RDBMS but one over the network then you might want to consider buffering events to a local disk briefly on their way over to the DB. Use a separate thread or process or something to push events into the DB to keep the realtime system unaffected.
Biggest problems are how unpredictable the latency will be and how it never goes down, always up. But modern hardware to the rescue, specify a machine with enough cpu cores. You can count on at least two, getting four is easy. So you can spin up a thread and dedicate one core to the dbase updates, isolating it from your soft real-time code. Now you don't care about the variability in the delays, at least as long as the dbase updates don't take so long that you generate data faster than it can consume.
Setup a dbase server and load it up with fake data, double the amount you think it ever needs to store. Test continuously while you develop, add the instrumenting code you need to measure how it is doing at an early stage in the project.
As I've written, if you queue the rows that need to be saved and save them in an async way (so not to stop the "main" thread) there shouldn't be any problem... BUT!!!
You want to save them in a DB... So someone else will read the rows AT THE SAME TIME they are being written. Sadly it's normally quite difficult to tell to a DB "this work is very high priority, everything else can be stalled but not this". So if someone does:
BEGIN TRANSACTION
SELECT COUNT(*) FROM TABLE
WAITFOR DELAY '01:00:00'
(I'm using T-Sql here... But I think it's quite clear. Ask for the COUNT(*) of the table, so that there is a lock on the table and then WAITFOR an hour)
then the writes could be stalled and go in timeout. In general if you configure everyone but the app to be able only to do reads, these problems shouldn't be present.

Database Network Latency

I am currently working on an n-tier system and battling some database performance issues.
One area we have been investigating is the latency between the database server and the application server. In our test environment the
average ping times between the two boxes is in the region of 0.2ms however on the clients site its more in the region of 8.2 ms. Is that
somthing we should be worried about?
For your average system what do you guys consider a resonable latency and how would you go about testing/measuring the latency?
Yes, network latency (measured by ping) can make a huge difference.
If your database response is .001ms then you will see a huge impact from going from a 0.2ms to 8ms ping. I've heard that database protocols are chatty, which if true means that they would be affected more by slow network latency versus HTTP.
And more than likely, if you are running 1 query, then adding 8ms to get the reply from the db is not going to matter. But if you are doing 10,000 queries which happens generally with bad code or non-optimized use of an ORM, then you will have wait an extra 80seconds for an 8ms ping, where for a 0.2ms ping, you would only wait 4 seconds.
As a matter of policy for myself, I never let client applications contact the database directly. I require that client applications always go through an application server (e.g. a REST web service). That way, if I accidentally have an "1+N" ORM issue, then it is not nearly as impactful. I would still try to fix the underlying problem...
In short : no !
What you should monitor is the global performance of your queries (ie transport to the DB + execution + transport back to your server)
What you could do is use a performance counter to monitor the time your queries usually take to execute.
You'll probably see your results are over the millisecond area.
There's no such thing as "Reasonable latency". You should rather consider the "Reasonable latency for your project", which would vary a lot depending on what you're working on.
People don't have the same expectation for a real-time trading platform and for a read only amateur website.
On a linux based server you can test the effect of latency yourself by using the tc command.
For example this command will add 10ms delay to all packets going via eth0
tc qdisc add dev eth0 root netem delay 10ms
use this command to remove the delay
tc qdisc del dev eth0 root
More details available here:
http://devresources.linux-foundation.org/shemminger/netem/example.html
All applications will differ, but I have definitely seen situations where 10ms latency has had a significant impact on the performance of the system.
One of the head honchos at answers.com said according to their studies, 400 ms wait time for a web page load is about the time when they first start getting people canceling the page load and going elsewhere. My advice is to look at the whole process, from original client request to fulfillment and if you're doing well there, there's no need to optimize further. 8.2 ms vs 0.2 ms is exponentially larger in a mathematical sense, but from a human sense, no one can really perceive an 8.0 ms difference. It's why they have photo finishes in races ;)

Resources