Why does Redis save partial data sets?

Why does Redis save partial data sets? - c

Version: Redis 5.0.3
In redis.conf there is an option to set snapshot period. I set this period as every 5 sec if one value is changed to see the Redis performance when it dumps. I ran two application; one is redis-server and the other is redis-benchmark.
While I was watching a log, I found out some interesting thing like below.
7269:C 27 Feb 2019 14:48:39.463 * RDB: 4535 MB of memory used by copy-on-write
7257:M 27 Feb 2019 14:48:39.939 * Background saving terminated with success
7257:M 27 Feb 2019 14:48:45.085 * 10 changes in 5 seconds. Saving...
7257:M 27 Feb 2019 14:48:45.187 * Background saving started by pid 7270
7270:C 27 Feb 2019 14:49:00.313 * DB saved on disk
7270:C 27 Feb 2019 14:49:00.401 * RDB: 4535 MB of memory used by copy-on-write
7257:M 27 Feb 2019 14:49:00.882 * Background saving terminated with success
7257:M 27 Feb 2019 14:49:06.011 * 10 changes in 5 seconds. Saving...
7257:M 27 Feb 2019 14:49:06.114 * Background saving started by pid 7271
7271:C 27 Feb 2019 14:49:21.086 * DB saved on disk
7271:C 27 Feb 2019 14:49:21.173 * RDB: 4534 MB of memory used by copy-on-write
7257:M 27 Feb 2019 14:49:21.706 * Background saving terminated with success
7257:M 27 Feb 2019 14:49:27.048 * 10 changes in 5 seconds. Saving...
7257:M 27 Feb 2019 14:49:27.155 * Background saving started by pid 7273
7273:C 27 Feb 2019 14:49:42.295 * DB saved on disk
7273:C 27 Feb 2019 14:49:42.382 * RDB: 4529 MB of memory used by copy-on-write
7257:M 27 Feb 2019 14:49:42.846 * Background saving terminated with success
7257:M 27 Feb 2019 14:49:48.023 * 10 changes in 5 seconds. Saving...
7257:M 27 Feb 2019 14:49:48.126 * Background saving started by pid 7274
7274:C 27 Feb 2019 14:50:05.251 * DB saved on disk
7274:C 27 Feb 2019 14:50:05.367 * RDB: 15 MB of memory used by copy-on-write
7257:M 27 Feb 2019 14:50:05.583 * Background saving terminated with success
As you can see, the dumped data has almost same size with others and the last one is even little. The thing that I don't understand is why the size is the same and why the last one has small size. (While the redis is dumping, the client requests set operation and the last dump probably means the end of set operation and start of get operation.)
To find out the reason, I look up the code but still don't know why the number has shown like above.
If you see rdb.c in the redis package, you can find this kind of source code.
int rdbSave(char *filename, rdbSaveInfo *rsi) {
...
snprintf(tmpfile, 256, "temp-%d.rdb", (int) getpid());
fp = fopen(tmpfile, "w");
...
rdbSaveRio(...);
}
From my understanding, everytime redis dumps the in-memory data, it should overwrite the previous saved data and this data should be lager than before. However, based on the log, the size is not linearly increased and even it is decreased at the last dump.
Am I missing some part of Redis features?
Edit
According to the comment, I definitely misinterpreted the logs. However, I still have a question about the performance. When snapshotting happens and if there are private dirty memories, the system saves them into the disk. In this point, based on the Redis dump mechanism, although the system sees the private dirty memory caused by set operation and only records this number on the log, it saves the all data from the memory. It means that everytime dumping happens, the size of disk expands and I'm pretty sure that it would lead to worse performance. But, when I see the result of benchmark, I can see the same performance drop despite of increasing disk size. I wonder why this shows the same drop rate and what is going on internally.
graph
In the above graph, the blue line indicates the throughput and you can see that it drops when snapshot happens and you can also notice that even though the second drop phase saves the larger disk size than the first phase, the drop rate is the same. So, my question would be is performance only affected by the private memory saving?

Related

Store Large Amount of Data (Statistical Data, Log Data) and retrieve it easily

I have created an advertising system where the system will collecting the data every second,
For example:
Start Time : 6:00pm 14 Dec 2020
End Time : 6:10pm 14 Dec 2020
RelatedId : 6d32683d71hd3x286s62327hd
The data will be stored every second for each site of viewing, meaning one user view = one-time record, roughly every ten seconds will have 200 new row data insert.
With this amount of data, is there any way to store it properly and could be retrieved easily?
Sample
Given Date range : 1 Jan 2020 to 1 Jan 2021
-retrieve all collected data and generate the report
The amount of time to retrieve the data is too long (cause error frequently) due to the too large amount of data inside the database(billions rows of data)
Summary:
How to store this kind of huge amount of data properly? Any Machanism?

How to deal with the categorical varaible that is not in training set (new categories in test set)?

I want to create a categorical variable for the semester column in my dataset. I have other additional variables with the target-not shown in the table.
Training set: include 2016-2017
Test set or validation set: include only 2018
My Concern is when I make the predictive model I will have categorical variables (factors) that do not exist in the training set (i.e SPRING 2018, SUMMER 2018–First SESSION,...etc). Is this will be a problem theoretically? How to deal with that?
Start End Semester
Jan 19,2016 May 6,2016 SPRING 2016
May 16,2016 Jun 25,2016 SUMMER 2016-FIRST SESSION
Jun 27,2016 Aug 6,2016 SUMMER 2016-SECOND SESSION
Aug 24,2016 Dec 16,2016 FALL 2016
Jan 17,2017 May 5,2017 SPRING 2017
May 15,2017 Jun 24,2017 SUMMER 2017–First SESSION
Jun 26,2017 Aug 5,2017 SUMMER 2017-SECOND SESSION
Aug 23,2017 Dec 15,2017 FALL 2017
Jan 16,2018 May 4,2018 SPRING 2018
May 14,2018 June 23,2018 SUMMER 2018–First SESSION
Jun 25,2018 Aug 4,2018 SUMMER 2018-SECOND SESSION
Aug 22,2018 Dec 14,2018 Fall 2018

The machine learning algorithms learn patterns in data, if we do not have any repeated pattern then with high probability they failed to provide adequate answer. I think you need to transform adequate information to your model for getting a rational output. Regarding to your research question, it can be different:
For instance, If you want to answer the question of when is the starting and ending time of semester x in year y?
You can convert the semester column into 4 ordinal categorical variables of 1 to 4 for Spring to Fall. In addition you should provide a year column in your data and DD,MM for ending and starting time.

sqlite3 transaction on 200 columns table takes long to complete sporadically

I am using sqlite3 as db manager for my application, developed on a rapsberry pi3.
My table is composed of around 200 columns (not so much), mostly boolean and numeric fields.
I add a (complete) record every minute. DB is accessed in a C program using transactions.
the transaction includes one insert and 6 updates (to maintain the code readable), avoiding to write a very long single insertion query.
The db file is on the filesystem (hence on the sd card) inside the home folder.
Every transsaction the db is opened, the pragmas
PRAGMA synchronous = NORMAL;
PRAGMA journal_mode = WAL;
are set and the query is performed.
I have good performance averagely but from the timing log I see a peak every once in a while.
Extract of the log is reported:
Apr 28 07:06:13 db write took 45.200000 ms
Apr 28 07:07:13 db commit took 0.302000 ms
Apr 28 07:07:13 db write took 75.858000 ms
Apr 28 07:08:13 db commit took 0.354000 ms
Apr 28 07:08:13 db write took 75.395000 ms
Apr 28 07:09:13 db commit took 0.268000 ms
Apr 28 07:09:13 db write took 40.620000 ms
Apr 28 07:10:13 db commit took 0.437000 ms
Apr 28 07:10:13 db write took 81.910000 ms
Apr 28 07:11:13 db commit took 0.205000 ms
Apr 28 07:11:13 db write took 43.315000 ms
Apr 28 07:12:13 db commit took 0.301000 ms
Apr 28 07:12:13 db write took 75.456000 ms
Apr 28 07:13:15 db commit took 1872.488000 ms <-----
Apr 28 07:13:15 db write took 1951.572000 ms <-----
Apr 28 07:14:13 db commit took 7.934000 ms
Apr 28 07:14:13 db write took 62.853000 ms
Apr 28 07:15:13 db commit took 0.274000 ms
Apr 28 07:15:13 db write took 80.568000 ms
Apr 28 07:16:13 db commit took 0.277000 ms
The arrow points to one of the time peak that are recurring (with variable periods) during the execution.
To bettere understand the situation, analyzing the benchmark I had two peaks in the last 12 hours, one is about 1 sec (not reported) and this one.
Could the time peaks happen because of filesystem activity on the sd?
Could making a different partition on the sd card have an impact on such performance?
Is there any other pragma that could protect my application from this behaviour?
Adding the pragmas has significantly improved the situation so far but I think is not acceptable yet.
Thanks for your time and patience.
Any hint is welcomed.
Regards,
mopyot

The database regularly moves the data from the write-ahead log into the actual database file; this is called checkpointing:
By default, SQLite will automatically checkpoint whenever a COMMIT occurs that causes the WAL file to be 1000 pages or more in size, or when the last database connection on a database file closes. […]
But programs that want more control can force a checkpoint using the wal_checkpoint pragma … The automatic checkpoint threshold can be changed or automatic checkpointing can be completely disabled using the wal_autocheckpoint pragma …

Oracle db stops working frequently

Recently, I faced a problem with Oracle 11g. It stops working frequently, every few hours, and respectively, must be started again. This was never happened before, it just began during last month.
Here is the content of clsc.log file
2016-10-06 07:00:14.344: [ default][2031720192]utgdv:2:ocr loc file /etc/oracle/olr.loc cannot be opened. errno 2
[ CLSE][2031720192]clse_get_crs_home: Error retrieving OLR configuration [0] [Error opening olr.loc file. No such file or directory]
Can anyone help me?
Edit: There is alert log messages like here, each time instance has stopped working (In this case, it has stopped working on Wed night and started again on Thu)
Thread 1 advanced to log sequence 35012 (LGWR switch)
Current log# 2 seq# 35012 mem# 0: /home/db/app/oracle/oradata/orcl/redo02.log
Wed Oct 05 18:53:08 2016
Thread 1 advanced to log sequence 35013 (LGWR switch)
Current log# 3 seq# 35013 mem# 0: /home/db/app/oracle/oradata/orcl/redo03.log
Wed Oct 05 19:05:08 2016
Thread 1 advanced to log sequence 35014 (LGWR switch)
Current log# 1 seq# 35014 mem# 0: /home/db/app/oracle/oradata/orcl/redo01.log
Wed Oct 05 19:15:03 2016
Thread 1 advanced to log sequence 35015 (LGWR switch)
Current log# 2 seq# 35015 mem# 0: /home/db/app/oracle/oradata/orcl/redo02.log
Thu Oct 06 06:41:11 2016
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
.
.
.

Apache2: server-status reported value for "requests/sec" is wrong. What am I doing wrong?

I am running Apache2 on Linux (Ubuntu 9.10).
I am trying to monitor the load on my server using mod_status.
There are 2 things that puzzle me (see cut-and-paste below):
The CPU load is reported as a ridiculously small number,
whereas, "uptime" reports a number between 0.05 and 0.15 at the same time.
The "requests/sec" is also ridiculously low (0.06)
when I know there are at least 10 requests coming in per second right now.
(You can see there are close to a quarter million "accesses" - this sounds right.)
I am wondering whether this is a bug (if so, is there a fix/workaround),
or maybe a configuration error (but I can't imagine how).
Any insights would be appreciated.
-- David Jones
- - - - -
Current Time: Friday, 07-Jan-2011 13:48:09 PST
Restart Time: Thursday, 25-Nov-2010 14:50:59 PST
Parent Server Generation: 0
Server uptime: 42 days 22 hours 57 minutes 10 seconds
Total accesses: 238015 - Total Traffic: 91.5 MB
CPU Usage: u2.15 s1.54 cu0 cs0 - 9.94e-5% CPU load
.0641 requests/sec - 25 B/second - 402 B/request
11 requests currently being processed, 2 idle workers
- - - - -

After I restarted my Apache server, I realized what is going on. The "requests/sec" is calculated over the lifetime of the server. So if your Apache server has been running for 3 months, this tells you nothing at all about the current load on your server. Instead, reports the total number of requests, divided by the total number of seconds.
It would be nice if there was a way to see the current load on your server. Any ideas?
Anyway, ... answered my own question.
-- David Jones

Apache status value "Total Accesses" is total access count since server started, it's delta value of seconds just what we mean "Request per seconds".
There is the way:
1) Apache monitor script for zabbix
https://github.com/lorf/zapache/blob/master/zapache
2) Install & config zabbix agentd
UserParameter=apache.status[*],/bin/bash /path/apache_status.sh $1 $2
3) Zabbix - Create apache template - Create Monitor item
Key: apache.status[{$APACHE_STATUS_URL}, TotalAccesses]
Type: Numeric(float)
Update interval: 20
Store value: Delta (speed per second) --this is the key option
Zabbix will calculate the increment of the apache request, store delta value, that is "Request per seconds".

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight