Thanos Receiver Data Retention for 3 hours - thanos

I did set up 5 replicas of Thanos Receivers with 3 hours of data retention.
I verified that ULID folders created every 2 hours are being created and being deleted after 3 hours.
However, my problem is, the data in the wal directory seems keep on increasing and not being deleted after 3 hours. I was trying to find any information about it in Thanos Documentation if this is an expected behavior. We only provision 20Gi of pvc, and we cannot afford to increase it as part of company's cost effectivity goal.
Is there any information where in I can make the wal directory in Thanos Receivers data storage can be deleted as the ULID directories do?
Expectation:
Data folders inside the wal directory would be deleted the same way as the ULID directories get deleted after 3 hours.
Please give me some light. Thank you.

Related

SQL HA Cluster TempDB Version Store blocking on secondary Replica due to open transaction? [migrated]

This question was migrated from Stack Overflow because it can be answered on Database Administrators Stack Exchange.
Migrated 17 days ago.
I am currently investigating a repeating error which occurs on the secondary Replica of our 2 node Alwasy on High Availability cluster. The Replica is set up with Read-Intent only because we use a separate Backup solution (Dell Networker).
The Tempdb keeps growing in the secondary replica because the Version Store never gets cleared.
I can fix it temporarly when i failover the Availability Groups, but after a couple of hours the error appears again on the repilca node. The error seems to follow one specific Availabilty Group, every node where its currently replicating gets the error after some time. So i guess it has to be a issue caused by a transaction and not from the sytem itself.
I tried all suggestions on google to find the culprit but even if i recklessly kill all sessions with last_batch in the timeframe i get from the Perfmon "longest running transaction time" indicator (as advised here: https://www.sqlservercentral.com/articles/tempdb-growth-due-to-version-store-on-alwayson-secondary-server ), it won't start cleaning up the Versionstore.
The shown elapsed seconds also match the output of the Query on the Secondary node:
select * from sys.dm_tran_active_snapshot_database_transactions
details are sadly not usefull:
Output
Here it shows Transaction ID 0 and Session ID 823 but the Session ID is long gone and keeps getting used by other processes alread. So i am stuck here.
I tried to match the Transaction_sequence_num with anyting, but no luck so far.
On the Primary Node it show no open transactions of any kind.
Any help finding the cause of this open snapshot transaction is appreciated.
I followed this guides already to find the issue:
https://sqlundercover.com/2018/03/21/tempdb-filling-up-on-secondary-replicas/
https://sqlgeekspro.com/tempdb-growth-due-to-version-store-in-alwayson/
https://learn.microsoft.com/en-us/archive/blogs/docast/rapid-growth-of-tempdb-on-alwayson-secondary-replica-due-to-version-store
https://www.sqlshack.com/how-to-detect-and-prevent-unexpected-growth-of-the-tempdb-database/
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/635a2fc6-550b-4e08-b232-0652bd6ea17d/version-store-space-not-being-released?forum=sqldatabaseengine
https://www.sqlservercentral.com/articles/tempdb-growth-due-to-version-store-on-alwayson-secondary-server
Update:
To show my claim that the session is long gone:
Here the Pictures you see the output of sys.dm_tran_active_snapshot_database_transactions, sys.sysprocesses and sys.dm_exec_sessions:
The first Picture shows currently 2 "open" snapshot database transactions. (normally it was always one in the past, but maybe the more the better) and it shows the now to this time running sessions on this ids.
Than i proceeded to kill session 899 and 823 and checked again:
Here you can see the active_snapshot_database_transaction is still showing the 2 Session_ids and the sysprocesses and dm_exec_sessions show now the 2 IDs are in use by a different Program, User, database etc. because i killed them and the ID number immediately got reused. If i check through the day sometimes they are even not in use at all.
If i check the elapsed time and the perfmon longest running transaction i would be looking for a session with a logintime or batch at aroung 2023-02-03 00:00:56. But if i check all sleeping sessions or sessions with last batch in this range and even kill all of them ( like described in all of the links above) it still shows the "transaction" in sys.dm_tran_active_snapshot_database_transactions with ever growing numbers.
Update 2:
in the meantime we needed to resolve the issue with a failover because the tempdb ran out of space. Now the new "stuck" session id as shown in dm_tran_active_transaction has the session id 47 and is currently at around 30000sec rising. So the problem started at around 11.2.2023 00:00:20.
Here is the Output of dm_tran_active_transaction:
There are many different assumptions going on here which need to be addressed.
First, the mechanism by which availability groups work is by log blocks. They don't send individual transactions. Log blocks are written to the log on the secondary and eventually the information inside of them is redone.
Second, the version store is only used on readable secondary replicas. Thus, looking at the primary for sessions that are using items on the secondary is not going to help. The version store can only be cleaned up by removing the oldest unused versions until it hits a version in use, it cannot skip versions, thus is version 3 is needed but 4-10 aren't, anything below 3 can be cleaned up but anything (including 3) can't.
Third, if a session is closed then any outstanding items are cleaned up. Whether that is freeing memory, killing transactions, etc. There is no evidence that the session is actually disconnected on your secondary replica that was given.
I wrote this in lieu of adding comments. If the OP decides to add in more data on the secondary they can and I'll address it. The replica can also be changed to not readable and that will solve the problem, since the issue is queries on the secondary.

How to clear Stackdriver logs in Google Cloud Platform?

I recently realized I was paying way too much for my logs:
As you can see in the image, logs are getting bigger each month
As you also can see I just today put a "limit" on the ingestion. Hopefully this will slow things down.
But as I understand it, my logs have gotten so big that I have to pay for their retention each month. I cannot figure out how to:
a) delete logs of a certain period (or just all of them)
b) make logs auto delete after x days
I also just today put a quota limit of 100 instead of 6000
The logs expire according to the retention policy:
Admin Activity 400 days
System Events 400 days
Data Access 30 days
Access Transparency 30 days
Other Logs 30 days
Note that you're not charged for Admin Activity or System Event logs.
Some solutions to control costs are exclusions and exports, but even if you use timestamp to specify the range of dates in the filter expressions to create an exclusion filter, since it's already loaded, it won't be excluded. The same applies to creating a log sink for exporting data, since it will export future matching logs.
You can use gcloud logging logs delete to delete all the logs for a given project or for a given resource, but you can't specify a range of time.
So, my suggestions are the next ones:
1.- Delete all the existing logs for resources you don't need logging.
2.- Create exclusions to keep only the logs you may need during 30 days.
3.- Create export sinks for all the logs you may need for more than 30 days.

why does CouchDBs _dbs.couch keep growing when purging/compacting DBs?

The setup:
A CouchDB 2.0 running in Docker on a Raspberry PI 3
A node-application that uses pouchdb, also in Docker on the same PI 3
The scenario:
At any given moment, the CouchDB has at max 4 Databases with a total of about 60 documents
the node application purges (using pouchdbs destroy) and recreates these databases periodically (some of them every two seconds, others every 15 minutes)
The databases are always recreated with the newest entries
The reason for purging the databases, instead of deleting their documents is, that i'd otherwise have a huge amount of deleted documents, and my web-client can't handle syncing all these deleted documents
The problem:
The file var/lib/couchdb/_dbs.couch always keeps growing, it never shrinks. Last time i left it alone for three weeks, and it grew to 37 GB. Fauxten showed, that the CouchDB only contains these up to 60 Documents, but this file still keeps growing, until it fills all the space available
What i tried:
running everything on an x86 machine (osx)
running couchdb without docker (because of this info)
using couchdb 2.1
running compaction manually (which didn't do anything)
googling for about 3 days now
Whatever i do, i always get the same result: the _dbs.couch keeps growing. I also wasn't really able to find out, what that files purpose is. googling that specific filename only yields two pages of search-results, none of which are specific.
The only thing i can currently do, is manually delete this file from time to time, and restart the docker-container, which does delete all my databases, but that is not a problem as the node-application recreates them soon after.
The _dbs database is a meta-database. It records the locations of all the shards of your clustered databases, but since it's a couchdb database too (though not a sharded one) it also needs compacting from time to time.
try;
curl localhost:5986/_dbs/_compact -XPOST -Hcontent-type:application/json
You can enable the compaction daemon to do this for you, and we enable it by default in the recent 2.1.0 release.
add this to the end of your local.ini file and restart couchdb;
[compactions]
_default = [{db_fragmentation, "70%"}, {view_fragmentation, "60%"}]

Drupal 7 -> 8 migration of a large database takes forever

So I have a Drupal 7 database with 2 million users that need to move to Drupal 8 with a minimum of downtime (target is an hour). The Drupal migrate module appears to solve this problem, but it writes new rows one item at a time and in my tests, 4 thousand users + related data took 20 minutes on frankly beastly AWS instances. Extrapolating to the full dataset, it would take me 7 days to run the migration, and that amount of downtime is not reasonable.
I've made a feature request against Drupal core but I also wanted to see if the community has any ideas that I missed. Also, I want to spawn some discussion about this issue.
If anyone still cares about this, I have resolved this issue. Further research showed that not only does the Drupal migration module write new rows one at a time, but it also reads rows from the source one at a time. Further, for each row Drupal will write to a mapping table for the source table so that it can support rollback and update.
Since a user's data is stored in one separate table per custom field, this results in something like 8 reads and 16 writes for each user.
I ended up extending Drupal's Migration Executable for running the process. Then I overrode both the part that reads data and the part that writes it to do their work in batches, and to not write to the mapping tables. I believe that my projected time is now down to less then an hour (A speed up of 168 times!).
Still, trying to use the Drupal infrastructure was more trouble then it was worth. If you are doing this yourself just write a command line application and do the SQL queries manually.

Solr Incremental backup on real-time system with heavy index

I implement search engine with solr that import minimal 2 million doc per day.
User must can search on imported doc ASAP (near real-time).
I using 2 dedicated Windows x64 with tomcat 6 (Solr shard mode). every server, index about 120 million doc and about 220 GB (total 500 GB).
I want to get backup incremental from solr index file during update or search.
after search it, find rsync tools for UNIX and DeltaCopy for windows (GUI rsync for windows). but get error (vanished) during update.
how to solve this problem.
Note1:File copy really slow, when file size very large. therefore i can't use this way.
Note2: Can i prevent corrupt index files during update, if windows crash or hardware reset or any other problem ?
You can take a hot backup (i.e. while writing to the index) using the ReplicationHandler to copy Solr's data directory elsewhere on the local system. Then do whatever you like with that directory. You can launch the backup whenever you want by going to a URL like this:
http://host:8080/solr/replication?command=backup&location=/home/jboss/backup
Obviously you could script that with wget+cron.
More details can be found here:
http://wiki.apache.org/solr/SolrReplication
The Lucene in Action book has a section on hot backups with Lucene, and it appears to me that the code in Solr's ReplicationHandler uses the same strategy as outlined there. One of that book's authors even elaborated on how it works in another StackOverflow answer.
Don't run a backup while updating the index. You will probably get a corrupt (therefore useless) backup.
Some ideas to work around it:
Batch up your updates, i.e. instead of adding/updating documents all the time, add/update every n minutes. This will let you run the backup in between those n minutes. Cons: document freshness is affected.
Use a second, passive Solr core: Set up two cores per shard, one active and one passive. All queries are issued against the active core. Use replication to keep the passive core up to date. Run the backup against the passive core. You'd have to disable replication while running the backup. Cons: complex, more moving parts, requires double the disk space to maintain the passive core.

Resources