I'm getting confused about the reason to use individual replication slots for individual replicas(aside of a single point of crash). I get that each replication slot stores only some amount of WAL segments and if a replica goes offline replication slots guarantee that those segments that didn't get to the replica will be preserved until it goes back online. But doesn't it require less disk space to store segments up until the oldest checkpoint among all of the replicas?
As a very simplified example I have 3 replicas. First one is online and fully synced so let's say we don't store anything for this one. 2nd one gone offline 30 minutes ago so we store last 30 minutes of WAL. And the 3rd one is offline for 2 days so we store 2 days of WAL. What's the reason to store 2 days 30 minutes of WAL when we can simply store 2 days and save up on space?
Slots don't store anything themselves. They just do the accounting to let the system know how much to keep.
What's the reason to store 2 days 30 minutes of WAL when we can simply store 2 days and save up on space?
What makes you think it is storing 2 days and 30 minutes rather than simply storing 2 days?
Related
my SSISDB is writing a large number of entries, especially in [internal].[event_messages] and [internal].[operation_messages].
I have already set the the number of versions to keep and the log retention period to 5 each. After running the maintenance job, selecting the distinct dates in both those tables shows that there are only 6 dates left (including today), as one would expect. Still, the tables I mentioned above have about 6.5 million entries each and a total database size of 35 GB (again, for a retention period of 5 days).
In this particular package, I am using a lot of loops and I suspect that they are causing this rapid expansion.
Does anyone have an idea of how to reduce the number of operational and event messages written by the individual components? Or do you have an idea of what else might be causing this rate of growth? I have packages running on other machines for over a year with a retention period of 150 days and the size of the SSISDB is just about 1 GB.
Thank You!
I have some database of ~2 billion documents and ~8 TB which I store for 90 days before dropping the documents. However, several of these fields contain much more data than the rest, and I only need them for a shorter time, say 30 days. After 30 days, I want to clear the fields out to free up space, before archiving the document entirely later on.
It doesn't seem that MongoDB has native functionality for TTL on individual fields.
The database is both write and read heavy.
I'm thinking about writing some script to query Mongo every 1 minute, and then do some query like:
timestamp: $gt -30 days 1 hour AND $lt -30 days and then updateMany to write "" to these fields.
So essentially run a script every minute with a rolling window of one hour (just to ensure no documents escape) and doing an updateMany.
Is this a decent approach? Are there any design considerations I should be aware of when addressing this problem?
Me and 10 students are doing a big project where we need to receive temperature data from hardware in form av nodes, that should be uploaded and stored on a server. As we are all engineers in embedded systems and having minor database knowledge, I am turning to you guys.
I want to receive data from the nodes lets say, every 30 seconds. The table that will store that data in the database would quickly become very long if you store: [nodeId, time, temp] in a table. Do you have any suggestions how to store the data in another way?
A solution could be to store it like mentioned for a period of time and then "compromize" it somehow to a matrix of some sort? I still want to be able to reach old data.
One row every 30 seconds is not a lot of data. It's 2880 rows per day per node. I once designed a database which had 32 million rows added per day, every day. I haven't looked at it for a while but I know it's currently got more than 21 billion rows in it.
The only thing to bear in mind is that you need to think about how you're going to query it, and make sure it has appropriate indexes.
Have fun!
Assume an app that collects real-time temperature data for various cities around the world every 10 minutes.
Using the following GAE datastore model,
class City(db.Model):
name = db.StringProperty()
class DailyTempData(db.Model):
date = db.DateProperty()
temp_readings = db.ListProperty(float, indexed=False) # appended every 10 minutes
and a cron.yaml as so,
cron:
- description: read temperature
url: /cron/read_temps
schedule: every 10 minutes
I am already hitting GAE's daily free quota for datastore writes, and I'm looking for ways to get around this problem.
I'm thinking of reducing my datastore writes by persisting the temperature data only at the end of each day, which will effectively reduce the daily write volume (for each city) from 144 times to 1.
One way to do this is to use memcache as a temporary scratchpad, but due to the possibility of random data evictions, I could well lose all my data for the day. (Aside question: from experience, how often does unplanned eviction really happen?)
Questions are as follows:
Is there such a memory/storage facility (persistent and guaranteed across cron jobs) that would allow me to reduce datastore writes as described?
If not, what could be some alternative solutions?
The only other requirement would be that the temperature readings must be accessible (for serving to client-side) any given time of day.
The only guaranteed storage in the datastore.
As to memcache evictions - it's depends on what is going on, in your app and in google appengine land, evictions could be within a minute or two or after hours. In my appengine instances I usually have oldest items sitting at about 2 hours old. But it all depends and you just can't rely on it.
tasks queues payload is about 10K.
You could just write a blob (containing all cities measured in the 10min interval) and then reprocess it and unpick it and write out the city details at the end of the day.
When you say clients must be able to access temperature readings, do you mean just the current or all the readings for the day.
You could also change your model, so that a huge object is stored for each execution or the cron. Not just for each city, I mean.
For example, say the object is called Measures... A Measures item will contain a List of all your measures for the corresponding time. Store them as non-indexed properties and you should have no problems... And also just 144 writes a day.
For the reading part... Use memcache to store the Measures items, as a good usage pattern.
I'm developing a high score web service for my game, and it's running on Google App Engine.
My game has 5 difficulties, so I originally had 5 boards with entries for each (player_login, score and time). If the player submitted a lower score than the previously scored, it got dismissed, so only the highest score is kept for each player.
But to add more fun into this, I'd decided to include daily/weekly/monthly/yearly high score tables. So I've created 5 boards for each difficulty, making it 25 boards. When a score is submitted, it's saved into each board, and the boards are supposed to be cleared on every day/week/month/year.
This happens by a cron job that is invoked and deletes all entries from a specific board.
Here comes the problem: it looks like deleting entries from the datastore is slow. From my test daily cleanups it looks like deleting a single entry takes around 200 ms.
In the worst-case scenario, if the game would be quite popular and would have, say, 100 000 players, and each of them would have an entry in the yearly board, it would take 100 000 * 0.012 seconds = 12 000 seconds (3 hours!!) to clear that board. I think we are allowed to have jobs of up to 30 seconds in App Engine, so this wouldn't work.
I'm deleting with following code (thanks to Nick Johnson):
q = Score.all(keys_only=True).filter('b = ',boardToClear)
results = q.fetch(500)
while results:
self.response.out.write("deleting one batch;")
db.delete(results)
q = Score.all(keys_only=True).filter('b = ',boardToClear).with_cursor(q.cursor())
results = q.fetch(500)
What do you recommend me to do with this problem?
One approach that comes to my mind is to use a task queue and delete older scores than that are permitted in each board, i.e. which have expired, but in smaller quantities. This way I wouldn't hit the CPU limit for one task, but the cleanup would not be (nearly) instantaneous, so my 12 000 seconds long cleanup would be split into 1 200 tasks, each roughly 10 seconds long.
But I think that there is something that I'm doing wrong, this kind of operation would be a lot faster when done in relational database. Possibly something is wrong with my approach to the datastore and scoring, because being locked in RDBMS mindset.
First, a couple of small suggestions:
Does deletion take 200ms per item even when you delete items in a batch process? The fastest way to delete should be to do a keys_only query and then call db.delete() on an entire list of keys at once.
The 30-second limit was recently relaxed to 10 minutes for background work (like the cron jobs or queue tasks that you're contemplating) as of 1.4.0.
These may not fundamentally address your problem, though. I think there's no way to get around the fact that deleting a large number of records (hundreds of thousands, say), will take some time. I'm not sure that this is as big a problem for your use case though, as I can see a couple of techniques that would help.
As you suggest, use a task queue to split up a long-running tasks into several smaller tasks. Your use case (deleting a huge number of items that match a particular query) is ideal for a map-reduce task. Nick Johnson's blog post on the Mapper API may be very helpful for you (so that you don't have to write all of that task management code on your own).
Do you need to delete all the out-of-date board entries immediately? If you had a field that listed which week, month, or year that a particular entry counted for, you could index on that field and then only display entries from the current month on the visible leaderboard. (Disk space is cheap, after all.) And then if you wanted to slowly (over hours, say, instead of milliseconds) remove the out-of-date data, you could do that in the background without ever having incorrect data on your leaderboards.
Delete entities in batches. Although a single delete takes a noticeable amount of time (though 200ms seems very high), batch deletes take no longer, as they delete all the entities in parallel. Task Queue and cron jobs can now run for up to 10 minutes, so timeouts should not be an issue.