I have a rather large amount of data (~400 mio datapoints) which is organized in a set of ~100,000 timecourses. This data may change every day and for reasons of revision-safety has to be archived daily.
Obviously we are talking about way too much data to be handled efficiently, so I made some analysis on sample data. Approx. 60 to 80% of the courses do not change at all between two days and for the rest only a very limited amount of the elements changes. All in all I expect much less than 10 mio datapoints change.
The question is, how do I make use of this knowledge? I am aware of concepts like the Delta-Trees used by SVN and similar techniques, however I would prefer, if the database itself would be capable of handling such semantic compression. We are using Oracle 11g for storage and the question is, is there a better way than a homebrew solution?
I am talking about timecourses representing hourly energy-currents. Such a timecourse might start in the past (like 2005), contains 8760 elements per year and might end any time up to 2020 (currently). Each timecourse is identified by one unique string.
The courses themselves are more or less boring:
"Course_XXX: 1.1.2005 0:00 5; 1.1.2005 1:00 5;1.1.2005 2:00 7,5;..."
My task is making day-to-day changes in these courses visible and to do so, each day at a given time a snapshot has to be taken. My hope is, that some loss-free semantical compression will spare me from archiving ~20GB per day.

Basically my source data looks like this:
Key | Value0 | ... | Value23
to archive that data I need to add an additional dimension which directly or indirectly tells me the time at which the data was loaded from the source-system, so my archive-database is
Key | LoadID | Value0 | ... | Value23
Where LoadID is more or less the time the source-DB was accessed.
Now, compression in my scenario is easy. LoadIDs are growing with each run and I can give a range, i.e.
Key | LoadID1 | LoadID2 | Value0 | ... | Value23
Where LoadID1 gives me the ID of the first load where the 24 values where observed and LoadID2 gives me the ID of the last consecutive load where the 24 values where observed.
In my scenario, this reduces the amount of data stored in the database to 1/30th


Designing a caching layer in front of a DB with minimal number of queries

I have multiple jobs that work on some key. The jobs are ran asynchronously and are written to some write-behind cache. Conceptually it looks like this:
| key | job1 | job2 | job3 | resolution |
| 123 | job1_res | job2_res | job3_res | resolution_val |
The key concept is that I don't know in advance how many jobs are running. Instead, when it's time to write the record we add our "resolution" (based on the current job results we've got) and write all values to the DB (MongoDB if that's matter)
I also have a load() function that runs in case of a cache-miss. What it does is to fetch the record from the database, or creating a new (and empty) one if the record wasn't found.
Now, there's a time window where the record isn't in the cache nor in the database. In that time, a "slow worker" might write its result, and unluckily the load() function will create a new record.
When evacuated from the cache, the record will look like this:
| key | job4 | resolution |
| 123 | job4_val | resolution_based_only_on_job4 |
I can think of two ways to control this problem:
Configure the write-behind mechanism so it will wait for all jobs to complete (i.e. give sufficient amount of time)
On write event, first query the DB for the record and merge results.
Problems with current solutions:
Hard to calibrate
Demands an extra query for write operation
What's the most natural solution to my problem?
Do I have to implement solution #2 in order to guarantee a resolution on all job results?
Theoretically speaking, I think that even after implementing solution #2 it doesn't give us the guarantee that the resolution will be based on all job results.
If the write-behind mechanism guarantees order of operations then solution #2 is ok. This can be achieved by limiting the write-behind to one thread.

Sonarqube - Very big database

I've found this post about the usual size of a Sonarqube Database:
How big is a sonar database?
In our case, we have 3,584,947 LOC to analyze. If every 1,000 LOC stores 350 Ko of data space it should use about 1.2Gb But we've found that our SonarQube database actually stores more than 20Gb...
The official documentation ( says that for 30 millions LOC with 4 years of history, they use less than 20Gb...
In our General Settings > Database Cleaner we have all default value except for "Delete all analyses after" which is set to 360 instead of 260
What can create so much data in our case?
We use sonarqube 6.7.1 version
As #simonbrandhof asked, here are our biggest tables
| Table Name | # Records | Data (KB) |
|`dbo.project_measures` | 12'334'168 | 6'038'384 |
|`dbo.ce_scanner_context`| 116'401 | 12'258'560 |
|`dbo.issues` | 2'175'244 | 2'168'496 |
20Gb of disk sounds way too big for 3.5M lines of code. For comparison the internal PostgreSQL schema at SonarSource is 2.1Gb for 1M lines of code.
I recommend to clean-up db in order to refresh statistics and reclaim dead storage. Command is VACUUM FULL on PostgreSQL. There are probably similar command on other databases. If it's not better then please provide the list of biggest tables.
The unexpected size of table ce_scanner_context is due to This bug is going to be fixed in 6.7.4 and 7.2.

How to optimize large database requests

I am working with a database that contains information (measurements) about ships. The ships send an update with their position, fuel use, etc. So an entry in the database looks like this
| measurement_id | ship_id | timestamp | position | fuel_use |
| key | f_key | dd-mm-yy hh:ss| lat-lon | in l/km |
A new one of these entries gets added for every ship every second so the amount of entries in the database gets large very fast.
What I need for the application I am working on is not the information for one second but rather cumulative data for 1 minute, 1 day, or even 1 year. For example the total fuel use over a day, the distance traveled in a year, or the average fuel use per day over a month.
To get that and calculate that from this raw data is unfeasible, you would have to get 31,5 million records from the server to calculate the distance traveled in a year.
What I thought was the smart thing to do is combining entries into one bigger entry. For example get 60 measurements and combine them into 1 minute measurement entry in a separate table. By averaging the fuel use, and by summing the distance traveled between two entries. A minute entry would then look like this.
| min_measurement_id | ship_id | timestamp | position | distance_traveled | fuel_use |
| new key |same ship| dd-mm-yy hh| avg lat-lon | sum distance_traveled | avg fuel_use |
This process could then be repeated to work with hours, days, months, years. This way a query for a week could be done by requesting only 7 queries, or if I want hourly details 168 entries. Those look like way more usable numbers to me.
The new tables can be filled by querying the original database every 10 minutes, that data then fills the minute table, which in turn updates the hours table, etc.
However this seems to be a lot of management and duplication of almost the same data, with constantly the same operation being done.
So what I am interested in is if there is some way of structuring this data. Could it be sorted hierarchically (after all seconds, days, minutes are pretty hierarchical) or are there other ways to optimize this?
This is the first time I am using a database this size so I also did not really know what to look for on the internet.
Aggregates are common in data warehouses so your approach to group data is fine. Yes, you are duplicating some of the data, but you'll get the speed benefit.

Can a value in AWS DynamoDB point to value in different table?

First off, I have very minimal experience with servers and databases (I have only used it once in my entire life and only beginning to learn) and this would not exactly be a "code" question strictly speaking because it is a question concerning a concept regarding DynamoDB.. But here it is because I cannot find answer to it no matter how much I search!
I am trying to make an application where users can see if their friends are "online" or not. There will be a table that keeps track of the users who are online and offline like this:
user_id | online
1 | O
2 | X
3 | O
and when user_id 1 who has friends 2 & 3 "refreshes", 1 would be able to see that 2 is offline and 3 is online. This would normally be done by batch_get in dynamodb, but each item I read would count as one unit, meaning if user1 had 20 friends, one refresh would use up 20 read units. To me, that would cost too much, and I thought that if I made a table for each user that would hold list of their friends that shows whether they are online or not, each refresh would cost only one read unit.
user_id | friends_on_off_line
1 | {2:X, 3:O}
2 | {1:O}
3 | {1:O}
However, the values in the list would have to be a "pointer" to the first table, because I cannot update the value everytime someone goes online or offline (if 1 went offline, I would have to write 1 as offline to both tables, and in second table, write it twice, using 3 write units which would end up costing even more)
So I am trying to make it so that in second table, values would point to the first table that would read whether they are online/offline and return the values as a list using only 1 read unit: like this
user_id | friends_on_off_line
1 | { ,}
2 | {}
3 | {}
Is this possible in DynamoDB? If not, which service should I use and how can I make it possible?
Thanks in advance!
I don't think DynamoDB is the right tool for this kind of job.
SQL databases (Mysql/PostgreSQL) both have easy designs - just use joins (pointers).
You can also look at this question regarding this area for MongoDB.
What you should ask yourself is what are the most common questions the database needs to answer and what is the update / read rate. This questions usually navigate you to the right direction when picking up a database.

How to store sets of objects that have occurred together during events?

I'm looking for an efficient way of storing sets of objects that have occurred together during events, in such a way that I can generate aggregate stats on them on a day-by-day basis.
To make up an example, let's imagine a system that keeps track of meetings in an office. For every meeting we record how many minutes long it was and in which room it took place.
I want to get stats broken down both by person as well as by room. I do not need to keep track of the individual meetings (so no meeting_id or anything like that), all I want to know is daily aggregate information. In my real application there are hundreds of thousands of events per day so storing each one individually is not feasible.
I'd like to be able to answer questions like:
In 2012, how many minutes did Bob, Sam, and Julie spend in each conference room (not necessarily together)?
Probably fine to do this with 3 queries:
>>> query(dates=2012, people=[Bob])
{Board-Room: 35, Auditorium: 279}
>>> query(dates=2012, people=[Sam])
{Board-Room: 790, Auditorium: 277, Broom-Closet: 71}
>>> query(dates=2012, people=[Julie])
{Board-Room: 190, Broom-Closet: 55}
In 2012, how many minutes did Sam and Julie spend MEETING TOGETHER in each conference room? What about Bob, Sam, and Julie all together?
>>> query(dates=2012, people=[Sam, Julie])
{Board-Room: 128, Broom-Closet: 55}
>>> query(dates=2012, people=[Bob, Sam, Julie])
{Board-Room: 22}
In 2012, how many minutes did each person spend in the Board-Room?
>>> query(dates=2012, rooms=[Board-Room])
{Bob: 35, Sam: 790, Julie: 190}
In 2012, how many minutes was the Board-Room in use?
This is actually pretty difficult since the naive strategy of summing up the number of minutes each person spent will result in serious over-counting. But we can probably solve this by storing the number separately as the meta-person Anyone:
>>> query(dates=2012, rooms=[Board-Room], people=[Anyone])
What are some good data structures or databases that I can use to enable this kind of querying? Since the rest of my application uses MySQL, I'm tempted to define a string column that holds the (sorted) ids of each person in the meeting, but the size of this table will grow pretty quickly:
2012-01-01 | "Bob" | "Board-Room" | 2
2012-01-01 | "Julie" | "Board-Room" | 4
2012-01-01 | "Sam" | "Board-Room" | 6
2012-01-01 | "Bob,Julie" | "Board-Room" | 2
2012-01-01 | "Bob,Sam" | "Board-Room" | 2
2012-01-01 | "Julie,Sam" | "Board-Room" | 3
2012-01-01 | "Bob,Julie,Sam" | "Board-Room" | 2
2012-01-01 | "Anyone" | "Board-Room" | 7
What else can I do?
Your question is a little unclear because you say you don't want to store each individual meeting, but then how are you getting the current meeting stats (dates)? In addition any table given the right indexes can be very fast even with alot of records.
You should be able to use a table like log_meeting. I imagine it could contain something like:
employee_id, room_id, date (as timestamp), time_in_meeting
Where foreign keys to employee id to employee table, and room id key to room table
If you index employee id, room id, and date you should have a pretty quick lookup as mysql multiple-column indexes go left to right such that you gain index on (employee id, employee id + room id, and employee id + room id + timestamp) when do searches. This is explained more in the multi-index part of:
By refusing to store meetings (and related objects) individually, you are loosing the original source of information.
You will not be able to compensate for this loss of data, unless you memorize on a regular basis the extensive list of all potential daily (or monthly or weekly or ...) aggregates that you might need to question later on!
Believe me, it's going to be a nightmare ...
If the number of people are constant and not very large you can then assign a column to each person for present or not and store the room, date and time in 3 more columns this can remove the string splitting problems.
Also by the nature of your question I feel first of all you need to assign Ids to everything rooms,people, etc. No need for long repetitive string in DB. Also try reducing any string operation and work using individual data in each column for better intersection performance. Also you can store a permutation all the people in a table and assign a id for them then use one of those ids in the actual date and time table. But all techniques will require that something be constant either people or rooms.
I do not understand whether you know all "questions" in design time or it's possible to add new ones during development/production time - this approach would require to keep all data all the time.
Well if you would know all your questions it seems like classic "banking system" which recalculates data on daily basis.
How I think about it.
Seems like you have limited number of rooms, people, days etc.
Gather logging data on daily basis, one table per day. Just one event, one database row, all information (field) what you need.
Start to analyse data using some crone script at "midnight".
Update stats for people, rooms, etc. Just increment number of hours spent by Bob in xyz room etc. All what your requirements need.
As analyzed data are limited and relatively small as you analyzed (compress) them, your system can contain also various queries as indexes would be relatively small etc.
You could be able to use scalable map/reduce algorithm.
You can't avoid storing the atomic facts as follows: (the meeting room, the people, the duration, the day), which is probably only a weak consolidation when the same people meet multiple times in the same room on the same day. Maybe that happens a lot in your office :).
Making groups comparable is an interesting problem, but as long as you always compose the member strings the same, you can probably do it with string comparisons. This is not "normal" however. To normalise you'll need a relation table (many to many) and compose a temporary table out of your query set so it joins quickly, or use an "IN" clause and a count aggregate to ensure everyone is there (you'll see what I mean when you try it).
I think you can derive the minutes the board room was in use as meetings shouldn't overlap, so a sum will work.
For storage efficiency, use integer keys for everything with lookup tables. Dereference the integers during the query parsing, or just use good old joins if you are feeling traditional.
That's how I would do it anyway :).
You'll probably have to store individual meetings to get the data you need anyway.
However you'll have to make sure you aggregate and anonymise it properly before creating your reports. Make sure to separate concerns and access levels to stay within the proper legal limits on data.
