Sonarqube - Very big database

Sonarqube - Very big database - database

I've found this post about the usual size of a Sonarqube Database:
How big is a sonar database?
In our case, we have 3,584,947 LOC to analyze. If every 1,000 LOC stores 350 Ko of data space it should use about 1.2Gb But we've found that our SonarQube database actually stores more than 20Gb...
The official documentation (https://docs.sonarqube.org/display/SONAR/Requirements) says that for 30 millions LOC with 4 years of history, they use less than 20Gb...
In our General Settings > Database Cleaner we have all default value except for "Delete all analyses after" which is set to 360 instead of 260
What can create so much data in our case?
We use sonarqube 6.7.1 version
EDIT
As #simonbrandhof asked, here are our biggest tables
| Table Name | # Records | Data (KB) |
|`dbo.project_measures` | 12'334'168 | 6'038'384 |
|`dbo.ce_scanner_context`| 116'401 | 12'258'560 |
|`dbo.issues` | 2'175'244 | 2'168'496 |

20Gb of disk sounds way too big for 3.5M lines of code. For comparison the internal PostgreSQL schema at SonarSource is 2.1Gb for 1M lines of code.
I recommend to clean-up db in order to refresh statistics and reclaim dead storage. Command is VACUUM FULL on PostgreSQL. There are probably similar command on other databases. If it's not better then please provide the list of biggest tables.
EDIT
The unexpected size of table ce_scanner_context is due to https://jira.sonarsource.com/browse/SONAR-10658. This bug is going to be fixed in 6.7.4 and 7.2.

Related

Designing a caching layer in front of a DB with minimal number of queries

I have multiple jobs that work on some key. The jobs are ran asynchronously and are written to some write-behind cache. Conceptually it looks like this:
+-----+-----------+----------+----------+----------------+
| key | job1 | job2 | job3 | resolution |
+-----+-----------+----------+----------+----------------+
| 123 | job1_res | job2_res | job3_res | resolution_val |
+-----+-----------+----------+----------+----------------+
The key concept is that I don't know in advance how many jobs are running. Instead, when it's time to write the record we add our "resolution" (based on the current job results we've got) and write all values to the DB (MongoDB if that's matter)
I also have a load() function that runs in case of a cache-miss. What it does is to fetch the record from the database, or creating a new (and empty) one if the record wasn't found.
Now, there's a time window where the record isn't in the cache nor in the database. In that time, a "slow worker" might write its result, and unluckily the load() function will create a new record.
When evacuated from the cache, the record will look like this:
+-----+----------+-------------------------------+
| key | job4 | resolution |
+-----+----------+-------------------------------+
| 123 | job4_val | resolution_based_only_on_job4 |
+-----+----------+-------------------------------+
I can think of two ways to control this problem:
Configure the write-behind mechanism so it will wait for all jobs to complete (i.e. give sufficient amount of time)
On write event, first query the DB for the record and merge results.
Problems with current solutions:
Hard to calibrate
Demands an extra query for write operation
What's the most natural solution to my problem?
Do I have to implement solution #2 in order to guarantee a resolution on all job results?
EDIT:
Theoretically speaking, I think that even after implementing solution #2 it doesn't give us the guarantee that the resolution will be based on all job results.
EDIT2:
If the write-behind mechanism guarantees order of operations then solution #2 is ok. This can be achieved by limiting the write-behind to one thread.

Optimal View Design To Find Mismatches Between Two Sets of Data

A bit of background...my company utilizes a piece of software that stores information about a mortgage loan in independent fields. These fields are broken up across many tables in the loan database.
My current dilemma revolves around designing a view(s) that will allow me to find mismatched data on a subset of loans from the underwriting side of our software and the lock side of our software.
Here is a quick example of the data returned from the two views that already exist:
UW View
transID | DTIField | LTVField | MIField
50000 | 37.5 | 85.0 | 1
Lock View
transID | DTIField | LTVField | MIField
50000 | 42.0 | 85.0 | 0
In the above situation, the view should return the fields that are not matching (in this case the DTIField and the MIField). I have built a comparison view that uses a series of CASE statements to return either a 0 for not matched or a 1 for matched already:
transID | DTIField | LTVField | MIField
50000 | 0 | 1 | 0
This is fine in itself but it is creating a bit of an issue downstream on the reporting side. We want to be able to build a report that would display only those transIDs that have mismatched data and show which columns are not matched. Crystal Reports is the reporting solution in question.
Some specifics about the data sets...we have 27 items of the loan that we are comparing (so a total 54 fields). There are over 4000 loans in the system and growing. There are already indexes on the transID fields.
How would you structure the view to return all the data needed for the report? We can do a good amount of work in Crystal Reports but ideally much of the logic would be handled in MSSQL.
Thanks for any assistance.

I think there should be no issue in comparing the 27 columns for a given row. Since you'll be reading the row just once and comparing the columns on that row in both the tables, it shouldn't really pose any performance issues. You can use some hash functions HASHBYTES to assign a hash value to the combination of these 27 fields in both the tables and then use this field to compare which rows should be returned by the view. This should result in some performance improvement. Testing will reveal more.

Best way to apply FIR filter to data stored in a database

I have a PostgreSQL database with a few tables that store several million of data from different sensors. The data is stored in one column of each row like:
| ID | Data | Comment |
| 1 | 19 | Sunny |
| 2 | 315 | Sunny |
| 3 | 127 | Sunny |
| 4 | 26 | Sunny |
| 5 | 82 | Rainy |
I want to apply a FIR filter to the data and store it in another table so I can work with it, but because of the amount of data I'm not sure of the best way to do it. So far I've got the coefficients in Octave and work with some extractions of it. Basically I export the column Data to a CSV and then run a csvimport in Octave to have it in a array and filter it. The problem is that this method doesn't allow me to work with more of several thousand data at the time.
Things I've been looking so far:
PostgreSQL: I've been looking for someway to do it directly in the database, but I haven't been able to find any way to do it so far.
Java: Another possible way to do it is making a small program that extracts chunks of data each time, recalculates the data using the coefficients and stores it back in other table of the database.
C/C++: I've seen some questions and resolutions about how to implement the filter in StackOverflow here, here or here, but they seem to be for working with data on real time and not talking advantage of having all the data already.
I think the best way would be to do it directly with PostgreSQL and with Java or C/C++ would be too slow, but I don't have too much experience working with so much data so probably I'm wrong. Just need to know why and where to point myself to.
What's the best way to apply a FIR filter to data stored on a database, and why?

Can a value in AWS DynamoDB point to value in different table?

First off, I have very minimal experience with servers and databases (I have only used it once in my entire life and only beginning to learn) and this would not exactly be a "code" question strictly speaking because it is a question concerning a concept regarding DynamoDB.. But here it is because I cannot find answer to it no matter how much I search!
I am trying to make an application where users can see if their friends are "online" or not. There will be a table that keeps track of the users who are online and offline like this:
user_id | online
1 | O
2 | X
3 | O
and when user_id 1 who has friends 2 & 3 "refreshes", 1 would be able to see that 2 is offline and 3 is online. This would normally be done by batch_get in dynamodb, but each item I read would count as one unit, meaning if user1 had 20 friends, one refresh would use up 20 read units. To me, that would cost too much, and I thought that if I made a table for each user that would hold list of their friends that shows whether they are online or not, each refresh would cost only one read unit.
user_id | friends_on_off_line
1 | {2:X, 3:O}
2 | {1:O}
3 | {1:O}
However, the values in the list would have to be a "pointer" to the first table, because I cannot update the value everytime someone goes online or offline (if 1 went offline, I would have to write 1 as offline to both tables, and in second table, write it twice, using 3 write units which would end up costing even more)
So I am trying to make it so that in second table, values would point to the first table that would read whether they are online/offline and return the values as a list using only 1 read unit: like this
user_id | friends_on_off_line
1 | {pointer_to_2.online , pointer_to_3.online}
2 | {pointer_to_1.online}
3 | {pointer_to_1.online}
Is this possible in DynamoDB? If not, which service should I use and how can I make it possible?
Thanks in advance!

I don't think DynamoDB is the right tool for this kind of job.
SQL databases (Mysql/PostgreSQL) both have easy designs - just use joins (pointers).
You can also look at this question regarding this area for MongoDB.
What you should ask yourself is what are the most common questions the database needs to answer and what is the update / read rate. This questions usually navigate you to the right direction when picking up a database.

Large amount of timecourses in database

I have a rather large amount of data (~400 mio datapoints) which is organized in a set of ~100,000 timecourses. This data may change every day and for reasons of revision-safety has to be archived daily.
Obviously we are talking about way too much data to be handled efficiently, so I made some analysis on sample data. Approx. 60 to 80% of the courses do not change at all between two days and for the rest only a very limited amount of the elements changes. All in all I expect much less than 10 mio datapoints change.
The question is, how do I make use of this knowledge? I am aware of concepts like the Delta-Trees used by SVN and similar techniques, however I would prefer, if the database itself would be capable of handling such semantic compression. We are using Oracle 11g for storage and the question is, is there a better way than a homebrew solution?
Clarification
I am talking about timecourses representing hourly energy-currents. Such a timecourse might start in the past (like 2005), contains 8760 elements per year and might end any time up to 2020 (currently). Each timecourse is identified by one unique string.
The courses themselves are more or less boring:
"Course_XXX: 1.1.2005 0:00 5; 1.1.2005 1:00 5;1.1.2005 2:00 7,5;..."
My task is making day-to-day changes in these courses visible and to do so, each day at a given time a snapshot has to be taken. My hope is, that some loss-free semantical compression will spare me from archiving ~20GB per day.

Basically my source data looks like this:
Key | Value0 | ... | Value23
to archive that data I need to add an additional dimension which directly or indirectly tells me the time at which the data was loaded from the source-system, so my archive-database is
Key | LoadID | Value0 | ... | Value23
Where LoadID is more or less the time the source-DB was accessed.
Now, compression in my scenario is easy. LoadIDs are growing with each run and I can give a range, i.e.
Key | LoadID1 | LoadID2 | Value0 | ... | Value23
Where LoadID1 gives me the ID of the first load where the 24 values where observed and LoadID2 gives me the ID of the last consecutive load where the 24 values where observed.
In my scenario, this reduces the amount of data stored in the database to 1/30th