Which Micrometer metric is better to count database values found?

Which Micrometer metric is better to count database values found? - database

We want to implement a Grafana dashboard that shows in how many calls to a database a value is found
I´m not sure which Micrometer metric to use:
Counter: Counters report a single metric, a count.
Timer: Measures the frequency
According to that, I would choose the counter, because I just want to know how many times we find a value in the database.

It depends on the information you hope to capture based on the metric. You most likely want a gauge.
You aren't timing anything so a Timer wouldn't be a good fit.
Counter - is used for measuring values that only go up and can be used to calculate rates. For instance, counting requests.
Gauge - is used for measuring values that go up and down. For instance, CPU usage.
If you are counting values in a database result, that number could go up and down (if that table allows deletion). However, if the amount only goes up, using a counter would make sense, and give you the ability to see the growth rate, but that will only work if you can guarantee the number is only going up.

Related

JMeter: how different type of timers can affect each others

I need to create a load test for a certain number of requests in a given time. I could successfully setup Precise Throughput Timer and I believe I understand how it works. What I don't understand is how other timers, specifically Gaussian Random Timer would affect it.
I have run my test plan with and without Gaussian Random Timer but I don't see that much of difference in the results. I'm wondering whether adding Gaussian Random Timer would help me to better simulate my users behavior?

I would say that these timers are mutually exclusive
Precise Throughput Timer allows you to reach and maintain the desired throughput (number of requests per given amount of time)
Gaussian Random Timer - allows you to simulate "think time"
If your goal is to mimic real users behavior as close as possible - go for the Gaussian Random Timer because real users don't hammer the application under test non-stop, they need some time to "think" between operations, i.e. locate the button and move the mouse pointer there, read something, type something, etc. So if your test assumes simulating real users using real browsers - go for Gaussian Random Timer and put realistic think times between operations. If you need your test to produce certain amount of hits per second - just increase the number of threads (virtual users) accordingly. Check out What is the Relationship Between Users and Hits Per Second? for comprehensive explanation if needed.
On the other hand Precise Thorughput Timer is handy when there are no "real users", for example you're testing an API or a database or a message queue and need to send a specific number of requests per second.

Sum of concurrently changing variable

I want to keep track of a sum of user-controlled variables.
Each user can add/remove/update his/her own variables.
Users should be able to see the sum change after their own update.
As the number of users scales up, the system will be distributed and updates will happen concurrently.
I want to avoid a bottleneck for updating the sum.
What is the best way to keep track of this sum?
Is a exist database that can handle this or do I need implement something myself?

So generally, if you know by how much each value has changed you know how the sum has changed and you can use these incremental changes to update the sum.
In a centralised was you could for instance use any SQL Database which supports triggers and transactions. You'd have a table with all the description of different clients / numbers and their values and another table to cache the sum. The idea is that the trigger would run on update / delete / insert und just update the cached sum. This way it would be much faster for huge amounts of data, but also much more error prone (you can also just re-sum up all the values in the trigger and store it in a cache, and it would work for few thousand values easily)
In a decentralised system you can do something similar. Here you can either share all the values between all the clients, or (as this might be too much) just the change. So every client is responsible for some values and on every change he'd share that the total sum has changed by the change. - example: if the user modifies a value from 5 to 3 the client will broadcast -2. you assume an initial state of 0 and just sum up all the numbers from the clients as they come in. The order doesn't matter due to the commutative property of the addition operation. You only need to make sure that everyone will receive the data, but you can achieve this via reliable multicast.

Store calculated value in database in this scenario?

I store readings in a database from sensors for a Temperature monitoring system.
There's 2 types of reading: air and product. The product temperature is represents the slow temperature change of an item of food versus the actual air temperature.
They 2 temperatures are taken from different sensors (different locations within the environment, usually a large controlled environment) so they are not related (i.e. I cannot derive the product temperature from the air temperature).
Initially the product temperature I was provided with was already damped by the sensor, however whoever wrote the firmware made a mistake so the damped value is incorrect, and now I instead have to take the un-damped reading from the product sensor and apply the damping myself based on the last few readings in the database.
When a new reading comes in, I look at the last few undamped readings, and the last damped reading, and determine a new damped reading from that.
My question is: Should I store this calculated reading as well as the undamped reading, or should I calculate it in a view leaving all physically stored readings undamped?
One thing that might influence this: The readings are critical; alarms rows are generated against the readings when they go out of tolerance: it is to prevent food poisoning and people can lose there jobs over it. People sign off the values they see, so those values must never change..
Normally I would use a view and put the calculation in the view, but I'm a little nervous about doing that this time. If the calculation gets "tweaked" I then have to make the view more complicated to use the old calculation before a certain timestamp, etc. (which is fine; I just have to be careful wherever I query the reading values - I don't like nesting views in other views as sometimes it can slow the query..).
What would you do in this case?
Thanks!

The underlying idea from the relational model is "logical data independence". Among other things, SQL views implement logical data independence.
So you can start by putting the calculation in a view. Later, when it becomes too complex to maintain that way, you can move the calculation to a SQL function or SQL stored procedure, or you can move the calculation to application code. You can store the results in a base table if you want to. Then update the view definition.
The view's clients should continue to work as if nothing had changed.
Here's one problem with storing this calculated value in a base table: you probably can't write a CHECK constraint to guarantee it was calculated correctly. This is a problem regardless of whether you display the value in a view. That means you might need some kind of administrative procedure to periodically validate the data.

Best way to access averaged static data in a Database (Hibernate, Postgres)

Currently I have a project (written in Java) that reads sensor output from a micro controller and writes it across several Postgres tables every second using Hibernate. In total I write about 130 columns worth of data every second. Once the data is written it will stay static forever.This system seems to perform fine under the current conditions.
My question is regarding the best way to query and average this data in the future. There are several approaches I think would be viable but am looking for input as to which one would scale and perform best.
Being that we gather and write data every second we end up generating more than 2.5 million rows per month. We currently plot this data via a JDBC select statement writing to a JChart2D (i.e. SELECT pressure, temperature, speed FROM data WHERE time_stamp BETWEEN startTime AND endTime). The user must be careful to not specify too long of a time period (startTimem and endTime delta < 1 day) or else they will have to wait several minutes (or longer) for the query to run.
The future goal would be to have a user interface similar to the Google visualization API that powers Google Finance. With regards to time scaling, i.e. the longer the time period the "smoother" (or more averaged) the data becomes.
Options I have considered are as follows:
Option A: Use the SQL avg function to return the averaged data points to the user. I think this option would get expensive if the user asks to see the data for say half a year. I imagine the interface in this scenario would scale the amount of rows to average based on the user request. I.E. if the user asks for a month of data the interface will request an avg of every 86400 rows which would return ~30 data points whereas if the user asks for a day of data the interface will request an avg of every 2880 rows which will also return 30 data points but of more granularity.
Option B: Use SQL to return all of the rows in a time interval and use the Java interface to average out the data. I have briefly tested this for kicks and I know it is expensive because I'm returning 86400 rows/day of interval time requested. I don't think this is a viable option unless there's something I'm not considering when performing the SQL select.
Option C: Since all this data is static once it is written, I have considered using the Java program (with Hibernate) to also write tables of averages along with the data it is currently writing. In this option, I have several java classes that "accumulate" data then average it and write it to a table at a specified interval (5 seconds, 30 seconds, 1 minute, 1 hour, 6 hours and so on). The future user interface plotting program would take the interval of time specified by the user and determine which table of averages to query. This option seems like it would create a lot of redundancy and take a lot more storage space but (in my mind) would yield the best performance?
Option D: Suggestions from the more experienced community?

Option A won't tend to scale very well once you have large quantities of data to pass over; Option B will probably tend to start relatively slow compared to A and scale even more poorly. Option C is a technique generally referred to as "materialized views", and you might want to implement this one way or another for best performance and scalability. While PostgreSQL doesn't yet support declarative materialized views (but I'm working on that this year, personally), there are ways to get there through triggers and/or scheduled jobs.
To keep the inserts fast, you probably don't want to try to maintain any views off of triggers on the primary table. What you might want to do is to periodically summarize detail into summary tables from crontab jobs (or similar). You might also want to create views to show summary data by using the summary tables which have been created, combined with detail table where the summary table doesn't exist.
The materialized view approach would probably work better for you if you partition your raw data by date range. That's probably a really good idea anyway.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Inspiration needed: Selecting large amounts of data for a highscore

I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.

Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.

Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.

First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.

How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.

Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?

It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.

Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight