How to model a table to save many metrics for same timestamp in QuestDb? - data-modeling

I have zillion lot of sensors which tick every min with some floating point data and I want to use to save the data in QuestDb. I see two options:
Options 1 is to create a wide table with zillion of columns and have one row per each minute
| Time | Sensor1 | Sensor2 | .... | Sensor1232123 |
| 10:01 | 3.4 | 0.0 | .... | 23.4 |
| 10:02 | 5.46 | 23.987 | .... | 0.0 |
...
And the option option 2
| Time | Id | Value |
| 10:01 | 1 | 3.4 |
| 10:01 | 2 | 0.0 |
...
| 10:01 | 123123 | 23.4 |
| 10:02 | 1 | 5.46 |
| 10:02 | 2 | 23.987 |
...
| 10:02 | 3 | 0.0 |
...
Since my data comes from individual sensors independently I'm inclined to use option 2 but the QuestDb requires the designated timestamp column to be ascending, so I cannot have duplicated values in Time column.
It sounds pretty common case but I cannot figure out how can I store my sensors data in one table.

You should use the option 2 you described where there is timestamp, sensor id, value saved in the table.
Repeated timestamp is allowed so it is valid to have all the sensors as individual rows at 10:01, 10:02 etc.

Related

Algorithm or pseudo code for finding combinations from table

I have a table with thousands of items with a lot of attributes (approx 15+). I would like to select the following results:
Select all combination of items to have at least 100% from each attributes? Exactly 100% would be nice but thats not necessary so it can go over a little or be a little less (maybe +-2%).
All combinations would be a big dataset so I think it would be better to sort them by price and select only the 10 cheapest one.
Also if I would like to modify selects before so that one or several attributes cant get over some value, like 50% for example?
| ----------- | ------------ | ----------- | ----------- | ----- |
| item name | attribute 1 | attribute 2 | attribute 3 | price |
| item 1 | 25% | 1% | 5% | 1€ |
| item 2 | 10% | 10% | 10% | 2€ |
| item 3 | 5% | 20% | 5% | 3€ |
| item 4 | 20% | 15% | 50% | 12€ |
I don't know if there is an existing algorithm for my problem ( I hope so ) or my problem has a name I can google but I would be thankful for any tips how I should proceed.
The only way I could think of for now is to bruteforce all the combinations and drop the unusable ones. But I don't think that's the right way (maybe I'm wrong and thats the only way).
The number of items, price and attribute values can change over time. If they were static I would just run the bruteforce option once and be done with it.
Sorry if this question was already asked.
EDIT:
As an example I can provide nutritional information about food (All the numbers are made up):
daily intake of carbohydrates/fat/protein are 225g/30g/65g
| ----------- | --------------- | ------- | --------- | ------ | ----- |
| item name | carbohydrates | fat | protein | sodium | price |
| apple | 10g | 1g | 5g | 1mg | 1€ |
| banana | 20g | 2g | 10g | 1mg | 2€ |
| pear | 15g | 3g | 5g | 5mg | 3€ |
| ----------- | --------------- | ------- | --------- | ------ | ----- |
find me combination of foods which will reach daily intake.
Now i want the same as in 1. but sort it by the price/select the cheapest.
I want only combinations with sodium not exceeding 30mg

How do you make a table into one long row in SAS?

I have a table with a number of variables such as:
+-----------+------------+---------+-----------+--------+
| DateFrom | DateTo | Price | Discount | Cost |
+-----------+------------+---------+-----------+--------+
| 01jan17 | 01jul17 | 17 | 4 | 5 |
| 01aug17 | 01feb18 | 15 | 1 | 3 |
| 01mar18 | 01dec18 | 12 | 2 | 1 |
| ... | ... | ... | ... | ... |
+-----------+------------+---------+-----------+--------+
However I want to split this so I have:
+------------+------------+----------+-------------+---------+-------------+------------+----------+-------------+-------------+
| DateFrom1 | DateTo1 | Price1 | Discount1 | Cost1 | DateFrom2 | DateTo2 | Price2 | Discount2 | Cost2 ... |
+------------+------------+----------+-------------+---------+-------------+------------+----------+-------------+-------------+
| 01jan17 | 01jul17 | 17 | 4 | 5 | 01aug17 | 01feb18 | 15 | 1 | 3 |
+------------+------------+----------+-------------+---------+-------------+------------+----------+-------------+-------------+
There's a cool (not at all obvious) solution using proc summary and the idgroup statement that only takes a few lines of code. This runs in memory and you're likely to come into problems if the dataset is large, otherwise this works very well.
Note that out[3] relates to the number of rows in the source data. You could easily make this dynamic by adding a prior step that calculates the number of rows and stores it in a macro variable.
/* create initial dataset */
data have;
input (DateFrom DateTo) (:date7.) Price Discount Cost;
format DateFrom DateTo date7.;
datalines;
01jan17 01jul17 17 4 5
01aug17 01feb18 15 1 3
01mar18 01dec18 12 2 1
;
run;
/* transform data into 1 row */
proc summary data=have nway;
output out=want (drop=_:)
idgroup(out[3] (_all_)=) / autoname;
run;

What database technologies should I consider for building a scalable "running average" view?

We are working on an application where millions of users will be entering information at the same time. Suppose the application allows people to rate geographic regions on where they would like to live. Each participant is allowed to rate each region using a decimal value from 0-10. Each person belongs to one or more groups based upon attributes such as gender, and people that consider themselves active, or enjoy culture.
Every time a rating is made, we need to have a view which shows us the average rating for each region/group. I'm aware that most DB's have an "average" function, but for our purposes we need to be able to use our own function as we may use a the geometric mean instead of the arithmetic mean.
Below are some tables which might be used. Note: I did not include the relationship table PeopleGroups which map which groups a person is a member of for brevity purposes.
Regions People Groups RegionScoresByPerson
+-----+------------+ +-----+-------+ +-----+----------+ +-----+-----+-------+
| RID | NAME | | PID | Name | | GID | Name | | RID | PID | Score |
+-----+------------+ +-----+-------+ +-----+----------+ +-----+-----+-------+
| 1 | Flordia | | P1 | Alice | | G0 | Everyone | | 1 | P1 | 6 |
| 2 | California | | P2 | Bob | | G1 | Women | | 1 | P2 | 8 |
+-----+------------+ | P3 | Frank | | G2 | Men | | 1 | P3 | 3 |
| P4 | Mary | | G3 | Active | | 1 | P4 | 2 |
+-----+-------+ | G4 | Culture | | 1 | P1 | 7 |
+-----+----------+ | 1 | P2 | 5 |
| 1 | P3 | 8 |
| 1 | P4 | 2 |
+-----+-----+-------+
Our current implementation uses a similar set of tables for storing ratings, but we don't calculate averages real-time. Anytime we need the results (e.g. show me the average score California for women), we have to pull all the information into memory and run the calculations manually.
I was wondering how I leverage database technologies such as views, triggers, stored procedures, etc. to present to me a simple table that will allow me to get scores by for people and groups so we don't have to manually run calculations.
I would like some table like the following, where everything is handled by the DB. Any insert,update,delete actions on the RegionScoresByPerson or Groups tables would automatically be reflected in this table. If it is not apparent, the rows marked with * calculated rows. In this case I'm using a simple arithmetic average, but I the design should allow for any type of function.
EID stands for entity ID (a person or group)
Besides deciding how to build such a view, I'm unsure of what sort of datatypes to use (and index) for People and Groups. I suppose I'd like the index to be integers, but that would prevent me from creating the table below because I couldn't distinguish between Person 1 and Group 1 -- Would having ID's such as P1 and G1 be a performance hit? I'm obviously concerned about the design being scalable.
ScoreView
+-----------+-----+-------+
| RID | EID | Score |
| 1 | P1 | 6 |
| 1 | P2 | 8 |
| 1 | P3 | 3 |
| 1 | P4 | 2 |
| 1 | P1 | 7 |
| 1 | P2 | 5 |
| 1 | P3 | 8 |
| 1 | P4 | 2 |
| 1 | G0 | 4.75 |*
| 1 | G1 | 4 |*
| 1 | G2 | … |*
| 1 | G3 | … |*
+-----------+-----+-------+
Apache Flume is the open source tool designed to solve this kind of problem. Also have a look at Google Cloud Dataflow.
https://flume.apache.org/

MySQL Import into Innodb table severely spikes at a certain point

I'm trying to migrate a 30GB database from one server to another.
The short story is that at a certain point through the process, the amount of time it takes to import records severely increases as a spike. The following is from using the SOURCE command to import a chunk of 500k records (out of about ~25-30 million throughout the database) that was exported as an sql file that was ssh tunnelled over to the new server:
...
Query OK, 2871 rows affected (0.73 sec)
Records: 2871 Duplicates: 0 Warnings: 0
Query OK, 2870 rows affected (0.98 sec)
Records: 2870 Duplicates: 0 Warnings: 0
Query OK, 2865 rows affected (0.80 sec)
Records: 2865 Duplicates: 0 Warnings: 0
Query OK, 2871 rows affected (0.87 sec)
Records: 2871 Duplicates: 0 Warnings: 0
Query OK, 2864 rows affected (2.60 sec)
Records: 2864 Duplicates: 0 Warnings: 0
Query OK, 2866 rows affected (7.53 sec)
Records: 2866 Duplicates: 0 Warnings: 0
Query OK, 2879 rows affected (8.70 sec)
Records: 2879 Duplicates: 0 Warnings: 0
Query OK, 2864 rows affected (7.53 sec)
Records: 2864 Duplicates: 0 Warnings: 0
Query OK, 2873 rows affected (10.06 sec)
Records: 2873 Duplicates: 0 Warnings: 0
...
The spikes eventually average to 16-18 seconds per ~2800 rows affected. Granted I don't usually use Source for a large import, but for the sakes of showing legitimate output, I used it to understand when the spikes happen. Using mysql command or mysqlimport yields the same results. Even piping the results directly into the new database instead of through an sql file has these spikes.
As far as I can tell, this happens after a certain amount of records are inserted into a table. The first time I boot up a server and import a chunk that size, it goes through just fine. Give or take the estimated amount it handles until these spikes occur. I can't correlate that because I haven't consistently replicated the issue to evidently conclude that. There are ~20 tables that have sub 500,000 records that all imported just fine when those 20 tables were imported through a single command. This seems to only happen to tables that have an excessive amount of data. Granted, the solutions I've come cross so far seem to only address the natural DR that occurs when you import over time (The expected output in my case was that eventually at the end of importing 500k records, it would take 2-3 seconds per ~2800, whereas it seems the questions were addressing that at the end it shouldn't take that long). This comes from a single sugarCRM table called 'campaign_log', which has ~9 million records. I was able to import in chunks of 500k back onto the old server i'm migrating off of without these spikes occuring, so I assume this has to do with my new server configuration. Another thing is that whenever these spikes occur, the table that it is being imported into seems to have an awkward way of displaying the # of records via count. I know InnoDB gives count estimates, but the number doesn't succeed the ~, indicating the estimate. It usually is accurate and that each time you refresh the table, it doesn't change the amount it displays (This is based on what it reports through PHPMyAdmin)
Here's the following commands/InnoDB system variables I have on the new server:
INNODB System Vars:
+---------------------------------+------------------------+
| Variable_name | Value |
+---------------------------------+------------------------+
| have_innodb | YES |
| ignore_builtin_innodb | OFF |
| innodb_adaptive_flushing | ON |
| innodb_adaptive_hash_index | ON |
| innodb_additional_mem_pool_size | 8388608 |
| innodb_autoextend_increment | 8 |
| innodb_autoinc_lock_mode | 1 |
| innodb_buffer_pool_instances | 1 |
| innodb_buffer_pool_size | 8589934592 |
| innodb_change_buffering | all |
| innodb_checksums | ON |
| innodb_commit_concurrency | 0 |
| innodb_concurrency_tickets | 500 |
| innodb_data_file_path | ibdata1:10M:autoextend |
| innodb_data_home_dir | |
| innodb_doublewrite | ON |
| innodb_fast_shutdown | 1 |
| innodb_file_format | Antelope |
| innodb_file_format_check | ON |
| innodb_file_format_max | Antelope |
| innodb_file_per_table | OFF |
| innodb_flush_log_at_trx_commit | 1 |
| innodb_flush_method | fsync |
| innodb_force_load_corrupted | OFF |
| innodb_force_recovery | 0 |
| innodb_io_capacity | 200 |
| innodb_large_prefix | OFF |
| innodb_lock_wait_timeout | 50 |
| innodb_locks_unsafe_for_binlog | OFF |
| innodb_log_buffer_size | 8388608 |
| innodb_log_file_size | 5242880 |
| innodb_log_files_in_group | 2 |
| innodb_log_group_home_dir | ./ |
| innodb_max_dirty_pages_pct | 75 |
| innodb_max_purge_lag | 0 |
| innodb_mirrored_log_groups | 1 |
| innodb_old_blocks_pct | 37 |
| innodb_old_blocks_time | 0 |
| innodb_open_files | 300 |
| innodb_print_all_deadlocks | OFF |
| innodb_purge_batch_size | 20 |
| innodb_purge_threads | 1 |
| innodb_random_read_ahead | OFF |
| innodb_read_ahead_threshold | 56 |
| innodb_read_io_threads | 8 |
| innodb_replication_delay | 0 |
| innodb_rollback_on_timeout | OFF |
| innodb_rollback_segments | 128 |
| innodb_spin_wait_delay | 6 |
| innodb_stats_method | nulls_equal |
| innodb_stats_on_metadata | ON |
| innodb_stats_sample_pages | 8 |
| innodb_strict_mode | OFF |
| innodb_support_xa | ON |
| innodb_sync_spin_loops | 30 |
| innodb_table_locks | ON |
| innodb_thread_concurrency | 0 |
| innodb_thread_sleep_delay | 10000 |
| innodb_use_native_aio | ON |
| innodb_use_sys_malloc | ON |
| innodb_version | 5.5.39 |
| innodb_write_io_threads | 8 |
+---------------------------------+------------------------+
System Specs:
Intel Xeon E5-2680 v2 (Ivy Bridge) 8 Processors
15GB Ram
2x80 SSDs
CMD to Export:
mysqldump -u <olduser> <oldpw>, <olddb> <table> --verbose --disable-keys --opt | ssh -i <privatekey> <newserver> "cat > <nameoffile>"
Thank you for any assistance. Let me know if there's any other information I can provide.
I figured it out. I increased the innodb_log_file_size from 5MB to 1024MB. While it did significantly increase the amount of records I imported (Never went above 1 second per 3000 rows), it also fixed the spikes. There were only 2 in all the records I imported, but after they happened, they immediately went back to taking sub 1 second.

Difficult kind of Hierachical Data in Relational Database

I have "components" which can be assembled in different ways into a "system". I want my database to hold all these "components", their type specific data and define how they are connected to each other to form a "system".
The systems are typically gearboxes and they can have rather complex branched designs. Let's start with an easy example:
This system is built up out of Masses (horizontal lines) and Stiffnesses (vertical lines). Gears and clutches are types of masses and come in pairs. Colors represent different branch speeds due to gear ratios. Here's a (bad) example of how I could store everything from this particular illustration:
ID | Type | Clutch | Ends | DrivenBy | NoOfTeeth| Mass | Stiffness
--- | ---- | ------ | ---- | --------- | -------- | ---- | ---------
1 | Mass | | Input1 | | | 5 |
2 | Stiffness | | | | | | 15
3 | Mass | 1.1 | | | | 2 |
4 | Mass | 1.2 | | | | 3 |
5 | Stiffness | | | | | | 20
6 | Gear | | | | 10 | 4 |
7 | Stiffness | | | | | | 30
8 | Gear | | | | 4 | 5 |
9 | Gear | | | 8 | 7 | 2 |
10 | Stiffness | | | | | | 40
11 | Mass | | | | | 4 |
12 | Stiffness | | Output1 | | | | 10
13 | Gear | | | 6 | 5 | 4 |
14 | Stiffness | | | | | | 20
15 | Mass | 2.1 | | | | 4 |
16 | Mass | 2.2 | | | | 3
17 | Stiffness | | | | | | 30
18 | Mass | | Output2 | | | 2 |
Obviously, this is not a very good way to store the data. This design pattern resembles somewhat of a "Repeated attributes" since each component type has a different attribute to be filled. I could create a table for each type of component, but things become more complex when looking at other examples, such as this 2-stage gearbox:
There are also examples with more than 1 input and several outputs, but I can't post more links due to low reputation.
Eitherway, you will see that the usual hierarchical data storage doesn't apply here because the data is not purely "tree-shaped" where everything branches off from 1 main branch.
I think that even though I could store data in the above mentioned way, I will get huge difficulties when it comes to the programming stage.
To add to the complexity, these gearboxes are actually sub-systems to a much bigger system.
So, any suggestions on a good way to store this type of data?*
Perhaps this is a possible way of doing it?
Here you will see that there is a "main" table called GearboxBranch, keeping track of all elements in the gearbox, giving them an id and to identify in which branch the element exists.
Then for the elements themselves, masses are defined in their dedicated table, so are stiffnesses. Gears and Clutches (which are types of masses) are then defined in their perspective tables. A recursive relationship is existing in the gear table, since one gear has to be driven by at least one other gear.
Furthermore, the table with Shaft Ends defines which of the elements in the gearbox are input or output and what number they have.
I can't seem to see any problems with this method, but I'm a little unsure how to get data out of the database. There will be considerable coding involved I'm afraid.

Resources