I'm currently writing some code for one of my classes involving distributed and parallel database processing. I'm doing horizontal fragmentation on some data and required to keep track of different pieces of data.
The professor recommends storing "metadata" to keep track of some basic computations. Is this as simple as creating another table and storing some basic information, or is there a much more efficient way of doing this?
Example:
I need to track ranges for min/max values of every table in my database. Should I store that information in an entirely new table or is there a better way of achieving this?
Example: I need to track ranges for min/max values of every table in my database. Should I store that information in an entirely new table or is there a better way of achieving this?
Yes, you should store min/max in a different table. Depending on your application, you might need more than one of those kinds of tables.
Each insert, update, or delete statement can change either or both of those values. Think about how you want to handle that. (Triggers, probably.)
Terminology
Metadata just means "data about other data", and min/max values for one or more columns in each table is arguably data about other data. But I've never seen such data called metadata. It's always either summary or aggregate data.
I think you'll find that when most DBAs and database developers use metadata, they're talking about system tables or the information_schema views that are built on top of system tables.
Related
I wanted to ask, if there may be a different and better approach than mine.
I have a model entity that can have an arbitrary amount of hyperparameters. Depending on the specific model I want to insert as row into the model table, I may have specific hyperparameters. I do not want to continuously add new columns to my model table for new hyperparameters that I encounter when trying out new models (+ I don't like having a lot of columns that are null for many rows). I also want to easily filter models on specific hyperparameter values, e.g. "select * from models where model.hyperparameter_x.value < 0.5". So, an n-to-n relationship to a hyperparameter table comes to mind. The issue is, that the datatype for hyperparameters can be different, so I cannot define a general value column on the relationship table, with a datatype, that's easily comparable across different models.
So my idea is, to define a json type "value" column in the relationship table to support different value datatypes (float, array, string, ...). What I don't like about that idea and what was legitimately critizised by colleagues is that this can result in chaos within the value column pretty fast, e.g. people inserting data with very different json structures for the same hyperparameters. To mitigate this issue, I would introduce a "json_regex_template" column in the hyperparameter table, so that on API level I can easily validate wheter the json for a value for hyperparameter x is correctly defined by the user. An additional "json_example" column in the hyperaparameter table would further help the user on the other side of the API make correct requests.
This solution would still not guarentee non-chaos on database request level (even though no User should directly insert data without using the API, so I don't think thats a very big deal). And the solution still feels a bit hacky. I would believe, that I'm not the first person with this problem and maybe there is a best practice to solve it?
Is my aversion against continuously adding columns reasonable? It's about probl. 3-5 new columns per month, may saturate at some point to a lower number, but thats speculative.
I'm aware of this post (Storing JSON in database vs. having a new column for each key), but it's pretty old, so my hope is that there may be new stuff I could use. The model-hyperparameter thing is of course just a small part of my full database model. Changing to a non-relational database is not an option.
Opinions are much appreciated
In a SQL database, we generally have related information stored in different tables in a database. From what I read from RocksDB document, there's really no clear or 'right' way to represent this kind of structure. So I'm wondering what is the practice to categorize information?
Say I have three types of information, Customer, Product, and Employee. And I want to implement these in RocksDB. Should I use prefix of the key, different column families, or different databases?
Thanks for any suggestion.
You can do it by coming up with some prefix which will mean such table, such column, such id. You could for simplicity store in one column family and definetely in one db since you have atomic operations, snapshots and so on. The better question why would you want to store relational data in nosql db unless you are building something higher level.
By the way, checkout linqdb which is an example of higher-level db where you can store entities, perform linq-style operations and it uses rocksdb underneath.
The way data is organized in key-value store is up the the implementation. There is no "one good way to go" it depends on the underlying key-value store features (is it key ordered in particular).
The same normalization/denormalization technics applies.
I think the piece you are missing about key-value store application design is the concept of key composition. Quickly, is the practice of building keys in such a way that they are amenable to querying. If the database is ordered then it will also also for prefix/range/scan queries and next/previous navigation. This will lead you to build key prefixes in such a way that querying is fast ie. that doesn't require a full table scan.
Expand your research to other key value stores like bsddb or leveldb here on Stack Overflow.
I have a table where am storing a startingDate in a DateTime column.
Once i have the startingDate value, am supposed to calculate the
number_of_days,
number_of_weeks
number_of_months and
number_of_years
all from the startingDate to the current date.
If you are going to use these values in two or more places in the application and you do care much about the applications response time, would you rather make the calculations in a view or create computed columns for each so you can query the table directly?
Computed columns are easy to maintain and provide an ideal solution to your problem – I have used such a solution recently. However, be aware the values are calculated when requested (when they are SELECTed), not when the row is INSERTed into the table – so performance might still be an issue. This might be acceptable if you can off-load work from the application server to the database server. Views also don’t exist until they are requested (unless they are materialised) so, again, there will be an overhead at runtime, but, again it’s on the database server, not the application server.
Like nearly everything: It depends.
As #RedX suggest it probably not much of a performance difference either way, so it becomes a question of how will use them. To me this is more of a feel thing.
Using them more than once doesn't wouldn't necessary drive me immediately to either a view or computed columns. If I only use them in a few places or low volume code paths I might calc them in-line in those places or use a CTE. But if the are in wide spread or heavy use I would agree with a view or computed column.
You would also want them in a view or cc if you want them available via ORM tools.
Am I using those "computed columns" individual in places or am I using them in sets? If using them in sets I probably want a view of the table that shows included them all.
When i need them do I usually want them associated with data from a particular other table? If so that would suggest a view.
Am I basing updates on the original table of those computed values? If so then I want computed columns to avoid joining the view in these case.
Calculated columns may seem an easy solution at first, but I have seen companies have trouble with them because when they try to do ETL with CDC for real-time Change Data Capture with tools like Attunity it will not recognize the calculated columns since the values are not there permanently. So there are some issues. Also if the columns will be retrieve many, many times by users, you will save time in the long run by putting that logic in the ETL tool or procedure and write it once to the database instead of calculating it many times for each request.
I was pointed out by someone else that the following database design have serious issues, can anyone tell me why?
a tb_user table saves all the users information
tb_user table will have 3 - 8 users only.
each user's data will be saved in a separate table, naming after the user's name.
Say a user is called: bill_admin, then he has a seperate table, i.e. bill_admin_data, to save all data belongs to him. All users' data shared the same structure.
The person who pointed out this problem said I should merge all the data into one table, and uses FK to distinguish them, but I have the following statement:
users will only be 3 - 8, so there's not gonna be a lot of tables anyway.
each user has a very large data table, say 500K records.
Is it a bad practise to design database like this? And why? Thank you.
Because it isn't very maintainable.
1) Adding data to a database should never require modifying the structure. In your model, if you ever need to add another person you will need a new table (or two). You may not think you will ever need to do this, but trust me. You will.
So assume, for example, you want to add functionality to your application to add a new user to the database. With this structure you will have to give your end users rights to create new tables, which creates security problems.
2) It violates the DRY principle. That is, you are creating multiple copies of the same table structure that are identical. This makes maintenance a pain in the butt.
3) Querying across multiple users will be unnecessarily complicated. There is no good reason to split each user into a separate table other than having a vendetta against the person who has to write queries against this DB model.
4) If you are splitting it into multiple tables for performance because each user has a lot of rows, you are reinventing the wheel. The RDBMS you are using undoubtedly has an indexing feature which allows it to efficiently query large tables. Your home-grown hack is not going to outperform the platform's approach for handling large data.
I wouldn't say it's bad design per se. It is just not the type of design for which relational databases are designed and optimized for.
Of course, you can store your data as you mention, but many operations won't be trivial. For example:
Adding a new person
Removing a person
Generating reports based on data across all your people
If you don't really care about doing this. Go ahead and do your tables as you propose, although I would recommend using a non relational database, such as MongoDB, which is more suited for this type of structure.
If you prefer using relational databases, by aggregating data by type, and not by person gives you lots of flexibility when adding new people and calculating reports.
500k lines is not "very large", so don't worry about size when making your design.
it is good to use Document based database like mongoDB for these type of requirement.
I am currently working on a project in which I have to keep track of the tuples that are modified in a relational database. This should include updated tuples, but also inserted and deleted tuples. My question is what will be the best way to accomplish this? I have several ideas of my own, but maybe there are easier/better ways that I did not think of, or there already exists a project that exactly does this.
The final goal of the project is that it will work for relational databases of different vendors, but the first implementation will use a MySQL database. Other database systems can be supported later. But it would be nice if the solution that works for MySQL can be easily adapted to another database.
My first idea was to parse log files. However, I am not certain whether these logfiles contain the actual modified tuples, and furthermore I can imagine that these logfiles will not always be available (e.g. on shared hosting).
My second idea was to intercept the queries at the application level. When a INSERT, DELETE or UPDATE query is performed, these queries can be parsed, and the tuples that they will affect can be determined beforehand. For an INSERT operation this simply is the inserted tuple, and for a DELETE or UPDATE operation the tuples can be identified by applying the WHERE clause in a new SELECT statement.
As a last remark I want to add that performance is not an important factor at this stage of development.
If more details are needed I am happy to provide them.
Use triggers to capture the INSERT, UPDATE, and DELETE and log your entries to a new table. You can use a timestamp on that table to note when the transactions occurred. In the future you can query that table for your modification information.
This will require some database dependent features but you can encapsulate them depending on your architecture but you could use database triggers, which I normally advise against except for this very thing, auditing. In each kind of trigger, you could simply write to a log table whatever info you need. Just one suggestion.