Database: Sharing two sets of values in a single table? - database

I'll start by saying I am not a DBA and I didn't get to do heavy database development so far (so I hope I'm not asking something obvious).
The Challenge:
I have a dictionary application with pre-defined values.
New values may be added via online updates.
Users are not allowed to modify these application-values, but they may add/delete values of their own.
The database (sqlite3) will contain a small amount of values (~2K-3K).
The database schema is exactly the same for both user and application values.
Possible solutions:
One way to go about it would be to create two different tables having the same schema, and JOIN the data from both tables when querying the database.
A different approach would be to have a single table in which application-values will start at ID=0, and user-values will start at ID=100000 (for example). Online updates will merge new values below ID=100000 such that user values will remain intact.
I prefer the second solution - it'll avoid JOINs during runtime and the queries will remain simple.However, an update to the application-values in the first solution would require me to just replace the application table with the new one.
Please let me know what you think:
Which solution is better?
What are the pros/cons that I'm missing?
Is there an even-better third solution?

Why not just a column 'type' to your table and fill it with user/application?
Personaly I hate meaningfull ID's....

Related

How do I replace multiple SQL tables without breaking unknown dependencies?

I'm not sure if this is possible, and I've found nothing short of "start over with the database" so far:
Here's what I currently have:
A set of 10 tables that, when converted from a "quote" into a "policy," essentially copy their data from one table to another. These tables have multiple entities (reports, software, other database items, dynamically-generated) dependent on them.
What I WANT to do:
Create a new table schema that is truly relational (Of these 20 tables, I could drop about 15), but in a way that the pre-existing tables "appear" to exist as they are to anything that depends on them.
In the second image, I want anything that is expecting the first image to get it, but behind that is actually the more standardized relational model.
I've considered:
Create the new tables, then set up triggers to move the data as needed. I'm not liking this solution as it seems error-prone at best.
Leave the tables as-is, and slowly move all dependencies into functions or stored procedures until I'm reasonably sure the majority of dependencies have been identified. Then switch to the new schema via these new procedures/functions/etc..
Build new schema, point what is known to new schema. Build views that mimic the original tables, and re-point queries there as they are discoverd (messy)
We're already having massive performance problems under the current setup and I'm trying to pre-emptively strike and fix this scenario. The biggest concern I have is that these are transactional tables, storing (at present) about 500,000 records each (you can see why getting rid of 2/3 of them is so appealing). Plus the data and quote sets of tables have new records added in all of them whenever a new transaction occurs.
QUESTION: How do I accomplish this switch without making code changes elsewhere? How do I make everything else "think" that the old table schema still exists but use the new schema going forward?

Understanding metadata in Postgres?

I'm currently writing some code for one of my classes involving distributed and parallel database processing. I'm doing horizontal fragmentation on some data and required to keep track of different pieces of data.
The professor recommends storing "metadata" to keep track of some basic computations. Is this as simple as creating another table and storing some basic information, or is there a much more efficient way of doing this?
Example:
I need to track ranges for min/max values of every table in my database. Should I store that information in an entirely new table or is there a better way of achieving this?
Example: I need to track ranges for min/max values of every table in my database. Should I store that information in an entirely new table or is there a better way of achieving this?
Yes, you should store min/max in a different table. Depending on your application, you might need more than one of those kinds of tables.
Each insert, update, or delete statement can change either or both of those values. Think about how you want to handle that. (Triggers, probably.)
Terminology
Metadata just means "data about other data", and min/max values for one or more columns in each table is arguably data about other data. But I've never seen such data called metadata. It's always either summary or aggregate data.
I think you'll find that when most DBAs and database developers use metadata, they're talking about system tables or the information_schema views that are built on top of system tables.

How do a handle a database with over 100 tables

I have a database with over a 100 tables, 30 of them are lookup tables with lookup Language tables. each table links back to one or three tables. but there are around 20 different web forms that needs to interlink for a registered user.
My question is, do i create one connection string with one Model, or do i break them up into individual models?
I've tried the breaking up into individual models based on the page that they are required for, but this just throws up validation and reference errors looking for the same field.
I don't have any errors to show at the moment, but i can provide if necessary.
Sounds like you need to create some views so that you can consolidate the queries coming from the database. Try to think of logical groupings of the tables (lookup and otherwise) that you have and create a view for each logical grouping. Then, have your application query against those views to retrieve data.
As for connection strings, I don't see why you would need more than one if all of the tables are in the same database.
If you have the possibility to create only one connection string, that is what you should do.
When you create a second connection string, it's because you have no choice. Having many different connections strings is just going to add to the confusion you migth already be in.
The number of tables you have in a data base is never going to influence how many connection string you should have. I would even say : having acces to all the information of your database through one single object is an advantage. Now, the way you are going to organise the impressive amount of informations is crucial, and there is a lot of way to accomplish that. You need to find out yours.

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...
I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.
No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)
From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.
I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.
There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

What will be the best way to keep track of modified tuples in a database?

I am currently working on a project in which I have to keep track of the tuples that are modified in a relational database. This should include updated tuples, but also inserted and deleted tuples. My question is what will be the best way to accomplish this? I have several ideas of my own, but maybe there are easier/better ways that I did not think of, or there already exists a project that exactly does this.
The final goal of the project is that it will work for relational databases of different vendors, but the first implementation will use a MySQL database. Other database systems can be supported later. But it would be nice if the solution that works for MySQL can be easily adapted to another database.
My first idea was to parse log files. However, I am not certain whether these logfiles contain the actual modified tuples, and furthermore I can imagine that these logfiles will not always be available (e.g. on shared hosting).
My second idea was to intercept the queries at the application level. When a INSERT, DELETE or UPDATE query is performed, these queries can be parsed, and the tuples that they will affect can be determined beforehand. For an INSERT operation this simply is the inserted tuple, and for a DELETE or UPDATE operation the tuples can be identified by applying the WHERE clause in a new SELECT statement.
As a last remark I want to add that performance is not an important factor at this stage of development.
If more details are needed I am happy to provide them.
Use triggers to capture the INSERT, UPDATE, and DELETE and log your entries to a new table. You can use a timestamp on that table to note when the transactions occurred. In the future you can query that table for your modification information.
This will require some database dependent features but you can encapsulate them depending on your architecture but you could use database triggers, which I normally advise against except for this very thing, auditing. In each kind of trigger, you could simply write to a log table whatever info you need. Just one suggestion.

Resources