What would be the best way to tie a database object to a source code implementation? Basically so that I could have a table of "ingredients" that could be referred to by objects from another table containing a "recipe", while still being able to index and search efficiently by their metadata. Also taking into account that some "ingredients" might inherit from other "ingredients".
Maybe I'm looking at this in a totally wrong way, would appreciate any light on the subject.
If I've correctly understood your goal, there should be these two choices:
Use an OR/M and don't try to implement the data mapping yourself from scratch.
Switch to a NoSQL storage. Analyze your data model and see if it's not very relational and it can be expressed using a document storage like MongoDB. For example, MongoDB already supports indexing.
I am currently working on creating a database for a community partnership program for educational purposes. The structure of the DB should be simple be as stated above, the data tends to overlap in various of ways. There are four main categories; Internships, Jobs, Summer/Yearly Programs, and Other. Followed by an Address book/Contacts list.
This is the part where the data is difficult to structure. The employer and has relate to the "employment posting" and doing so relates to the school's academic departments, 6. But some employers require more than one. This data will then be followed by, how many openings?, posting date, follow up contact date, Student hired? if so, student evaluation, and Notes.
I'm not asking how to create the DB, but how would I organize and structure such a complex data collection? I have managed DB's, (putting in information) and I know how to build from scratch as needed. But I have been tasked with structuring somethings like this.
Here is an image of information needed to collect. (More or less)
Click me!
If you are stuck at the "How do I get started?" stage, I suggest that you start at a very high level (the conceptual data model), then refine only a bit to the logical data model, then to physical data model. Here is a short explanation of the 3 different kinds of data model. (Don't worry that it appears to be about data warehouses - these bits aren't specific to data warehouses.)
For a bit more detail, there is another article on data modeling - again, don't worry that it appears in the context of Agile - this is generally useful stuff even if you're not using Agile.
Another two things that might help are these questions (in this order):
What questions do I need the database to answer?
What information does it need to provide a home for? (Why? If it's not covered by part of the answer to the first question, challenge why it's needed.)
TL;DR
I have architecture issue which boils down to filtering entities by predefined set of common filters. Input is: set of products. Each product has details. I need to design filtering engine so that I can (easily and fast) resolve a task:
"Filter out collection of products with specified details"
Requirements
User may specify whatever filtering is possible with support of precedence and nested filters. So, bare example is (weight=X AND (color='red' OR color='green')) OR price<1000 The requests should go via HTTP / REST, but that's insignificant (it only adds an issue with translating filters from URI to some internal model). Any comparison operators should be supported (like equality, inequality, less than etc.)
Specifics
Model
There is no fixed model definition - in fact I am free to chose one. To make it simpler I am using simple key=>value for details. So it goes at the very minimum to:
class Value extends Entity implements Arrayable
{
protected $key;
protected $value;
//getters/setters for key/value here
}
for simple value for product detail and something like
class Product extends Entity implements Arrayable
{
protected $id;
/**
* #var Value[]
*/
protected $details;
//getters/setters, more properties that are omitted
}
for the product. Now, regarding data model, there is a first question: How to design filtering model?. I have a simple idea of implementing it as a let's say, recursive iterator which will be a tree regular structure according to incoming user request. The difficulties which I certainly need to solve here are:
Quickly build the model structure out from user request
Possibility for easy modification of the structure
Easy translate of chosen filters data model to chosen storage (see below)
Last point in the list above is probably the most important part as storage routines will be most time-consuming and therefore filters data model should fit in such structure. That means storage has always higher priority and if data model can not fit into some storage design that allows to resolve the issue - then data model should be changed.
Storage
As a storage I want to use NoSQL+RDBMS which is Postgree 9.4 for example. So that will allow to use JSON for storing details. I do not want to use EAV in any case, that is why pure relational DBMS isn't an option (see here why). There is one important thing - products may contain stocks which leads to the situation that I have basically two ways:
If I design products as a single entity with their stocks (pretty logical), then I can not go "storage" + "indexer" approach because this produces outdated state as indexer (such as SOLR) needs to update and reindex data
Design with separate entities. That means - to separate whatever can be cached from whatever that can not. First part then can go to indexer (and details probably can go to there, so we are filtering by them) and non-cacheable part will go somewhere else.
And the question for storage part would be, of course: which one to chose?
Good thing about first approach is that the internal API is simple, internal structures are simple and scalable because they then can easily be abstracted from storage layer. Bad thing is that then I need this "magic solution" which will allow to use "just storage" instead of "storage+indexer". "Magic" here means to somehow design indexes or some additional data-structures (I was thinking about hashing, but it isn't helpful against range queries) in storage that will resolve filtering requests.
On the other hand second solution will allow to use search engine to resolve filtering task inside itself but producing some gap when data will be outdated there. And of course now the data layer needs to be implemented the way it will somehow know about which part of model goes to which storage (so stocks to one storage, details to another etc)
Summary
What can be a proper data model to design filtering?
Which approach should be used to resolve the issue on the storage level: storage+indexer with separate products model or only storage with monolithic products model? Or may be something else?
If go the approach with storage only - is it possible to design storage so it will be possible to filter out products easily by any set of details?
If go with the indexer, what will fit better for this issue? (There is a good comparison between solr and sphinx here, but it's '15 now while it was made in '09 so for sure it is outdated)
Any links, related blogposts or articles are very welcome.
As a P.S.: I did a search across SO but faced barely-relevant suggestions/topics so far (for example this). I am not expecting a silver bullet here as it is always boils down to some trade-off, but however question looks very standard so there should be good insights already. Please, guide me - I tried to "ask google" with some luck but that was not enough yet.
P.P.S. feel free to edit tags or redirect question to proper SE resource if SO is not a good idea for such kind of questions. And I am not asking language-specific solution, so if you are not using PHP - it does not matter, design has nothing to do with the language
My preferred solution would be to split the entities - your second approach. The stable data would be held in Cassandra (or Solr or Elastic etc), while the volatile stock data would be held in (ideally) an in-memory database like Redis or Memcache that supports compare-and-swap / transactions (or Dynamo or Voldemort etc if the stock data won't fit in memory). You won't need to worry too much about the consistency of the stable data since presumably it changes rarely if ever, so you can choose a scalable but not entirely consistent database like Cassandra; meanwhile you can choose a less scalable but more consistent database for the volatile stock data.
I'm trying to wrap my mind around the statuses that vCloud returns in their SDK, but there seems to be very light documentation on them. A few of them I don't understand what they're about, and in practice I'm only seeing POWERED_ON, POWERED_OFF, and SUSPENDED. The only documentation on the statuses that I can find are here:
http://www.vmware.com/support/vcd/doc/rest-api-doc-1.5-html/operations/GET-VApp.html
What confuses me are things like "what is an 'entity'? And what does it mean when it's 'resolved'?" When I go to provision a VM and monitor its state, it starts at POWERED_OFF and goes to POWERED_ON, when I would expect to see some intermediary statuses while it's in the process of provisioning. Does anyone know where I can go to find out more about this?
This page from the vCD 5.1 documentation shows the possible values of the status field for various entities. The current doc uses numerical values but the API also has a few spots where string values are returned instead. The reference you found from the 1.5 API includes some of them; I think as part of the 5.1 doc update the string values were dropped from the schema reference.
An entity in the vCloud API is very similar to the likewise-named notion in database modeling. Wikipedia provides a fair definition of the term from entity-relationship modeling:
An entity may be defined as a thing which is recognized as being
capable of an independent existence and which can be uniquely
identified.
The RESOLVED (numerical value 1) state means that most of the parts of the entity are present, but it isn't fully constructed yet. You typically see it when uploading an OVF and all of the bits have be transferred to vCD but stuff is still happening in the background prior to it being usable.
Let's say you are a GM dba and you have to design around the GM models
Is it better to do this?
table_model
type {cadillac, saturn, chevrolet}
Or this?
table_cadillac_model
table_saturn_model
table_chevrolet_model
Let's say that the business lines have the same columns for a model and that there are over a million records for each subtype.
EDIT:
there is a lot of CRUD
there are a lot of very processor intensive reports
in either schema, there is a model_detail table that contains 3-5 records for each model and the details for each model differ (you can't add a cadillac detail to a saturn model)
the dev team doesn't have any issues with db complexity
i'm not really sure that this is a normalization question. even though the structures are the same they might be thought of as different entities.
EDIT:
Reasons for partitioning the structure into multiple tables
- business lines may have different business rules regarding parts
- addModelDetail() could be different for each business line (even though the data format is the same)
- high add/update activity - better performance with partitioned structure instead of single structure (I'm guessing and not sure here)?
I think this is a variation of the EAV problem. When posed as a EAV design, the single table structure generally gets voted as a bad idea. When posed in this manner, the single table strucutre generally gets voted as a good idea. Interesting...
I think the most interesting answer is having two different structures - one for crud and one for reporting. I think I'll try concatenated/flattened view for reporting and multiple tables for crud and see how that works.
Definitely the former example. Do you want to be adding tables to your database whenever you add a new model to your product range?
On data with a lot of writes, (e.g. an OLTP application), it is better to have more, narrower tables (e.g. tables with fewer fields). There will be less lock contention because you're only writing small amounts of data into different tables.
So, based on the criteria you have described, the table structure I would have is:
Vehicle
VehicleType
Other common fields
CadillacVehicle
Fields specific to a Caddy
SaturnVehicle
Fields specific to a Saturn
For reporting, I'd have an entirely different database on an entirely different server that does not have the normalized structure (e.g. just has CadillacVehicle and SaturnVehicle tables with all of the fields from the Vehicle table duplicated into them).
With proper indexes, even the OLTP database could be performant in your SELECT's, regardless of the fact that there are tens of millions of rows. However, since you mentioned that there are processor-intensive reports, that's why I would have a completely separate reporting database.
One last comment. About the business rules... the data store cares not about the business rules. If the business rules are different between models, that really shouldn't factor into your design decisions about the database schema (other than to help dictate which fields are nullable and their data types).
Use the former. Setting up separate tables for the specialisations will complicate your code and doesn't bring any advantages that can't be achieved in other ways. It will also massively simplify your reports.
If the tables really do have the same columns, then the former is the best way to do it. Even if they had different columns, you'd probably still want to have the common columns be in their own table, and store a type designator.
You could try having two separate databases.
One is an OLTP (OnLine Transaction Processing) system which should be highly normalized so that the data model is highly correct. Report performance must not be an issue, and you would deal with non-reporting query performance with indexes/denormalization etc. on a case-by-case basis. The data model should try to match up very closely with the conceptual model.
The other is a Reports system which should pull data from the OLTP system periodically, and massage and rearrange that data in a way that makes report-generation easier and more performant. The data model should not try to match up too closely with the conceptual model. You should be able to regenerate all the data in the reporting database at any time from the data currently in the main database.
I would say the first way looks better.
Are there reasons you would want to do it the second way?
The first way follows normalization better and is closer to how most relational database schema are developed.
The second way seems to be harder to maintain.
Unless there is a really good reason for doing it the second way I would go with the first method.
Given the description that you have given us, the answer is either.
In other words you haven't given us enough information to give a decent answer. Please describe what kind of queries you expect to perform on the data.
[Having said that, I think the answer is going to be the first one ;-)
As I imaging even though they are different models, the data for each model is probably going to be quite similar.
But this is a complete guess at the moment.]
Edit:
Given your updated edit, I'd say the first one definitely. As they have all the same data then they should go into the same table.
Another thing to consider in defining "better"--will end users be querying this data directly? Highly normalized data is difficult for end-users to work with. Of course this can be overcome with views but it's still something to think about as you're finalizing your design.
I do agree with the other two folks who answered: which form is "better" is subjective and dependent on what you're hoping to achieve. If you're hoping to achieve very quick queries that's one thing. If you're hoping to achieve high programmer productivity--that's a different goal again and possibly conflicts with quick queries.
Choice depends on required performance.
The best database is normalized database. But there could be performance issues in normalized database then you have to denormalize it.
Principle "Normalize first, denormalize for performance" works well.
It depends on the datamodel and the use case. If you ever need to report on a query that wants data out of the "models" then the former is preferable because otherwise (with the latter) you'd have to change the query (to include the new table) every time you added a new model.
Oh and by "former" we mean this option:
table_model
* type {cadillac, saturn, chevrolet}
#mson has asked the question "What do you do when a question is not satisfactorily answered on SO?", which is a direct reference to the existing answers to this question.
I contributed the following answer to that discussion, primarily critiquing the way the question was asked.
Quote (verbatim):
I looked at the original question yesterday, and decided not to contribute an answer.
One problem was the use of the term 'model' as in 'GM models' - which cited 'Chevrolet, Saturn, Cadillac' as 'models'. To my understanding, these are not models at all; they are 'brands', though there might also be an industry-insider term for them that I'm not familiar with, such as 'division'. A model would be a 'Saturn Vue' or 'Chevrolet Impala' or 'Cadillac Escalade'. Indeed, there could well be models at a more detailed level than that - different variants of the Saturn Vue, for example.
So, I didn't think that the starting point was well framed. I didn't critique it; it wasn't quite compelling enough, and there were answers coming in, so I let other people try it.
The next problem is that it is not clear what your DBMS is going to be storing as data. If you're storing a million records per 'model' ('brand'), then what sorts of data are you dealing with? Lurking in the background is a different scenario - the real scenario - and your question has used an analogy that failed to be sufficiently realistic. That means that the 'it depends' parts of the answer are far more voluminous than the 'this is how to do it' ones. There is just woefully too little background information on the data to be modelled to allow us to guess what might be best.
Ultimately, it will depend on what uses people have for the data. If the information is going to go flying off in all different directions (different data structures in different brands; different data structures at the car model levels; different structures for the different dealerships - the Chevrolet dealers are handled differently from the Saturn dealers and the Cadillac dealers), then the integrated structure provides limited benefit. If everything is the same all the way down, then the integrated structure provides a lot of benefit.
Are there legal reasons (or benefits) to segregating the data? To what extent are the different brands separate legal entities where shared records could be a liability? Are there privacy issues, such that it will be easier to control access to the data if the data for the separate brands is stored separately?
Without a lot more detail about the scenario being modelled, no-one can give a reliable general answer - at least, not more than the top-voted one already gives (or doesn't give).
Data modelling is not easy.
Data modelling without sufficient information is impossible to do reliably.
I have copied the material here since it is more directly relevant. I do think that to answer this question satisfactorily, a lot more context should be given. And it is possible that there needs to be enough extra context to make SO the wrong place to ask it. SO has its limitations, and one of those is that it cannot deal with questions which require long explanations.
From the SO FAQs page:
What kind of questions can I ask here?
Programming questions, of course! As long as your question is:
detailed and specific
written clearly and simply
of interest to at least one other programmer somewhere
...
What kind of questions should I not ask here?
Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!
This question is, IMO, close to the 'require extended discussion' limit.