Data quality database model - database

Need an example of a database model to be attached to a database for data quality. Best form of the answer would at the very least be DDL that's executable in MySQL; other RDMS DDL's are okay, I'll just post another question asking for a porting of the code.
A good explaintion would be a huge plus.
Questions, comments, feedback, etc. -- just comment, thanks!!

The biggest problem is identifying meaningful measures of quality. That's so highly application-dependent, I doubt that anybody will be able to help you very much. (At least not without a lot more information--perhaps more than you're allowed to give.)
But let's say your application records observations of birds by individuals. (I'm just throwing this together off the top of my head. Read it for the gist, and expect the details to crumble under scrutiny.) Under average field conditions,
some species are hard for even a beginner to get wrong
some species are hard for an expert to get right
a specific individual's ability varies irregularly over time (good days, bad days)
individuals usually become more skilled over time
you might be highly skilled at identifying hawks, and totally suck at identifying gulls
individuals are prone to suggestion (who they're with makes a difference in their reliability)
So, to take a shot at assessing the quality of an identification, you might try to record a lot of information besides the observation "3 red-tailed hawks at Cape May on 05-Feb-2011 at 4:30 pm". You might try to record
weather
lighting
temperature (some birders suck in the cold)
hours afield (some birders suck after 3 hours, or after 20 cold minutes)
names of others present
average difficulty of correctly
identifying red-tailed hawks
probability that this individual
could correctly identify red-tails
under these field conditions
alcohol intake
Although this might be "meta" to field birders, to the database designer it's just data. And you'd design the tables just like you'd design them for any other application. (That's what I did, anyway.)

Related

Database schema naming conventions and common mistakes?

(https://i.stack.imgur.com/VYkV6.png) :
I'm asked to design a relational database to keep data to answer clinic operation queries such as:
● List the patient appointments for each doctor for a given date.
● When a patient rings to make an appointment, give the available time slots for a given date.
● Retrieve the address of patients to send notices via mail services.
I have one database schema of one relation as shown below, but I was wondering whether there were any mistakes I've made?
ABC(doc-name, doc-gender, registration_num, qualification, pat-name, pat-gender, DOB, address, phone-num, appoint-date, appoint-time, type)
Is the use of words such as date and the use of hyphens generally discouraged? Are there any other weaknesses in my design?
Thank you
So, that's not a schema or a design. Not for a relational database, which, based on the tags for the question, is what you're looking for. That's the storage definition for an ID/Value style of database. If you're looking for actual relational storage, you should be building out those relationships through the process of normalization.
For example, let's start at the beginning with doc-name (I am personally not crazy about using hyphens, but it's not a showstopper, so at least on that note, be sure whichever RDBMS you're working with supports them in the name and then you're good to go). If we think about this just from a data entry stand point, we don't want to have to type in the name of the doctor every time we use that doctor. Instead, we'd want to pull that from a list. So, clearly, we can break that apart from the rest of the information. There is the beginning of our normalization process. We can also easily note the fact that a patient is likely to have more than one appointment. Under the current structure, we'd have to re-enter every bit of patient information prior to the appointment. There's another place where we'd break this apart.
There is tons more to this simple example that could be split out and normalized.
I'd suggest you read up on data normalization. My favorite teacher on the subject is Louis Davidson. Here's his book on the topic. Read that and then try to readdress the situation you're facing.
I'm assuming this isn't just homework. If it is, currently, I'd give you an "F". If it isn't, you should track down someone to give you hand with this database design. You won't be able to quickly read Louis' book on the topic and turn around even a rough working design in any reasonable period of time.
I have to second what Grant said, this is not a relational design at all.
Stop and ask yourself for example what happens if Steven Arrow has to take an afternoon off and update his schedule. You need to be very careful updating the database lest you reassign all his patients.
Spending a total of 5 minutes on this, I see at the very least:
A Doctors table, a Patients Table, and probably a table of open appointment times (which btw, is a bit harder than you think, so you have to give some thought how to handle that and some reading up on tables for scheduling).
That's for starters. I might break out Patients phone numbers to its own table. Why? Well how many columns do you want have for phone numbers? 1? What if they have a work AND home number? Or a Work and Cell and Home? And more.
The concept you're looking for is normal forms. You don't need to go overboard, but generally 3NF is about right.

SQLite database questions, problems with design (indexing/multiple fields)

I use stackoverflow a lot, but this is my first question here, so if i'm doing anything wrong just let me know. I'm not a programmer (I just do programming for my own needs) so I'm open to tutorial suggestions etc. I won't be offended if you just give me something to read and find the answers myself.
OK, to the point - I'm trying to write simple application to track my personal expenses and I have a problem with database design. I'm using VStudio to create the database (SQLite). I attached a diagram with my design and I have some questions.
My SQLite diagram
I don't know exactly how to design "Transactions" table. Fields like Date, Payment Type etc. seems to be easy enough but the idea was to store in this table information about transactions so I need to store multiple products there. I've read about it and created table "Transactions_Products" that will help with that. My problem is : where do I put quantity of products in the transaction? I can't think of a place to put it. I tried to find similar databases but couldn't find anything.
Second thing. I've read about indexing a lot, but I still can't grasp the idea. I don't know when to use it. Should I use it only on fields that I will be "querying" a lot?
Last one - is it better for such a small application just for myself to store my account balance in a separate table or should I just calculate it every time?
As I said, I don't need answers like: "do this, do that". If you just give me some good tutorials/articles I think I can find answers on my own, but I couldn't find it. Maybe I'm searching for it wrong.
Thank you in advance for any information.
where do I put quantity of products in the transaction?
Transactions is a bad table name as it's vague and has multiple meanings. Consider "payments", "purchase invoices", etc. See https://dba.stackexchange.com/questions/12991/ready-to-use-database-models-example/23831#23831 for some existing patterns.
Should I use [indexes] only on fields that I will be "querying" a lot?
There's no free lunch. Indexes take space, and can slow down inserts. Start with indexes on your primary keys (which is the default for SQLite), measure what is slow (looking at query plans) and add indexes if they help and if you have room.
is it better for such a small application just for myself to store my account balance in a separate table or should I just calculate it every time?
For an operational/transactional database like you describe, avoid storing calculated values. SQLite can count numbers quickly :)
Premature optimization is premature. Make it work first with full normalization. If you have performance problems, analyze what is really causing the slow-down and go from there.

Recommended AI/machine learning: profiles input, income prediction

My project looks like this: my data set is a bunch of profiles of people, with various attributes, e.g. boolean hasJob and int healthScore, and their income. Using this data, I'm trying to predict their income for the future. Each profile also has a history: e.g., what their attributes and income were in the past.
So in essence I'm trying to map multiple sets of (x booleans, y numbers) to a number (salary in the coming year).
I've considered neural networks, Bayes nets, and genetic algorithms for function-fitting. Any suggestions or input?
Thanks in advance!
--Emily
What you want to do is called "time series modeling". However you probably have only very little data per series (per person). I think it is difficult to find one model that fits every person as you make some general assumptions that e.g. everyone is equally career oriented. Also this is such a noisy target, it could be that e.g. you have to take into account if someone is a sweettalker or not. How do you measure such a thing? I'm pretty sure your current attributes have enough noise that will make it difficult to predict anything. When you say health status, do you mean physical health only or mental health. In different businesses different things are important. What about the business or industry they are working in? Its health and growth potential? I would assume this highly influences their income. I also think that you have dependent variables as well as attributes could (and likely are) influenced by your target variable. E. g. people with higher income have better health. It sounds like a very very complex and difficult thing and definitely nothing where "I naively grouped my data and tried a bunch of methods" is going to give meaningful results. I would suggest to learn more about time series modeling and especially also about the data that you have. Maybe try starting out with clustering persons by their initial attributes and see how they develop. Are there any variables that correlate with this development?
What is your research question?

schema design

Let's say you are a GM dba and you have to design around the GM models
Is it better to do this?
table_model
type {cadillac, saturn, chevrolet}
Or this?
table_cadillac_model
table_saturn_model
table_chevrolet_model
Let's say that the business lines have the same columns for a model and that there are over a million records for each subtype.
EDIT:
there is a lot of CRUD
there are a lot of very processor intensive reports
in either schema, there is a model_detail table that contains 3-5 records for each model and the details for each model differ (you can't add a cadillac detail to a saturn model)
the dev team doesn't have any issues with db complexity
i'm not really sure that this is a normalization question. even though the structures are the same they might be thought of as different entities.
EDIT:
Reasons for partitioning the structure into multiple tables
- business lines may have different business rules regarding parts
- addModelDetail() could be different for each business line (even though the data format is the same)
- high add/update activity - better performance with partitioned structure instead of single structure (I'm guessing and not sure here)?
I think this is a variation of the EAV problem. When posed as a EAV design, the single table structure generally gets voted as a bad idea. When posed in this manner, the single table strucutre generally gets voted as a good idea. Interesting...
I think the most interesting answer is having two different structures - one for crud and one for reporting. I think I'll try concatenated/flattened view for reporting and multiple tables for crud and see how that works.
Definitely the former example. Do you want to be adding tables to your database whenever you add a new model to your product range?
On data with a lot of writes, (e.g. an OLTP application), it is better to have more, narrower tables (e.g. tables with fewer fields). There will be less lock contention because you're only writing small amounts of data into different tables.
So, based on the criteria you have described, the table structure I would have is:
Vehicle
VehicleType
Other common fields
CadillacVehicle
Fields specific to a Caddy
SaturnVehicle
Fields specific to a Saturn
For reporting, I'd have an entirely different database on an entirely different server that does not have the normalized structure (e.g. just has CadillacVehicle and SaturnVehicle tables with all of the fields from the Vehicle table duplicated into them).
With proper indexes, even the OLTP database could be performant in your SELECT's, regardless of the fact that there are tens of millions of rows. However, since you mentioned that there are processor-intensive reports, that's why I would have a completely separate reporting database.
One last comment. About the business rules... the data store cares not about the business rules. If the business rules are different between models, that really shouldn't factor into your design decisions about the database schema (other than to help dictate which fields are nullable and their data types).
Use the former. Setting up separate tables for the specialisations will complicate your code and doesn't bring any advantages that can't be achieved in other ways. It will also massively simplify your reports.
If the tables really do have the same columns, then the former is the best way to do it. Even if they had different columns, you'd probably still want to have the common columns be in their own table, and store a type designator.
You could try having two separate databases.
One is an OLTP (OnLine Transaction Processing) system which should be highly normalized so that the data model is highly correct. Report performance must not be an issue, and you would deal with non-reporting query performance with indexes/denormalization etc. on a case-by-case basis. The data model should try to match up very closely with the conceptual model.
The other is a Reports system which should pull data from the OLTP system periodically, and massage and rearrange that data in a way that makes report-generation easier and more performant. The data model should not try to match up too closely with the conceptual model. You should be able to regenerate all the data in the reporting database at any time from the data currently in the main database.
I would say the first way looks better.
Are there reasons you would want to do it the second way?
The first way follows normalization better and is closer to how most relational database schema are developed.
The second way seems to be harder to maintain.
Unless there is a really good reason for doing it the second way I would go with the first method.
Given the description that you have given us, the answer is either.
In other words you haven't given us enough information to give a decent answer. Please describe what kind of queries you expect to perform on the data.
[Having said that, I think the answer is going to be the first one ;-)
As I imaging even though they are different models, the data for each model is probably going to be quite similar.
But this is a complete guess at the moment.]
Edit:
Given your updated edit, I'd say the first one definitely. As they have all the same data then they should go into the same table.
Another thing to consider in defining "better"--will end users be querying this data directly? Highly normalized data is difficult for end-users to work with. Of course this can be overcome with views but it's still something to think about as you're finalizing your design.
I do agree with the other two folks who answered: which form is "better" is subjective and dependent on what you're hoping to achieve. If you're hoping to achieve very quick queries that's one thing. If you're hoping to achieve high programmer productivity--that's a different goal again and possibly conflicts with quick queries.
Choice depends on required performance.
The best database is normalized database. But there could be performance issues in normalized database then you have to denormalize it.
Principle "Normalize first, denormalize for performance" works well.
It depends on the datamodel and the use case. If you ever need to report on a query that wants data out of the "models" then the former is preferable because otherwise (with the latter) you'd have to change the query (to include the new table) every time you added a new model.
Oh and by "former" we mean this option:
table_model
* type {cadillac, saturn, chevrolet}
#mson has asked the question "What do you do when a question is not satisfactorily answered on SO?", which is a direct reference to the existing answers to this question.
I contributed the following answer to that discussion, primarily critiquing the way the question was asked.
Quote (verbatim):
I looked at the original question yesterday, and decided not to contribute an answer.
One problem was the use of the term 'model' as in 'GM models' - which cited 'Chevrolet, Saturn, Cadillac' as 'models'. To my understanding, these are not models at all; they are 'brands', though there might also be an industry-insider term for them that I'm not familiar with, such as 'division'. A model would be a 'Saturn Vue' or 'Chevrolet Impala' or 'Cadillac Escalade'. Indeed, there could well be models at a more detailed level than that - different variants of the Saturn Vue, for example.
So, I didn't think that the starting point was well framed. I didn't critique it; it wasn't quite compelling enough, and there were answers coming in, so I let other people try it.
The next problem is that it is not clear what your DBMS is going to be storing as data. If you're storing a million records per 'model' ('brand'), then what sorts of data are you dealing with? Lurking in the background is a different scenario - the real scenario - and your question has used an analogy that failed to be sufficiently realistic. That means that the 'it depends' parts of the answer are far more voluminous than the 'this is how to do it' ones. There is just woefully too little background information on the data to be modelled to allow us to guess what might be best.
Ultimately, it will depend on what uses people have for the data. If the information is going to go flying off in all different directions (different data structures in different brands; different data structures at the car model levels; different structures for the different dealerships - the Chevrolet dealers are handled differently from the Saturn dealers and the Cadillac dealers), then the integrated structure provides limited benefit. If everything is the same all the way down, then the integrated structure provides a lot of benefit.
Are there legal reasons (or benefits) to segregating the data? To what extent are the different brands separate legal entities where shared records could be a liability? Are there privacy issues, such that it will be easier to control access to the data if the data for the separate brands is stored separately?
Without a lot more detail about the scenario being modelled, no-one can give a reliable general answer - at least, not more than the top-voted one already gives (or doesn't give).
Data modelling is not easy.
Data modelling without sufficient information is impossible to do reliably.
I have copied the material here since it is more directly relevant. I do think that to answer this question satisfactorily, a lot more context should be given. And it is possible that there needs to be enough extra context to make SO the wrong place to ask it. SO has its limitations, and one of those is that it cannot deal with questions which require long explanations.
From the SO FAQs page:
What kind of questions can I ask here?
Programming questions, of course! As long as your question is:
detailed and specific
written clearly and simply
of interest to at least one other programmer somewhere
...
What kind of questions should I not ask here?
Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!
This question is, IMO, close to the 'require extended discussion' limit.

Inventory database design [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This is a question not really about "programming" (is not specific to any language or database), but more of design and architecture. It's also a question of the type "What the best way to do X". I hope does no cause to much "religious" controversy.
In the past I have developed systems that in one way or another, keep some form of inventory of items (not relevant what items). Some using languages/DB's that do not support transactions. In those cases I opted not to save item quantity on hand in a field in the item record. Instead the quantity on hand is calculated totaling inventory received - total of inventory sold. This has resulted in almost no discrepancies in inventory because of software. The tables are properly indexed and the performance is good. There is a archiving process in case the amount of record start to affect performance.
Now, few years ago I started working in this company, and I inherited a system that tracks inventory. But the quantity is saved in a field. When an entry is registered, the quantity received is added to the quantity field for the item. When an item is sold, the quantity is subtracted. This has resulted in discrepancies. In my opinion this is not the right approach, but the previous programmers here swear by it.
I would like to know if there is a consensus on what's the right way is to design such system. Also what resources are available, printed or online, to seek guidance on this.
Thanks
I have seen both approaches at my current company and would definitely lean towards the first (calculating totals based on stock transactions).
If you are only storing a total quantity in a field somewhere, you have no idea how you arrived at that number. There is no transactional history and you can end up with problems.
The last system I wrote tracks stock by storing each transaction as a record with a positive or negative quantity. I have found it works very well.
The Data Model Resource Book, Vol. 1: A Library of Universal Data Models for All Enterprises
The Data Model Resource Book, Vol. 2: A Library of Data Models for Specific Industries
The Data Model Resource Book: Universal Patterns for Data Modeling
I have Vol 1 and Vol 2 and these have been pretty helpful in the past.
It depends, inventory systems are about far more than just counting items. For example, for accounting purposes, you might need to know accounting value of inventory based on FIFO (First-in-First-out) model. That can't be calculated by simple "totaling inventory received - total of inventory sold" formula. But their model might calculate this easily, because they modify accounting value as they go. I don't want to go into details because this is not programming issue but if they swear by it, maybe you didn't understand fully all their requirements they have to accommodate.
both are valid, depending on the circumstances. The former is best when the following conditions hold:
the number of items to sum is relatively small
there are few or no exceptional cases to consider (returns, adjustments, et al)
the inventory item quantity is not needed very often
on the other hand, if you have a large number of items, several exceptional cases, and frequent access, it will be more efficient to maintain the item quantity
also note that if your system has discrepancies then it has bugs which should be tracked down and eliminated
i have done systems both ways, and both ways can work just fine - as long as you don't ignore the bugs!
It's important to consider the existing system and the cost and risk of changing it. I work with a database that stores inventory kind of like yours does, but it includes audit cycles and stores adjustments just like receipts. It seems to work well, but everyone involved is well trained, and the warehouse staff aren't exactly quick to learn new procedures.
In your case, if you're looking for a little more tracking without changing the whole db structure then I'd suggest adding a tracking table (kind of like from your 'transaction' solution) and then log changes to the inventory level. It shouldn't be too hard to update most changes to the inventory level so that they also leave a transaction record. You could also add a periodic task to backup the inventory level to the transaction table every couple hours or so so that even if you miss a transaction you can discover when the change happened or roll back to a previous state.
If you want to see how a large application does it take a look at SugarCRM, they have and inventory management module though I'm not sure how it stores the data.
I think this is actually a general best-practices question about doing a (relatively) expensive count every time you need a total vs. doing that count every time something changes, then storing the count in a field and reading that field whenever you need a total.
If I couldn't use transactions, I would go with the live count every time I needed a total. If transactions are available, it would be safe to perform the inventory update operations and the saving of the re-counted total within the same transaction, which would ensure the accuracy of the count (although I'm not sure this would work with multiple users hitting the database).
But if performance is not really a huge problem (and modern databases are good enough at counting rows that I would rarely even worry about this) I'd just stick with the live count each time.
I would opt for the first way, where
the quantity on hand is calculated
totaling inventory received - total of
inventory sold
The Right Way, IMO.
EDIT: I would also want to factor in any stock losses/damages into the system, but I'm sure you have that covered.
I've worked on systems that solve this problem before. I think the ideal solution is a precomputed column, which gets you the best of both worlds. Your total would be a field somewhere, thus no expensive lookups, but it can't get out of sync with the rest of your data (the database maintains the integrity). I don't remember which RDMSs support precomputed columns, but if you don't have transactions, that might not be available either.
You could potentially fake precomputed columns (very effectively... I see no downside) using triggers. You'd probably need transactions though. IMHO, keeping data integrity when you're doing this sort of controlled denormalization is the only legitimate use for a trigger.
Django-inventory geared more to fixed assets, but might give you some ideas.
IE: ItemTemplate (class) -> ItemsOnHand (instance)
ItemsOnHand can be linked to more ItemTemplates; Example Printer & the ink cartridges is requires. This also allows to set Reorder points for each ItemOnHand.
Each ItemsOnHand is linked to InventoryTransactions, this allows for easy auditing.
To avoid calculating actual on hand items from thousand of invetory transactions, checkpoints are used which are just a balance + a date. To calculate items on hand query to find the most recent checkpoint and start adding or substracting items to find the current balance of items. Define new checkpoints periodically.
I can see some benefit to having the two columns, but I'm not following the part about discrepancies - you seem to be implying that having the two columns (in and out) is less prone to discrepancy than a single column (current). Why is that?
Is not having one or two columns, what I meant with "totaling inventory received - total of inventory sold" is something like this:
Select sum(quantity) as inventory_received from Inventory_entry
Select sum(quantity) as inventory_sold from Sales_items
then
Qunatity_on_hand = inventory_received - inventory_sold
Please keep in mind that I oversimplified this and my initial explanation. I know there is much more to inventory that just keeping track of quantities, but in this case that's were the problem lies and what we want to fix. At this point the reason to change it is preciselly the cost of supporting the problems caused by the current design.
Also I wanted to mention that although this is not a "coding" question is related to algoritms and design which IMHO are very important topics.
Thanks everybody for your answers so far.
Nelson Marmol
We solve different problems, but our approach to some of them might be interesting to you.
We allow the system to make a "best guess", and give the users regular feedback about any of those guesses that look wrong.
To apply this to inventory, you could have 3 fields:
inventory_received
inventory_sold
estimated_on_hand
Then, you could run a process (daily?) along the lines of:
SELECT *
FROM Inventory
WHERE estimated_on_hand != inventory_received - inventory_sold
Of course, this relies on users looking at this alert, and doing something about it.
Also, you could have a function to reset inventory some how, either by updating inventory_sold/received, or perhaps adding another field "inventory_adjustment", which could be positive or negative.
... just some thoughts. Hope it's helpful.

Resources