Reducing number of attributes stored in table

Reducing number of attributes stored in table - database

I have a problem in creating a table for a database. I want to record many status for each farmer for example farmer will perform many procedures in paddy farming and have about 26 procedures from cultivation until harvesting.
So, each farmer must follow a schedule for each procedure according to dates fixed by Agriculture assistant. My problem is how can I record this procedure status to record whether the farmer is following the schedule or not? For now, I use the 26 procedures as the attributes for the activity table so in the activity table I have attributes
farmerID, status1 (for activity 1 eg: Cultivation) ,
status2 (for activity 2 eg: fertilization),
status 3
and so on until status 26...so is this the correct way? My lecturer says it is incorrect because so many attributes are there. Can you help me out from this problem? I can't think about this any more.

Not a good way of handling it, especially since it's not immediately scalable without adding new fields (and having your code map those new fields). I'd do something like this:
tbl_farmer
- farmerId
tbl_status
- statusId
- name (i.e. Cultivation, etc.)
tbl_activity
- farmerId
- statusId
And each time a farmer performs a status update, you place the entry inside tbl_activity. Basically tbl_activity is a reference table

An alternative approach would be to give each activity (procedure) an id and instead of many columns only have three.
farmer_id
activity_id
status
Assuming that your activities are stored in a separate table.

Related

Avoid duplicating fields across multiple tables

Let me describe briefly the table structures:
Customer Table
id | name | address_line_one | address_line_two | contact_no_one
SaleInvoice Table
id | id_Customer (Foreign Key) | invoice_no
If I have to print a Sale invoice, I have to use the Customer information (like name, address) from the Customer table.
Assume that after a year, some customer data changes (like name or address), and I update the new data in my customer table. Now, if the customer asks for an old invoice, it will be printed with the new customer data which shall be legally wrong.
Does that mean, I have to create
name_customer
address_line_one_customer
...
and all these fields in the Sale Invoice table too?
If yes, is there a better way to get data from these fields in Customer table to the Sale Invoice table then to write a SQL query to get the values and then set the values?

This is really up to you. In some cases, where it is a legal document, you will save all the details so that you can always bring it up the way it was created. Alternatively if you are producing pdf invoices then save them to be 100% sure.
The other alternative is to create a CustomerHistory table, so that past versions are always saved with a date range, so that you can go back to the old version.
It depends on the use cases, but those are your main options.

It sounds like a problem easily solved by placing the Employee table in version normal form (VNF). This is actually just a flavor of 2nf but done in a way that provides the ability to query current data and past data using the same query.
A datetime parameter is used to provide the distinction. When the value is set to NOW, the current data is returned. When the value is set to a specific datetime value in the past, the data that was current at that date and time is returned.
A brief discussion of the particulars can be found here. That answer also contains links to more information if you think it is something that would work for you.

data synchronization from unreliable data source to SQL table

I am looking for pattern, framework or best practice to handle a generic problem of application level data synchronisation.
Let's take an example with only 1 table to make it easier.
I have an unreliable datasource of product catalog. Data can occasionally be unavailable or incomplete or inconsistent. ( issue might come from manual data entry error, ETL failure...)
I have a live copy in a Mysql table in use by a live system. Let's say a website.
I need to implement safety mecanism when updating the mysql table to "synchronize" with original data source. Here are the safety criteria and the solution I an suggesting:
avoid deleting records when they temporarily disappear from datasource => use "deleted" boulean/date column or an archive/history table.
check for inconsistent changes => configure rules per columns such as : should never change, should only increment,
check for integrity issue => (standard problem, no point discussing approach)
ability to rollback last sync=> restore from history table ? use a version inc/date column ?
What I am looking for is best practice and pattern/tool to handle such problem. If not you are not pointing to THE solution, I would be grateful of any keywords suggestion that would me narrow down which field of expertise to explore.

We have the same problem importing data from web analytics providers - they suffer the same problems as your catalog. This is what we did:
Every import/sync is assigned a unique id (auto_increment int64)
Every table has a history table that is identical to the original, but has an additional column "superseded_id" which gets the import-id of the import, that changed the row (deletion is a change) and the primary key is (row_id,superseded_id)
Every UPDATE copies the row to the history table before changing it
Every DELETE moves the row to the history table
This makes rollback very easy:
Find out the import_id of the bad import
REPLACE INTO main_table SELECT <everything but superseded_id> FROM history table WHERE superseded_id=<bad import id>
DELETE FROM history_table WHERE superseded_id>=<bad import id>
For databases, where performance is a problem, we do this in a secondary database on a different server, then copy the found-to-be-good main table to the production database into a new table main_table_$id with $id being the highest import id and have main_table be a trivial view to SELECT * FROM main_table_$someid. Now by redefining the view to SELECT * FROM main_table_$newid we can atomically swicth the table.

I'm not aware of a single solution to all this - probably because each project is so different. However, here are two techniques I've used in the past:
Embed the concept of version and validity into your data model
This is a way to deal with change over time without having to resort to history tables; it does complicate your queries, so you should use it sparingly.
For instance, instead of having a product table as follows
PRODUCTS
Product_ID primary key
Price
Description
AvailableFlag
In this model, if you want to delete a product, you execute "delete from product where product_id = ..."; modifying price would be "update products set price = 1 where product_id = ...."
With the versioned model, you have:
PRODUCTS
product_ID primary key
valid_from datetime
valid_until datetime
deleted_flag
Price
Description
AvailableFlag
In this model, deleting a product requires you to update products set valid_until = getdate() where product_id = xxx and valid_until is null, and then insert a new row with the "deleted_flag = true".
Changing price works the same way.
This means that you can run queries against your "dirty" data and insert it into this table without worrying about deleting items that were accidentally missed off the import. It also allows you to see the evolution of the record over time, and roll-back easily.
Use a ledger-like mechanism for cumulative values
Where you have things like "number of products in stock", it helps to create transactions to modify the amount, rather than take the current amount from your data feed.
For instance, instead of having a amount_in_stock column on your products table, have a "product_stock_transaction" table:
product_stock_transactions
product_id FK transaction_date transaction_quantity transaction_source
1 1 Jan 2012 100 product_feed
1 2 Jan 2012 -3 stock_adjust_feed
1 3 Jan 2012 10 product_feed
On 2 Jan, the quantity in stock was 97; on 3 Jan, 107.
This design allows you to keep track of adjustments and their source, and is easier to manage when moving data from multiple sources.
Both approaches can create large amounts of data - depending on the number of imports and the amount of data - and can lead to complex queries to retrieve relatively simple data sets.
It's hard to plan for performance concerns up front - I've seen both "history" and "ledger" work with large amounts of data. However, as Eugen says in his comment below, if you get to an excessively large ledger, it may be necessary to to clean up the ledger table by summarizing the current levels, and deleting (or archiving) old records.

Where should I break up my user records to keep track of revisions

I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.
I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.

A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.
You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.
To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.
Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.

I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.
With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.
You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)
With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.

#Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.
The structure of a database should reflect the real-world facts that are being modeled.
I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.
Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!

trigger insertions into same table

I have many tables in my database which are interrelated. I have a table (table one) which has had data inserted and the id auto increments. Once that row has an ID i want to insert this into a table (table three) with another set of ID's which comes from a form(this data will also be going into a table, so it could from from that table), the same form as the data which went into the first table came from.
The two ID's together make the primary key of the third table.
How can I do this, its to show that more than one ID is joined to a single ID for something else.
Thanks.

You can't do that through a trigger as the trigger only has available to it the data that you already inserted not data that is currenlty only residing in your user interface.
Normally how you handle this situation is that you write a stored proc that inserts the meeting, returns the id value (using scope_identity() in SQL Server, but I'm sure other databases would have method to return the auto-generated id as well). Then you would use that value to insert to the other table with the other values you need for that table. You would of course want to wrap the whole thing in a transaction.

I think you can probably do what you're describing (just write the INSERTs to table 3) in the table 1 trigger) but you'll have to put the additional info for the table 3 rows into your table 1 row, which isn't very smart.
I can't see why you would do that instead of writing the INSERTs in your code, where someone reading it can see what's happening.
The trouble with triggers is that they make it easy to hide business logic in the database. I think (and I believe I'm in the majority here) that it's easier to understand, manage, maintain and generally all-round deal with an application where all the business rules exist in the same general area.
There are reasons to use triggers (for propagating denormalised values, for example) just as there are reasons for useing stored procedures. I'm going to assert that they are largely related to performance-critical areas. Or should be.

How do you avoid adding timestamp fields to your tables? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a question regarding the two additional columns (timeCreated, timeLastUpdated) for each record that we see in many solutions. My question: Is there a better alternative?
Scenario: You have a huge DB (in terms of tables, not records), and then the customer comes and asks you to add "timestamping" to 80% of your tables.
I believe this can be accomplished by using a separate table (TIMESTAMPS). This table would have, in addition to the obvious timestamp column, the table name and the primary key for the table being updated. (I'm assuming here that you use an int as primary key for most of your tables, but the table name would most likely have to be a string).
To picture this suppose this basic scenario. We would have two tables:
PAYMENT :- (your usual records)
TIMESTAMP :- {current timestamp} + {TABLE_UPDATED, id_of_entry_updated, timestamp_type}
Note that in this design you don't need those two "extra" columns in your native payment object (which, by the way, might make it thru your ORM solution) because you are now indexing by TABLE_UPDATED and id_of_entry_updated. In addition, timestamp_type will tell you if the entry is for insertion (e.g "1"), update (e.g "2"), and anything else you may want to add, like "deletion".
I would like to know what do you think about this design. I'm most interested in best practices, what works and scales over time. References, links, blog entries are more than welcome. I know of at least one patent (pending) that tries to address this problem, but it seems details are not public at this time.
Cheers,
Eduardo

While you're at it, also record the user who made the change.
The flaw with the separate-table design (in addition to the join performance highlighted by others) is that it makes the assumption that every table has an identity column for the key. That's not always true.
If you use SQL Server, the new 2008 version supports something they call Change Data Capture that should take away a lot of the pain you're talking about. I think Oracle may have something similar as well.
Update: Apparently Oracle calls it the same thing as SQL Server. Or rather, SQL Server calls it the same thing as Oracle, since Oracle's implementation came first ;)
http://www.oracle.com/technology/oramag/oracle/03-nov/o63tech_bi.html

I have used a design where each table to be audited had two tables:
create table NAME (
name_id int,
first_name varchar
last_name varchar
-- any other table/column constraints
)
create table NAME_AUDIT (
name_audit_id int
name_id int
first_name varchar
last_name varchar
update_type char(1) -- 'U', 'D', 'C'
update_date datetime
-- no table constraints really, outside of name_audit_id as PK
)
A database trigger is created that populates NAME_AUDIT everytime anything is done to NAME. This way you have a record of every single change made to the table, and when. The application has no real knowledge of this, since it is maintained by a database trigger.
It works reasonably well and doesn't require any changes to application code to implement.

I think I prefer adding the timestamps to the individual tables. Joining on your timestamp table on a composite key -- one of which is a string -- is going to be slower and if you have a large amount of data it will eventually be a real problem.
Also, a lot of the time when you are looking at timestamps, it's when you're debugging a problem in your application and you'll want the data right there, rather than always having to join against the other table.

One nightmare with your design is that every single insert, update or delete would have to hit that table. This can cause major performance and locking issues. It is a bad idea to generalize a table like that (not just for timestamps). It would also be a nightmare to get the data out of.
If your code would break at the GUI level from adding fields you don't want the user to see, you are incorrectly writing the code to your GUI which should specify only the minimum number of columns you need and never select *.

The advantage of the method you suggest is that it gives you the option of adding other fields to your TIMESTAMP table, like tracking the user who made the change. You can also track edits to sensitive fields, for example who repriced this contract?
Logging record changes in a separate file means you can show multiple changes to a record, like:
mm/dd/yy hh:mm:ss Added by XXX
mm/dd/yy hh:mm:ss Field PRICE Changed by XXX,
mm/dd/yy hh:mm:ss Record deleted by XXX
One disadvantage is the extra code the will perform inserts into your TIMESTAMPS table to reflect changes in your main tables.

If you set up the time-stamp stuff to run off of triggers, than any action that can set off a trigger (Reads?) can be logged. Also there might be some locking advantages.
(Take all that with a grain of salt, I'm no DBA or SQL guru)

Yes, I like that design, and use it with some systems. Usually, some variant of:
LogID int
Action varchar(1) -- ADDED (A)/UPDATED (U)/DELETED (D)
UserID varchar(20) -- UserID of culprit :)
Timestamp datetime -- Date/Time
TableName varchar(50) -- Table Name or Stored Procedure ran
UniqueID int -- Unique ID of record acted upon
Notes varchar(1000) -- Other notes Stored Procedure or Application may provide

I think the extra joins you will have to perform to get the Timestamps will be a slight performance hit and a pain the neck. Other than that I see no problem.

We did exactly what you did. It is great for the object model and the ability to add new stamps and differant types of stamps to our model with minimal code. We were also tracking the user that made the change, and a lot of our logic was heavily based on these stamps. It woked very well.
One drawback is reporting, and/or showing a lot of differant stamps on on screen. If you are doing it the way we did it, it caused a lot of joins. Also,back ending changes was a pain.

Our solution is to maintain a "Transaction" table, in addition to our "Session" table. UPDATE, INSERT and DELETE instructions are all managed through a "Transaction" object and each of these SQL instruction is stored in the "Transaction" table once it has been successfully executed on the database. This "Transaction" table has other fields such as transactiontType (I for INSERT, D for DELETE, U for UPDATE), transactionDateTime, etc, and a foreign key "sessionId", telling us finally who sent the instruction. It is even possible, through some code, to identify who did what and when (Gus created the record on monday, Tim changed the Unit Price on tuesday, Liz added an extra discount on thursday, etc).
Pros for this solution are:
you're able to tell "what who and when", and to show it to your users! (you'll need some code to analyse SQL statements)
if your data is replicated, and replication fails, you can rebuild your database through this table
Cons are
100 000 data updates per month mean 100 000 records in Tbl_Transaction
Finally, this table tends to be 99% of your database volume
Our choice: all records older than 90 days are automatically deleted every morning

Philippe,
Don't simply delete those older than 90 days, move them first to a separate DB or write them to text file, do something to preserve them, just move them out of the main production DB.
If ever comes down to it, most often it is a case of "he with the most documentation wins"!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight