I am working on a project where I have the following scenario:
Sales-rep A may deal with customers x, y and z for products 1, 2 and 3
Sales-rep B may deal with customers w, y and z for product 1, 5 and 6
to deal with the above I have the following table structures:
SalesReps: a table with the detail of each sales rep uniquely identified by SRID
Products: a table with all the products with unique key PRID
Customers: a tabel with customer details with unique key CID
Finally I link them with a table
RepProductCustomer: with columns SRID, PRID and CID
It all works fine generally but when dealing with a company which has a lot of products and customers and sales-reps the number of rows gets really large and take very long to add using Entity Framework.
What's the most efficient way to add these entries? Can I use some sort of stored procedure or SQL command to speed this up? I have tried various EF optimisations, but there is just too much data to be inserted on some occasions and the user has to wait for a very long time for the request to complete.
Any help would be appreciated.
Unfortunately, EF optimization you try will not work since the real issue in your scenario is the number of database round trip performed (one per record) which make the SaveChanges method very slow.
Disclaimer: I'm the owner of the project Entity Framework Extensions
BulkSaveChanges allow to save using Bulk Operations and reduce database round trip.
All associations and inheritance are supported.
// Upgrade SaveChanges performance with BulkSaveChanges
var context = new CustomerContext();
// ... context code ...
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(operation => operation.BatchSize = 1000);
Related
This is a theoretical question which I ask due to a request that has come my way recently. I own the support of a master operational data store which maintains a set of data tables (with the master data), along with a set of lookup tables (which contain a list of reference codes along with their descriptions). There has been recently a push from the downstream applications to unite the two structures (data and lookup values) logically in the presentation layer so that it is easier for them to find out if there have been updates in the overall data.
While the request is understandable, my first thought is that it should be implemented at the interface level rather than at the source. Combining the two tables logically (last_update_date) at ODS level is almost similar to the de-normalization of data and seems contrary to the idea of keeping lookups and data separate.
That said, I cannot think of any reason of why it should not be done at ODS level apart from the fact that it does not "seem" to be right... Does anyone have any thoughts around why such an approach should or should not be followed?
I am listing an example here for simplicity's sake.
Data table
ID Name Emp_typ_cd Last_update_date
1 X E1 2014-08-01
2 Y E2 2014-08-01
Code table
Emp_typ_cd Emp_typ_desc Last_Update_date
E1 Employee_1 2014-08-23
E2 Employee_2 2013-09-01
The downstream request is to represent the data as
Data view
ID Name Emp_typ_cd Last_update_date
1 X E1 2014-08-23
2 Y E2 2014-08-01
or
Data view
ID Name Emp_typ_cd Emp_typ_desc Last_update_date
1 X E1 Employee_1 2014-08-23
2 Y E2 Employee_2 2014-08-01
You are correct, it is demoralizing the database because someone wants to see the data in a certain way. The side effects, as you know, are that you are duplicating data, reducing flexibility, increasing table size, storing dissimilar objects together, etc. You are also correct that their problem should be solved somewhere or somehow else. They won’t get what they want if they change the database the way they want to change it. If they want to make it “easier for them to find out if there have been updates in the overall data” but they duplicate massive amounts of it, they’re just opening themselves up to errors. In your example the Emp_typ_cd Updated value must be updated for all employees with that emp type code. A good update statement will do that, but still, instead of updating a single row in the lookup table you’re updating every single employee that has the emp type.
We use lookup tables all the time. We can add a new value to a lookup table, add employees to the database with a fk to that new attribute, and any report that joins on that table now has the ID, Value, Sort Order, etc. Let’s say we add ‘Veteran’ to the lu_Work_Experience. We add an employee with the veteran fk_Id and now any existing query that joins on lu_Work_Experience has that value. They sort Work Experience alphabetically or by our pre-defined sort.
There is a valid reason for flattening your data structure though, and that is speed. If you’re running a very large report it will be faster with now joins (and good indexing). If the business knows it’s going to run a very large report many times and is worried about end user wait times, then it is a good idea to build a single table for that one report. We do it all the time for calculated measures. If we know that a certain analytic report will have a ton of aggregation and joins we pre-aggregate the data into the data store. That being said, we don’t do that very often in SQL because we use cubes for analytics.
So why use lookup tables in a database? Logical separation of data. An employee has a employee code, but it does NOT have a date of when an employee code was updated. Reduce duplicate data. Minimize design complexity. To avoid building a table for a specific report and then having to build a different table for a different report even if it has similar data.
Anyway, the rest of my argument would be comprised of facts from the Database Normalization wikipedia page so I’ll skip it.
I am going to create a new project, where I need users to view their friends activities and actions just like Facebook and LinkedIn.
Each user is allowed to do 5 different types of activities, each activity have different attributes, for example activity X can be public/private for while activity Y will be assigned to categories. Some of actions include 1 users others have 2 or 3 ...etc. Eventually I have to aggregate all these 5 different types of activities on the news feed page.
How can I design a database that is efficient?
I have 3 designs in mind, please let me know your thoughts. Any new ideas will be greatly appreciated!
1- Separate tables: since there are nearly 3-4 different columns for each activity, it would be logical to separate each activity to its own table.
Pros: Clean database, and easy to develop.
Cons: It will need to query the database 5 times and aggregate results to make a single newsfeed page.
2- One big table: This table will hold all activities with many unused columns. A new numeric column will be added called "type" which will indicate the type of activity. Some attributes could be combined in an HStore field (since we are using Postgres), others will be queried a lot so I dont think it is a good thing to include them as in an HStore field.
Pros: Easy to pull newsfeed.
Cons: Lots of read/writes on the same table, the code will be a bit messier so is the database.
3- Hybrid: A solution would be to make one table containing all the newsfeed, with a polymorphic association to other tables that contain details of each specific activity.
Pros: Tidy code and database, easy to add new activities.
Cons: JOIN ALL THE TABLES to make a single newsfeed! Still better than making 5 different queries.
As I am writing this post I am starting to lean towards solution number 2. Please advise!
Thanks
I would consider a graph database for this. Neo4j. It will add very flexible attributes on either nodes (users) or links (types of relations).
For small sets and few joins, SQL databases are faster and more appropriate. But if your starting point is 5 table joins, graph databases seem simpler and offer similar performance (if not better).
I am building a price comparison site that holds about 300.000 products and several hundred clients.
On a daily basis the site needs updating of prices and vendor stock availability.
When a vendor needs updating I was thinking about deleting all the vendor information and then pulling and inserting a new - each time.
Doing this I don't have to worry about a vendor deleting a product. In a simple way I would get a new set of data each day.
On the other hand I need to keep the autoincrement counter in check and it seems a waste to delete everything from a vendor if he has only updated prices on 3 products in his entire warehouse.
But updating has its disadvantages too. I still have to read out all data and compare each and every product price/availability to the data from the vendor, one product at a time. Also it's more complicated to notice products that gets deleted by the vendor.
How do I achieve updating in the most efficient and easy way?
I am not sure of what database technology you are using. In any case, a database supports handling of relationships between multiple tables. It is up to you to implement the concepts. For example, you can use the CASCADE option when creating a table A that references table B (by primary keys for example) to ensure that an update/delete of tuples in table B cascades (reflects) that corresponding tuple in table A.
See example of CASCADING and when to use
I am looking for pattern, framework or best practice to handle a generic problem of application level data synchronisation.
Let's take an example with only 1 table to make it easier.
I have an unreliable datasource of product catalog. Data can occasionally be unavailable or incomplete or inconsistent. ( issue might come from manual data entry error, ETL failure...)
I have a live copy in a Mysql table in use by a live system. Let's say a website.
I need to implement safety mecanism when updating the mysql table to "synchronize" with original data source. Here are the safety criteria and the solution I an suggesting:
avoid deleting records when they temporarily disappear from datasource => use "deleted" boulean/date column or an archive/history table.
check for inconsistent changes => configure rules per columns such as : should never change, should only increment,
check for integrity issue => (standard problem, no point discussing approach)
ability to rollback last sync=> restore from history table ? use a version inc/date column ?
What I am looking for is best practice and pattern/tool to handle such problem. If not you are not pointing to THE solution, I would be grateful of any keywords suggestion that would me narrow down which field of expertise to explore.
We have the same problem importing data from web analytics providers - they suffer the same problems as your catalog. This is what we did:
Every import/sync is assigned a unique id (auto_increment int64)
Every table has a history table that is identical to the original, but has an additional column "superseded_id" which gets the import-id of the import, that changed the row (deletion is a change) and the primary key is (row_id,superseded_id)
Every UPDATE copies the row to the history table before changing it
Every DELETE moves the row to the history table
This makes rollback very easy:
Find out the import_id of the bad import
REPLACE INTO main_table SELECT <everything but superseded_id> FROM history table WHERE superseded_id=<bad import id>
DELETE FROM history_table WHERE superseded_id>=<bad import id>
For databases, where performance is a problem, we do this in a secondary database on a different server, then copy the found-to-be-good main table to the production database into a new table main_table_$id with $id being the highest import id and have main_table be a trivial view to SELECT * FROM main_table_$someid. Now by redefining the view to SELECT * FROM main_table_$newid we can atomically swicth the table.
I'm not aware of a single solution to all this - probably because each project is so different. However, here are two techniques I've used in the past:
Embed the concept of version and validity into your data model
This is a way to deal with change over time without having to resort to history tables; it does complicate your queries, so you should use it sparingly.
For instance, instead of having a product table as follows
PRODUCTS
Product_ID primary key
Price
Description
AvailableFlag
In this model, if you want to delete a product, you execute "delete from product where product_id = ..."; modifying price would be "update products set price = 1 where product_id = ...."
With the versioned model, you have:
PRODUCTS
product_ID primary key
valid_from datetime
valid_until datetime
deleted_flag
Price
Description
AvailableFlag
In this model, deleting a product requires you to update products set valid_until = getdate() where product_id = xxx and valid_until is null, and then insert a new row with the "deleted_flag = true".
Changing price works the same way.
This means that you can run queries against your "dirty" data and insert it into this table without worrying about deleting items that were accidentally missed off the import. It also allows you to see the evolution of the record over time, and roll-back easily.
Use a ledger-like mechanism for cumulative values
Where you have things like "number of products in stock", it helps to create transactions to modify the amount, rather than take the current amount from your data feed.
For instance, instead of having a amount_in_stock column on your products table, have a "product_stock_transaction" table:
product_stock_transactions
product_id FK transaction_date transaction_quantity transaction_source
1 1 Jan 2012 100 product_feed
1 2 Jan 2012 -3 stock_adjust_feed
1 3 Jan 2012 10 product_feed
On 2 Jan, the quantity in stock was 97; on 3 Jan, 107.
This design allows you to keep track of adjustments and their source, and is easier to manage when moving data from multiple sources.
Both approaches can create large amounts of data - depending on the number of imports and the amount of data - and can lead to complex queries to retrieve relatively simple data sets.
It's hard to plan for performance concerns up front - I've seen both "history" and "ledger" work with large amounts of data. However, as Eugen says in his comment below, if you get to an excessively large ledger, it may be necessary to to clean up the ledger table by summarizing the current levels, and deleting (or archiving) old records.
What is the best way to store settings for certain objects in my database?
Method one: Using a single table
Table: Company {CompanyID, CompanyName, AutoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
Method two: Using two tables
Table Company {CompanyID, COmpanyName}
Table2 CompanySettings{CompanyID, utoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
I would take things a step further...
Table 1 - Company
CompanyID (int)
CompanyName (string)
Example
CompanyID 1
CompanyName "Swift Point"
Table 2 - Contact Types
ContactTypeID (int)
ContactType (string)
Example
ContactTypeID 1
ContactType "AutoEmail"
Table 3 Company Contact
CompanyID (int)
ContactTypeID (int)
Addressing (string)
Example
CompanyID 1
ContactTypeID 1
Addressing "name#address.blah"
This solution gives you extensibility as you won't need to add columns to cope with new contact types in the future.
SELECT
[company].CompanyID,
[company].CompanyName,
[contacttype].ContactTypeID,
[contacttype].ContactType,
[companycontact].Addressing
FROM
[company]
INNER JOIN
[companycontact] ON [companycontact].CompanyID = [company].CompanyID
INNER JOIN
[contacttype] ON [contacttype].ContactTypeID = [companycontact].ContactTypeID
This would give you multiple rows for each company. A row for "AutoEmail" a row for "AutoPrint" and maybe in the future a row for "ManualEmail", "AutoFax" or even "AutoTeleport".
Response to HLEM.
Yes, this is indeed the EAV model. It is useful where you want to have an extensible list of attributes with similar data. In this case, varying methods of contact with a string that represents the "address" of the contact.
If you didn't want to use the EAV model, you should next consider relational tables, rather than storing the data in flat tables. This is because this data will almost certainly extend.
Neither EAV model nor the relational model significantly slow queries. Joins are actually very fast, compared with (for example) a sort. Returning a record for a company with all of its associated contact types, or indeed a specific contact type would be very fast. I am working on a financial MS SQL database with millions of rows and similar data models and have no problem returning significant amounts of data in sub-second timings.
In terms of complexity, this isn't the most technical design in terms of database modelling and the concept of joining tables is most definitely below what I would consider to be "intermediate" level database development.
I would consider if you need one or two tables based onthe following criteria:
First are you close the the record storage limit, then two tables definitely.
Second will you usually be querying the information you plan to put inthe second table most of the time you query the first table? Then one table might make more sense. If you usually do not need the extended information, a separate ( and less wide) table should improve performance on the main data queries.
Third, how strong a possibility is it that you will ever need multiple values? If it is one to one nopw, but something like email address or phone number that has a strong possibility of morphing into multiple rows, go ahead and make it a related table. If you know there is no chance or only a small chance, then it is OK to keep it one assuming the table isn't too wide.
EAV tables look like they are nice and will save futue work, but in reality they don't. Genreally if you need to add another type, you need to do future work to adjust quesries etc. Writing a script to add a column takes all of five minutes, the other work will need to be there regarless of the structure. EAV tables are also very hard to query when you don;t know how many records you wil need to pull becasue normally you want them on one line and will get the information by joining to the same table multiple times. This causes performance problmes and locking especially if this table is central to your design. Don't use this method.
It depends if you will ever need more information about a company. If you notice yourself adding fields like companyphonenumber1 companyphonenumber2, etc etc. Then method 2 is better as you would seperate your entities and just reference a company id. If you do not plan to make these changes and you feel that this table will never change then method 1 is fine.
Usually, if you don't have data duplication then a single table is fine.
In your case you don't so the first method is OK.
I use one table if I estimate the data from the "second" table will be used in more than 50% of my queries. Use two tables if I need multiple copies of the data (i.e. multiple phone numbers, email addresses, etc)