This is a bit complex query which has multiple joins and reruns a lot of records with several data fields. Let’s say it basically use to retrieve manager details.
First set of tables (already implemented query):
Select m.name, d.name, d.address, m.salary , m.age,……
From manager m,department d,…..etc
JOINS …..
Assume, a one manger can have zero or more employees.
Let’s say I need to list down all employee names for each and every manager for result of first set of tables with managers who has no employees (which means want to keep the manager list of first set of tables as it is).
Then I have to access “employee” table through “party” tables (might be involved few more tables).
Second set of tables (to be newly connected):
That means there are one or more join with “employee” , “party” and …..etc
I have two approaches on this.
Make left outer join with first set of tables to second set of
tables.
Create a user define function (UDF) in DB level for second set of
tables. Then I have to insert manger id in to this UDF as a
parameter and take all the employees (e1,e2,…) as a formatted string
by calling through the select clause in the first set of tables
Please can someone suggest me the best solution in DB performance wise out of these two options?
Go for the JOIN, using appropriate WHERE clauses and indexes.
The database engine is far better at optimizing that you'll ever be. Let it do its job.
Your way sounds like (n+1) query death.
Write a sample query and ask your database to EXPLAIN PLAN to see what the cost is. If you spot a TABLE
Related
Say I want to create a typical todo-webApp using a db like postgresql. A user should be able to create todo-lists. On this lists he should be able to make the actual todo-entries.
I regard the todo-list as an object which has different properties like owner, name, etc, and of course the actual todo-entries which have their own properties like content, priority, date ... .
My idea was to create a table for all the todo-lists of all the users. In this table I would store all the attributes of each list. But the questions which arises is how to store the todo-entries themselves? Of course in an additional table, but should I rather:
1. Create one big table for all the entries and have a field storing the id of the todo-list they belong to, like so:
todo-list: id, owner, ...
todo-entries: list.id, content, ...
which would give 2 tables in total. The todo-entries table could get very large. Although we know that entries expire, hence the table only grows with more usage but not over time. Then we would write something like SELECT * FROM todo-entries WHERE todo-list-id=id where id is the of the list we are trying to retrieve.
OR
2. Create a todo-entries table on a per user basis.
todo-list: id, owner, ...
todo-entries-owner: list.id, content,. ..
Number of entries table depends on number of users in the system. Something like SELECT * FROM todo-entries-owner. Mid-sized tables depending on the number of entries users do in total.
OR
3. Create one todo-entries-table for each todo-list and then store a generated table name in a field for the table. For instance could we use the todos-list unique id in the table name like:
todo-list: id, owner, entries-list-name, ...
todo-entries-id: content, ... //the id part is the id from the todo-list id field.
In the third case we could potentially have quite a large number of tables. A user might create many 'short' todo-lists. To retrieve the list we would then simply go along the lines SELECT * FROM todo-entries-id where todo-entries-id should be either a field in the todo-list or it could be done implicitly by concatenating 'todo-entries' with the todos-list unique id. Btw.: How do I do that, should this be done in js or can it be done in PostgreSQL directly? And very related to this: in the SELECT * FROM <tablename> statement, is it possible to have the value of some field of some other table as <tablename>? Like SELECT * FROM todo-list(id).entries-list-name or so.
The three possibilities go from few large to many small tables. My personal feeling is that the second or third solutions are better. I think they might scale better. But I'm not sure quite sure of that and I would like to know what the 'typical' approach is.
I could go more in depth of what I think of each of the approaches, but to get to the point of my question:
Which of the three possibilities should I go for? (or anything else, has this to do with normalization?)
Follow up:
What would the (PostgreSQL) statements then look like?
The only viable option is the first. It is far easier to manage and will very likely be faster than the other options.
Image you have 1 million users, with an average of 3 to-do lists each, with an average of 5 entries per list.
Scenario 1
In the first scenario you have three tables:
todo_users: 1 million records
todo_lists: 3 million records
todo_entries: 15 million records
Such table sizes are no problem for PostgreSQL and with the right indexes you will be able to retrieve any data in less than a second (meaning just simple queries; if your queries become more complex (like: get me the todo_entries for the longest todo_list of the top 15% of todo_users that have made less than 3 todo_lists in the 3-month period with the highest todo_entries entered) it will obviously be slower (as in the other scenarios). The queries are very straightforward:
-- Find user data based on username entered in the web site
-- An index on 'username' is essential here
SELECT * FROM todo_users WHERE username = ?;
-- Find to-do lists from a user whose userid has been retrieved with previous query
SELECT * FROM todo_lists WHERE userid = ?;
-- Find entries for a to-do list based on its todoid
SELECT * FROM todo_entries WHERE listid = ?;
You can also combine the three queries into one:
SELECT u.*, l.*, e.* -- or select appropriate columns from the three tables
FROM todo_users u
LEFT JOIN todo_lists l ON l.userid = u.id
LEFT JOIN todo_entries e ON e.listid = l.id
WHERE u.username = ?;
Use of the LEFT JOINs means that you will also get data for users without lists or lists without entries (but column values will be NULL).
Inserting, updating and deleting records can be done with very similar statements and similarly fast.
PostgreSQL stores data on "pages" (typically 4kB in size) and most pages will be filled, which is a good thing because reading a writing a page are very slow compared to other operations.
Scenario 2
In this scenario you need only two tables per user (todo_lists and todo_entries) but you need some mechanism to identify which tables to query.
1 million todo_lists tables with a few records each
1 million todo_entries tables with a few dozen records each
The only practical solution to that is to construct the full table names from a "basename" related to the username or some other persistent authentication data from your web site. So something like this:
username = 'Jerry';
todo_list = username + '_lists';
todo_entries = username + '_entries';
And then you query with those table names. More likely you will need a todo_users table anyway to store personal data, usernames and passwords of your 1 million users.
In most cases the tables will be very small and PostgreSQL will not use any indexes (nor does it have to). It will have more trouble finding the appropriate tables, though, and you will most likely build your queries in code and then feed them to PostgreSQL, meaning that it cannot optimize a query plan. A bigger problem is creating the tables for new users (todo_list and todo_entries) or deleting obsolete lists or users. This typically requires behind-the scenes housekeeping that you avoid with the previous scenario. And the biggest performance penalty will be that most pages have only little content so you waste disk space and lots of time reading and writing those partially filled pages.
Scenario 3
This scenario is even worse that scenario 2. Don't do it, it's madness.
3 million tables todo_entries with a few records each
So...
Stick with option 1. It is your only real option.
I have a StockinHand view generated from stock_Outward & Stock_Inward tables right now needs the sorting based on frequency i.e most moving stock items should be on top of the table
My tables are like below:
tbl_StockInward:
ID, Stock_Code,Units,Rate, Description, Vendor, DateOfPurchase, DateOfUpdate, Purchased_By, WareHouse, Remarks,
vice versa tbl_StockOutward
Please help me
Thanks in advance
Just like in sub queries, you can't use ORDER BY in a view definition in sql server unless you also use TOP.
The reason for this is that Views are acted upon as if they where tables, and tables in sql server (in fact, in any relational database) are considered as not ordered sets.
Just like there is no meaning to the order of records stored in a table, there is also no meaning to the order of records fetched by a view.
You can use a dirty hack and write SELECT TOP 100 PERCENT ... and then use ORDER BY, but I doubt if it has any meaning at all.
Having said all that, you can of course use ORDER BY in any query that selects from a view.
This should be a simple Yes/No answer so here goes.
If I set a 3 tables, 2 typical recordsets and 1 that joins them by the id of the 2 tables, do I need the id from both tables in order to have an entry in the join table?
The scenario is a Jobs table and a Parts table linked by JobsParts table. But some parts are not in the Parts table, they are just freetext entries (so as to avoid stock control issues) belonging to a Job.
Hope this is enough to explain my question.
Thanks
BTW using CakePHP 2.0
For database sanity, I'd say the join table 'jobs_parts' should have both IDs.
If you try entering free-form parts into the join table, you're not only going to increase the size of the join table, but you've effectively lost the ability to grow/expand - ie. what if you want to add a few more fields to this unknown part? Or what if it turns out to be a part that you actually want in your normal parts table.... it just gets confusing.
There are other options for dealing with free-form parts vs actual parts...
have a field in the parts table that's a tinyint(1) for whether or not it's a verified part
OR make an UnknownParts model/table
In my opinion, go with what makes logical sense for ease of understanding and for future updates to your database/website...etc. And IMO, adding a freeform part into the join table would not fit that bill.
I have a complex view that I use to pull a list of primary keys that indicate rows in a table that have been modified between two temporal points.
This view has to query 13 related tables and look at a changelog table to determine if a entity is "dirty" or not.
Even with all of this going on, doing a simple query:
select * from vwDirtyEntities;
Takes only 2 seconds.
However, if I change it to
select
e.Name
from
Entities e
inner join vwDirtyEntities de
on e.Entity_ID = de.Entity_ID
This takes 1.5 minutes.
However, if I do this:
declare #dirtyEntities table
(
Entity_id uniqueidentifier;
)
insert into #dirtyEntities
select * from vwDirtyEntities;
select
e.Name
from
Entities e
inner join #dirtyEntities de
on e.Entity_ID = de.Entity_ID
I get the same results in only 2 seconds.
This leads me to believe that SQLServer is evaluating the view per row when joined to Entities, instead of constructing a query plan that involves joining the single inner join above to the other joins in the view.
Note that I want to join against the full result set from this view, as it filters out only the keys I want internally.
I know I could make it into a materialized view, but this would involve schema binding the view and it's dependencies and I don't like the overhead maintaining the index would cause (This view is only queried for exports, while there are far more writes to the underlying tables).
So, aside from using a table variable to cache the view results, is there any way to tell SQL Server to cache the view while evaluating the join? I tried changing the join order (Select from the view and join against Entities), however that did not make any difference.
The view itself is also very efficient, and there is no room to optimize there.
There is nothing magical about a view. It's a macro that expands. The optimiser decides when JOINed to expand the view into the main query.
I'll address other points in your post:
you have ruled out an indexed view. A view can only be a discrete entity when it is indexed
SQL Server will never do a RBAR query on it's own. Only developers can write loops.
there is no concept of caching: every query uses latest data unless you use temp tables
you insist on using the view which you've decided is very efficient. But have no idea how views are treated by the optimizer and it has 13 tables
SQL is declarative: join order usually does not matter
Many serious DB developer don't use views because of limitations like this: they are not reusable because they are macros
Edit, another possibility. Predicate pushing on SQL Server 2005. That is, SQL Server can not push the JOIN condition "deeper" into the view.
Lately we have been testing QlikView in the office. The first impression is good: it has an attractive interface and performs very fast. We want to use it as a database frontend for our customers. We are also trying to determine whether it can take over parts of our relational database structure. However, we are in doubt whether its database functions are advanced enough to be more than an attractive frontend.
Specifically, we run into the following problem. The equivalent of normal JOIN (equijoin) operations can be done in QlikView simply by setting equal field names across tables - those fields will then be linked. However, one of our traditional SQL JOIN operations uses a "BETWEEN" query to find out whether a date is in a certain range and join the data on that.
Is it possible to specify such a "non-equijoin" relationship between tables in QlikView? Or is this an inherent limitation to the so-called "associative database" structure?
Marcus' answer is correct. The way to do this is to use IntervalMatch. You can have the two tables as they are and add a "between" relationship between, using IntervalMatch. You can't add relationships after the load script has run.
First you'll have to load the table that has the date range (sql queries omitted). Let's say:
Ranges:
LOAD
rangeID,
validfrom, // date
validto, // date
commonkey, // common key for the two tables
price; // the data that's needed as a result of the linking
Second, you load another table with the date
Data:
LOAD
column1,
column2,
date,
commonkey;
Next you will have to use the IntervalMatch. This is one way to do it:
Left Join (Data)
IntervalMatch(date, commonkey)
LOAD
validfrom,
validto,
commonkey
Resident Ranges;
Now you have the link between the two tables. You can delete the resulting synthetic key by adding this:
Left Join (Data)
LOAD
validfrom,
validto,
commonkey,
rangeID
Resident Ranges;
DROP Fields validfrom, validto FROM Data;
Now the tables are linked by using the rangeID key. If the tables don't have some key in common, like a category id or something, (ie. just the dates need to be matched), you can just ignore the commonkey in the example above. I just wanted to include it in the example because I needed it in my own case and hopefully it will help someone with a similar issue.
You can find this in the Qlikview help labeled "IntervalMatch (extended)". The Qlikview cookbook (fillrowsintervalmatch.qvw) also helped me with this issue.
Sure can - I think what you want it the IntervalMatch function.