Database setup
I have a PostgreSQL database with tables posts and comments that are on posts.
These tables have columns,
posts:
postid, body
comments:
commentid, postid*, body
(obviously comments.postid references posts.postid)
Problem
I want my server to fetch from the database a post, and the first 10 comments on that post, given a POSTID.
There are two ways of calling the database
1) using one large call
The call would look something like
select * from posts
where postid = POSTID
join
select * from comments
where comments have POSTID
Cons:
There is redundant data, since the column postid will be returned for every entry. We want it returned only once.
Also, I think doing a single call means the database has to run a sequence of events (not in parallel, which is slower)
2) using two smaller calls
The calls would be
select * from posts
where postid = POSTID
and then
select * from comments
where postid = POSTID
Cons: Doing many calls means many back and forths between server and database.
Which of these approaches is best practice?
I am asking about a toy model, but the answer applies if you have replies inside of comments, or more complicated structures.
In general a single call will always be better, there are exceptions but those are rare especially when you also consider network latency. Further, each call has its own sequence of events. Your complaint of redundant data holds true for your multiple call as well.
So a single call will be better. But the way you structured it is incorrect. I am not sure whether you mean the postid is redundant across a row or within a row. You eliminate within row redundancy by specifying the columns you want (abandon select *). You eliminate across row redundancy in your apps presentation manager.
select p.pastid, p.body, c.commentid, c.body
from posts p
left join comments c
on c.postid = p.postid
order by p.postid, c.commentid
limit 10;
Note: the left join allows for the inclusion of posts that have no comments. If that is not desirable then just use join comments ... (i.e. an inner join).
Related
I have a products table in postgres that stores all the data I need. What I need to work out the best way to do:
Each product has a different status on the way to completion - machined, painted, assembled, etc. For each status there is a letter that changes in the product id.
What would the most efficient way of saving the data? For each status of the product should there be 'another product' in the table? Or would doing join tables somewhere work?
Example:
111a1 for machined
111b1 for painted
Yet these are the same end product, just at different stages ...
It depends on what you want to be efficient: storage, ingestion, queries, maintainability...
Joins would work - you can join on some substring of the product id, so you need not have separate products for every stage of production.
But maintainability of your code is really important. Businesses change - perhaps to include sub-assemblies.
You might want to re-think this scheme of altering the product id to show the status. Product ID and work flow state are orthogonal concepts. So you probably want to have them in separate fields. You'll probably write far less code that way. The alternative will be becoming really well acquainted with substr() (depending on your SQL dialect), and all sorts of duplications elsewhere.
Details
A Person has many Objectives.
Objectives have Person-specific details about Activitys.
An Activity contains generic information such as a world record.
A Person can organize an Event to attempt the Objective.
A Person invites other Persons to watch an Event with an Invitation.
Schema
Note: Only backref's are listed on the example schema diagram, indicated by "(fk)". The arrows imply the normal relationship.
Image Link Until I Get 10 Points To Use Image Tag
Question
I want most Event, Objective, and Activity details for all Invitations one Person received (irregardless of status, but the status is still needed) displayed at once.
Is there a better way to represent the problem before I try tackling a JOIN like this? I believe the Person -> Invitation <- Event is an Association Object pattern, but I am unsure of how to get the Objective and Activity information in a clean, efficient manner for each Invitation returned.
Bonus: Provide sample SQLAlchemy query.
On the SQL side, this is pretty straightforward. I built some tables for testing using only one id number (for persons); all the rest of the keys were natural keys. Looking at the execution plan for this query
select I.*, A.activity_placeholder, E.event_location, O.objective_placeholder
from event_invitations I
inner join activity A
on (I.activity_name = A.activity_name)
inner join events E
on (I.personal_id = E.personal_id
and I.activity_name = E.activity_name
and I.objective_deadline = E.objective_deadline
and I.event_time = E.event_time)
inner join personal_objectives O
on (I.personal_id = O.personal_id
and I.activity_name = O.activity_name
and I.objective_deadline = O.objective_deadline)
where I.person_invited_id = 2;
shows that the dbms (PostgreSQL) is using indexes throughout except for a sequential scan on event_invitations. I'm sure that's because I used very little data, so all these tables easily fit into RAM. (When tables fit in RAM, it's often faster to scan a small table than to use the index.)
The optimizer estimates the cost for each part of the query to be 0.00, and you can't get much better than that. Actual run time was less than 0.2 milliseconds, but that doesn't mean much.
I'm sure you can translate this into SQLAlchemy. Let me know if you want me to post my tables and sample data.
i have some specific question to solve but i can not think.
I have 5-6 statements that I need to store in my database. This system like a news feeds.
Statement 1 : A installed this website.
Statement 2 : A added store R in provinceX
Statement 3 : B reviewed store R
Statement 4 : A edited store R
Statement 5 : A added product P in product_category1
Statement 6 : B review product P
Note that bold is dynamic data such as A, B is some preson's name, store R is store'name that person add.
In my idea, i have
person_table(id, name, age, ...)
store_table(sid, store_name, province_id, ...)
product_table(pid, product_name, ...)
and how about feed_table??
How I design database to store this data. and How i query this data to view.
thank you
There are two approaches to this kind of problem:
You design your tables in such a way that you have no repetition of information. Basically the feed you're interested in can be constructed from the existing tables in a performant manner; or
You repeat certain data to make your feed easier to implement.
Personally I would probably go for (2) and have a new table:
Feed: id, person_id, store_id, action_id, province_id, product_category_id
with the last two fields being optional, based on the action (review, edit, add, etc).
Purists will argue that repeated data is bad (certainly normal-form seeks to factor it out) but in the real world database schemas do this all the time for performance reasons.
Think about it this way: what do you spend most of your time doing in your application?
Viewing the feed (reading); or
Doing actions (writing).
If it's (1), which I suspect it is, then a feed table makes sense. You typically want to optimize for reads not writes as writes occur far less often.
I'm currently hand-writing a DAL in C# with SqlDataReader and stored procedures. Performance is important, but it still should be maintainable...
Let's say there's a table recipes
(recipeID, author, timeNeeded, yummyFactor, ...)
and a table ingredients
(recipeID, name, amount, yummyContributionFactor, ...)
Now I'd like to query like 200 recipes with their ingredients. I see the following possibilities:
Query all recipes, then query the ingredients for each recipe.
This would of course result in maaany queries.
Query all recipes and their ingredients in a big joined list. This will cause a lot of useless traffic, because every recipe data will be transmitted multiple times.
Query all recipes, then query all the ingredients at once by passing the list of recipeIDs back to the database. Alternatively issue both queries at one and return multiple resultsets. Back in the DAL, associate the ingredients to the recipes by their recipeID.
Exotic way: Cursor though all recipes and return for each recipe two separate resultsets for recipe and ingredients. Is there a limit for resultsets?
For more variety, the recipes can be selected by a list of IDs from the DAL or by some parametrized SQL condition.
Which one you think has the best performance/mess ratio?
If you only need to join two tables and an "ingredient" isn't a huge amount of data, the best balance of performance and maintainability is likely to be a single joined query. Yes, you are repeating some data in the results, but unless you have 100,000 rows and it's overloading the database server/network, it's too soon to be optimizing.
The story is a little bit different if you have many layers of joins each with decreasing cardinality. For example, in one of my apps I have something like the following:
Event -> EventType -> EventCategory
-> EventPriority
-> EventSource -> EventSourceType -> Vendor
A query like this results in a significant amount of duplication which is unacceptable when there are 100k events to retrieve, 1000 event types, maybe 10 categories/priorities, 50 sources, and 5 vendors. So in that case, I have a stored procedure that returns multiple result sets:
All 100k Events with just EventTypeID
The 1000 EventTypes with CategoryID, PriorityID, etc. that apply to these Events
The 10 EventCategories and EventPriorities that apply to the above EventTypes
The 50 EventSources that generated the 100k events
And so on, you get the idea.
Because the cardinality goes down so drastically, it is much quicker to download only what is needed here and use a few dictionaries on the client side to piece it together (if that is even necessary). In some cases the low-cardinality data may even be cached in memory and never retrieved from the database at all (except on app start or when the data is changed).
The determining factors in using an approach such as this are a very high number of results and a steep decrease in cardinality for the joins, in other words fanning in. This is actually the reverse of most usages and probably the reverse of what you are doing here. If you are selecting "recipes" and joining to "ingredients", you are probably fanning out, which can make this approach wasteful, especially if there are only two tables to join.
So I'm just putting it out there that this is a possible alternative if performance becomes an issue down the road; at this point in your design, before you have real-world performance data, I would simply go the route of using a single joined result set.
The best performance/mess ratio is 42.
On a more serious note, go with the simplest solution: retrieve everything with a single query. Don't optimize before you encounter a performance issue. "Premature optimization is the root of all evil" :)
One stored proc that returns 2 datasets: "recipe header" and "recipe details"?
This is what I'd do if I needed the data all at once in one go. If I don't need it in one go, I'd still get 2 datasets but with less data.
We've found it slightly easier to work with this in the client rather than one big query as Andomar suggested, but his/her answer is still very valid.
I would look at the bigger picture - do you really need to retrieve ingredients for 200 recipes? What happens when you have 2,000?
For example, if this is in a web page I would have the 200 recipes listed (if not less because of paging), and when the user clicked on one to see the ingredient then I would get the ingredients from the database.
If this isn't doable, I would have 1 stored proc that returns one DataSet containing 2 tables. One with the recipes and the second with the list of ingredients.
"I'm currently hand-writing a DAL in C#..." As a side note, you might want to check out the post: Generate Data Access Layer Methods From Stored Procs. It can save you a lot of time.
I'm working on a little blog software, and I'd like to have tags attached to a post. Each Post can have between 0 and infinite Tags, and I wonder if it's possible to do that without having to join tables?
As the number of tags is not limited, I can not just create n fields (Tag1 to TagN), so another approach (which is apparently the one StackOverflow takes) is to use one large text field and a delimiter, i.e. "<Tag1><Tag2><Tag3>".
The problem there: If I want to display all posts with a tag, I would have to use a "Like '%<Tag2>%'" statement, and those can AFAIK not use any indexes, requiring a full table scan.
Is there any suitable way to solve this?
Note: I know that a separate Tag-Link-Table offers benefits and that I should possibly not worry about performance without measuring etc. I'm more interested in the different ways to design a system.
Wanting to do this without joins strikes me as a premature optimisation. If this table is being accessed frequently, its pages are very likely to be in memory and you won't incur an I/O penalty reading from it, and the plans for the queries accessing it are likely to be cached.
A separate tag table is really the only way to go here. It is THE only way to allow an infinite number of tags.
This sounds like an exercise in denormalization. All that's really needed is a table that can naturally support any query you happen to have, by repeating any information you would otherwise have to join to another table to satisfy. A normalized database for something like what you've got might look like:
Posts:
PostID | PostTitle | PostBody | PostAuthor
--------+--------------+-------------------+-------------
1146044 | Join-Free... | I'm working on... | Michael Stum
Tags:
TagID | TagName
------+-------------
1 | Archetecture
PostTags:
PostID | TagID
--------+------
1146044 | 1
Then You can add a columns to optimise your queries. If it were me, I'd probably just leave the Posts and Tags tables alone, and add extra info to the PostTags join table. Of course what I add might depend a bit on the queries I intend to run, but probably I'd at least add Posts.PostTitle, Posts.PostAuthor, and Tags.TagName, so that I need only run two queries for showing a blog post,
SELECT * FROM `Posts` WHERE `Posts`.`PostID` = $1
SELECT * FROM `PostTags` WHERE `PostTags`.`PostID` = $1
And summarizing all the posts for a given tag requires even less,
SELECT * FROM `PostTags` WHERE `PostTags`.`TagName` = $1
Obviously the downside to denormalization is that it means you have to do a bit more work to keep the denormalized tables up to date. A typical way of dealing with this is to put some sanity checks in your code that detects when a denormalized query is out of sync by comparing it to other information it happens to have available. Such a check might go in the above example by comparing the post titles in the PostTags result set against the title in the Posts result. This doesn't cause an extra query. If there's a mismatch, the program could notify an admin, ie by logging the inconsistency or sending an email.
Fixing it is easy (but costly in terms of server workload), throw out the extra columns and regenerate them from the normalized tables. Obviously you shouldn't do this until you have found the cause of the database going out of sync.
If you're using SQL Server, you could use a single text field (varchar(max) seems appropriate) and full-text indexing. Then just do a full-text search for the tag you're looking for.