We've built an algorithm that helps us deliver relevant articles to our users. In the background during certain intervals, the algorithm will calculate metadata, such as average age, age spread, and gender coefficient from a slew of data related to views, comments, and votes.
With that said, are there any downsides to storing this metadata as fields on the Articles table? Or, should I create a separate table, such as Article_Data, to store the information? I am just not sure how much the updating of this metadata will interfere with selecting the articles.
For the most part, we will be SELECTing articles and its metadata and JOINing it on user data (age, gender, etc) to show users relevant content. The only time we don't need the metadata is when we show a particular article to a user.
If the fields are clearly defined, and there are a limited number of them, put them in the Articles table.
If you are going to store more than one record of metadata fields per article, you need another table, in a one-to-many relationship with the Articles table.
If the fields are not clearly defined, user-defined, or there are many of them, you probably need a new table with one row per metadata item. But this is more difficult to work with in the long run.
See Also
http://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model
Related
This is very similar to my issue.
http://forum.kimballgroup.com/t2534-modeling-fact-tables-that-have-direct-relationships-but-at-a-detail-and-not-a-dimension-layer
I’ve got a fact table for POs, Supplier Invoices, Payments, Receipts, etc. They have some dimensions in common, others not. Problem is, for example, say if they are looking at invoices by their gl account, (using an excel pivot table connected to the cube) then they expect to be able drop in a column for the PO number, the buyer of the PO, etc. Even though the buyer dimension is only related to the PO, and the account dimension is only related to the invoice. But they say, well the PO is related to the invoice, so you should be able to pull it in.
I do have a PO Ref field on the invoice fact table, but it is only filled out 50% of the time. Even when it is, you could have a one to many relationship in either way between a PO and an invoice, as far as I understand it at least.
Anyway, they expect to be able to throw in any measure from any measure group, and every single possible dimension to work, and then be able to drill down to the detail to see the POs, Invoices, Payments and Receipts and how they match up. Best practice is to keep the fact tables separate if they are different grains according to Kimball, but then all the business problems aren't solved this way.
The only solutions I can come up with are:
to either tack on a bunch of detail related columns to the degenerate dimensions when I load them. i.e. add PO to invoice and invoice to PO etc., but have it as a comma separated list in that column when it is many to one.
Create every possible relationship with every fact and dimension table. This would be a lot of work though, and some still may not have a relationship to certain dimensions.
Create a monstrous fact table with all the current ones joined together, and somehow figure out logic to only display the measure values once for the many to one joins.
This is probably a bad idea, but thought maybe somehow I could create a relationship between every measure group and the corresponding degenerate dimensions reference field. Like create a relationship between the supplier invoice degenerate dimension PO Ref field and the purchase order line measure group PO field.
Lower their expectations, lol.
Here's a screen shot of the dimension usage tab to give an idea of what it looks like currently.
I tried option 3 once. The performance was terrible. The output was misleading. Never ever again.
Your best bet is to work with the business. Where the data is not readily available (invoice without PO, for example) agree what should be done. You could show a default value (PO not recorded on invoice). You could agree on a logic, implemented in the ETL, that extracts the most likely PO.
Whatever approach you choose you must discuss it. If you do not the business will make decisions based on false assumptions. The business will find itself looking at reporting it does not understand. You must help your users to avoid these outcomes.
Once the approach has been agreed, document it. When queries arise, share the documentation. Make sure the documentation highlights all calculations, difficulties and missing source data.
Work with the teams that generate your source date. If an important field is sparsely populated arrange a meeting. See if the capture processes can be improved. Let your users know that you are investigating this area. Keep them informed of the outcome. If the source data cannot be improved (invoices continue to be raised without a PO), inform your users of the reasons for this.
Managing your customers can be challenging. Especially those who hold senior positions in the company. Transparency and solid documentation will help you.
So I've done extensive searching on this and I can't seem to find a good solution that actually applies to my situation.
I have a list of projects in a table, then a list of people. I want to assign multiple people to one project. Seems pretty common. Obviously, I can't make multiple columns on my projects table for each person, as the people will change fairly frequently.
I need to display this information very quickly in a continuous list of projects (the ultimate way would be a multiple-select combobox as a listbox is too tall, but they don't exist outside of the dreaded lookup fields)
I can think of two ways:
- Store multiple employee IDs delimited by commas in one field in my projects table (I know this goes against good database design). Would require some code to store and retrieve the data.
- Have a separate table for employees assigned to projects (ID, ProjectID, EmployeeID). One to many relationship between projects table and this new table. One to many relationship between employees table and this new table. If a project has 3 employees assigned, it would store 3 records in this table. It seems a bit odd joining both tables in this way, and would also require code to get it to store and retrieve into a control like the one mentioned above).
Does anyone know if there is a better way (including displaying in an easy control) or how you usually tackle this problem?
The usual way to tackle this problem would be with a Junction Table. This is what you describe where you have a separate table maybe called EmployeeProject which has an EmployeeProjectID(PK), EmployeeID(FK) and ProjectID(FK).
In this way you model a Many-to-Many relationship where each project can have many employees involved and each employee can be involved in many projects. It's not actually all that difficult to do the SQL etc. required to pull the information back together again for display.
I would definitely stay away from storing comma-delimited values as this becomes significantly more complicated when you want to display or manipulate the data.
There's a good guide here: http://en.tekstenuitleg.net/articles/software/create-a-many-to-many-relationship-in-access but if you google "many to many junction table" or similar, there are thousands of pages/articles about implementation.
Heres a simple version of the website I'm designing: Users can belong to one or more groups. As many groups as they want. When they log in they are presented with the groups the belong to. Ideally, in my Users table I'd like an array or something that is unbounded to which I can keep on adding the IDs of the groups that user joins.
Additionally, although I realize this isn't necessary, I might want a column in my Group table which has an indefinite amount of user IDs which belong in that group. (side question: would that be more efficient than getting all the users of the group by querying the user table for users belonging to a certain group ID?)
Does my question make sense? Mainly I want to be able to fill a column up with an indefinite list of IDs... The only way I can think of is making it like some super long varchar and having the list JSON encoded in there or something, but ewww
Please and thanks
Oh and its a mysql database (my website is in php), but 2 years of php development I've recently decided php sucks and I hate it and ASP .NET web applications is the only way for me so I guess I'll be implementing this on whatever kind of database I'll need for that.
Your intuition is correct; you don't want to have one column of unbounded length just to hold the user's groups. Instead, create a table such as user_group_membership with the columns:
user_id
group_id
A single user_id could have multiple rows, each with the same user_id but a different group_id. You would represent membership in multiple groups by adding multiple rows to this table.
What you have here is a many-to-many relationship. A "many-to-many" relationship is represented by a third, joining table that contains both primary keys of the related entities. You might also hear this called a bridge table, a junction table, or an associative entity.
You have the following relationships:
A User belongs to many Groups
A Group can have many Users
In database design, this might be represented as follows:
This way, a UserGroup represents any combination of a User and a Group without the problem of having "infinite columns."
If you store an indefinite amount of data in one field, your design does not conform to First Normal Form. FNF is the first step in a design pattern called data normalization. Data normalization is a major aspect of database design. Normalized design is usually good design although there are some situations where a different design pattern might be better adapted.
If your data is not in FNF, you will end up doing sequential scans for some queries where a normalized database would be accessed via a quick lookup. For a table with a billion rows, this could mean delaying an hour rather than a few seconds. FNF guarantees a direct access lookup path for each item of data.
As other responders have indicated, such a design will involve more than one table, to be joined at retrieval time. Joining takes some time, but it's tiny compared to the time wasted in sequential scans, if the data volume is large.
I am just starting up with Lucene, and I'm trying to index a database so I can perform searches on the content. There are 3 tables that I am interested in indexing:
1. Image table - this is a table where each entry represents an image. Each image has an unique ID and some other info (title, description, etc).
2. People table - this is a table where each entry represent a person. Each person has a unique ID and other info like (name, address, company, etc)
3. Credited table - this table has 3 fields (image, person, and credit type). It's purpose is to associate some people to a image as the credits for that image. Each image can have multiple credited people (there's the director, photographer, props artist, etc). Also, a person is credited in multiple images.
I'm trying to index these tables so I can perform some searching using Lucene but as I've read, I need to flatten the structure.
The first solution the came to me would be to create Lucene documents for each combination of Image/Credited Person. I'm afraid this will create a lot of duplicate content in the index (all the details of an image/person would have to be duplicated in each Document for each person that worked on the image).
Is there anybody experienced with Lucene that can help me with this? I know there is no generic solution to denormalization, that is why I provided a more specific example.
Thank you, and I will gladly provide more info on the database is anybody needs
PS: Unfortunately, there is no way for me to change the structure of the database (it belongs to the client). I have to work with what I have.
You could create a Document for each person with all the associated images' descriptions concatenated (either appended to the person info or in a separate Field).
Or, you could create a minimal Document for each person, create a Document for each image, puts the creators' names and credit info in a separate field of the image Document and link them by putting the person ID (or person Document id) a third, non-indexed field. (Lucene is geared toward flat document indexing, not relational data, but relations can be defined manually.)
This is really a matter of what you want to search for, images or persons, and whether each contains enough keywords for search to function. Try several options, see if they work well enough and don't exceed the available space.
The credit table will probably not be a good candidate for Document construction, though.
I am working on a system, which will run on GAE, which will have several related entities and I am not sure of the best way to store the data. This post is a request for advice from others who may have similar experience....
The system will have users, with profile data and an image. Those users will be able to create "events" and add journal entries to it. For the purpose of the system, the "events" will likely have 1 or 2 journal entries in them, and anything over 10 would likely never happen. Other users will be able to add comments to users' entries as well, where popular ones may have hundreds or even thousands of comments. When a random visitor uses the system, they should be able to see the latest events (latest, being defined by those with latest journal entries in them), search by tag, and a very perform basic text search. Then upon selecting an event to view, it should be displayed with all journal entries, and all user comments, with user images alongside comments. A user should also have a kind of self-admin page, to view/modify/delete their events and to view/modify/delete comments they have made on other events. So, doing all this on a normal RDBMS would just queries with some big joins across several tables. On GAE it would obviously need to work differently. Here are my initial thoughts on the design of the entities:
Event entity - id, name, timstamp, list
property of tags, view count,
creator's username, creator's profile
image id, number of journal entries
it contains, number of total comments
it contains, timestamp of last update to contained journal entries, list property of index words for search (built/updated from text from contained journal entries)
JournalEntry entity - timestamp,
journal text, name of event,
creator's username, creator's profile
image id, list property of comments
(containing commenter username and
image id)
User entity - username, password hash, email, list property of subscribed events, timestamp of create date, image id, number of comments posted, number of events created, number of journal entries created, timestamp of last journal activity
UserComment entity - username, id of event commented on, title of event commented on
TagData entity - tag name, count of events with tag on them
So, I'd like to hear what people here think about the design and what changes should be made to help it scale well. Thanks!
Rather than store Event.id as a property, use the id automatically embedded in each entity's key, or set unique key names on entities as you create them.
You have lots of options for modeling the relationship between Event and JournalEntry: you could use a ReferenceProperty, you could parent JournalEntries to Events and retrieve them with ancestor queries, or you could store a list of JournalEntry key ids or names on Event and retrieve them in batch with a key query. Try some things out with realistically-distributed dummy data, and use appstats to see what works best.
UserComment references an Event, while JournalEntry references a list of UserComments, which is a little confusing. Is there a relationship between UserComment and JournalEntry? or just between UserComment and Event?
Persisting so many counts is expensive. When I post a comment, you're going to write a new UserComment entity and also update my User entity and a JournalEntry entity and an Event entity. The number of UserComments you expect per Event makes it unwise to include everything in the same entity group, which means you can't do these writes transactionally, so you'll do them serially, and the entities might be stored across different network nodes, making the whole operation slow; and you'll also be open to consistency problems. Can you do without some of these counts and consider storing others in memcache?
When you fetch an Event from the datastore, you don't actually care about its list of search index words, and retrieving and deserializing them from protocol buffers has a cost. You can get around this by splitting each Event's search index words into a separate child EventIndex entity. Then you can query EventIndex on your search term, fetch just the EventIndex keys for EventIndexes that match your search, derive the corresponding Events' keys with key.parent(), and fetch the Events by key, never paying for the retrieval or deserialization of your search index word lists. Brett Slatkin explains this strategy here at 14:35.
Updating Event.viewCount will fail if you have a lot of views for any Event in rapid succession, so you should try out counter sharding.
Good luck, and tell us what you learn by trying stuff out.