Related
I am new to DynamoDB and after reading several docs, there is a scenario in which I am not sure which would be the best approach for designing a table.
Consider that we have some JobOffers and we should support the following data access:
get a specific JobOffer
get all JobOffers from a specific Company sorted by different criteria (newest/oldest/wage)
get JobOffers from a specific Company filtered by a specific city sorted by different criteria (newest/oldest/wage)
get all JobOffers (regardless of any Company !!!) sorted by different criteria (newest/oldest/wage)
get JobOffers (regardless of any Company !!!) filtered by a specific city sorted by different criteria (newest/oldest/wage)
Since we need to support sorting, my understanding is that we should use Query instead of Scan. In order to use Query, we need to use a primary key. Because we need to support a search like "get all JobOffers without any filters sorted somehow", which would be a good candidate for partition key?
As a workaround, I was thinking to use a new attribute "country" which can be used as the partition key, but since all JobOffers are specified in one country, all these items fall under the same partition, so it might be a bit redundant until we will have support for JobOffers from different countries.
Any suggestion on how to design JobOffer table (with PK and GSI/LSI) for such a scenario?
Design of a Dynamodb table is best done with an Access approach - that is - how are you going to be accessing the data in here. You have information X, you need Y.
Also remember that a dynamo is NOT an sql, and it is not restricted that every item has to be the same - consider each entry a document, with its PK/SK as directory/item structure in a file system almost.
So for example:
You have user data. You have things like : Avatar data (image name, image size, image type) Login data (salt/pepper hashes, email address, username), Post history (post title, identifier, content, replies). Each user will only have 1 Avatar item and 1 Login item, but have many Post items
You know that from the user page you are always going to have the user ID. 100% of the time. This should then be your PK - your Hash Key, PartitionKey. Then you have the rest of the things you need inform your sort key/range key.
PK
USER#123456
SK:
AVATAR - Attributes: (image name, image size, image type)
PK
USER#123456
SK:
LOGIN - Attributes: (salt/pepper hashes, email address, username)
PK
USER#123456
SK:
POST#123 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#125 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#193 - Attributes: (post title, identifier, content, replies)
This way you can do a query with the User ID and get ALL the user data back. Or if you are on a page that just displays their post history, you can do a query against User ID # SK Begins With POST and get back all t heir posts.
You can put in an inverted index (SK -> PK and vice versa) and then do a query on POST#193 and get back the user ID. Or if you have other PK types with POST#193 as the SK, you get more information there (like a REPLIES#193 PK or something)
The idea here is that you have X bits of information, and you need to craft your dynamo to be able to retrieve as much as possible with just that information, and using prefix's on your SKs you can then narrow the fields a little.
Note!
Sometimes this means a duplication of information! That you may have the same information under two sets of keys. This is ok and kind of expected when you start getting into really complex relationships. You can mitigate it somewhat with index's, but you should aim to avoid them where possible as they do introduce a bit of lag in terms of data propagation (its tiny, but it can add up)
So you have your list of things you want to get for your dynamo. What will you always have to tie them together? What piece of data do you have that will work?
You can do the first 3 with a company PK identifier and a reverse index. That will let you look up and get all a companies jobs, or using the reverse index all a specific job. Or if you can always know the company when looking up a specific job, then it uses the general first index.
Company# - Job# - data data data
You then do the sorting on your own, OR you add some sort of sort valuye to the Job# key - Sort Keys are inherently sorted after all. Company# - Job#1234#UNITED_STATES
of course this will only work for one sort at a time. You can make more than one index, but again - data sync lag is a real possibility.
But how to do this regardless of Company? Well you can have another index with your searchable attribute (Country for example) as the PK then you can query that.
Or do you have another set of data that can tie this all together? Do you have another thing that can reach it all?
If not, you may just have two items in your dynamo:
Company#1234 - Job#321 - details
Company#1234 - Country#United_states - job#321, job#456, job#1234
Company#1234 - Country#England - job#992, job#123, job#19231
your reverse index here would apply - you could do a query on PK: Contry#UnitedStates and you'd get back:
Country#United_states - Company#1234 - job#321, job #456, job31234
Country#United_states - Company#4556
Country#United_States - Comapny#8322
this isnt a relational database however! So either you have to do one of two things - use t hose job#s to then query that company and get the filter the jobs by what you want (bad - trying to avoid multiple queries!) or each job# is an attribute on country sk's, and it contains a copy of that relevant data in a map format {job title, job#, country, company, salary}. Then when they click on that job to go to the details, it makes a direct call straight to the job query, gets the details to display,and its good.
Again, it all comes down to access patterns. What do you have, and how can you arrange it in a way that lets you get what you need fast
The following problem comes from: https://cs.senecac.on.ca/~dbs201/pages/Normalization_Practice.htm (Exercise 3)
The unnormalized table appears like this:
To comply with First Normal Form, all repeating groups must be dealt with. In this case, multiple products could appear on a single order, so it must be given its own entity and related back to the original table:
These tables are also in Second Normal Form, because in all of the tables, each non-key attribute is dependent on the primary key in it's table.
Finally, to bring it to Third Normal Form, Customer must be given its own entity and related back to the original Order entity:
Have I properly normalized the original table into Third Normal Form? If not, please provide feedback explaining what I've done wrong.
Store some Customer's Primary details in Order Table which are available on Bill, because once customer details is changed then Bill is differ then original.
Same for Product, Store Product price in Product_Order table, because once Product price changed Bill will change.
I am building a tool to allow people to create customized reports. My question resolves around getting the right database schema and design to support some custom report settings.
In terms of design, I have various Slides and each Slide has a bunch of settings (like date range, etc). A Report would basically be an ordered list of slides
The requirements are:
A user can create a report by putting together a list of "Slides" in any order they wish
A user can include the same slide twice in a report with different settings
So I was thinking of having the following tables:
Report Table: Id, Name, Description
Slide Table:, Id, Description
ReportSlide Table: ReportId, SlideId, Order, SlideSettings
my 2 main questions are:
Order: Is this the best way to manage the fact that a user can order their slides on any given report
SlideSettings: since every slides has a different set of settings (inputs), i was thinking of storing this as just a json blob and then parsing it out on the front end. Does anything one think this is the wrong design? Is there a better way to store this information (again, each slide has different inputs and you could have the same slide listed twice in a report each with different settings
Order: Is this the best way to manage
It is the correct way.
SlideSettings: ... storing this as just a json blob
If you never intend to query these values, then that's fine.
You may want to rename ReportSlide to SlideInReport. A relationship should not just list the referenced tables, but the nature of the relationship.
Some (me) prefer to give PK-columns and FK-columns the same name. Then you cannot get away with just Id, but you need to call them sld_id, rep_id.
May be you should have a Settings table. You may also need a ValueTypes table to define which setting can take what kind of values. (such as Date Range). And then let the list of setting IDs be stored against a slide.
Needless to say, these "best way"s will depend on type and amount of data being stored etc. Am a novice in JSON etc, but as far as I read, it's not a good idea to keep JSON strings as database fields, but not a rule.
I think, from a high level view, your schema will work. However, you might consider revising some of the table structure. For example:
Settings
Rather than a JSON blob, it may be best to add columns for each setting the ReportSlide table. Depending on what inputs you allow, give a column for each. For example, your date range will need to have StartDate/EndDate, Integers, Text fields, etc.
What purpose does the Slide Table serve? If your database allows a many-to-many relationship between Slides and Reports, then the ReportSlide table will hold all your settings. Will your Slide Table have attributes? If not, then perhaps Report Slides are all you need. For example:
Report Table: ReportID | DateCreated | UserID | Description
ReportSlides Table: ReportSlideID | ReportID | SlideOrder | StartDate | EndDate | Description...
Unless your Slide table is going to hold specific attributes that will be common across every report, you don't need the extra joins or space.
Depending on the tool, you may also want to have a DateCreated, UserID, FolderID, etc. Attributes that allow people to organize their reports.
If the Slides are dependent on each other, you will want to add constraints so Slide 2 cannot be deleted if Slide 3 depends on it.
Order
Regarding order, having a SlideOrder column will work. Because each ReportSlideID will have a corresponding Report, the SlideOrder can still be changed. That way, if ReportSlideID = 1 belongs to ReportID = 1 and has specific settings, it can be ordered 7th or 3rd and still work.
Be aware of your naming convention. If the Order column is directly referencing Slide Order, then go ahead and name it SlideOrder.
I'm sure there are a million other ways to make it efficient. That's my initial idea based on what you've provided.
Report Table: ID (Primary Key), Name, Description,....
Slide Table: ID (PK), Name, Description,...
Slide_x_report Table: ID(PK), ReportID (FK), SlideID (FK), order
Slide_settings Table: ID(PK), NameSetting, DescriptionSettings, SlideXReportID (FK),...
I think that you shoud have a structure like this, and in the Slide_settings table you will have the setting of the differents slides by reports.
Imagine that the slide_settings table may contain dynamic forms and these should relate to a specific slide of a report, in this way you can have it all properly stored and the slide_settings table, you would have only columns that are needed to define an element of slide.
We are building a set of features for our application. One of which is a list of recent user activities ala on SO. I'm having a little problem finding the best way to design the table for these activities.
Currently we have an Activities table with the following columns
UserId (Id of the user the activity is for)
Type (Type of activity - i.e. PostedInForum, RepliedInForum, WroteOnWall - it's a tinyint with values taken from an enumerator in C#)
TargetObjectId (An id of the target of the activity. For PostedInForum this will be the Post ID, for WroteOnWall this will be the ID of the User whose wall was written on)
CreatedAtUtc (Creationdate)
My problem is that TargetObjectId column doesn't feel right. It's a soft link - no foreign keys and only a knowledge about the Type tells you what this column really contains.
Does any of you have a suggestion on an alternate/better way of storing a list of user activites?
I should also mention that the site will be multilingual, so you should be able to see the activity list in a range of languages - that's why we haven't chosen for instance to just put the activity text/html in the table.
Thanks
You can place all content to a single table with a discriminator column and then just select top 20 ... from ... order by CreatedAtUtc desc.
Alternatively, if you store different type of content in different tables, you can try something like (not sure about exact syntax):
select top 20 from (
select top 20 ID, CreatedAtUtc, 'PostedToForum' from ForumPosts order by CreatedAtUtc
union all
select top 20 ID, CreatedAtUtc, 'WroteOnWalll' from WallPosts order by CreatedAtUtc) t
order by t.CreatedAtUtc desc
You might want to check out http://activitystrea.ms/ for inspiration, especially the schema definition. If you look at that spec you'll see that there is also the concept of a "Target" object. I have recently done something very similar but I had to create my own database to encapsulate all of the activity data and feed data into it because I was collecting activity data from multiple applications with disparate data sources in different databases.
Max
I have an application with multiple "pick list" entities, such as used to populate choices of dropdown selection boxes. These entities need to be stored in the database. How do one persist these entities in the database?
Should I create a new table for each pick list? Is there a better solution?
In the past I've created a table that has the Name of the list and the acceptable values, then queried it to display the list. I also include a underlying value, so you can return a display value for the list, and a bound value that may be much uglier (a small int for normalized data, for instance)
CREATE TABLE PickList(
ListName varchar(15),
Value varchar(15),
Display varchar(15),
Primary Key (ListName, Display)
)
You could also add a sortOrder field if you want to manually define the order to display them in.
It depends on various things:
if they are immutable and non relational (think "names of US States") an argument could be made that they should not be in the database at all: after all they are simply formatting of something simpler (like the two character code assigned). This has the added advantage that you don't need a round trip to the db to fetch something that never changes in order to populate the combo box.
You can then use an Enum in code and a constraint in the DB. In case of localized display, so you need a different formatting for each culture, then you can use XML files or other resources to store the literals.
if they are relational (think "states - capitals") I am not very convinced either way... but lately I've been using XML files, database constraints and javascript to populate. It works quite well and it's easy on the DB.
if they are not read-only but rarely change (i.e. typically cannot be changed by the end user but only by some editor or daily batch), then I would still consider the opportunity of not storing them in the DB... it would depend on the particular case.
in other cases, storing in the DB is the way (think of the tags of StackOverflow... they are "lookup" but can also be changed by the end user) -- possibly with some caching if needed. It requires some careful locking, but it would work well enough.
Well, you could do something like this:
PickListContent
IdList IdPick Text
1 1 Apples
1 2 Oranges
1 3 Pears
2 1 Dogs
2 2 Cats
and optionally..
PickList
Id Description
1 Fruit
2 Pets
I've found that creating individual tables is the best idea.
I've been down the road of trying to create one master table of all pick lists and then filtering out based on type. While it works, it has invariably created headaches down the line. For example you may find that something you presumed to be a simple pick list is not so simple and requires an extra field, do you now split this data into an additional table or extend you master list?
From a database perspective, having individual tables makes it much easier to manage your relational integrity and it makes it easier to interpret the data in the database when you're not using the application
We have followed the pattern of a new table for each pick list. For example:
Table FRUIT has columns ID, NAME, and DESCRIPTION.
Values might include:
15000, Apple, Red fruit
15001, Banana, yellow and yummy
...
If you have a need to reference FRUIT in another table, you would call the column FRUIT_ID and reference the ID value of the row in the FRUIT table.
Create one table for lists and one table for list_options.
# Put in the name of the list
insert into lists (id, name) values (1, "Country in North America");
# Put in the values of the list
insert into list_options (id, list_id, value_text) values
(1, 1, "Canada"),
(2, 1, "United States of America"),
(3, 1, "Mexico");
To answer the second question first: yes, I would create a separate table for each pick list in most cases. Especially if they are for completely different types of values (e.g. states and cities). The general table format I use is as follows:
id - identity or UUID field (I actually call the field xxx_id where xxx is the name of the table).
name - display name of the item
display_order - small int of order to display. Default this value to something greater than 1
If you want you could add a separate 'value' field but I just usually use the id field as the select box value.
I generally use a select that orders first by display order, then by name, so you can order something alphabetically while still adding your own exceptions. For example, let's say you have a list of countries that you want in alpha order but have the US first and Canada second you could say "SELECT id, name FROM theTable ORDER BY display_order, name" and set the display_order value for the US as 1, Canada as 2 and all other countries as 9.
You can get fancier, such as having an 'active' flag so you can activate or deactivate options, or setting a 'x_type' field so you can group options, description column for use in tooltips, etc. But the basic table works well for most circumstances.
Two tables. If you try to cram everything into one table then you break normalization (if you care about that). Here are examples:
LIST
---------------
LIST_ID (PK)
NAME
DESCR
LIST_OPTION
----------------------------
LIST_OPTION_ID (PK)
LIST_ID (FK)
OPTION_NAME
OPTION_VALUE
MANUAL_SORT
The list table simply describes a pick list. The list_ option table describes each option in a given list. So your queries will always start with knowing which pick list you'd like to populate (either by name or ID) which you join to the list_ option table to pull all the options. The manual_sort column is there just in case you want to enforce a particular order other than by name or value. (BTW, whenever I try to post the words "list" and "option" connected with an underscore, the preview window goes a little wacky. That's why I put a space there.)
The query would look something like:
select
b.option_name,
b.option_value
from
list a,
list_option b
where
a.name="States"
and
a.list_id = b.list_id
order by
b.manual_sort asc
You'll also want to create an index on list.name if you think you'll ever use it in a where clause. The pk and fk columns will typically automatically be indexed.
And please don't create a new table for each pick list unless you're putting in "relationally relevant" data that will be used elsewhere by the app. You'd be circumventing exactly the relational functionality that a database provides. You'd be better off statically defining pick lists as constants somewhere in a base class or a properties file (your choice on how to model the name-value pair).
Depending on your needs, you can just have an options table that has a list identifier and a list value as the primary key.
select optionDesc from Options where 'MyList' = optionList
You can then extend it with an order column, etc. If you have an ID field, that is how you can reference your answers back... of if it is often changing, you can just copy the answer value to the answer table.
If you don't mind using strings for the actual values, you can simply give each list a different list_id in value and populate a single table with :
item_id: int
list_id: int
text: varchar(50)
Seems easiest unless you need multiple things per list item
We actually created entities to handle simple pick lists. We created a Lookup table, that holds all the available pick lists, and a LookupValue table that contains all the name/value records for the Lookup.
Works great for us when we need it to be simple.
I've done this in two different ways:
1) unique tables per list
2) a master table for the list, with views to give specific ones
I tend to prefer the initial option as it makes updating lists easier (at least in my opinion).
Try turning the question around. Why do you need to pull it from the database? Isn't the data part of your model but you really want to persist it in the database? You could use an OR mapper like linq2sql or nhibernate (assuming you're in the .net world) or depending on the data you could store it manually in a table each - there are situations where it would make good sense to put it all in the same table but do consider this only if you feel it makes really good sense. Normally putting different data in different tables makes it a lot easier to (later) understand what is going on.
There are several approaches here.
1) Create one table per pick list. Each of the tables would have the ID and Name columns; the value that was picked by the user would be stored based on the ID of the item that was selected.
2) Create a single table with all pick lists. Columns: ID; list ID (or list type); Name. When you need to populate a list, do a query "select all items where list ID = ...". Advantage of this approach: really easy to add pick lists; disadvantage: a little more difficult to write group-by style queries (for example, give me the number of records that picked value X".
I personally prefer option 1, it seems "cleaner" to me.
You can use either a separate table for each (my preferred), or a common picklist table that has a type column you can use to filter on from your application. I'm not sure that one has a great benefit over the other generally speaking.
If you have more than 25 or so, organizationally it might be easier to use the single table solution so you don't have several picklist tables cluttering up your database.
Performance might be a hair better using separate tables for each if your lists are very long, but this is probably negligible provided your indexes and such are set up properly.
I like using separate tables so that if something changes in a picklist - it needs and additional attribute for instance - you can change just that picklist table with little effect on the rest of your schema. In the single table solution, you will either have to denormalize your picklist data, pull that picklist out into a separate table, etc. Constraints are also easier to enforce in the separate table solution.
This has served us well:
SQL> desc aux_values;
Name Type
----------------------------------------- ------------
VARIABLE_ID VARCHAR2(20)
VALUE_SEQ NUMBER
DESCRIPTION VARCHAR2(80)
INTEGER_VALUE NUMBER
CHAR_VALUE VARCHAR2(40)
FLOAT_VALUE FLOAT(126)
ACTIVE_FLAG VARCHAR2(1)
The "Variable ID" indicates the kind of data, like "Customer Status" or "Defect Code" or whatever you need. Then you have several entries, each one with the appropriate data type column filled in. So for a status, you'd have several entries with the "CHAR_VALUE" filled in.