Expire data in Azure Table storage without creating a hot key

Expire data in Azure Table storage without creating a hot key - data-modeling

I'm trying to model Azure Table storage for a simple website that works like so:
Visitors can create a "widget"
Visitors work with one widget and all its associated data together
Widgets expire after some period of time and should be deleted from the table
I'd like to use a single table for this app, but I'm not sure how to model expiration without either (a) scanning the whole table, or (b) creating a hot key. DynamoDB has TTL that would help, but I haven't seen a similar feature for Azure Table storage.
This is my current design:
PartitionKey
RowKey
Other fields
widget:1
self
{ expires_at: 1000010 }
widget:1
associated:1
...
widget:1
associated:2
...
widget:2
self
{ expires_at: 1000020 }
widget:2
associated:1
...
expires
timestamp:1000010
{ PK: "widget:1" }
expires
timestamp:1000020
{ PK: "widget:2" }
This makes querying a widget by ID easy because I can simply use the filter PartitionKey eq 'widget:1' to get everything for widget 1.
However, the partition key "expires" becomes a hot key. To find expired widgets, I'd do PK eq 'expires' and RowKey lt 'timestamp:(expiry_threshold)'. For simplicity, I was planning to run the "expire old widgets" routine on every request to the website, so every request would hit the partition that contains "expires".
This app's traffic is very low so this won't be a problem in practice, but I'm wondering if there's a "better" design I'm not thinking of, or if this kind of design is common and necessary.

Related

Design scenario of a DynamoDB table

I am new to DynamoDB and after reading several docs, there is a scenario in which I am not sure which would be the best approach for designing a table.
Consider that we have some JobOffers and we should support the following data access:
get a specific JobOffer
get all JobOffers from a specific Company sorted by different criteria (newest/oldest/wage)
get JobOffers from a specific Company filtered by a specific city sorted by different criteria (newest/oldest/wage)
get all JobOffers (regardless of any Company !!!) sorted by different criteria (newest/oldest/wage)
get JobOffers (regardless of any Company !!!) filtered by a specific city sorted by different criteria (newest/oldest/wage)
Since we need to support sorting, my understanding is that we should use Query instead of Scan. In order to use Query, we need to use a primary key. Because we need to support a search like "get all JobOffers without any filters sorted somehow", which would be a good candidate for partition key?
As a workaround, I was thinking to use a new attribute "country" which can be used as the partition key, but since all JobOffers are specified in one country, all these items fall under the same partition, so it might be a bit redundant until we will have support for JobOffers from different countries.
Any suggestion on how to design JobOffer table (with PK and GSI/LSI) for such a scenario?

Design of a Dynamodb table is best done with an Access approach - that is - how are you going to be accessing the data in here. You have information X, you need Y.
Also remember that a dynamo is NOT an sql, and it is not restricted that every item has to be the same - consider each entry a document, with its PK/SK as directory/item structure in a file system almost.
So for example:
You have user data. You have things like : Avatar data (image name, image size, image type) Login data (salt/pepper hashes, email address, username), Post history (post title, identifier, content, replies). Each user will only have 1 Avatar item and 1 Login item, but have many Post items
You know that from the user page you are always going to have the user ID. 100% of the time. This should then be your PK - your Hash Key, PartitionKey. Then you have the rest of the things you need inform your sort key/range key.
PK
USER#123456
SK:
AVATAR - Attributes: (image name, image size, image type)
PK
USER#123456
SK:
LOGIN - Attributes: (salt/pepper hashes, email address, username)
PK
USER#123456
SK:
POST#123 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#125 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#193 - Attributes: (post title, identifier, content, replies)
This way you can do a query with the User ID and get ALL the user data back. Or if you are on a page that just displays their post history, you can do a query against User ID # SK Begins With POST and get back all t heir posts.
You can put in an inverted index (SK -> PK and vice versa) and then do a query on POST#193 and get back the user ID. Or if you have other PK types with POST#193 as the SK, you get more information there (like a REPLIES#193 PK or something)
The idea here is that you have X bits of information, and you need to craft your dynamo to be able to retrieve as much as possible with just that information, and using prefix's on your SKs you can then narrow the fields a little.
Note!
Sometimes this means a duplication of information! That you may have the same information under two sets of keys. This is ok and kind of expected when you start getting into really complex relationships. You can mitigate it somewhat with index's, but you should aim to avoid them where possible as they do introduce a bit of lag in terms of data propagation (its tiny, but it can add up)
So you have your list of things you want to get for your dynamo. What will you always have to tie them together? What piece of data do you have that will work?
You can do the first 3 with a company PK identifier and a reverse index. That will let you look up and get all a companies jobs, or using the reverse index all a specific job. Or if you can always know the company when looking up a specific job, then it uses the general first index.
Company# - Job# - data data data
You then do the sorting on your own, OR you add some sort of sort valuye to the Job# key - Sort Keys are inherently sorted after all. Company# - Job#1234#UNITED_STATES
of course this will only work for one sort at a time. You can make more than one index, but again - data sync lag is a real possibility.
But how to do this regardless of Company? Well you can have another index with your searchable attribute (Country for example) as the PK then you can query that.
Or do you have another set of data that can tie this all together? Do you have another thing that can reach it all?
If not, you may just have two items in your dynamo:
Company#1234 - Job#321 - details
Company#1234 - Country#United_states - job#321, job#456, job#1234
Company#1234 - Country#England - job#992, job#123, job#19231
your reverse index here would apply - you could do a query on PK: Contry#UnitedStates and you'd get back:
Country#United_states - Company#1234 - job#321, job #456, job31234
Country#United_states - Company#4556
Country#United_States - Comapny#8322
this isnt a relational database however! So either you have to do one of two things - use t hose job#s to then query that company and get the filter the jobs by what you want (bad - trying to avoid multiple queries!) or each job# is an attribute on country sk's, and it contains a copy of that relevant data in a map format {job title, job#, country, company, salary}. Then when they click on that job to go to the details, it makes a direct call straight to the job query, gets the details to display,and its good.
Again, it all comes down to access patterns. What do you have, and how can you arrange it in a way that lets you get what you need fast

Oracle APEX - Data Modeling & Primary Keys

I'm creating a rather large APEX application which allows managers to go in and record statistics for associates in the company. Currently we have a database in oracle with data from AD which hold all the associates information. Name, Manager, Employee ID, etc.
Now I'm responsible for creating and modeling a table that will house all their stats for each employee. The table I have created has over 90+ columns in it. Some contain data such as:
Documents Processed
Calls Received
Amount of Doc 1 Processed
Amount of Doc 2 Processed
and the list goes on for well over 90 attributes. So here is my question:
When creating this table in my application with so many different columns how would I go about choosing a primary key that's appropriate? Should I link it to our employee table using the employees identification which is unique (each have a associate number)?
Secondly, how can I create these tables (and possibly form) to allow me to associate the statistic I am entering for an individual to the actual individual?
I have ordered two books from amazon on data modeling since I am new to APEX and DBA design. Not a fresh chicken, but new enough to need some guidance. An additional problem I am running into is that each form can have only 60 fields to it. So I had thought about creating tables for different functions out of my 90+ I have.
Thanks

4.2 allows for 200 items per page.
oracle apex component limits
A couple of questions come to mind:
Are you sure that the employee Ids are not recyclable? If these ids are unique and not recycled.. you've found yourself a good primary key.
What do you plan on doing when you decide to add a new metric? Seems like you might have to add a new column to your rather large and likely not normalized table.
I'd recommend a vertical table for your metrics.. you can use oracle's pivot function to make your data appear more like a horizontal table.
If you went this route you would store your employee Id in one column, your metric key in another, and value...
I'd recommend that you create a metric table consisting of a primary key, a metric label, an active indicator, creation timestamp, creation user id, modified timestamp, modified user id.
This metric table will allow you to add new metrics, change the name of the metric, deactivate a metric, and determine who changed what and when.
This would be a much more flexible approach in my opinion. You may also want to think about audit logs.

How should I design my SSAS cube to handle non aggreagable data?

I'm working on a SSIS/SAAS project to build a BI solution.
One of my data sources contains informations about a Service Desk.
A user can create a new request related to a service catalog (for example because his laptop crashed).
So it will generate a new row in the Request table (creation date, comment, tracking number, etc.).
To solve this issue, few actions will be perform. So these actions will be recorded in the action table (there is a One to many relationship between Request and Action tables).
An action can be : "try to format computer", "change hard drive", etc.
In the production environnent a Request contains aproximatly from 10 to 100 actions.
I'm facing a problem about designing this because many columns of my fact table cannot be aggregated.
In fact there are many date columns, tracking number (string), bollean values and only few SUM attributes.
Here is an extract of the dw model :
FactRequest :
ID (DW primary key)
Business Key (original PK)
Request number (string)
Begin date (datetime)
End date (datetime)
Max resolution date (datetime)
Time to solve request
Comment (string)
Delay (int)
...
FactAction :
ID (DW primary key)
Business Key (original PK)
Begin date (datetime)
End date (datetime)
Name (string)
Time to solve action
...
I know adding non aggregable data in a fact table is not the best solution.
In my SSAS project, I created a new cube based on my FactRequest table.
It works fine except for "string" attributes such as the request identifier because it is a string.
Should I use an SSAS "fact dimension" to create a "Request" dimension based on my FactRequest table ?
Any idea ?
Thanks so much,

Sounds like you are lacking specific requirements (which is very common in BI projects). Is the textual data required to be displayed in the report at all? If yes: is it required only in some detail view?
Columns like ID, Business Key, Request number typically have little value in your cube. This data is only interesting for detailed reports (e.g. getting all actions taken for a certain request ID) and such lists often require no aggregates. You do not need a cube for lists like that, you can query the database directly with SQL.
Only if you require a summary report (e.g. getting the average time taken to solve a request per weekday) the cube could make sense - it may still not be worth the effort to use an SSAS database if you can get almost the same query response time with direct SQL queries.

Dynamic database routing in Django

In my database, I have a Customer table defined in my database that all other tables are foreign keyed on.
class Customer(models.Model):
...
class TableA(models.Model):
Customer = models.ForeignKey(Customer)
...
class TableB(models.Model):
Customer = models.ForeignKey(Customer)
...
I'm trying to implement a database router that determines the database to connect to based on the primary key of the Customer table. For instance, ids in the range 1 - 100 will connect to Database A, ids in the range 101 - 200 will connect to Database B.
I've read through the Django documentation on routers but I'm unsure if what I'm asking is possible. Specifically, the methods db_for_read(model, **hints) and db_for_write(model, **hints) work on the type of the object. This is useless for me as I need routing to be based on the contents of the instance of the object. The documentation further states that the only **hints provided at this moment are an instance object where applicable and in some cases no instance is provided at all. This doesn't inspire me with confidence as it does not explicitly state the cases when no instance is provided.
I'm essentially attempting to implement application level sharding of the database. Is this possible in Django?

Solve Chicken and egg
You'll have to solve the chicken and egg problem when saving a new Customer. You have to save to get an id, but you have to know the id to know where to save.
You can solve that by saving all Customers in DatabaseA first and then check the id and save it in the target db too. See Django multidb: write to multiple databases. If you do it consequently, you won't run into these problems. But make sure to pay attention to deleting Customers.
Then route using **hints
The routing problem that's left is pretty straight forward if an instance is in the hints. Either it is a Customer and you'll return 'DatabaseA' or it has a customer and you'll decide on its customer_id or customer.id.
Try and remember, there is no spoon.
When there is no instance in the hints, but it is a model from your app, raise an error, so you can change the code that created the Queryset. You should always provide hints, when they aren't added automatically.
What will really bake your cookie
If for most queries you have a know Customer, this is ok. But think about queries like TableA.objects.filter(customer__name__startswith='foo')

Database design, shound I use varchar for Primary Key in this case?

Im building a webpage where users will be able to create accounts, and every account will have its own subdomain. So there could be URL-s like this:
www.user1.domain.com
www.user2.domain.com
...
They will have their own pages too, like this:
www.user1.domain.com/url-1/
www.user1.domain.com/url-2/
www.user2.domain.com/url-3/
...
So I need to store account_url and page_url in database.
I did it like this, I have users, accounts and pages tables.
This is how my tables look like:
USERS:
user_id PK
user_name
user_pass
...
ACCOUNTS:
account_id PK
user_id FK
account_url
account_name
account_type
...
PAGES:
page_id PK
user_id FK
page_url
page_name
page_content
...
Now the problem is this, since I get url like this:
www.user1.domain.com/page-url/
The only information I can fetch from url is account_url and page_url since its in URL, dispatcher/router gets these two variables. account_url is subdomain, and page_url is segment after domain.
Since there will be multiple users I always need to get that user_id so I can update/delete rows that belong to them. So I need to update page_content where user_id belongs to this user and page_url is the one from URL.
But I dont have user_id. And when I would like to update page_url_content, first I need to find user_id, like this:
SELECT user_id FROM accounts WHERE account_url = something
And then when I have user_id I can update content of a page or do any other action.
So is this a good design?
Its normalized and clean, but when Im using this in every action inside controller I need to fetch user_id first joust to be able to do a real query I wanted.
Now, I could use account_url for Primary Key, and have all tables relate to that primary key. So when I get URL I already know the Primary key since its in the URL.
Is this a good case to use Primary Key in URL, or Im doing something wrong?

I prefer to always have my primary ID keys as integers for joins. That said, there are a bunch of ways to help make your site snappy.
You could index the account_url column so look ups are more efficient.
Or you could cookie the users ID and use that value instead of querying the database each time. Granted, you would want to do some session tracking so someone can't spoof someone else.
One presumes the user will be in control of the name of the subdomain, so embedding the user ID into the subdomain name probably wouldn't be effective otherwise it is also an option.
You could keep user ID and user account_url in a separate table and cache that table so you don't hit the database for the vast majority of lookups.
My recommendation would be to keep the primary key the integer, index the account_url and identify a page load target time; say completing all database access and page rendering in under 1.500 seconds. When your site starts to respond over your threshold, then you can analyze your site to see where the actual problems lie and address them then.

In general, leave the database normalized as much as possible. If and when you can provably show (using metrics and actual measurements) that you need to denormalize for performance reasons, then think about doing that.
In this case, if you have a m-1 relationship between a domain and a user's account, you can effectively treat the domain as a user ID; you just have to join things in the right way. (and by m-1, I mean a single domain can only be "owned" by 1 user).
The key thing is that you don't need to get the user_id because you can get to it by joining the ACCOUNTS table as needed since it ties the domain to the user_id.
Lastly, to your question about using the domain as the primary key, you can do this, since a domain is required to be "unique", but you have a minimal overhead and much more flexibility by using a surrogate primary key.

You have two totaly separate issues. Mapping Subdomains and pages to a user is the easier of the two. The more difficult issue is "State". You need to create state database (or similar module) to keep track of which user is currently logged in and if they are still logged in when an update is received.
JZ touched on this in his comment. Don't confuse these two issues, they are separate and should betreated as such.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight