Separate tables vs. one big one in Azure table storage (NoSQL) - database

I'm using Azure Tables and I'm trying to figure out how I should organize my data.
Each entity in a table has a PartitionKey and a RowKey and my understanding is that partitions should be used to organize similar objects for scalability. In the example on the site they used movie entities where the category (action, sci fi, etc) is the PartitionKey while the title (fast and the furious, etc.) is the RowKey.
Going with the above example, lets say we have no duplicate movies and you also wanted to keep track of each particular movies rental history, i.e location, due date, customer, etc.
Would it be bad practice to have one table to store all of this and use a separate partition for the rental entities? To be clear, I'm talking about a movie item and its corresponding history items together in the same denormalized table in separate partitions.
Would there be an advantage to using two separate tables and if not, then what is the point of tables?
EDIT:
PartitionKey| RowKey | prop0 | prop1 |...
------------------------------------------------...
SciFi | StarWars| foo0: bar0 | foo1: bar1|...
Rental | StarWars| foo0: bar0 | foo1: bar1|...

First of all the notion of table storage is that you can "dump" large amounts of data knowing that search facilities are very poor, you won't be able to issue SQL queries, hence no RDBMS and that it is a means of storing much data.
Indeed, partitionKey and rowKey are the only indexed columns you get with Azure storage, this means that searching by partitionKey or rowKey will be quicker than searching by any other column. If you need to retrieve data swiftly then blob storage or table storage is a no-go. If you just want to keep a record for audit purposes or history purposes then yes.
However, if you want to use it in a video store for instance and need to retrieve client's details, then as I mentioned it's bad practice. You'd better of using RDBMS. Lastly, you can't do JOIN or other RDBMS queries etc in table storage i.e. between two tables.

Related

Modelling Cassandra Data Struncture for both real-time and search from all?

My project serves both real-time data and past data. It works like feed, so it shows real-time data through socket and past data(if scroll down) through REST api. To get real-time data efficiently, I set date as partition key and time as clustering key. For real-time service, I think this data structure is well modeled. But I also have to get limited number of recent datas(like pagination), which should able to show whole data if requested. To serve data like recent 0~20 / 20~40 / 40~60 through REST api calls, my data-serving server has to remember what it showed before to load next 20 datas continuously, as bookmark. If it was SQL I would use IDs or page&offset things but I cannot do that with Cassandra. So I tried:
SELECT * FROM examples WHERE date<='DATEMARK' AND create_at < 'TIMEMARK' AND entities CONTAINS 'something' limit 20 ALLOW FILTERING;
But since date is the partition key, I cannot use comparison operation >, <. The past data could be created very far from now.
Can I satisfy my real-time+past requirements with Cassandra? I wonder if I have to make another DB for accessing past data.
Yes you can, but you must change your mindset and think like NoSQL patterns, in this scenarios you can save your data in duplicate manner and save your data in other table with another partition key and cluster column that satisfies your needs.
we have been using Cassandra extensively for showing real-time + past data. I request you not to use allow filtering option in Cassandra as it's not a good practice. Try to make your schema properly such that you need not required to jump the columns. Suppose you have a schema:
Created_date | Created_time | user_id | Country | Name | Activity
In this schema, you are considering Created_date,created_time,user_id, country as a primary key but you want the user_id of a particular country. In this case, even though you have considered Country column as a primary key you can't query like:
"SELECT * from table where Created_date='2020-02-14' and Country ='india' allow filtering ";
If your query in this pattern you will lose data in your resultset and will get errors when working with big data. Or you'll be using the allow filtering option which is not suggested. So, you need to change the structure of your schema.
Created_date | Country | City | Created_time | user_id | Name | Activity
"SELECT * from table where created_date='2020-02-14' and country='india'";
Using this structure will give you a very consistent result and you will never face any errors. Suppose you want to get all the data for the last seven days. In that case use loop and traverse through the results of each day and store it into some data structure. Hope you understand.

Property address database design in DynamoDB NoSQL

We have several terabytes of address data and are investigating the possibility of storing this in a DynamoDB NoSQL database. I've done quite a bit of reading on DynamoDB and NoSQL in general, but am coming from many years of MS SQL and am struggling with some of the NoSQL concepts.
My biggest question at this point is how to setup the table structure so that I can accommodate the various different ways the data could be queried. For example, in regular SQL I would expect some queries like:
WHERE Address LIKE '%maple st%' AND ZipCode = 12345
WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
Those are just examples. The actual queries could be any combination of those various fields.
It's not clear to me what my index should look like or how the table (or tables) should be structured. Can anyone get me started on understanding how that could work?
The post is relatively old, but I will try to give you an answer (maybe it will be helpful for someone having similar issues in the future).
DynamoDB is not really meant to be used in the way you describe. Its strengths are in fast (smoking fast in fact) look-ups of key/value pairs. To take your example of IP address if you wanted to really quickly look-up information associated with an IP address you could easily make the HashKey a string with the IP address and use this to do a look-up.
Things start to get complicated when you want to do queries (or scans) in dynamoDb, you can read about them here: Query and Scan in DynamDB
The gist being that scans/queries are really expensive when not performed on either the HaskKey or HaskKey+RangeKey combo (range keys are basically composite keys).
In other words I am not sure if DynamoDb is the right way to go. For smoking fast search functionality I would consider using something like Lucene. If you configure your indexes wisely you will be amazed how fast it works.
Hope this helps.
Edit:
Seems Amazon has now added support for secondary indices:
See here
DynamoDB was built to be utilized in the way the question author describes refer to this LINK where AWS documentation describes creating a secondary index like this
[country]#[region]#[state]#[county]#[city]#[neighborhood]
The partition key could be something like this as well based on what you want to look up.
In DynamoDB, you create the joins before you create the table. This means that you have to think about all the ways you intend to search for you data, create the indexes, and query your data using them.
AWS created AWS noSQL WorkBench to help teams do this. There are a few UI bugs in that application at the time of this writing; refer to LINK for more information on the bugs.
To review some of the queries you mentioned, I'll share a few possibilities in which you can create an index to create that query.
Note: noSQL means denormalized data in some cases, but not necessarily.
There are limits as to how keys should be shaped so that dynamoDB can partition actual servers to scale; refer to partition keys for more info.
The magic of dynamoDB is a well thought out model that can also handle new queries after the table is created and being used in production. There are a great deal of posts and video's online that explain how to do this.
Here is one with Rick Houlihan link. Rick Houlihan is the principle designer of DynamoDB, so go there for gospel.
To make the queries you're attempting, one would create multiple keys, mainly an initial partition key and secondary key. Rick recommends keeping them generic like PK, and SK.
Then try to shape the PK with a great deal of uniqueness e.g. A partition key of a zip code PK: "12345" could contain a massive amount of data that may be more than the 10GB quota for any partition key limit.
Example 1: WHERE Address LIKE '%maple st%' AND ZipCode = 12345
For example 1, we could shape a partition key of PK: "12345:maple"
Then just calling the PK of "12345:maple" would retrieve all the data with that zip code as well as street of maple. There will be many different PK's and that is what dynamoDB does well: scales horizontally.
Example 2: WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
In example 2, we could then use the secondary index to add another way to be more specific such as PK: "12345:poplar" SK: "losangeles:ca:other:info:that:helps"
Example 3: WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
For example 3, we don't have a street name. We would need to know the street name to query the data, but we may not have it in a search. This is where one would need to fully understand their base query patterns and shape the PK to be easily known at the time of the query while still being quite unique so that we do not go over the partition limits. Having a street name would probably not be the most optimal, it all depends on what queries are required.
In this last example, it may be more appropriate to add some global secondary indices, which just means making new primary key and secondary keys that map to data attribute (column) like CountyFIPS.

Database design for a product aggregator

I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?
One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.
I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.
For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.
One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.

Storing Preferences/One-to-One Relationships in Database

What is the best way to store settings for certain objects in my database?
Method one: Using a single table
Table: Company {CompanyID, CompanyName, AutoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
Method two: Using two tables
Table Company {CompanyID, COmpanyName}
Table2 CompanySettings{CompanyID, utoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
I would take things a step further...
Table 1 - Company
CompanyID (int)
CompanyName (string)
Example
CompanyID 1
CompanyName "Swift Point"
Table 2 - Contact Types
ContactTypeID (int)
ContactType (string)
Example
ContactTypeID 1
ContactType "AutoEmail"
Table 3 Company Contact
CompanyID (int)
ContactTypeID (int)
Addressing (string)
Example
CompanyID 1
ContactTypeID 1
Addressing "name#address.blah"
This solution gives you extensibility as you won't need to add columns to cope with new contact types in the future.
SELECT
[company].CompanyID,
[company].CompanyName,
[contacttype].ContactTypeID,
[contacttype].ContactType,
[companycontact].Addressing
FROM
[company]
INNER JOIN
[companycontact] ON [companycontact].CompanyID = [company].CompanyID
INNER JOIN
[contacttype] ON [contacttype].ContactTypeID = [companycontact].ContactTypeID
This would give you multiple rows for each company. A row for "AutoEmail" a row for "AutoPrint" and maybe in the future a row for "ManualEmail", "AutoFax" or even "AutoTeleport".
Response to HLEM.
Yes, this is indeed the EAV model. It is useful where you want to have an extensible list of attributes with similar data. In this case, varying methods of contact with a string that represents the "address" of the contact.
If you didn't want to use the EAV model, you should next consider relational tables, rather than storing the data in flat tables. This is because this data will almost certainly extend.
Neither EAV model nor the relational model significantly slow queries. Joins are actually very fast, compared with (for example) a sort. Returning a record for a company with all of its associated contact types, or indeed a specific contact type would be very fast. I am working on a financial MS SQL database with millions of rows and similar data models and have no problem returning significant amounts of data in sub-second timings.
In terms of complexity, this isn't the most technical design in terms of database modelling and the concept of joining tables is most definitely below what I would consider to be "intermediate" level database development.
I would consider if you need one or two tables based onthe following criteria:
First are you close the the record storage limit, then two tables definitely.
Second will you usually be querying the information you plan to put inthe second table most of the time you query the first table? Then one table might make more sense. If you usually do not need the extended information, a separate ( and less wide) table should improve performance on the main data queries.
Third, how strong a possibility is it that you will ever need multiple values? If it is one to one nopw, but something like email address or phone number that has a strong possibility of morphing into multiple rows, go ahead and make it a related table. If you know there is no chance or only a small chance, then it is OK to keep it one assuming the table isn't too wide.
EAV tables look like they are nice and will save futue work, but in reality they don't. Genreally if you need to add another type, you need to do future work to adjust quesries etc. Writing a script to add a column takes all of five minutes, the other work will need to be there regarless of the structure. EAV tables are also very hard to query when you don;t know how many records you wil need to pull becasue normally you want them on one line and will get the information by joining to the same table multiple times. This causes performance problmes and locking especially if this table is central to your design. Don't use this method.
It depends if you will ever need more information about a company. If you notice yourself adding fields like companyphonenumber1 companyphonenumber2, etc etc. Then method 2 is better as you would seperate your entities and just reference a company id. If you do not plan to make these changes and you feel that this table will never change then method 1 is fine.
Usually, if you don't have data duplication then a single table is fine.
In your case you don't so the first method is OK.
I use one table if I estimate the data from the "second" table will be used in more than 50% of my queries. Use two tables if I need multiple copies of the data (i.e. multiple phone numbers, email addresses, etc)

Database design query

I'm trying to work out a sensible approach for designing a database where I need to store a wide range of continuously changing information about pets. The categories of data can be broken down into, for example, behaviour, illness etc. Data will be submitted on a regular basis relating to these categories, so i need to find a good way to design the db to efficiently accommodate this. A simple approach would just to store multiple records for each pet within each relevant table - e.g the behaviour table would store the behaviour data and would simply have a timestamp for each record along with the identifier for that pet. When querying the db, it would be straightforward to query the one table with the pet id, using the timestamps to output the correct history of submissions. Is there a more sensible way around this or does that make sense?
I would use a combination of lookup tables with a strong use of foreign keys. I think what you are suggesting is very common. For example, get me all the reported illnesses for a particluar pet during this data range would look something like:
Select *
from table_illness
where table_illness.pet_id = <value>
and date between table_illness.start_date and table_illness.finish_date
You could do that for any of the tables. The lookup tables will be a link between, for example, table_illness.illness_type and illness_types.illness_type. The illness_types table is where you would store the details on the types of illnesses.
When designing a database you should build your tables to mimic real-life objects or concepts. So in that sense the design you suggest makes sense. Each pet should have its own record in a pet table which doesn't change. Changing information should then be placed into the appropriate table which has the pet's id. The time stamp method you suggest is probably what I would do -- unless of course this is for a vet or something. Then I'd create an appointment table with the date and connect the illness or behavior to the appointment as well.

Resources