How can I design a table that stores subdomain metadata to avoid large partitions? - database

i try to design a table in cassandra, but im getting a lot of large partition messages.
Any ideas how i could improve this "design" to prevent overloading and still can use a query like this:
select * from analytics where domain='test' and tld='com'
CREATE TABLE analytics (
domain text,
tld text,
subdomain text,
a text,
PRIMARY KEY ((domain, tld), subdomain)
)
Also im loading this table with
update analytics set a='a' where domain='test' and tld='com' and subdomain='b';
Some partitions are over 1million rows

I must be naïve but I'm very surprised to hear that some domains can have a million subdomains. In any case, I suspect that a significant majority of domains would have less than 100 subdomains so for the most part, your current table schema is going to be fine and you just need to deal with the really "large" domains.
This is a common problem for social apps and in Graph Theory it is known as the supernode problem -- a vertex with an incredibly high number of edges. In simpler terms, it's Barack Obama (the vertex or node) with over 133M followers (edges) on Twitter, or Cristiano Ronaldo with over 506M followers on Instagram.
For apps that run into the supernode problem, they typically work around it by handling the supernodes separately from the rest. In your case, you need to implement some logic in your app to detect the "super domains" and store them in a separate table.
A possible table design uses the first 2 characters of the subdomain as a bucket. For example with domain sub.domainsr.us, we use the prefix su for bucketing to make the partitions smaller:
CREATE TABLE subdomains_by_domain_tld_prefix (
domain text,
tld text,
prefix text,
subdomain text,
a text,
PRIMARY KEY ((domain, tld, prefix), subdomain)
)
This is just an example so the prefix doesn't have to be limited to just the first 2 characters. You can adjust it depending on the dataset.
Also if it makes it simpler for your app, you can choose to use this table for all domains. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

How unique is a? You can include whatever makes the most sense and would give you smaller partitions, then you could create a secondary index on whichever column you leave out of the original PK and need to query. Remember that whatever you include in the PK, you'll need to use when you query records, so only include or add a column that would make sense to include in queries and would give you smaller partitions.

Related

Property address database design in DynamoDB NoSQL

We have several terabytes of address data and are investigating the possibility of storing this in a DynamoDB NoSQL database. I've done quite a bit of reading on DynamoDB and NoSQL in general, but am coming from many years of MS SQL and am struggling with some of the NoSQL concepts.
My biggest question at this point is how to setup the table structure so that I can accommodate the various different ways the data could be queried. For example, in regular SQL I would expect some queries like:
WHERE Address LIKE '%maple st%' AND ZipCode = 12345
WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
Those are just examples. The actual queries could be any combination of those various fields.
It's not clear to me what my index should look like or how the table (or tables) should be structured. Can anyone get me started on understanding how that could work?
The post is relatively old, but I will try to give you an answer (maybe it will be helpful for someone having similar issues in the future).
DynamoDB is not really meant to be used in the way you describe. Its strengths are in fast (smoking fast in fact) look-ups of key/value pairs. To take your example of IP address if you wanted to really quickly look-up information associated with an IP address you could easily make the HashKey a string with the IP address and use this to do a look-up.
Things start to get complicated when you want to do queries (or scans) in dynamoDb, you can read about them here: Query and Scan in DynamDB
The gist being that scans/queries are really expensive when not performed on either the HaskKey or HaskKey+RangeKey combo (range keys are basically composite keys).
In other words I am not sure if DynamoDb is the right way to go. For smoking fast search functionality I would consider using something like Lucene. If you configure your indexes wisely you will be amazed how fast it works.
Hope this helps.
Edit:
Seems Amazon has now added support for secondary indices:
See here
DynamoDB was built to be utilized in the way the question author describes refer to this LINK where AWS documentation describes creating a secondary index like this
[country]#[region]#[state]#[county]#[city]#[neighborhood]
The partition key could be something like this as well based on what you want to look up.
In DynamoDB, you create the joins before you create the table. This means that you have to think about all the ways you intend to search for you data, create the indexes, and query your data using them.
AWS created AWS noSQL WorkBench to help teams do this. There are a few UI bugs in that application at the time of this writing; refer to LINK for more information on the bugs.
To review some of the queries you mentioned, I'll share a few possibilities in which you can create an index to create that query.
Note: noSQL means denormalized data in some cases, but not necessarily.
There are limits as to how keys should be shaped so that dynamoDB can partition actual servers to scale; refer to partition keys for more info.
The magic of dynamoDB is a well thought out model that can also handle new queries after the table is created and being used in production. There are a great deal of posts and video's online that explain how to do this.
Here is one with Rick Houlihan link. Rick Houlihan is the principle designer of DynamoDB, so go there for gospel.
To make the queries you're attempting, one would create multiple keys, mainly an initial partition key and secondary key. Rick recommends keeping them generic like PK, and SK.
Then try to shape the PK with a great deal of uniqueness e.g. A partition key of a zip code PK: "12345" could contain a massive amount of data that may be more than the 10GB quota for any partition key limit.
Example 1: WHERE Address LIKE '%maple st%' AND ZipCode = 12345
For example 1, we could shape a partition key of PK: "12345:maple"
Then just calling the PK of "12345:maple" would retrieve all the data with that zip code as well as street of maple. There will be many different PK's and that is what dynamoDB does well: scales horizontally.
Example 2: WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
In example 2, we could then use the secondary index to add another way to be more specific such as PK: "12345:poplar" SK: "losangeles:ca:other:info:that:helps"
Example 3: WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
For example 3, we don't have a street name. We would need to know the street name to query the data, but we may not have it in a search. This is where one would need to fully understand their base query patterns and shape the PK to be easily known at the time of the query while still being quite unique so that we do not go over the partition limits. Having a street name would probably not be the most optimal, it all depends on what queries are required.
In this last example, it may be more appropriate to add some global secondary indices, which just means making new primary key and secondary keys that map to data attribute (column) like CountyFIPS.

Shield database size when exposing keys? (without killing performance)

We have a database table that will be 10 million records. We don't want to use auto_increment because that will allow our users to know how many records we have. We don't want to expose that to our competitors. The problem I see is that using UUID or something like that will kill query performance.
for instance, this is a no-no:
http://domain.com/widgets?id=34345
because competitors can crawl the site to determine how many widgets we have. Should this business shielding be handled on the app level, or is it OK to handle it on the database level? What do most people do in this situation? The database we're using is postgres, but I assume the solution is still database agnostic.
Use GUIDs as keys. You can look at this question to see why it would be OK to do. You may be able to get away with using a subset of the GUID number, but the smaller the bit size, the more likely a collision. A GUID is not overly large and should be able to be stored as a number. The transfer would be 4 times as much for the key, but that is largely irrelevant.
The storage might be about 120 MB more for 10 million rows, but that seems negligible at such a large size. Have you tested the performance of GUIDs and found them lacking?
I use slug based urls where slug is unique and therefore indexed field, plus you get nice urls like http://example.com/awesome-blue-widget. You can create slugs by lowercasing the widget name, replacing spaces with hyphens etc. My web framework has an easy slugify function for it, that I extended to add an increment on the end if a slug is already taken.
Slugs generally match the pattern [a-z0-9-]+. And you can still have your auto-incremented primary key for use in foreign keys in other tables and such without compromising your business data.

Basic query about SQL Server

Why use rowguid and what are the benefits?
Suppose a company has thousands of customers, is it a good approach to divide them on the basis of their gender for the performance and fast query if no then why?
How do large companies like Facebook handle their primary key for their comments, users and for other things for example:
Suppose there are five users with primary key 1,2,3,4,5...
What if user 3 is deleted from there, now it's 1,2,4,5 will be left, which is kind of gap between continuous chain. How do they deal with it?
Don't know - maybe you use a non-auto value so you can keep it constant across other databases (maybe for use with 3rd part integration etc.)
Do not divide on a field such as gender, when you don't know gender (or want a full list) you are going to have to search two tables, also when you want to add other filtering/searching you will have to do over multiple tables again
So what if there is a gap in the ID chain - it does not effect anything. Why would you think it is important?

Database design for a product aggregator

I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?
One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.
I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.
For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.
One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.

Indexing URL's in SQL Server 2005

What is the best way to deal with storing and indexing URL's in SQL Server 2005?
I have a WebPage table that stores metadata and content about Web Pages. I also have many other tables related to the WebPage table. They all use URL as a key.
The problem is URL's can be very large, and using them as a key makes the indexes larger and slower. How much I don't know, but I have read many times using large fields for indexing is to be avoided. Assuming a URL is nvarchar(400), they are enormous fields to use as a primary key.
What are the alternatives?
How much pain would there likely to be with using URL as a key instead of a smaller field.
I have looked into the WebPage table having a identity column, and then using this as the primary key for a WebPage. This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.
I have also played around with using a hash on the URL, to create a smaller index, but am still not sure if it is the best way of doing things. It wouldn't be a unique index, and would be subject to a small number of collisions. So I am unsure what foreign key would be used in this case...
There will be millions of records about webpages stored in the database, and there will be a lot of batch updating. Also there will be a quite a lot of activity reading and aggregating the data.
Any thoughts?
I'd use a normal identity column as the primary key. You say:
This keeps all the associated indexes smaller and more efficient
but it makes importing data a bit of a pain. Each import for the
associated tables has to first lookup what the id of a url is
before inserting data in the tables.
Yes, but the pain is probably worth it, and the techniques you learn in the process will be invaluable on future projects.
On SQL Server 2005, you can create a user-defined function GetUrlId that looks something like
CREATE FUNCTION GetUrlId (#Url nvarchar(400))
RETURNS int
AS BEGIN
DECLARE #UrlId int
SELECT #UrlId = Id FROM Url WHERE Url = #Url
RETURN #UrlId
END
This will return the ID for urls already in your URL table, and NULL for any URL not already recorded. You can then call this function inline your import statements - something like
INSERT INTO
UrlHistory(UrlId, Visited, RemoteIp)
VALUES
(dbo.GetUrlId('http://www.stackoverflow.com/'), #Visited, #RemoteIp)
This is probably slower than a proper join statement, but for one-time or occasional import routines it might make things easier.
Break up the URL into columns based on the bits your concerned with and use the RFC as a guide. Reverse the host and domain info so an index can group like domains (Google does this).
stackoverflow.com -> com.stackoverflow
blog.stackoverflow.com -> com.stackoverflow.blog
Google has a paper that outlines what they do but I can't find right now.
http://en.wikipedia.org/wiki/Uniform_Resource_Locator
I would stick with the hash solution. This generates a unique key with a fairly low chance of collision.
An alternative would be to create GUID and use that as the key.
I totally agree with Dylan. Use an IDENTITY column or a GUID column as surrogate key in your WebPage table. Thats a clean solution. The lookup of the id while importing isn't that painful i think.
Using a big varchar column as key column is wasting much space and affects insert and query performance.
Not so much a solution. More another perspective.
Storing the total unique URI of a page perhaps defeats part of the point of URI construction. Each forward slash is supposed to refer to a unique semantic space within the domain (whether that space is actual or logical). Unless the URIs you intend to store are something along the line of www.somedomain.com/p.aspx?id=123456789 then really it might be better to break a single URI metatable into a table representing the subdomains you have represented in your site.
For example if you're going to hold a number of "News" section URIs in the same table as the "Reviews" URIs then you're missing a trick to have a "Sections" table whose content contains meta information about the section and whose own ID acts as a parent to all those URIs within it.

Resources