Shield database size when exposing keys? (without killing performance) - database

We have a database table that will be 10 million records. We don't want to use auto_increment because that will allow our users to know how many records we have. We don't want to expose that to our competitors. The problem I see is that using UUID or something like that will kill query performance.
for instance, this is a no-no:
http://domain.com/widgets?id=34345
because competitors can crawl the site to determine how many widgets we have. Should this business shielding be handled on the app level, or is it OK to handle it on the database level? What do most people do in this situation? The database we're using is postgres, but I assume the solution is still database agnostic.

Use GUIDs as keys. You can look at this question to see why it would be OK to do. You may be able to get away with using a subset of the GUID number, but the smaller the bit size, the more likely a collision. A GUID is not overly large and should be able to be stored as a number. The transfer would be 4 times as much for the key, but that is largely irrelevant.
The storage might be about 120 MB more for 10 million rows, but that seems negligible at such a large size. Have you tested the performance of GUIDs and found them lacking?

I use slug based urls where slug is unique and therefore indexed field, plus you get nice urls like http://example.com/awesome-blue-widget. You can create slugs by lowercasing the widget name, replacing spaces with hyphens etc. My web framework has an easy slugify function for it, that I extended to add an increment on the end if a slug is already taken.
Slugs generally match the pattern [a-z0-9-]+. And you can still have your auto-incremented primary key for use in foreign keys in other tables and such without compromising your business data.

Related

How can I design a table that stores subdomain metadata to avoid large partitions?

i try to design a table in cassandra, but im getting a lot of large partition messages.
Any ideas how i could improve this "design" to prevent overloading and still can use a query like this:
select * from analytics where domain='test' and tld='com'
CREATE TABLE analytics (
domain text,
tld text,
subdomain text,
a text,
PRIMARY KEY ((domain, tld), subdomain)
)
Also im loading this table with
update analytics set a='a' where domain='test' and tld='com' and subdomain='b';
Some partitions are over 1million rows
I must be naïve but I'm very surprised to hear that some domains can have a million subdomains. In any case, I suspect that a significant majority of domains would have less than 100 subdomains so for the most part, your current table schema is going to be fine and you just need to deal with the really "large" domains.
This is a common problem for social apps and in Graph Theory it is known as the supernode problem -- a vertex with an incredibly high number of edges. In simpler terms, it's Barack Obama (the vertex or node) with over 133M followers (edges) on Twitter, or Cristiano Ronaldo with over 506M followers on Instagram.
For apps that run into the supernode problem, they typically work around it by handling the supernodes separately from the rest. In your case, you need to implement some logic in your app to detect the "super domains" and store them in a separate table.
A possible table design uses the first 2 characters of the subdomain as a bucket. For example with domain sub.domainsr.us, we use the prefix su for bucketing to make the partitions smaller:
CREATE TABLE subdomains_by_domain_tld_prefix (
domain text,
tld text,
prefix text,
subdomain text,
a text,
PRIMARY KEY ((domain, tld, prefix), subdomain)
)
This is just an example so the prefix doesn't have to be limited to just the first 2 characters. You can adjust it depending on the dataset.
Also if it makes it simpler for your app, you can choose to use this table for all domains. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!
How unique is a? You can include whatever makes the most sense and would give you smaller partitions, then you could create a secondary index on whichever column you leave out of the original PK and need to query. Remember that whatever you include in the PK, you'll need to use when you query records, so only include or add a column that would make sense to include in queries and would give you smaller partitions.

Auto-Complete/Primary Key as String - PostgreSQL

I setup a database that is not too complex but still nonetheless has multiple many-to-many relationships. Let me explain the database first briefly using three tables(there are many more, but just to keep things simple):
Database is storing information about projects completed. One attribute is software used. So I have three tables(with respective columns/keys):
tblProjects(ProjectID[PK], ProjectTitle, etc...)
tblProjectsSoftware(SoftwareID[FK], ProjectID[FK], UniqueID[PK])
tblSoftwareUsed(SoftwareID[PK], SoftwareName)
In order to make data entry easier in phppgadmin, I was considering just making 'SoftwareName' the primary key in tblSoftwareUsed. This is because when I go to enter the software associated with certain projects into tblProjectsSoftware, I can only use the auto-complete feature on the SoftwareID column which is just more or less a meaningless number.
As you can see, when entering data into the SoftwareID column of tblSoftwareUsed, I would only be able to 'filter' results by the ID and not the name. When this database gets large, it may not be an issue for software, but there are some other attributes that will have tons of records. To explain that further, I would start my data entry by creating a record for the project in tblProjects. Then I would create new records (if necessary) for software used. Then, when entering data into tblProjectsSoftware, I would either have to know the ID of the software or click through a few pages to find it.
So, my question is, would I have any issues by making the name of the software my Primary Key, or would it be better to just leave it as is with the ID as the PK? Furthermore, maybe I am missing an option to make 'SoftwareName' searchable as in addition to the ID.
There are advantages and disadvantages to using surrogate keys, which are discussed at length in this wikipedia article:
http://en.wikipedia.org/wiki/Surrogate_key
Borrowing their headers...
Advantages:
Immutability
Requirement changes
Performance
Compatibility
Uniformity
Validation
Disadvantages:
Disassociation
Query optimization
Normalization
Business process modeling
Inadvertent disclosure
Inadvertent assumptions
More often than not, you'll want to use a surrogate key for practical reasons -- such as avoiding headaches when you need to update a software name.

Are there any standards/best-practices for managing small non-transactional lookup tables?

I have an ERP application with about 50 small lookup tables containing non-transactional data. Examples are ItemTypes, SalesOrderStatuses etc. There are so many different types and categories and statuses and with every new module new lookup tables are being added. I have a service to provide List objects out of these tables. These tables usually contain only two columns, (Id and Description). They have only a couple of rows, 8 - 10 rows at max.
I am thinking about putting all of them in one table with ID, Description and LookupTypeID. With this one table I will be able to get rid of 50 tables. Is it good idea? Bad Idea? Very bad idea?
Are there any standards/best-practices for managing small lookup tables?
Among some professionals, the single common lookup table is a design error you should avoid. At the very least, it will slow down performance. The reason is that you will have to have a compound primary key for the common table, and lookups via a compound key will take longer than lookups via a simple key.
According to Anith Sen, this is the first of five design errors you should avoid. See this article: Five Simple Design Errors
Merging lookup tables is a bad idea if you care about integrity of your data (and you should!):
It would allow "client" tables to reference the data they were not meant to reference. E.g. the DBMS will not protect you from referencing SalesOrderStatuses where only ItemTypes should be allowed - they are now in the same table and you cannot (easily) separate the corresponding FKs.
It would force all lookup data to share the same columns and types.
Unless you have a performance problems due to excessive JOINs, I recommend you stay with your current design.
If you do, then you could consider using natural instead of surrogate keys in the lookup tables. This way, the natural keys gets "propagated" through foreign keys to the "client" tables, resulting in less need for JOINing, at the price of increased storage space. For example, instead of having ItemTypes {Id PK, Description AK}, only have ItemTypes {Description PK}, and you no longer have to JOIN with ItemTypes just to get the Description - it was automatically propagated down the FK.
You can store them in a text search (ie nosql) database like Lucene. They are ridiculously fast.
I have implemented this to great effect. Note though that there is some initial setup to overcome, but not much. Lucene queries on ids are a snap to write.
The "one big lookup table" approach has the problem of allowing for silly values -- for example "color: yellow" for trucks in the inventory when you only have cars with "color: yellow". One Big Lookup Table: Just Say No.
Off-hand, I would go with the natural keys for the lookup tables unless you would have cases like "the 2012 model CX300R was red but the 2010-2011 models CX300R were blue (and model ID also denotes color)".
Traditionally if you ask a DBA they will say you should have separate tables. If you asked a programmer they would say using the single table is easier. (Makes making a Edit Status webpage very easy you just make one webpage and pass it a different LookupTypeID instead of lots of similar pages)
However now with ORM the SQL and Code to access different status tables is not really any extra effort.
I have used both method and both work fine. I must admit using a single status table is easiest. I have done this for small apps and also enterprise apps and have noticed no performance impacts.
Finally the other field I normally like to add on these generic status tables is a OrderBy field so you can sort the status in your UI by something other than the description if needed.
Sounds like a good idea to me. You can have the ID and LookupTypeID as a multi-attribute primary key. You just need to know what all of the different LookupTypeIDs represent and you should be good as gold.
EDIT: As for the standards/best-practices, I honestly don't have an answer for you. I've only had one semester of SQL/database design so I haven't been all too exposed to the matter.

Basic query about SQL Server

Why use rowguid and what are the benefits?
Suppose a company has thousands of customers, is it a good approach to divide them on the basis of their gender for the performance and fast query if no then why?
How do large companies like Facebook handle their primary key for their comments, users and for other things for example:
Suppose there are five users with primary key 1,2,3,4,5...
What if user 3 is deleted from there, now it's 1,2,4,5 will be left, which is kind of gap between continuous chain. How do they deal with it?
Don't know - maybe you use a non-auto value so you can keep it constant across other databases (maybe for use with 3rd part integration etc.)
Do not divide on a field such as gender, when you don't know gender (or want a full list) you are going to have to search two tables, also when you want to add other filtering/searching you will have to do over multiple tables again
So what if there is a gap in the ID chain - it does not effect anything. Why would you think it is important?

Indexing URL's in SQL Server 2005

What is the best way to deal with storing and indexing URL's in SQL Server 2005?
I have a WebPage table that stores metadata and content about Web Pages. I also have many other tables related to the WebPage table. They all use URL as a key.
The problem is URL's can be very large, and using them as a key makes the indexes larger and slower. How much I don't know, but I have read many times using large fields for indexing is to be avoided. Assuming a URL is nvarchar(400), they are enormous fields to use as a primary key.
What are the alternatives?
How much pain would there likely to be with using URL as a key instead of a smaller field.
I have looked into the WebPage table having a identity column, and then using this as the primary key for a WebPage. This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.
I have also played around with using a hash on the URL, to create a smaller index, but am still not sure if it is the best way of doing things. It wouldn't be a unique index, and would be subject to a small number of collisions. So I am unsure what foreign key would be used in this case...
There will be millions of records about webpages stored in the database, and there will be a lot of batch updating. Also there will be a quite a lot of activity reading and aggregating the data.
Any thoughts?
I'd use a normal identity column as the primary key. You say:
This keeps all the associated indexes smaller and more efficient
but it makes importing data a bit of a pain. Each import for the
associated tables has to first lookup what the id of a url is
before inserting data in the tables.
Yes, but the pain is probably worth it, and the techniques you learn in the process will be invaluable on future projects.
On SQL Server 2005, you can create a user-defined function GetUrlId that looks something like
CREATE FUNCTION GetUrlId (#Url nvarchar(400))
RETURNS int
AS BEGIN
DECLARE #UrlId int
SELECT #UrlId = Id FROM Url WHERE Url = #Url
RETURN #UrlId
END
This will return the ID for urls already in your URL table, and NULL for any URL not already recorded. You can then call this function inline your import statements - something like
INSERT INTO
UrlHistory(UrlId, Visited, RemoteIp)
VALUES
(dbo.GetUrlId('http://www.stackoverflow.com/'), #Visited, #RemoteIp)
This is probably slower than a proper join statement, but for one-time or occasional import routines it might make things easier.
Break up the URL into columns based on the bits your concerned with and use the RFC as a guide. Reverse the host and domain info so an index can group like domains (Google does this).
stackoverflow.com -> com.stackoverflow
blog.stackoverflow.com -> com.stackoverflow.blog
Google has a paper that outlines what they do but I can't find right now.
http://en.wikipedia.org/wiki/Uniform_Resource_Locator
I would stick with the hash solution. This generates a unique key with a fairly low chance of collision.
An alternative would be to create GUID and use that as the key.
I totally agree with Dylan. Use an IDENTITY column or a GUID column as surrogate key in your WebPage table. Thats a clean solution. The lookup of the id while importing isn't that painful i think.
Using a big varchar column as key column is wasting much space and affects insert and query performance.
Not so much a solution. More another perspective.
Storing the total unique URI of a page perhaps defeats part of the point of URI construction. Each forward slash is supposed to refer to a unique semantic space within the domain (whether that space is actual or logical). Unless the URIs you intend to store are something along the line of www.somedomain.com/p.aspx?id=123456789 then really it might be better to break a single URI metatable into a table representing the subdomains you have represented in your site.
For example if you're going to hold a number of "News" section URIs in the same table as the "Reviews" URIs then you're missing a trick to have a "Sections" table whose content contains meta information about the section and whose own ID acts as a parent to all those URIs within it.

Resources