Indexing URL's in SQL Server 2005 - sql-server

What is the best way to deal with storing and indexing URL's in SQL Server 2005?
I have a WebPage table that stores metadata and content about Web Pages. I also have many other tables related to the WebPage table. They all use URL as a key.
The problem is URL's can be very large, and using them as a key makes the indexes larger and slower. How much I don't know, but I have read many times using large fields for indexing is to be avoided. Assuming a URL is nvarchar(400), they are enormous fields to use as a primary key.
What are the alternatives?
How much pain would there likely to be with using URL as a key instead of a smaller field.
I have looked into the WebPage table having a identity column, and then using this as the primary key for a WebPage. This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.
I have also played around with using a hash on the URL, to create a smaller index, but am still not sure if it is the best way of doing things. It wouldn't be a unique index, and would be subject to a small number of collisions. So I am unsure what foreign key would be used in this case...
There will be millions of records about webpages stored in the database, and there will be a lot of batch updating. Also there will be a quite a lot of activity reading and aggregating the data.
Any thoughts?

I'd use a normal identity column as the primary key. You say:
This keeps all the associated indexes smaller and more efficient
but it makes importing data a bit of a pain. Each import for the
associated tables has to first lookup what the id of a url is
before inserting data in the tables.
Yes, but the pain is probably worth it, and the techniques you learn in the process will be invaluable on future projects.
On SQL Server 2005, you can create a user-defined function GetUrlId that looks something like
CREATE FUNCTION GetUrlId (#Url nvarchar(400))
RETURNS int
AS BEGIN
DECLARE #UrlId int
SELECT #UrlId = Id FROM Url WHERE Url = #Url
RETURN #UrlId
END
This will return the ID for urls already in your URL table, and NULL for any URL not already recorded. You can then call this function inline your import statements - something like
INSERT INTO
UrlHistory(UrlId, Visited, RemoteIp)
VALUES
(dbo.GetUrlId('http://www.stackoverflow.com/'), #Visited, #RemoteIp)
This is probably slower than a proper join statement, but for one-time or occasional import routines it might make things easier.

Break up the URL into columns based on the bits your concerned with and use the RFC as a guide. Reverse the host and domain info so an index can group like domains (Google does this).
stackoverflow.com -> com.stackoverflow
blog.stackoverflow.com -> com.stackoverflow.blog
Google has a paper that outlines what they do but I can't find right now.
http://en.wikipedia.org/wiki/Uniform_Resource_Locator

I would stick with the hash solution. This generates a unique key with a fairly low chance of collision.
An alternative would be to create GUID and use that as the key.

I totally agree with Dylan. Use an IDENTITY column or a GUID column as surrogate key in your WebPage table. Thats a clean solution. The lookup of the id while importing isn't that painful i think.
Using a big varchar column as key column is wasting much space and affects insert and query performance.

Not so much a solution. More another perspective.
Storing the total unique URI of a page perhaps defeats part of the point of URI construction. Each forward slash is supposed to refer to a unique semantic space within the domain (whether that space is actual or logical). Unless the URIs you intend to store are something along the line of www.somedomain.com/p.aspx?id=123456789 then really it might be better to break a single URI metatable into a table representing the subdomains you have represented in your site.
For example if you're going to hold a number of "News" section URIs in the same table as the "Reviews" URIs then you're missing a trick to have a "Sections" table whose content contains meta information about the section and whose own ID acts as a parent to all those URIs within it.

Related

How can I design a table that stores subdomain metadata to avoid large partitions?

i try to design a table in cassandra, but im getting a lot of large partition messages.
Any ideas how i could improve this "design" to prevent overloading and still can use a query like this:
select * from analytics where domain='test' and tld='com'
CREATE TABLE analytics (
domain text,
tld text,
subdomain text,
a text,
PRIMARY KEY ((domain, tld), subdomain)
)
Also im loading this table with
update analytics set a='a' where domain='test' and tld='com' and subdomain='b';
Some partitions are over 1million rows
I must be naïve but I'm very surprised to hear that some domains can have a million subdomains. In any case, I suspect that a significant majority of domains would have less than 100 subdomains so for the most part, your current table schema is going to be fine and you just need to deal with the really "large" domains.
This is a common problem for social apps and in Graph Theory it is known as the supernode problem -- a vertex with an incredibly high number of edges. In simpler terms, it's Barack Obama (the vertex or node) with over 133M followers (edges) on Twitter, or Cristiano Ronaldo with over 506M followers on Instagram.
For apps that run into the supernode problem, they typically work around it by handling the supernodes separately from the rest. In your case, you need to implement some logic in your app to detect the "super domains" and store them in a separate table.
A possible table design uses the first 2 characters of the subdomain as a bucket. For example with domain sub.domainsr.us, we use the prefix su for bucketing to make the partitions smaller:
CREATE TABLE subdomains_by_domain_tld_prefix (
domain text,
tld text,
prefix text,
subdomain text,
a text,
PRIMARY KEY ((domain, tld, prefix), subdomain)
)
This is just an example so the prefix doesn't have to be limited to just the first 2 characters. You can adjust it depending on the dataset.
Also if it makes it simpler for your app, you can choose to use this table for all domains. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!
How unique is a? You can include whatever makes the most sense and would give you smaller partitions, then you could create a secondary index on whichever column you leave out of the original PK and need to query. Remember that whatever you include in the PK, you'll need to use when you query records, so only include or add a column that would make sense to include in queries and would give you smaller partitions.

Property address database design in DynamoDB NoSQL

We have several terabytes of address data and are investigating the possibility of storing this in a DynamoDB NoSQL database. I've done quite a bit of reading on DynamoDB and NoSQL in general, but am coming from many years of MS SQL and am struggling with some of the NoSQL concepts.
My biggest question at this point is how to setup the table structure so that I can accommodate the various different ways the data could be queried. For example, in regular SQL I would expect some queries like:
WHERE Address LIKE '%maple st%' AND ZipCode = 12345
WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
Those are just examples. The actual queries could be any combination of those various fields.
It's not clear to me what my index should look like or how the table (or tables) should be structured. Can anyone get me started on understanding how that could work?
The post is relatively old, but I will try to give you an answer (maybe it will be helpful for someone having similar issues in the future).
DynamoDB is not really meant to be used in the way you describe. Its strengths are in fast (smoking fast in fact) look-ups of key/value pairs. To take your example of IP address if you wanted to really quickly look-up information associated with an IP address you could easily make the HashKey a string with the IP address and use this to do a look-up.
Things start to get complicated when you want to do queries (or scans) in dynamoDb, you can read about them here: Query and Scan in DynamDB
The gist being that scans/queries are really expensive when not performed on either the HaskKey or HaskKey+RangeKey combo (range keys are basically composite keys).
In other words I am not sure if DynamoDb is the right way to go. For smoking fast search functionality I would consider using something like Lucene. If you configure your indexes wisely you will be amazed how fast it works.
Hope this helps.
Edit:
Seems Amazon has now added support for secondary indices:
See here
DynamoDB was built to be utilized in the way the question author describes refer to this LINK where AWS documentation describes creating a secondary index like this
[country]#[region]#[state]#[county]#[city]#[neighborhood]
The partition key could be something like this as well based on what you want to look up.
In DynamoDB, you create the joins before you create the table. This means that you have to think about all the ways you intend to search for you data, create the indexes, and query your data using them.
AWS created AWS noSQL WorkBench to help teams do this. There are a few UI bugs in that application at the time of this writing; refer to LINK for more information on the bugs.
To review some of the queries you mentioned, I'll share a few possibilities in which you can create an index to create that query.
Note: noSQL means denormalized data in some cases, but not necessarily.
There are limits as to how keys should be shaped so that dynamoDB can partition actual servers to scale; refer to partition keys for more info.
The magic of dynamoDB is a well thought out model that can also handle new queries after the table is created and being used in production. There are a great deal of posts and video's online that explain how to do this.
Here is one with Rick Houlihan link. Rick Houlihan is the principle designer of DynamoDB, so go there for gospel.
To make the queries you're attempting, one would create multiple keys, mainly an initial partition key and secondary key. Rick recommends keeping them generic like PK, and SK.
Then try to shape the PK with a great deal of uniqueness e.g. A partition key of a zip code PK: "12345" could contain a massive amount of data that may be more than the 10GB quota for any partition key limit.
Example 1: WHERE Address LIKE '%maple st%' AND ZipCode = 12345
For example 1, we could shape a partition key of PK: "12345:maple"
Then just calling the PK of "12345:maple" would retrieve all the data with that zip code as well as street of maple. There will be many different PK's and that is what dynamoDB does well: scales horizontally.
Example 2: WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
In example 2, we could then use the secondary index to add another way to be more specific such as PK: "12345:poplar" SK: "losangeles:ca:other:info:that:helps"
Example 3: WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
For example 3, we don't have a street name. We would need to know the street name to query the data, but we may not have it in a search. This is where one would need to fully understand their base query patterns and shape the PK to be easily known at the time of the query while still being quite unique so that we do not go over the partition limits. Having a street name would probably not be the most optimal, it all depends on what queries are required.
In this last example, it may be more appropriate to add some global secondary indices, which just means making new primary key and secondary keys that map to data attribute (column) like CountyFIPS.

Are there any standards/best-practices for managing small non-transactional lookup tables?

I have an ERP application with about 50 small lookup tables containing non-transactional data. Examples are ItemTypes, SalesOrderStatuses etc. There are so many different types and categories and statuses and with every new module new lookup tables are being added. I have a service to provide List objects out of these tables. These tables usually contain only two columns, (Id and Description). They have only a couple of rows, 8 - 10 rows at max.
I am thinking about putting all of them in one table with ID, Description and LookupTypeID. With this one table I will be able to get rid of 50 tables. Is it good idea? Bad Idea? Very bad idea?
Are there any standards/best-practices for managing small lookup tables?
Among some professionals, the single common lookup table is a design error you should avoid. At the very least, it will slow down performance. The reason is that you will have to have a compound primary key for the common table, and lookups via a compound key will take longer than lookups via a simple key.
According to Anith Sen, this is the first of five design errors you should avoid. See this article: Five Simple Design Errors
Merging lookup tables is a bad idea if you care about integrity of your data (and you should!):
It would allow "client" tables to reference the data they were not meant to reference. E.g. the DBMS will not protect you from referencing SalesOrderStatuses where only ItemTypes should be allowed - they are now in the same table and you cannot (easily) separate the corresponding FKs.
It would force all lookup data to share the same columns and types.
Unless you have a performance problems due to excessive JOINs, I recommend you stay with your current design.
If you do, then you could consider using natural instead of surrogate keys in the lookup tables. This way, the natural keys gets "propagated" through foreign keys to the "client" tables, resulting in less need for JOINing, at the price of increased storage space. For example, instead of having ItemTypes {Id PK, Description AK}, only have ItemTypes {Description PK}, and you no longer have to JOIN with ItemTypes just to get the Description - it was automatically propagated down the FK.
You can store them in a text search (ie nosql) database like Lucene. They are ridiculously fast.
I have implemented this to great effect. Note though that there is some initial setup to overcome, but not much. Lucene queries on ids are a snap to write.
The "one big lookup table" approach has the problem of allowing for silly values -- for example "color: yellow" for trucks in the inventory when you only have cars with "color: yellow". One Big Lookup Table: Just Say No.
Off-hand, I would go with the natural keys for the lookup tables unless you would have cases like "the 2012 model CX300R was red but the 2010-2011 models CX300R were blue (and model ID also denotes color)".
Traditionally if you ask a DBA they will say you should have separate tables. If you asked a programmer they would say using the single table is easier. (Makes making a Edit Status webpage very easy you just make one webpage and pass it a different LookupTypeID instead of lots of similar pages)
However now with ORM the SQL and Code to access different status tables is not really any extra effort.
I have used both method and both work fine. I must admit using a single status table is easiest. I have done this for small apps and also enterprise apps and have noticed no performance impacts.
Finally the other field I normally like to add on these generic status tables is a OrderBy field so you can sort the status in your UI by something other than the description if needed.
Sounds like a good idea to me. You can have the ID and LookupTypeID as a multi-attribute primary key. You just need to know what all of the different LookupTypeIDs represent and you should be good as gold.
EDIT: As for the standards/best-practices, I honestly don't have an answer for you. I've only had one semester of SQL/database design so I haven't been all too exposed to the matter.

Shield database size when exposing keys? (without killing performance)

We have a database table that will be 10 million records. We don't want to use auto_increment because that will allow our users to know how many records we have. We don't want to expose that to our competitors. The problem I see is that using UUID or something like that will kill query performance.
for instance, this is a no-no:
http://domain.com/widgets?id=34345
because competitors can crawl the site to determine how many widgets we have. Should this business shielding be handled on the app level, or is it OK to handle it on the database level? What do most people do in this situation? The database we're using is postgres, but I assume the solution is still database agnostic.
Use GUIDs as keys. You can look at this question to see why it would be OK to do. You may be able to get away with using a subset of the GUID number, but the smaller the bit size, the more likely a collision. A GUID is not overly large and should be able to be stored as a number. The transfer would be 4 times as much for the key, but that is largely irrelevant.
The storage might be about 120 MB more for 10 million rows, but that seems negligible at such a large size. Have you tested the performance of GUIDs and found them lacking?
I use slug based urls where slug is unique and therefore indexed field, plus you get nice urls like http://example.com/awesome-blue-widget. You can create slugs by lowercasing the widget name, replacing spaces with hyphens etc. My web framework has an easy slugify function for it, that I extended to add an increment on the end if a slug is already taken.
Slugs generally match the pattern [a-z0-9-]+. And you can still have your auto-incremented primary key for use in foreign keys in other tables and such without compromising your business data.

database design - best practice- one table for web form drop down options or separate table for each drop down options

I'm looking at the best practice approach here. I have a web page that has several drop down options. The drop downs are not related, they are for misc. values (location, building codes, etc). The database right now has a table for each set of options (e.g. table for building codes, table for locations, etc). I'm wondering if I could just combine them all into on table (called listOptions) and then just query that one table.
Location Table
LocationID (int)
LocatValue (nvarchar(25))
LocatDescription (nvarchar(25))
BuildingCode Table
BCID (int)
BCValue (nvarchar(25))
BCDescription (nvarchar(25))
Instead of the above, is there any reason why I can't do this?
ListOptions Table
ID (int)
listValue (nvarchar(25))
listDescription (nvarchar(25))
groupID (int) //where groupid corresponds to Location, Building Code, etc
Now, when I query the table, I can pass to the query the groupID to pull back the other values I need.
Putting in one table is an antipattern. These are differnt lookups and you cannot enforce referential integrity in the datbase (which is the ciorrect place to enforce it as applications are often not the only way data gets changed) unless they are in separate tables. Data integrity is FAR more important than saving a few minutes of development time if you need an additonal lookup.
If you plan to use the values later in some referencing FKeys - better use separate tables.
But why do you need "all in one" table? Which problem it solves?
You could do this.
I believe that is your master data and it would not be having any huge amounts of rows that it might create and performance problems.
Secondly, why would you want to do it once your app is up and running. It should have thought about earlier. The tables might be used in a lot of places and it's might be a lot of coding and most importantly testing.
Can you throw further light into your requirements.
You can keep them in separate tables and have your stored procedure return one set of data with a "datatype" key that signifies which set of values go with what option.
However, I would urge you to consider a much different approach. This suggestion is based on years of building data driven websites. If these drop-down options don't change very often then why not build server-side include files instead of querying the database. We did this with most of our websites. Think about it, each time the page is presented you query the database for the same list of values... that data hardly ever changes.
In cases when that data did have the tendency to change, we simply added a routine to the back end admin that rebuilt the server-side include file whenever an add, change or delete was done to one of the lookup values. This reduced database I/O's and spead up the load time of all our websites.
We had approximately 600 websites on the same server all using the same instance of SQL Server (separate databases) our total server database I/O's were drastically reduced.
Edit:
We simply built SSI that looked like this...
<option value="1'>Blue</option>
<option value="2'>Red</option>
<option value="3'>Green</option>
With single table it would be easy to add new groups in favour of creating new tables, but for best practices concerns you should also have a group table so you can name those groups in the db for future maintenance
The best practice depends on your requirements.
Do the values of location and building vary frequently? Where do the values come from? Are they imported from external data? Do other tables refer the unique table (so that I need a two-field key to preper join the tables)?
For example, I use unique table with hetorogeneus data for constants or configuration values.
But if the data vary often or are imported from external source, I prefer use separate tables.

Resources