SQL Design - How to store large amount of URLs

SQL Design - How to store large amount of URLs - sql-server

I'm writing an application that will have a SQL Server backend that will store (among other things) urls. URLS will be mapped to users, and some URLs may be common between different users. In absence of a true DBA, I'm trying to design a solution that can handle hundreds of thousands of URLs as efficiently as possible.
Ideas:
Create table that simply has ID, URL
Pro: simple, complete.
CON: duplicate entries for a URL will exist which will cause the table to be larger than it needs to be.
Break up the user and URLs into separate tables. One table containing USER ID, and URL ID . Another table with URL ID and URL itself.
Pro: single URL in the system, seems more "enterprisey"
Con: must join two tables when trying to pull back results, and not really sure what the benefit of this approach is?
Expand on the 2 idea, except REALLY break it up. So have a table for domain, another for path/query string. Then, user table would have userid, domain ID, path ID.
Pro: urls could share data even if it was unrelated (meaning, cnn.com/helloworld and nbc.com/helloworld would have different domain ids, but same path ids.. seems this could be useful when running metrics later?
Con: Seems like a nightmare from a performance perspective (again, because joins would be necessary to pull a URL.
Any thoughts?

I would do the following in my design:
UserId UrlId
1 1
2 2
1 1
UrlId Url
1 http://www.google.com
2 http://www.yahoo.com
Storing your URLs in a seperate table and only creating a new entry in the URL table, if an exact match does not already exist. If you have a lot of common URLs, this will save some space. You could take it a step farther and add a third table as you mentioned, e.g.
UrlPathId UrlId UrlPath
1 1 /shopping
...and then tieing the UrlPathId to the User table. And perhaps even further:
UrlPathId UrlId UrlQueryString
1 1 ?product=speakers
...and again, referencing this from your User table.

It sounds like you are describing a many to many relationship between users and URL's.
I would highly suggest ruling out option 1. Not only will this increase size, but because if you need to update a URL or a User, you'll have to do it every time that it's duplicated, instead of once.
Choosing between 2 and 3 is more difficult, because it depends much more on how this is going to be used. #2 is a lot more simplistic, and is still normalized. The features in #3 don't seem to outweigh the complexity to me, so personally I'd pick #2.
Edit: Upon seeing George's answer, I completely agree with the first section.

Are you really that short on space? Unless you need to treat URLs as an object in their own right I would just go for option 1 and cover it with indexes if you have specific performance requirements on URLs alone.
See my other comment here on dealing with orphan URLs.

Related

Should I save storage or save speed in keeping a history of notes?

I am creating a note system and want my notes to be editable, but also want them to never be deleted so I'm compromising with keeping a history of the different changes made to them. So I have come up with one idea where each note table looks like this:
Note
--------
+id
+content
+author
+timestamp
+edited
in this version if edited is anything other than null the note has been edited and points to the note id of its ancestor. It is essentially a linked list. I'm not very happy with that though as most notes won't be edited so there's just a bunch of nulls sitting around.
my other idea was to create a table like:
Note
-------
+id
+content
+author
+timestamp
and also a table like:
Edited_Notes
-----------
+id
+note_id
then whenever a note is loaded just see if it's been added to Edited_Notes. If it has been, then obviously it's been edited. I'm worried that searching through this table every time a note is opened by hundreds of users could be taxing for the database though, especially if I add an ability to see all note history for a single note at once.
I am not a db designer so this is pretty new to me. Would these kinds of transactions even scratch a databases capabilities? Is there a better way to go about it?

There is no reason to avoid empty columns - storage is cheap (too cheap to measure for most systems).
What's usually expensive is developer time. I'd optimize for the most obvious, easy-to-understand solution, that describes your business domain in the cleanest possible way. In my opinion, option 1 does that; option 2 would probably require significant additional queries on most screens.

If I understand the solution correctly deep parent child relation could be an issue for a db for even small number of records as you will need to join table by itself several times depending on the change number.
Instead I would recommend a history table separate then note table with the exact same structure with a parent id reference (or you can just use the same note id as a non primary field) to the actual note table. Whenever something changed you should move old data to history table with the parent reference id.

Too many columns in a single preference db table?

I have an application that is essentially built out of many smaller applications. Each application has their own individual preferences, but all of them share the same 5 preferences, for example, whether the application is displayed in the nav, whether it is public, whether reports should be generated, etc.
All of these common preferences need to be known by any page in the web app because the navigation is constructed from it. So originally I put all these preferences in a single table. However as the number of applications grow (10 now, eventually around 30), the number of columns will end up being around 150-200 total. Most of these columns are just booleans, but it still worries me having that many columns in one table. On the other hand, if I were to split them apart into separate tables (preferences per app), I'd have to join them all together anyway every time I need to see the preferences, so why not just leave them all together?
In the application I can break the preferences into smaller objects so they are easier to work with, but from a db perspective they are a single entity. Is it better to leave them in one giant table, or break them apart into smaller ones but force many joins every time they are requested?

Which database engine are you using ? normally you will find some recommendations about recommended number of columns per table in your DB engine. Mostly Row size limitations, which should keep you safe.
Other options and suggestions include:
Assign a bit per config key in an integer, and use the logical "AND" operation to show only the key you are interested in at a given point in time. Single value read from DB, one quick Logical operation for each read of a config key.
Caching the preferences in memory, less round trips to DB servers, Based on frequency of changes , you may also having to clear the cache of each preference when it is updated.

Why not turn the columns into rows and use something like this:
This is a typical approach for maintaining lists of settings values.
The APP_SETTING table contains the value of the setting. The SETTING table gives you the context of what the setting is.
There are ways of extending this to add information such as which settings apply to which applications and whether or not the possible values for a particular setting are constrained to a specific list.

Well CommonPreferences and ApplicationPreferences would certainly make sense, and perhaps even segregating them in code (two queries instead of a join).
After that a table per application will make more sense.
Another way is going down the route suggested By Joel Brown.
A third would be instead of having individual colums or row per setting, you stuff all the non-common ones in to an xml snippet or serialise from a preferences class.
Which decision you make revolves around how your application does (or could use the data).
If you go down the settings table approach getting application settings as a row will be 'erm painful. Go down the xml snippet route and querying for a setting across applications will be even more painful than several joins.
No way to say what you should compromise on from here. I think I'd go for CommonPreferences first and see where I was at after that.

should I make two separate tables for two similar objects

I want to store "Tweets" and "Facebook Status" in my app as part of "Status collection" so every status collection will have a bunch of Tweets or a bunch of Facebook Statuses. For Facebook I'm only interested in text so I won't store videos/photos for now.
I was wondering in terms of best practice for DB design. Is it better to have one table (put the max for status to 420 to include both Facebook and Twitter limit) with "Type" column that determines what status it is or is it better to have two separate tables? and Why?

Strictly speaking, a tweet is not the same thing as a FB update. You may be ignoring non-text for now, but you may change your mind later and be stuck with a model that doesn't work. As a general rule, objects should not be treated as interchangeable unless they really are. If they are merely similar, you should either use 2 separate tables or use additional columns as necessary.
All that said, if it's really just text, you can probably get away with a single table. But this is a matter of opinion and you'll probably get lots of answers.

I would put the messages into one table and have another that defines the type:
SocialMediaMessage
------------------
id
SocialMediaTypeId
Message
SocialMediaType
---------------
Id
Name
They seem similar enough that there is no point to separate them. It will also make your life easier if you want to query across both Social Networking sites.

Its probably easier to use on table and use type to identify them. You will only need one query/stored procedure to access the data instead of one query for each type when you have multiple tables.

Names of businesses keyed differently by different people

I have this table
tblStore
with these fields
storeID (autonumber)
storeName
locationOrBranch
and this table
tblPurchased
with these fields
purchasedID
storeID (foreign key)
itemDesc
In the case of stores that have more than one location, there is a problem when two people inadvertently key the same store location differently. For example, take Harrisburg Chevron. On some of its receipts it calls itself Harrisburg Chevron, some just say Chevron at the top, and under that, Harrisburg. One person may key it into tblStore as storeName Chevron, locationoOrBranch Harrisburg. Person2 may key it as storeName Harrisburg Chevron, locationOrBranch Harrisburg. What makes this bad is that the business's name is Harrisburg Chevron. It seems hard to make a rule (that would understandably cover all future opportunities for this error) to prevent people from doing this in the future.
Question 1) I'm thinking as the instances are found, an update query to change all records from one way to the other is the best way to fix it. Is this right?
Questions 2) What would be the best way to have originally set up the db to have avoided this?
Question 3) What can I do to make future after-the-fact corrections easier when this happens?
Thanks.
edit: I do understand that better business practices are the ideal prevention, but for question 2 I'm looking for any tips or tricks that people use that could help. And question 1 and 3 are important to me too.

This is not a database design issue.
This is an issue with the processes around using the database design.
The real question I have is why are users entering in stores ad-hoc? I can think of scenarios, but without knowing your situation it is hard to guess.
The normal solution is that the tblStore table is a lookup table only. Normally users only have access to stores that have already been entered.
Then there is a controlled process to maintain the tblStore table in a consistent manner. Only a few users would have access to this process.
Of course as I alluded to above this is not always possible, so you may need a different solution.
UPDATE:
Question #1: An update script is the best approach. The best way to do this is to have a copy of the database if possible, or a close copy if not, and test the script against this data. Once you have ensured that the script runs correctly, then you can run it against the real data.
If you have transactional integrity you should use that. Use "begin" before running the script and if the number of records is what you expect, and any other tests you devise (perhaps also scripted), then you can "commit"
Do not type in SQL against a live DB.
Question #3: I suggest your first line of attack is to create processes around the creation of new stores, but this may not be wiuthin your ambit.
The second is possibly to get proactive and identify and enter new stores (if this is the problem) before the users in the field need to do so. I don't know if this works inside your scenario.
Lastly if you had a script that merged "store1" into "store2" you can standardise on that as a way of reducing time and errors. You could even possibly build that into an admin only screen that automated merging stores.
That is all I can think of off the top of my head.

What are some techniques for stored database keys in URL

I have read that using database keys in a URL is a bad thing to do.
For instance,
My table has 3 fields: ID:int, Title:nvarchar(5), Description:Text
I want to create a page that displays a record. Something like ...
http://server/viewitem.aspx?id=1234
First off, could someone elaborate on why this is a bad thing to do?
and secondly, what are some ways to work around using primary keys in a url?

I think it's perfectly reasonable to use primary keys in the URL.
Some considerations, however:
1) Avoid SQL injection attacks. If you just blindly accept the value of the id URL parameter and pass it into the DB, you are at risk. Make sure you sanitise the input so that it matches whatever format of key you have (e.g. strip any non-numeric characters).
2) SEO. It helps if your URL contains some context about the item (e.g. "big fluffy rabbit" rather than 1234). This helps search engines see that your page is relevant. It can also be useful for your users (I can tell from my browser history which record is which without having to remember a number).

It's not inherently a bad thing to do, but it has some caveats.
Caveat one is that someone can type in different keys and maybe pull up data you didn't want / expect them to get at. You can reduce the chance that this is successful by increasing your key space (for example making ids random 64 bit numbers).
Caveat two is that if you're running a public service and you have competitors they may be able to extract business information from your keys if they are monotonic. Example: create a post today, create a post in a week, compare Ids and you have extracted the rate at which posts are being made.
Caveat three is that it's prone to SQL injection attacks. But you'd never make those mistakes, right?

Using IDs in the URL is not necessarily bad. This site uses it, despite being done by professionals.
How can they be dangerous? When users are allowed to update or delete entries belonging to them, developers implement some sort of authentication, but they often forget to check if the entry really belongs to you. A malicious user could form a URL like "/questions/12345/delete" when he notices that "12345" belongs to you, and it would be deleted.
Programmers should ensure that a database entry with an arbitrary ID really belongs to the current logged-in user before performing such operation.
Sometimes there are strong reasons to avoid exposing IDs in the URL. In such cases, developers often generate random hashes that they store for each entry and use those in the URL. A malicious person tampering in the URL bar would have a hard time guessing a hash that would belong to some other user.

Security and privacy are the main reasons to avoid doing this. Any information that gives away your data structure is more information that a hacker can use to access your database. As mopoke says, you also expose yourself to SQL injection attacks which are fairly common and can be extremely harmful to your database and application. From a privacy standpoint, if you are displaying any information that is sensitive or personal, anybody can just substitute a number to retrieve information and if you have no mechanism for authentication, you could be putting your information at risk. Also, if it's that easy to query your database, you open yourself up to Denial of Service attacks with someone just looping through URL's against your server since they know each one will get a response.
Regardless of the nature of the data, I tend to recommend against sharing anything in the URL that could give away anything about your application's architecture, it seems to me you are just inviting trouble (I feel the same way about hidden fields which aren't really hidden).
To get around it, we usaully encrypt the parameters before passing them. In some cases, the encyrpted URL also includes some form of verification/authentication mechanism so the server can decide if it's ok to process.
Of course every application is different and the level of security you want to implement has to be balanced with functionality, budget, performance, etc. But I don't see anything wrong with being paranoid when it comes to data security.

It's a bit pedantic at times, but you want to use a unique business identifier for things rather than the surrogate key.
It can be as simple as ItemNumber instead of Id.
The Id is a db concern, not a business/user concern.

Using integer primary keys in a URL is a security risk. It is quite easy for someone to post using any number. For example, through normal web application use, the user creates a user record with an ID of 45 (viewitem/id/45). This means the user automatically knows there are 44 other users. And unless you have a correct authorization system in place they can see the other user's information by created their own url (viewitem/id/32).
2a. Use proper authorization.
2b. Use GUIDs for primary keys.

showing the key itself isn't inherently bad because it holds no real meaning, but showing the means to obtain access to an item is bad.
for instance say you had an online store that sold stuff from 2 merchants. Merchant A had items (1, 3, 5, 7) and Merchant B has items (2, 4, 5, 8).
If I am shopping on Merchant A's site and see:
http://server/viewitem.aspx?id=1
I could then try to fiddle with it and type:
http://server/viewitem.aspx?id=2
That might let me access an item that I shouldn't be accessing since I am shopping with Merchant A and not B. In general allowing users to fiddle with stuff like that can lead to security problems. Another brief example is employees that can look at their personal information (id=382) but they type in someone else id to go directly to someone else profile.
Now, having said that.. this is not bad as long as security checks are built into the system that check to make sure people are doing what they are supposed to (ex: not shopping with another merchant or not viewing another employee).
One mechanism is to store information in sessions, but some do not like that. I am not a web programmer so I will not go into that :)
The main thing is to make sure the system is secure. Never trust data that came back from the user.

Everybody seems to be posting the "problems" with using this technique, but I haven't seen any solutions. What are the alternatives. There has to be something in the URL that uniquely defines what you want to display to the user. The only other solution I can think of would be to run your entire site off forms, and have the browser post the value to the server. This is a little trickier to code, as all links need to be form submits. Also, it's only minimally harder for users of the site to put in whatever value they wish. Also this wouldn't allow the user to bookmark anything, which is a major disadvantage.
#John Virgolino mentioned encrypting the entire query string, which could help with this process. However it seems like going a little too far for most applications.

I've been reading about this, looking for a solution, but as #Kibbee says there is no real consensus.
I can think of a few possible solutions:
1) If your table uses integer keys (likely), add a check-sum digit to the identifier. That way, (simple) injection attacks will usually fail. On receiving the request, simply remove the check-sum digit and check that it still matches - if they don't then you know the URL has been tampered with. This method also hides your "rate of growth" (somewhat).
2) When storing the DB record initially, save a "secondary key" or value that you are happy to be a public id. This has to be unique and usually not sequential - examples are a UUID/Guid or a hash (MD5) of the integer ID e.g. http://server/item.aspx?id=AbD3sTGgxkjero (but be careful of characters that are not compatible with http). Nb. the secondary field will need to be indexed, and you will lose benefits of clustering that you get in 1).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight