Pattern matching of websites

Pattern matching of websites - sql-server

I maintain a global repository of sites in a table.
website:
id, name, url
1 google http://www.google.com/
2 CNN http://www.cnn.com/
3 SO http://www.stackoverflow.com/
I maintain a reference table, which stores the the website id's the user has stored.
userwebsite
userid, websiteid
[attributes of the table]
Say a user is interested to save microsoft; in his collection, he enters
www.microsoft.com
As the website doesn't exist in the global repository, it first sits in the repository and then gets added to his collection. Now the contents of both the tables looks something like this:
website:
id, name, url
1 google http://www.google.com/
2 CNN http://www.cnn.com/
3 SO http://www.stackoverflow.com/
4 msft http://www.microsoft.com
userwebsite:
userid, websiteid
1 4
Say a user is interested in saving google in his collection, and he enters
www.google.com
As the website is already existing in the collection, instead of adding the website to the collection, only the reference gets added to the user collection.
The place where am stuck,
both www.google.com and http://www.google.com/
semantically they point out to the same site, but when you try to match them they are 2 distinct strings. How should I go about matching the strings in such cases?
One solution I think of is, input a site first check if the domain exists in the collection of websites (probably a PATINDEX will do good here), by doing this you get a list of sites which have the save domain name. and then check if the path exists in any of the resultant websites. Is this is a good idea?
Does a significant solution exist to this problem? Are there any better methods to go about?

You don't need pattern matching in this case, what you are really asking for (to continue from what Matteo commented about) is a way of validating web addresses and storing them in a consistent way. But if you want a regular expression to at least determine if the address is valid you can have a look here: http://www.shauninman.com/archive/2006/05/08/validating_domain_names
Or use Javascript to validate it although you don't say what language you are using outside of the SQL server.
It's almost the case you need to send the domain name to a Domain Name Server to resolve before storing it in your table. It may be better to ignore the fact they are web addresses and just think of them as strings. For example, how would you ensure peoples names were compared correctly in a database? The first step is usually to ensure upper or lower case is used; from then on it becomes more difficult such as handling middle names/initials which may be omitted.

Related

Should I store uploaded filename in database?

I have a database table with an autoincrement ID as primary key.
For each record of this table, I can have up to 3 files, which can be publicly available so random filename generation is not mandatory, and these files are optional.
I think I have 2 possible solutions:
Store a random generated filename in 3 nullable varchar column and store all the files in the same place:
columns: a | b | c
uploads/f6se54fse654.jpg
Don't store the filenames, but place them in specific folders and name them the same than the primary key value:
uploads/a/1.jpg
uploads/b/1.jpg
uploads/c/1.jpg
With this last solution, I know that uploads/a/1.jpg belongs to record with ID 1, and is a file of type a. But I have to check if the file exists because the files are optional.
Do you think there is a good practice in all that? Or maybe there is a better approach?

If the files you are talking about are intended to be displayed or downloaded by users (whether for visitors or for authenticated users, filtered by roles (ACL) or not), it is important to ensure (IMHO) that the user will not be able to guess other information other than the content of the concerned resource which has been sent to him. There is no perfect solution that can be applied to all cases without exception, so let's take an example to give you more explanations.
In order to enhance the security and total opacity of sensitive data, for example for the specific case of uploads/users/7/invoices/3.pdf, I think it would be wise to ensure that absolutely no one can guess the number of files that are potentially associated with the user or any other entity (because otherwise, in this example, we could imagine that there potentially are other accessible files - 1.pdf and 2.pdf). By design, we generally want to give access to files in a well defined and specific cases and context. However, this may not be the case for an image file which is intended to be seen by everyone (a profile photo, for example). That's why the context matters in some way.
If you choose to keep the auto-incremented identifiers as names to refer to your files, this can also give information about the size of the data stored in your database (/uploads/invoices/128.pdf informs that you may already have 127 invoices on your server) and potentially motivate unscrupulous people to try to reach resources that should never be fetched out of the defined context. This case may be less obvious if you choose to use some kind of unique generated identifiers (GUID).
I recommend that you read this article concerning the generation of (G)/(U)UIDs (a 128-bit hexadecimal numbers) to be stored in your database for each uploaded or created file. If you use MySQL in its latest version it is even possible to host this identifier in a binary (16) type which offers an automatic conversion to UUID, I let you read this interesting topic associated with what I refer about. It will probably output this as /uploads/invoices/b0016303-8e4f-487a-8c30-5dddf1ebf7e9.pdf which is a lot better as long as you ensure that the generated identifier is unique hash.
It does not seem useful to me here to talk about performance issues because today there are many methods for caching files or path and urls, which avoid having to make requests each time in a lot of cases where a resource is called (often ordered by their popularity rank in bigdata cases).
Last, but not least, many web and mobile platform applications (I think of Slack, Discord, Facebook, Twitter...) which store a lot of media files every day which are often associated with accounts users, both public and confidential files and information, generate a unique hash for each of them.
Twitter is using its own unique identifier string (64-bits BIGINT) generator called Twitter Snowflake which you might be interesting to read too. It is based on the UNIX epoch value which is, by definition, unique at each millisecond tick.
There isn't a global and perfect solution which can be applied for everything but I hope that this will help you as you may want to take a deeper look in this and find the "best solution" for each context and entity you'll store and link files.

Couchbase, two user registering with same username but different datacenters?

Let's say I have two users, Alice in North America and Bob in Europe. Both want to register a new account with the same username, at the same time, on different datacenters. The datacenters are configured to replicate between each other using eventual consistency.
How can I make sure only one of them succeeds at registering the username? Keep in mind that the connection between the datacenters might even be offline at the time (worst case, but daily occurance on spotify's cassandra setup).
EDIT:
I do realize the key uniqueness is the big problem here. The thing is that I need all usernames to be unique. Imagine using twitter if you couldn't tag a specific person, but had to tag everyone with the same username.

With any eventual consistency system, and particularly in the presence of a network partition, you essentially have two choices:
Accept collisions, and pick a winner later.
Ensure you never have a collision.
In the case of Couchbase:
For (1) that means letting two users register with the same address in both NA and EU, and then later picking one as the "winner" (when the network link is present - not a very desirable outcome for something like a user account. A slight variation on this would be something like #Robert's suggestion and putting them in a staging area (which means the account cannot be made "active" until the partition is resolved), and then telling the "winning" user they have successfully registered, and the "loser" that the name is taken and to try again.
For (2) this means making the users unique, even though they pick the same username - for example adding a NA:: / EU:: prefix to their username document. When they login the application would need some logic to try looking up both document variations - likely trying the prefix for the local region first. (This is essentially the same idea as "realms" or "servers" that many MMO games use).
There are variations of both of these, but ultimately given an AP-type system (which Couchbase across XDCR is) you've essentially chosen Availability & Partition-Tolerance over Consistancy, and hence need to reconcile that at the application layer.

Put the user name registrations into a staging table until you can perform a replication to determine if the name already exists in one of the other data centers.

You tagged Couchbase, so I will answer about that.
As long as the key for each object is different, you should be fine with Couchbase. It is the keys that would be unique and work great with XDCR. Another solution would be to have a concatenated key made up of the username and other values (company name, etc) if that suits your use case, again giving you a unique key for the object. Yet another would be to have a key/value in a JSON document that is the username.

It's not clear to me whether you're using Cassandra or Couchbase.
As far as Cassandra is concerned, since version 2.0, you can use Lightweight Transactions which are created for the goal. A Serial Consistency has been created just to achieve what you need. In the above link you can read what follows:
For example, suppose that I have an application that allows users to
register new accounts. Without linearizable consistency, I have no way
to make sure I allow exactly one user to claim a given account — I
have a race condition analogous to two threads attempting to insert
into a [non-concurrent] Map: even if I check for existence before
performing the insert in one thread, I can’t guarantee that no other
thread inserts it after the check but before I do.
As far as the missing connection between two or more cluster its your choice how to handle it. If you can't guarantee the uniqueness at insert-time you can both refuse the registration or dealing with it, accepting and apologize later.
HTH, Carlo

Choosing the right model for storing and querying data?

I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?

The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.

For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers

It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.

SQL Design - How to store large amount of URLs

I'm writing an application that will have a SQL Server backend that will store (among other things) urls. URLS will be mapped to users, and some URLs may be common between different users. In absence of a true DBA, I'm trying to design a solution that can handle hundreds of thousands of URLs as efficiently as possible.
Ideas:
Create table that simply has ID, URL
Pro: simple, complete.
CON: duplicate entries for a URL will exist which will cause the table to be larger than it needs to be.
Break up the user and URLs into separate tables. One table containing USER ID, and URL ID . Another table with URL ID and URL itself.
Pro: single URL in the system, seems more "enterprisey"
Con: must join two tables when trying to pull back results, and not really sure what the benefit of this approach is?
Expand on the 2 idea, except REALLY break it up. So have a table for domain, another for path/query string. Then, user table would have userid, domain ID, path ID.
Pro: urls could share data even if it was unrelated (meaning, cnn.com/helloworld and nbc.com/helloworld would have different domain ids, but same path ids.. seems this could be useful when running metrics later?
Con: Seems like a nightmare from a performance perspective (again, because joins would be necessary to pull a URL.
Any thoughts?

I would do the following in my design:
UserId UrlId
1 1
2 2
1 1
UrlId Url
1 http://www.google.com
2 http://www.yahoo.com
Storing your URLs in a seperate table and only creating a new entry in the URL table, if an exact match does not already exist. If you have a lot of common URLs, this will save some space. You could take it a step farther and add a third table as you mentioned, e.g.
UrlPathId UrlId UrlPath
1 1 /shopping
...and then tieing the UrlPathId to the User table. And perhaps even further:
UrlPathId UrlId UrlQueryString
1 1 ?product=speakers
...and again, referencing this from your User table.

It sounds like you are describing a many to many relationship between users and URL's.
I would highly suggest ruling out option 1. Not only will this increase size, but because if you need to update a URL or a User, you'll have to do it every time that it's duplicated, instead of once.
Choosing between 2 and 3 is more difficult, because it depends much more on how this is going to be used. #2 is a lot more simplistic, and is still normalized. The features in #3 don't seem to outweigh the complexity to me, so personally I'd pick #2.
Edit: Upon seeing George's answer, I completely agree with the first section.

Are you really that short on space? Unless you need to treat URLs as an object in their own right I would just go for option 1 and cover it with indexes if you have specific performance requirements on URLs alone.
See my other comment here on dealing with orphan URLs.

What are some techniques for stored database keys in URL

I have read that using database keys in a URL is a bad thing to do.
For instance,
My table has 3 fields: ID:int, Title:nvarchar(5), Description:Text
I want to create a page that displays a record. Something like ...
http://server/viewitem.aspx?id=1234
First off, could someone elaborate on why this is a bad thing to do?
and secondly, what are some ways to work around using primary keys in a url?

I think it's perfectly reasonable to use primary keys in the URL.
Some considerations, however:
1) Avoid SQL injection attacks. If you just blindly accept the value of the id URL parameter and pass it into the DB, you are at risk. Make sure you sanitise the input so that it matches whatever format of key you have (e.g. strip any non-numeric characters).
2) SEO. It helps if your URL contains some context about the item (e.g. "big fluffy rabbit" rather than 1234). This helps search engines see that your page is relevant. It can also be useful for your users (I can tell from my browser history which record is which without having to remember a number).

It's not inherently a bad thing to do, but it has some caveats.
Caveat one is that someone can type in different keys and maybe pull up data you didn't want / expect them to get at. You can reduce the chance that this is successful by increasing your key space (for example making ids random 64 bit numbers).
Caveat two is that if you're running a public service and you have competitors they may be able to extract business information from your keys if they are monotonic. Example: create a post today, create a post in a week, compare Ids and you have extracted the rate at which posts are being made.
Caveat three is that it's prone to SQL injection attacks. But you'd never make those mistakes, right?

Using IDs in the URL is not necessarily bad. This site uses it, despite being done by professionals.
How can they be dangerous? When users are allowed to update or delete entries belonging to them, developers implement some sort of authentication, but they often forget to check if the entry really belongs to you. A malicious user could form a URL like "/questions/12345/delete" when he notices that "12345" belongs to you, and it would be deleted.
Programmers should ensure that a database entry with an arbitrary ID really belongs to the current logged-in user before performing such operation.
Sometimes there are strong reasons to avoid exposing IDs in the URL. In such cases, developers often generate random hashes that they store for each entry and use those in the URL. A malicious person tampering in the URL bar would have a hard time guessing a hash that would belong to some other user.

Security and privacy are the main reasons to avoid doing this. Any information that gives away your data structure is more information that a hacker can use to access your database. As mopoke says, you also expose yourself to SQL injection attacks which are fairly common and can be extremely harmful to your database and application. From a privacy standpoint, if you are displaying any information that is sensitive or personal, anybody can just substitute a number to retrieve information and if you have no mechanism for authentication, you could be putting your information at risk. Also, if it's that easy to query your database, you open yourself up to Denial of Service attacks with someone just looping through URL's against your server since they know each one will get a response.
Regardless of the nature of the data, I tend to recommend against sharing anything in the URL that could give away anything about your application's architecture, it seems to me you are just inviting trouble (I feel the same way about hidden fields which aren't really hidden).
To get around it, we usaully encrypt the parameters before passing them. In some cases, the encyrpted URL also includes some form of verification/authentication mechanism so the server can decide if it's ok to process.
Of course every application is different and the level of security you want to implement has to be balanced with functionality, budget, performance, etc. But I don't see anything wrong with being paranoid when it comes to data security.

It's a bit pedantic at times, but you want to use a unique business identifier for things rather than the surrogate key.
It can be as simple as ItemNumber instead of Id.
The Id is a db concern, not a business/user concern.

Using integer primary keys in a URL is a security risk. It is quite easy for someone to post using any number. For example, through normal web application use, the user creates a user record with an ID of 45 (viewitem/id/45). This means the user automatically knows there are 44 other users. And unless you have a correct authorization system in place they can see the other user's information by created their own url (viewitem/id/32).
2a. Use proper authorization.
2b. Use GUIDs for primary keys.

showing the key itself isn't inherently bad because it holds no real meaning, but showing the means to obtain access to an item is bad.
for instance say you had an online store that sold stuff from 2 merchants. Merchant A had items (1, 3, 5, 7) and Merchant B has items (2, 4, 5, 8).
If I am shopping on Merchant A's site and see:
http://server/viewitem.aspx?id=1
I could then try to fiddle with it and type:
http://server/viewitem.aspx?id=2
That might let me access an item that I shouldn't be accessing since I am shopping with Merchant A and not B. In general allowing users to fiddle with stuff like that can lead to security problems. Another brief example is employees that can look at their personal information (id=382) but they type in someone else id to go directly to someone else profile.
Now, having said that.. this is not bad as long as security checks are built into the system that check to make sure people are doing what they are supposed to (ex: not shopping with another merchant or not viewing another employee).
One mechanism is to store information in sessions, but some do not like that. I am not a web programmer so I will not go into that :)
The main thing is to make sure the system is secure. Never trust data that came back from the user.

Everybody seems to be posting the "problems" with using this technique, but I haven't seen any solutions. What are the alternatives. There has to be something in the URL that uniquely defines what you want to display to the user. The only other solution I can think of would be to run your entire site off forms, and have the browser post the value to the server. This is a little trickier to code, as all links need to be form submits. Also, it's only minimally harder for users of the site to put in whatever value they wish. Also this wouldn't allow the user to bookmark anything, which is a major disadvantage.
#John Virgolino mentioned encrypting the entire query string, which could help with this process. However it seems like going a little too far for most applications.

I've been reading about this, looking for a solution, but as #Kibbee says there is no real consensus.
I can think of a few possible solutions:
1) If your table uses integer keys (likely), add a check-sum digit to the identifier. That way, (simple) injection attacks will usually fail. On receiving the request, simply remove the check-sum digit and check that it still matches - if they don't then you know the URL has been tampered with. This method also hides your "rate of growth" (somewhat).
2) When storing the DB record initially, save a "secondary key" or value that you are happy to be a public id. This has to be unique and usually not sequential - examples are a UUID/Guid or a hash (MD5) of the integer ID e.g. http://server/item.aspx?id=AbD3sTGgxkjero (but be careful of characters that are not compatible with http). Nb. the secondary field will need to be indexed, and you will lose benefits of clustering that you get in 1).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight