Protecting (or tracking plagiarism of) Openly Available Web Content (database/list/addreses)

Protecting (or tracking plagiarism of) Openly Available Web Content (database/list/addreses) - database

We have put together a very comprehensive database of retailers across the country with specific criteria. It took over a year of phone interviews, etc., to put together the list. The list is, of course, not openly available on our site to download as a flat file...that would be silly.
But all the content is searchable on the site via Google Maps. So theoretically with enough zip-code searches, someone could eventually grab all the retailer data. Of course, we don't want that since our whole model is to do the research and interviews required to compile this database and offer it to end-users for consumption on our site.
So we've come to the conclusion there isnt really any way to protect the data from being taken en-masse but a potentially competing website. But is there a way to watermark the data? Since the Lat/Lon is pre-calculated in our db, we dont need the address to be 100% correct. We're thinking of, say, replacing "1776 3rd St" with "1776 Third Street" or replacing standard characters with unicode replacements. This way, if we found this data exactly on a competing site, we'd know it was plagiarism. The downside is if users tried to cut-and-paste the modified addresses into their own instance of Google Maps -- in some cases the modification would make it difficult.
How have other websites with valuable openly-distributed content tackled this challenge? Any suggestions?
Thanks

It is a question of "openly distribute" vs "not openly distribute" if you ask me. If you really want to distribute it, you should acknowledge that someone can receive the data.
With certain kinds of data (media like photos, movies, etc) you can watermark or otherwise tamper with the data so it becomes trackable, but if your content is like yours that will become hard, and even harder to defend: if you use "third street" and someone else also uses it, do you think you can make a case against them? I highly doubt it.
The only steps I can think of is
Making it harder to get all the information. Hide it behind scripts and stuff instead of putting it on google maps, make sure it is as hard as you can make it for bots to get the information, limit the amount of results shown to one user, etc. This could very well mean your service is less attractive to the end user, this is a trade-off
Sort of the opposite of above: use somewhat the same technique to HIDE some of the data for the common user instead of showing it to them. This would be FAKE data, that a normal person shouldn't see. If these retailers show up at your competitors, you've caught them red-handed. This is certainly not fool-proof, as they can check their results for validity and remove your fake stuff, there is always a possibility a user with a strange system gets the fake data which makes your served content less correct, and lastly if your competitors' scraper looks too much like real user, it won't get the data.
provide 2-step info: in step one you get the "about" info, anyone can find that. In step 2, after you've confirmed that this is what the user wants, maybe a login, maybe just limited in requests etc, you give everything. So if the user searches for easy-to-reach retailers, first say in which area you have some, and show it 'roughly' on the map, and if they have chosen something, show them in a limited environment what the real info is.

Related

What's the point of storing the user agent?

So far while logging userlogins I always stored the complete user agent additionally to already parsed informations (like browser, version, os, etc). The user agent usually just is a TEXT field in the table.
While implementing another similar thing, I was asking myself: What's even the point of doing that? Obviously, the user agent can be manipulated easily in any case, and the only relevant informations (browser, version and operating system) are already parsed and stored separately anyways.
Is there some actual benefit in still storing it, except for backtracking of data that could be faked anyways? What other relevant informations does the user agent contain to justify the (over years, quite large) amount of data that is used to store it?
And of course I realize that the user agent contains a lot more than just the browser specifications - but how many times did you really have to go back and analyze the user agent itself?
Just to clarify: I'm talking about reasons why to store the raw user-agent string, after parsing the "relevant" informations out of it (browser, os, etc) - what is the point of the user-agent after that point?

The user agent string contains information about the environment including operating system and browser. It is something I frequently check. There are two main reasons to store it.
If you are following up on a bug report or error then this
information is useful or even essential for determining what went
wrong - imagine trying to find an error that occurs only on IE8
without the user agent! This information can also help you prioritize a bug fix. You will want to fix an issue that is present on 93% of environments before you fix the one that is present on 7%.
Secondly, it provides very useful stats on the profile of your user. You might only want to support environments of more than a certain percentage of your user base. For example, if you are designing a new version of your software and, on examining your user agent logs, you find no one using IE, you might not bother to optimize or design for IE.
You seem to be concerned that the user agent string can be faked. While this is possible, unless there is some specific reason someone might do this in your app, it seems rather paranoid to worry about it. You make a good point, though, to remember what information is possible to fake.
UPDATE: I see your point, in fact in the logging I recently implemented I removed the parsed string because of the data overhead. There is little point in storing both the raw string and the parsed string. The only real reason to do that would be to make querying the logs slightly easier, which is not a good enough reason to me. Personally, I store the whole raw useragent which means no loss of data, future proofing for future browsers/oses/formats of user string, and eliminates the possibility of making mistakes when parsing.
From Wikipedia:
For this reason, most Web browsers use a User-Agent value as follows:
Mozilla/[version] ([system and browser information]) [platform]
([platform details]) [extensions]
If you have stored all the fields out of that you need then by all means discard the rest. The amount of data to log, how long to keep logs for, and in what form to keep them is a fairly personal thing that will differ in some ways from company to company and project to project.

How to merge user data after login?

It doesn't matter if you're building an eshop or any other application which uses session to store some data between requests.
If you don't want to annoy the user by requiring him to register, you need to allow him to do certain tasks anonymously when possible (user really have to have a reason for registering).
There comes a problem - if user decides to login with his existing profile, he may already have some data in his "anonymous" session.
What are the best practices of merging these data? I'm guessing the application should merge it automatically where possible or let the user decide where not possible.
But what I'm asking more is if there are any resources about how to do the magic in database (where the session data are usually stored) effectively.
I have two basic solutions in my mind:
To keep anonymous session data and just add another "relation" saying what's actually used where and how it's merged
To physically merge these data
We could say that the first solution will probably be more effective, because the information about any relation will probably mean less data than data about the user. But it will also mean more effort when reading the data (as we firstly need to read the relation to get to actual user data).
Are there any articles/resources for designing data structures for this particular use case (anonymous + user data)?

An excellent question that any app developer using user data should ask, and, sadly very few do :(
In fact, there are two completely independent questions here:
Q1 - At what stage require user to sign in/up?
Q2 - Data concurrency and conflict resolution (see below).
And here some analysis for each of the questions. Please excuse my extra passion coming from my own "frustrated user" experience. :)
Q1 is a pure usability question. To which the answer is actually obvious:
Avoid or delay to force the user sign in as much as possible!
Even the need to save state is not enough a reason by itself. If I am as user not interested in saving that state, then don't force me to sign! Please!
The only reason for you (as website) to justify forcing me to sign is when I (as user) want to save my data for later use. Here I speak as user having wasted time on signing only to find the site useless. If you want to get rid of too many users, that is the right way. In any other case - please, delay it as much as possible!
Why so many sites completely disregard such an obvious rule? The possible reasons I can see:
R1- developer friendly vs user friendly. Yes, it is developer friendly to require sign in right away, so we don't need to bother with concurrency (Q2). So we can save developer costs, time etc. But every saving comes at a cost! Which in this case is called User Experience. Which is not necessarily where you would like to look for saving. Especially, since the solution should not be that hard (see below).
R2 - Designer or Manager making the decision is an "indoor enthusiast" :) She lives happy life surrounded by super-fast computers with super-fast internet connection and can't imagine singing up can be that hard for any user. So why is it such a big deal? So many reasons:
It breaks the application flow. Sites living in previous century still replace the whole screen with sometimes rather lengthy form. Some forms are badly designed, some have erratic instructions, some simply don't work. Some have submit buttons that are for some reason disabled in the browser used.
Some form designers have genius idea to lock certain fields with barely noticeable change or colour. Then don't show me the field if you don't want me to fill it!
If the site is serious about user's data, it must request Email and must verity it! Why? How else shall I get back to user who forgot all other credentials? Why verify? What if user mistyped the email? If I don't verify it, next time the user tries to recover password with her correct email, the recovery fails and all data are lost! Obvious, yet there are still sites out there not doing it. Then I need to wait till the verification email is received and click on, hopefully, well-formatted and uniquely identifiable link that does not break in my browser, nor get some funny characters due to broken encoding detection, making the whole link unusable.
The internet connection can be slow or broken, making every additional step a piece of pain. Even with good connection, it happens here and there that page suddenly takes much longer to load. Also the email may not arrive right away. Then impatient user starts furiously clicking the "resend verification" link. In which case 90% of sites resend their link with new token but also disable all previous tokens. Then several emails arrive in unpredictable order and poor user has to guess in vain, which one (and only one) is still valid. Now why those sites find it so hard to keep several tokens active, just for this case, is beyond my understanding.
Finally there is still this so hard to unlearn habit for sites to insist on the so-called "username". So now, beside my email, I have to think hard to come up with this unique username, different from any previous user! Thank you so much for making it sweet and easy! My own way of dealing with it is to use my email as username. Sadly, there are still sites not accepting it! Then what if some fun type used my email as his username? Not so unrealistic if your email is bill#gates.com. But why simply not use Email and Password to avoid all this mess?
Here some possible guidelines to relieve user's pain:
Only force me to sign in/up if you absolutely need and give me a chance to choose not to!
Make it one page form, so I know what I am up to and, needless to say, use as few input fields as possible. Ideally only Email and Password (possibly twice), no Username!
Show your sign in form as small window on top of your page without reloading, and allow me to get rid of it with single click away from that window. Don't force me to look for "close" button or, even worse, icon I could confuse for something else!
Account for user to click back/forth and reload buttons. Don't clear the form upon reload! Don't put clear button at all! It is too easy to click by accident. The data you are ask me to fill should not be so long in first place, that I could not re-enter it without the need of "assistance" to clear.
Now to question Q2. Here we have well known problem of conflict resolution that occurs any time two data need to be merged. For instance, the anonymous data and the registered user data, but also whenever two users modify the same data, or the same user modifies it from different devices at different times, or locally stored data conflict with server data, and so on.
But whatever the source is, the problem is always the same. We have two data, say two objects $obj1 and $obj2 and you need to produce your single merged object $obj3. The logic can be as simple as the rule that server's object always wins, or that the last modified object always wins, or the last modified object keys always win or any more complicated logic. This really depends on the nature of your application. But in each case, all you need to do is to write your logic function with arguments $obj1, $obj2 that returns $obj3.
A solution that will possibly work in many cases is to store timestamp on each object attribute (key) and let the latest changed key win at the moment of synchronisation. That accounts e.g. for the situation when the same user modifies different attributes when being anonymous from different devices.
Imagine I had modified keys A and B on device AA yesterday, then logged today from device BB to enter another B and saved it to the server, then switched back to my device AA, where I am anonymous, to enter yet another A without changing the old B from yesterday, and then realised I want to log in and synchronise. Then my local B is obviously old and should clearly not overwrite the value of B that I changed more recently on device BB. In this seemingly complicated case, the above solutions works seamlessly and effectively. In contrast, putting the timestamp only on whole objects would be wrong.
Now in some cases, it could make sense to keep both objects, and, e.g. distinguish them by adding extra properties, like in case 1 suggested in Radek's question. For instance, Dropbox adds something like "conflicted copy by user X" to the end of the file. Like in Dropbox case, this is sensible in case of collaboration apps, where users like to have some version control.
However, in those cases, you as developer simply save two copies and let the users deal with that problem.
If on the other hand, you have to write a complicated logic based on user's data, having two different copies hanging around can be a nightmare. In that case, I would split data into two groups (e.g. create two objects out of one). The first group has data representing the state of the application as a whole, that is important to be unique. For that data I would use the conflict resolution as above or similar. Then the second group is user-specific, where I would store both data as two separate entries in the database, properly mark them (like Dropbox does), and then let users deal with the list of two (or more) entries of their project.
Finally, if that additional complication of database management makes the developer uneasy, and since Radek asked to give a resource reference, I want to "kill two flies with one shot" by mentioning the blog entry StackMob offline Sync, whose solution provides both database and user management functionality and so relieves the developer from that pain. Surely there is a lot more info to be found when searching for data concurrence, conflict resolution and the likes.
To conclude, I have to add the obligatory disclaimer, that all written here are merely my own thoughts and suggestions, that everyone should use at own risk and don't hold me responsible if you suddenly get too many happy users making your system crash :)
As I am myself working on an app, where I am implementing all those aspects, I am certainly very interested to hear other opinions and what else folks have to say on the subject.

From my experience - both as a user of sites that require a login, and as a developer working with logged in users - I don't think I've ever seen a site behave this way.
The common pattern is to let a user be anonymous and the first time they do something that would require saving state, they are prompted to login. Then the previous action is remembered and the user can continue. For example, if they try to add something to their shopping cart, they are prompted to login and then after login, the item is in their cart.
I suppose some places would allow you to fill a cart and then login at which point the cart is associated with a concrete user.
I would create a SessionUser object that has the state of the site interaction and one field called UserId that is used to retrieve other things like name, address, etc.
With anonymous users, I would create the SessionUser object with an empty reference for UserId. This means we can't resolve a name or an address, but we can still save state. The actions they are performing, the pages they're viewing, etc.
Once they login we don't have to merge two objects, we just populate the UserId field in SessionUser and now we can traverse an object graph to get name, email, address or whatever else.

Are there any algorithms for creating playlists that don't require a massive database (or that work with a publicly accessible one)?

Is there any algorithm with which I can automatically create a playlist of songs that well with each other -- similarly to services like iTunes Genius -- that a single developer can actually implement? It should either a) not require any sort of remote database of listening habits etc. or b) require such a database, but work with one that is freely available.

i did this, and i used the last.fm database as described by tomasz. i didn't use "related artist" directly, but instead constructed my own relationship graph by comparing tags associated with different artists (this is not the approach suggested by lcfseth btw - i have quite a large range of music and i wanted to explore "natural" connections that might not be common partners in "normal" playlists; also i wasn't sure how uniform the related artists were).
i also used a local database to cache data from last.fm, because calls to the api are rate limited, and i experimented with using other parts of the api to improve / normalize the information i was reading from mp3 tags.
generating a useful graph of related artists was actually quite hard; largely because some nodes in the graph naturally tend to be more important than others. if you don't "even out" the graph then your playlist will keep returning to the "important" artists.
the final result did work well, in that the selection of music had a good balance between "central theme" and variation. but the implementation is not at all polished, the calculation of the graph can take a long time (many hours), the program takes up a fair amount of memory when running, and it still seems to play elvis costello a little more than expected ;o)
if you are interested, the code is at http://code.google.com/p/uykfe/
the best part of all, from my point of view as a user, is that it can update logitech media server (squeezeserver) playlists in "realtime", adding a new track whenever the list is empty. that works really well in continuing from whatever music you select "by hand". it can also generate one-off playlists, of course, and, finally, by tweaking parameters you can get a kind of "random walk" through your music collection - it will play related tunes but slowly drift from one style to another (in fact, this is really the "default" mode - to get it to stay on a single theme i needed extra logic that biased it towards whatever music it had played earlier).
ps also, the dump of the final graph to gephi was really cool - i had it printed out and it's now pinned to the wall...
pps i also experimented with the musicbrainz database, which in theory sounds like a fantastic resource. but in practice it is over-complex and poorly documented.

I don't know iTunes Genius, but I think last.fm database and API might be useful for you. Every time you see any track it shows you a list of similar tracks, based on other users preferencs. The same information can be obtained using track.getSimilar API method.

The idea behind most of these databases, is to see what other users listens to after they listen to a given song. The accuracy of these statistics depends on the number of users therefor it is probably hard to use this locally. The algorithm itself is not that hard to implement.
The alternative would be to sort song based on genre, singer... which are informations that are usually embedded in the songs but not always. Winamp have this feature, but it won't work for old songs, unless you manually set the informations or use an On-line song database.

Best practices on what data to collect in an in-app web analytics

In our SaaSy webapp we need to collect Google Analytics-like data (like, what pages were visited, how many 404s where there, etc.). I wonder if there are any best practices on what pieces of information should be collected (like, IP, User Agent, etc.) and how should these logs be stored. Requirements on what statistics we're going to display are not yet fixed, but I want to have a starting point.

Tracking for the sake of tracking is pointless. The point of tracking activity on your site is to answer specific business questions, such as how many people are buying your product, or how far are they getting in your sale funnel or other events like signing up for a newsletter, etc...What you should be doing is asking the people who make business decisions what it is they need/want to know, and go from there.
Having said that, most ad-hoc reports can be generated with basics like the URL and timestamp. Ability to parse specific variables from the URL and categorize them and their values is handy for campaign tracking. Tracking IP addresses are good for debugging and finding out what country/region/market the user is coming from. Referring URL is good for tracking where the user came from on the internet (another site, paid vs. organic search, a campaign, etc...).
And then throw a couple of variables into the mix. Allow for the ability to populate variables with arbitrary information (like product IDs, etc...) that can be sent to you and stored, so you can see things like how many times a product was viewed or purchased, how much it cost, etc...
But anyways, to answer your question, ultimately "best practice" is first sitting down with the guys in suits and ask what they want/need to know and work with them to find out if what they want to know is just silly or if it's actually actionable (for example, knowing things like number of pageviews is okay but how actionable is it really? What's MORE actionable is knowing how many of xyz is being sold, or where on your site people are abandoning you, so you can streamline your site, maybe decide your product or offer sucks and needs to be revisited, etc...).
I have to ask though...is there a particular reason you wish to create your own tracking tool as opposed to using or investing in one of the many tools already out there? There is Google Analytics (GA), Yahoo Web Analytics (YWA), Omniture SiteCatalyst, Webtrends to name a few. Some are free, some cost money, but it is an investment that yields real returns if used properly.

Designing a main form ("main menu") for a WinForm application

The form that currently loads during when our beta WinForm application starts up is one that shows a vast array of buttons... "Inventory", "Customers", "Reports", etc. Nothing too exciting.
I usually begin UI by looking at similar software products to see how they get done, but as this is a corporate application, I really can't go downloading other corporate applications.
I'd love to give this form a bit of polish but I'm not really sure where to start. Any suggestions?
EDIT: I am trying to come up with multiple options to present to users, however, I'm drawing blanks as well. I can find a ton of design ideas for the web, but there really doesn't seem to be much for Windows form design.

I have found that given no option, users will have a hard time to say what they want. Once given an option, it's usually easier for them to find things to change. I would suggest making some paper sketches of potential user interfaces for you application. Then sit down with a few users and discuss around them. I would imagine that you would get more concrete ideas from the users that way.
Update
Just a couple of thoughts that may (or may not) help you get forward:
Don't get too hung up on the application being "corporate". Many coprorate applications that I have seen look so boring that I feel sorry for the users that need to see them for a good share of their day.
Look at your own favourite UI's and ask yourself why you like them.
While not getting stuck in the "corporate template", also do not get too creative; the users collected experience comes from other applications and it may be good if they can guess how things work without training.
Don't forget to take in inspiration from web sites that you find appealing and easy to use.
Try to find a logical "flow"; visualize things having the same conceptual functionality in a consistent way; this also helps the user do successful "guesswork".

You might look to other applications that your users are familiar with. Outlook is ubiquitous in my company, and we were able to map our application to its interface relatively easily, so we used that application as a model when developing our UI.
Note that I'm not suggesting Outlook specifically to you, just that you look for UIs that would make your users' learning curve shallower.

The problem here is that you need some good user analysis and I'm guessing you've only done functional analysis.
Because your problem is so abstract, it's hard to give one good example of what you need to do. I'd go to usability.gov and check out the usability methods link, especially card sorting and contextual interviews.
Basically you want to do two things:
1- Discover where your users think how information is grouped on the page: This will help flesh out your functional requirements too. Once you've got information all grouped up, you've basically got your navigation metaphor set up. Also, you can continually do card sorting exercises right down to page and function levels - e.g. you do one card sorting session to understand user needs, then you take one group of cards and ask users to break that down into ranks of importance. Doing so will help you understand what needs to be in dominate areas of the screen and what can be hidden.
2- Understand what tools they already use: what they do and don't like about them. You need to get a list of tools/applications that they use externally and internally. Internally is probably the most important because there is a fair chance that most people in your business will share an experience of using it. External tools however might help give you context into how your users think.
Also, don't be afraid to get pencil and paper and sketch up ideas with users. People generally understand that sketches are a quick and useful way to help with early design work and you can get an immense amount of information out of them with just simple sketches. Yes, even do this if you suck at sketching - chances are it won't matter. In fact, crappy sketches could even work in your favour because then nobody is going to argue if buttons should be blue, red or whatever.

Frankly, a form with a “vast array of buttons” needs more than a little polish. A form dedicated solely to navigation generally means you’re giving your users unnecessary work. Provide a pulldown or sidebar menu on each form for navigating to any form.
The work area of your starting form should provide users with something to actually accomplish their tasks. Among the options are:
A “dashboard” main form, showing summarized information about the users’ work (e.g., list of accounts to review and status of each, number of orders at each stage of processing, To Do schedule). Ideally, users should be able to perform their most common tasks directly in the opening form (e.g., mark each account as “approved” or not). If further information is necessary to complete a task, links navigate to detailed forms filled with the proper query results. At the very least users should be able to assess the status of their work without going any further. Note that different groups of users may need different things on their respective dashboards.
Default form or forms. Users of a corporate application typically have specific assignments, often involving only one to three of all your forms. Users who work with Inventory, for example, may almost never need to look at Customer records, and vice versa. Users also often work on a specific subset of records. Each sales rep, for example may be assigned a small portion of the total number of customers in the database. Divide your users into groups based on the forms and records they usually use. For each user group, start the app by automatically opening the user group’s form(s) populated with the query results of their records. Users should be able to complete most of their work without any further navigation or querying.
If all else fails, open the app to whatever forms and content were last open when the user quit the app. Many corporate users will continue to work tomorrow on the same or similar stuff they’re working on today.
Analyze the tasks of your users to determine which of the above options to use. It is generally not productive to describe each option to the users and ask which they like better.
BTW, “Reports” is probably not a particularly good navigation option. It’s better if you consistently identify things primarily by what they show, rather than how they show it. Users may not know that the information they want to see is in a “report” rather than a form, but they’ll know what content they want to see. Reports on inventory are accessed under Inventory; reports about sales are accessed under Sales.

Have you tried asking your end users what they would like? After all they are the ones that are going to be using the system.

I use components from the company DevExpress. They have some really cool controls (such as the Office 2007 ribbon), form skinning utilities (with a vast amount of different skins), and a load more...
If you want to check it out they have 60 free components - if its corporate though you might have to check the licence but you can get it at... DevExpress 60 Free

I suggest starting with the design principles suggested by Microsoft: Windows User Experience Interaction Guidelines

Some places to get ideas for interaction designs:
Books
About Face 3 - The Essentials of Interaction Design
Don't Make Me Think (this is focused on web design, but many of the principles carry over to Windows design)
Web Sites
Windows User Experience Interaction Guidelines
In addition, many applications have free trial versions that you can download to determine how they handle user interaction. Also, don't discount items on your desktop right now.

Do you have any statistics or insights concerning what the most commonly-used or important functions might be? If so, you could use that to pare down your "vast array of buttons" and highlight only those that are most important.
That's sort of a trivial example, but the underlying point is that your understanding of your audience should inform your design, at least from a functional perspective. You might have past usage statistics, or user stories, or documented workflows, or whatever - even if you're drawing a blank right now, remember that you have to know something about your users, otherwise you wouldn't be able to write software for them.
Building on what they already know can make it easy on your users. Do they live in Outlook? Then you might want to mimic that (as Michael Petrotta suggested). Do they typically do the same thing (within a given role) every time they use the app? Then look for a simple, streamlined interface. Are they power users? Then they'll likely want to be able to tweak and customize the interface. Maybe you even have different menu forms for different user roles.
At this stage, I wouldn't worry about getting it right; just relax and put something out there. It almost doesn't matter what you design, because if you have engaged users and you give them the option, they're going to want to change something (everything?) anyway. ;-)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight