What's the point of storing the user agent? - database

So far while logging userlogins I always stored the complete user agent additionally to already parsed informations (like browser, version, os, etc). The user agent usually just is a TEXT field in the table.
While implementing another similar thing, I was asking myself: What's even the point of doing that? Obviously, the user agent can be manipulated easily in any case, and the only relevant informations (browser, version and operating system) are already parsed and stored separately anyways.
Is there some actual benefit in still storing it, except for backtracking of data that could be faked anyways? What other relevant informations does the user agent contain to justify the (over years, quite large) amount of data that is used to store it?
And of course I realize that the user agent contains a lot more than just the browser specifications - but how many times did you really have to go back and analyze the user agent itself?
Just to clarify: I'm talking about reasons why to store the raw user-agent string, after parsing the "relevant" informations out of it (browser, os, etc) - what is the point of the user-agent after that point?

The user agent string contains information about the environment including operating system and browser. It is something I frequently check. There are two main reasons to store it.
If you are following up on a bug report or error then this
information is useful or even essential for determining what went
wrong - imagine trying to find an error that occurs only on IE8
without the user agent! This information can also help you prioritize a bug fix. You will want to fix an issue that is present on 93% of environments before you fix the one that is present on 7%.
Secondly, it provides very useful stats on the profile of your user. You might only want to support environments of more than a certain percentage of your user base. For example, if you are designing a new version of your software and, on examining your user agent logs, you find no one using IE, you might not bother to optimize or design for IE.
You seem to be concerned that the user agent string can be faked. While this is possible, unless there is some specific reason someone might do this in your app, it seems rather paranoid to worry about it. You make a good point, though, to remember what information is possible to fake.
UPDATE: I see your point, in fact in the logging I recently implemented I removed the parsed string because of the data overhead. There is little point in storing both the raw string and the parsed string. The only real reason to do that would be to make querying the logs slightly easier, which is not a good enough reason to me. Personally, I store the whole raw useragent which means no loss of data, future proofing for future browsers/oses/formats of user string, and eliminates the possibility of making mistakes when parsing.
From Wikipedia:
For this reason, most Web browsers use a User-Agent value as follows:
Mozilla/[version] ([system and browser information]) [platform]
([platform details]) [extensions]
If you have stored all the fields out of that you need then by all means discard the rest. The amount of data to log, how long to keep logs for, and in what form to keep them is a fairly personal thing that will differ in some ways from company to company and project to project.

Related

How to permanently store information using C program?

I am trying to get input from the user that the program will remember every time it runs. I want to do this in C. Based on my limited knowledge of computers, this will be stored in the storage/hard drive/ssd instead of the RAM/memory. I know I could use a database, or write to a text file, but I don't want to use an external file or a database, since I think that a database is a bit overkill for this and an external file can be messed with by the user. How can I do this (get the user to enter the input once and for the program to remember it forever)? Thanks! (If anyone needs me to clarify my question, I'd be happy to do so and when I get an answer, I will clarify this question for future users.)
People have done this all kinds of ways. In order from best to worst (in my opinion):
Use a registry or equivalent. That's what it's there for.
Use an environment variable. IMO this is error prone and may not really be what you want.
Store a file on your user's computer. Easy to do and simple.
Store a file on a server and read/write to the server via the network. Annoying that you have to use the network, but OK.
Modify your own binary on disk. This is fun as a learning experience, but generally inadvisable in production code. Still it can be done sometimes especially using an installer.
Spawn a background process that "never" dies. This is strictly worse than using a file.
You won't be able to prevent the user from modifying a file if they really want to. What you could do is create a file with a name or extension that makes it obvious that it should not be modified, or make it hidden to the user.
There isn't really any common way that you could write to a file and at the same time prevent the user from accessing it. You would need OS/platform level support to have some kind of protected storage.
The only real alternative commonly available is to store the information online on a server that you control and fetch it from there over the network. You could cache a cryptographically signed local copy with an expiration date to avoid having to connect every time the program is run. Of course if you are doing this as some kind of DRM or similar measure (e.g., time-limited demo), you will also need to protect your software from modification.
(Note that modifying the program itself, which you mentioned in a comment, is not really any different from modifying other files. In particular, the user can restore an earlier version of the program from backup or by re-downloading it, which is something even a casual user might try. Also, any signature on the software would become invalid by your modifications, and antivirus software may be triggered.)
If you simply wish to hide your file somewhere to protect against casual users (only), give it an obscure name and set the file hidden using both filesystem attributes and naming (in *nix systems with a . as the first character of file name). (How hidden you can make it may be thwarted by permissions and/or sandboxing, depending on the OS.)
Also note that if your goal is to hide the information from the user, you should encrypt it somehow. This includes any pieces of the data that are part of the program that writes it.
edit: In case I guessed incorrectly and the reason for wanting to do this is simply to keep things "clean", then just store it in the platform's usual "user settings" area. For example, AppData or registry on Windows, user defaults or ~/Library/Application Support on macOS, a "dotfile" on generic *nix systems, etc. Do not modify the application itself for this reason.
If you want to persist data, the typical way to that is to store it to a file. Just use FILE* and go about your business.
Using a database for this may be an overkill, it depends on how you want to later access the data once it is stored.
If you just load the data from the file and search through it, then there is no need for a database, if you have loads of data and want to make complex searches, then a database is the way to go. If you need redundancy, user handling, security then choose a database, since the developers of each one already spent a lot of time fixing this.

How to merge user data after login?

It doesn't matter if you're building an eshop or any other application which uses session to store some data between requests.
If you don't want to annoy the user by requiring him to register, you need to allow him to do certain tasks anonymously when possible (user really have to have a reason for registering).
There comes a problem - if user decides to login with his existing profile, he may already have some data in his "anonymous" session.
What are the best practices of merging these data? I'm guessing the application should merge it automatically where possible or let the user decide where not possible.
But what I'm asking more is if there are any resources about how to do the magic in database (where the session data are usually stored) effectively.
I have two basic solutions in my mind:
To keep anonymous session data and just add another "relation" saying what's actually used where and how it's merged
To physically merge these data
We could say that the first solution will probably be more effective, because the information about any relation will probably mean less data than data about the user. But it will also mean more effort when reading the data (as we firstly need to read the relation to get to actual user data).
Are there any articles/resources for designing data structures for this particular use case (anonymous + user data)?
An excellent question that any app developer using user data should ask, and, sadly very few do :(
In fact, there are two completely independent questions here:
Q1 - At what stage require user to sign in/up?
Q2 - Data concurrency and conflict resolution (see below).
And here some analysis for each of the questions. Please excuse my extra passion coming from my own "frustrated user" experience. :)
Q1 is a pure usability question. To which the answer is actually obvious:
Avoid or delay to force the user sign in as much as possible!
Even the need to save state is not enough a reason by itself. If I am as user not interested in saving that state, then don't force me to sign! Please!
The only reason for you (as website) to justify forcing me to sign is when I (as user) want to save my data for later use. Here I speak as user having wasted time on signing only to find the site useless. If you want to get rid of too many users, that is the right way. In any other case - please, delay it as much as possible!
Why so many sites completely disregard such an obvious rule? The possible reasons I can see:
R1- developer friendly vs user friendly. Yes, it is developer friendly to require sign in right away, so we don't need to bother with concurrency (Q2). So we can save developer costs, time etc. But every saving comes at a cost! Which in this case is called User Experience. Which is not necessarily where you would like to look for saving. Especially, since the solution should not be that hard (see below).
R2 - Designer or Manager making the decision is an "indoor enthusiast" :) She lives happy life surrounded by super-fast computers with super-fast internet connection and can't imagine singing up can be that hard for any user. So why is it such a big deal? So many reasons:
It breaks the application flow. Sites living in previous century still replace the whole screen with sometimes rather lengthy form. Some forms are badly designed, some have erratic instructions, some simply don't work. Some have submit buttons that are for some reason disabled in the browser used.
Some form designers have genius idea to lock certain fields with barely noticeable change or colour. Then don't show me the field if you don't want me to fill it!
If the site is serious about user's data, it must request Email and must verity it! Why? How else shall I get back to user who forgot all other credentials? Why verify? What if user mistyped the email? If I don't verify it, next time the user tries to recover password with her correct email, the recovery fails and all data are lost! Obvious, yet there are still sites out there not doing it. Then I need to wait till the verification email is received and click on, hopefully, well-formatted and uniquely identifiable link that does not break in my browser, nor get some funny characters due to broken encoding detection, making the whole link unusable.
The internet connection can be slow or broken, making every additional step a piece of pain. Even with good connection, it happens here and there that page suddenly takes much longer to load. Also the email may not arrive right away. Then impatient user starts furiously clicking the "resend verification" link. In which case 90% of sites resend their link with new token but also disable all previous tokens. Then several emails arrive in unpredictable order and poor user has to guess in vain, which one (and only one) is still valid. Now why those sites find it so hard to keep several tokens active, just for this case, is beyond my understanding.
Finally there is still this so hard to unlearn habit for sites to insist on the so-called "username". So now, beside my email, I have to think hard to come up with this unique username, different from any previous user! Thank you so much for making it sweet and easy! My own way of dealing with it is to use my email as username. Sadly, there are still sites not accepting it! Then what if some fun type used my email as his username? Not so unrealistic if your email is bill#gates.com. But why simply not use Email and Password to avoid all this mess?
Here some possible guidelines to relieve user's pain:
Only force me to sign in/up if you absolutely need and give me a chance to choose not to!
Make it one page form, so I know what I am up to and, needless to say, use as few input fields as possible. Ideally only Email and Password (possibly twice), no Username!
Show your sign in form as small window on top of your page without reloading, and allow me to get rid of it with single click away from that window. Don't force me to look for "close" button or, even worse, icon I could confuse for something else!
Account for user to click back/forth and reload buttons. Don't clear the form upon reload! Don't put clear button at all! It is too easy to click by accident. The data you are ask me to fill should not be so long in first place, that I could not re-enter it without the need of "assistance" to clear.
Now to question Q2. Here we have well known problem of conflict resolution that occurs any time two data need to be merged. For instance, the anonymous data and the registered user data, but also whenever two users modify the same data, or the same user modifies it from different devices at different times, or locally stored data conflict with server data, and so on.
But whatever the source is, the problem is always the same. We have two data, say two objects $obj1 and $obj2 and you need to produce your single merged object $obj3. The logic can be as simple as the rule that server's object always wins, or that the last modified object always wins, or the last modified object keys always win or any more complicated logic. This really depends on the nature of your application. But in each case, all you need to do is to write your logic function with arguments $obj1, $obj2 that returns $obj3.
A solution that will possibly work in many cases is to store timestamp on each object attribute (key) and let the latest changed key win at the moment of synchronisation. That accounts e.g. for the situation when the same user modifies different attributes when being anonymous from different devices.
Imagine I had modified keys A and B on device AA yesterday, then logged today from device BB to enter another B and saved it to the server, then switched back to my device AA, where I am anonymous, to enter yet another A without changing the old B from yesterday, and then realised I want to log in and synchronise. Then my local B is obviously old and should clearly not overwrite the value of B that I changed more recently on device BB. In this seemingly complicated case, the above solutions works seamlessly and effectively. In contrast, putting the timestamp only on whole objects would be wrong.
Now in some cases, it could make sense to keep both objects, and, e.g. distinguish them by adding extra properties, like in case 1 suggested in Radek's question. For instance, Dropbox adds something like "conflicted copy by user X" to the end of the file. Like in Dropbox case, this is sensible in case of collaboration apps, where users like to have some version control.
However, in those cases, you as developer simply save two copies and let the users deal with that problem.
If on the other hand, you have to write a complicated logic based on user's data, having two different copies hanging around can be a nightmare. In that case, I would split data into two groups (e.g. create two objects out of one). The first group has data representing the state of the application as a whole, that is important to be unique. For that data I would use the conflict resolution as above or similar. Then the second group is user-specific, where I would store both data as two separate entries in the database, properly mark them (like Dropbox does), and then let users deal with the list of two (or more) entries of their project.
Finally, if that additional complication of database management makes the developer uneasy, and since Radek asked to give a resource reference, I want to "kill two flies with one shot" by mentioning the blog entry StackMob offline Sync, whose solution provides both database and user management functionality and so relieves the developer from that pain. Surely there is a lot more info to be found when searching for data concurrence, conflict resolution and the likes.
To conclude, I have to add the obligatory disclaimer, that all written here are merely my own thoughts and suggestions, that everyone should use at own risk and don't hold me responsible if you suddenly get too many happy users making your system crash :)
As I am myself working on an app, where I am implementing all those aspects, I am certainly very interested to hear other opinions and what else folks have to say on the subject.
From my experience - both as a user of sites that require a login, and as a developer working with logged in users - I don't think I've ever seen a site behave this way.
The common pattern is to let a user be anonymous and the first time they do something that would require saving state, they are prompted to login. Then the previous action is remembered and the user can continue. For example, if they try to add something to their shopping cart, they are prompted to login and then after login, the item is in their cart.
I suppose some places would allow you to fill a cart and then login at which point the cart is associated with a concrete user.
I would create a SessionUser object that has the state of the site interaction and one field called UserId that is used to retrieve other things like name, address, etc.
With anonymous users, I would create the SessionUser object with an empty reference for UserId. This means we can't resolve a name or an address, but we can still save state. The actions they are performing, the pages they're viewing, etc.
Once they login we don't have to merge two objects, we just populate the UserId field in SessionUser and now we can traverse an object graph to get name, email, address or whatever else.

Protecting (or tracking plagiarism of) Openly Available Web Content (database/list/addreses)

We have put together a very comprehensive database of retailers across the country with specific criteria. It took over a year of phone interviews, etc., to put together the list. The list is, of course, not openly available on our site to download as a flat file...that would be silly.
But all the content is searchable on the site via Google Maps. So theoretically with enough zip-code searches, someone could eventually grab all the retailer data. Of course, we don't want that since our whole model is to do the research and interviews required to compile this database and offer it to end-users for consumption on our site.
So we've come to the conclusion there isnt really any way to protect the data from being taken en-masse but a potentially competing website. But is there a way to watermark the data? Since the Lat/Lon is pre-calculated in our db, we dont need the address to be 100% correct. We're thinking of, say, replacing "1776 3rd St" with "1776 Third Street" or replacing standard characters with unicode replacements. This way, if we found this data exactly on a competing site, we'd know it was plagiarism. The downside is if users tried to cut-and-paste the modified addresses into their own instance of Google Maps -- in some cases the modification would make it difficult.
How have other websites with valuable openly-distributed content tackled this challenge? Any suggestions?
Thanks
It is a question of "openly distribute" vs "not openly distribute" if you ask me. If you really want to distribute it, you should acknowledge that someone can receive the data.
With certain kinds of data (media like photos, movies, etc) you can watermark or otherwise tamper with the data so it becomes trackable, but if your content is like yours that will become hard, and even harder to defend: if you use "third street" and someone else also uses it, do you think you can make a case against them? I highly doubt it.
The only steps I can think of is
Making it harder to get all the information. Hide it behind scripts and stuff instead of putting it on google maps, make sure it is as hard as you can make it for bots to get the information, limit the amount of results shown to one user, etc. This could very well mean your service is less attractive to the end user, this is a trade-off
Sort of the opposite of above: use somewhat the same technique to HIDE some of the data for the common user instead of showing it to them. This would be FAKE data, that a normal person shouldn't see. If these retailers show up at your competitors, you've caught them red-handed. This is certainly not fool-proof, as they can check their results for validity and remove your fake stuff, there is always a possibility a user with a strange system gets the fake data which makes your served content less correct, and lastly if your competitors' scraper looks too much like real user, it won't get the data.
provide 2-step info: in step one you get the "about" info, anyone can find that. In step 2, after you've confirmed that this is what the user wants, maybe a login, maybe just limited in requests etc, you give everything. So if the user searches for easy-to-reach retailers, first say in which area you have some, and show it 'roughly' on the map, and if they have chosen something, show them in a limited environment what the real info is.

Steps to publish Software to be purchased via Registration

I'm about to get finished developing a windows application which I want to release as shareware. It was developed in C# and will be running on .Net 3.5+ machines.
To use it the user will have to be online.
My intent is to let the user try it for 30 days and then limit its functionality until a registration is purchased.
The installer will be made available via an msi file.
Could anyone give the general steps on how to implement this?
Here are some more specific questions:
Since I am trying to avoid having to invest a lot upfront in order to establish an e-commerce site, I was thinking of a way to just let the user pay somehow, while supplying his email in which he then receives the unlock key.
I found some solutions out there like listed here:
Registration services
I am still not sure, if they are the way to go.
One of my main concerns is to prevent the reuse if a given serial, e.g. if two users run the program with the same serial at the same time, this serial should disabled or some other measure be taken.
Another point is, that my software could potentially be just copied from one computer to the other without using an installer, so to just protect the installer itself will not be sufficient.
Maybe someone who already went though this process can give me some pointers, like the general steps involved (like 1. Get domain, 2. Get certain kind of webhost ....) and address some of the issues I mentioned above.
I'm thankful for any help people can give me.
I don't have a useful answer for you, but I did have a couple observations I wanted to share that were too large to fit in a comment. Hopefully someone else with more technical expertise can fill in the details.
One of my main concerns is to prevent the reuse if a given serial, e.g. if two users run the program with the same serial at the same time, this serial should disabled or some other measure be taken.
To ensure that two people aren't using the same serial number, your program will have to "phone home." A lot of software does this at installation time, by transmitting the serial number back to you during the installation process. If you want to do it in real time, your application will have to periodically connect to your server and say "this serial number is in use."
This is not terribly user friendly. Any time that the serial number check is performed, the user must be connected to the Internet, and must have their firewall configured to allow it. It also means that you must commit to maintaining the server side of things (domain name, server architecture) unchanged forever. If your server goes down, or you lose the domain, your software will become inoperative.
Of course, if a connection to your service specifically (rather than the Internet in general) is essential to the product's operation, then it becomes a lot easier and more user friendly.
Another point is, that my software could potentially be just copied from one computer to the other without using an installer, so to just protect the installer itself will not be sufficient.
There are two vectors of attack here. One is hiding a piece of information somewhere on the user's system. This is not terribly robust. The other is to check and encode the user's hardware configuration and encode that data somewhere. If the user changes their hardware, force the product to reactivate itself (this is what Windows and SecuROM do).
As you implement this, please remember that it is literally impossible to prevent illegal copying of software. As a (presumably) small software developer, you need to balance the difficulty to crack your software against the negative effects your DRM imposes on your users. I personally would be extremely hesitant to use software with the checks that you've described in place. Some people are more forgiving than I am. Some people are less so.
The energy and effort to prevent hacks from breaking your code is very time consuming. You'd be better served by focusing on distribution and sales.
My first entry into shareware was 1990. Back then the phrase was S=R which stood for Shareware equals Registered. A lot has changed since then. The web is full of static and you have to figure out how to get heard above the static.
Here's somethings I've learned
Don't fall in love with your software. Someone will always think it should work differently. Don't try and convert them to your way of thinking instead listen and build a list of enhancements for the next release.
Learn how to sell or pay someone to help you sell your stuff
Digital River owns most of the registration companies out there
Create free loss leaders that direct traffic back to you
Find a niche that is has gone unmet and fill it
Prevent copying: base the key on the customer's NIC MAC. Most users will not go to the trouble of modifying their NIC MAC. Your app will have a dialog to create and send the key request, including their MAC.
The open issue is that many apps get cracked and posted to warez sites. Make this less likely by hiding the key validation code in multiple places in your app. Take care to treat honest users with respect, and be sure your key validation does not annoy them in any way.
Make it clear that the key they are buying is node locked.
And worry about market penetration. Get a larger installed base by providing a base product that has no strings attached.
cheers -- Rick

How do I create a web application where I do not have access to the data?

Premise: The requirements for an upcoming project include the fact that no one except for authorized users have access to certain data. This is usually fine, but this circumstance is not usual. The requirements state that there be no way for even the programmer or any other IT employee be able to access this information. (They want me to store it without being able to see it, ever.)
In all of the scenarios I've come up with, I can always find a way to access the data. Let me describe some of them.
Scenario I: Restrict the table on the live database so that only the SQL Admin can access it directly.
Hack 1: I rollout a change that sends the data to a different table for later viewing. Also, the SQL Admin can see the data, which breaks the requirement.
Scenario II: Encrypt the data so that it requires a password to decrypt. This password would be known by the users only. It would be required each time a new record is created as well as each time the data from an old record was retrieved. The encryption/decryption would happen in JavaScript so that the password would never be sent to the server, where it could be logged or sniffed.
Hack II: Rollout a change that logs keypresses in javascript and posts them back to the server so that I can retrieve the password. Or, rollout a change that simply stores the unecrypted data in a hidden field that can be posted to the server for later viewing.
Scenario III: Do the same as Scenario II, except that the encryption/decryption happens on a website that we do not control. This magic website would allow a user to input a password and the encrypted or plain-text data, then use javascript to decrypt or encrypt that data. Then, the user could just copy the encrypted text and put the in the field for new records. They would also have to use this site to see the plain-text for old records.
Hack III: Besides installing a full-fledged key logger on their system, I don't know how to break this one.
So, Scenario III looks promising, but it's cumbersome for the users. Are there any other possibilities that I may be overlooking?
If you can have javascript on the page, then I don't think there's anything you can do. If you can see it in a browser, then that means it's in the DOM, which means you can write a script to get it and send it to you after it has been decrypted.
Aren't these problems usually solved via controls:
All programmers need a certain level of clearance and background checks
They are trained to understand that rolling out code to access the data is a fireable or worse offense
Every change in certain areas needs some kind of signoff
For example -- no JavaScript on page without signoff.
If you are allowed to add any code you want, then there's always a way, IMO.
Ask the client to provide an Non-disclosure Agreement for you to sign, sign it, then look at as much data as you want.
What I'm wondering is, what exactly will you be able to do with encrypted data anyway? Pretty-much all apps require you to do some filtering of the data, whether it be move it to a required place, modify it, sanitize it, or display it. Otherwise, you're just a glorified pipe, and you don't have to do any work.
The only way I can think of where you wouldn't be looking at the data or doing anything with it would be a simple form to table mapping with CRUD options. If you know what format the data will be coming in as you should be able to roll something out with RoR, a simple skin, put SSL into the mix, and roll it out. Test with dummy data in the same format, and you're set.
In fact, is your client unable to supply dummy data for testing? If they can, then your life is simple as all you do is provide an "installable" and tell them how to edit a config file.
I think you could still create the app in the following way:
Create a dev database and set up a user for it.
Ask them for: the data type, size, and name of each field that needs to be on the screen.
Set up the screens, create columns in the database that accept the data type and size they specify.
Deploy the app to production, hooked up to an empty database. Get someone with permission (not you) to go in and set the password on the database user and set the password for the DB user in the web app.
Authorized users can then do whatever they want and you never saw what any of the data looked like.
Of course, maintaining the app and debugging is gonna be a bitch!
--In answer to comments:
Ok, so after setting up the password for the Username in the database and in the web app's config, write a program that connects to the database, sets a randomized password, then writes that same randomized password to the web config.
Prevent any outgoing packets from the machine except to a set of authorized workstations - so you can't install your spyware.
Then set the Admin password on both servers to the same random password, then delete all other users on the servers, delete the program, and delete the program source code.
Wipe the hard drives of the developer machines with the DOD algorithm, and then toss them into an industrial shredder.
10. If the server ever needs debugging, toss it in the trash, buy a new one, and start back at #1.
But seriously - this is an insolvable problem. The best answer to this really is:
Tell them they can't have an application. Write your stuff on paper. Put it in a folder. Lock it in a vault. Thrust, repeat.
Wouldn't scenario 3 just expose all the data to the magic website? This doesn't sound like a solvable problem (at least I can't think of a solution).
Go with whatever solution is easiest for you to implement, I think the requirements show the the client does not understand software development and so it should be easy to sell any approach you take.
I have to say I really don't like the idea of using JavaScript on the client to decrypt the data. That is a huge hole as any script (hacker, GreaseMonkey, IE7Pro, etc.) can access the DOM and get data out of the page.
Also, it is very hard to get around the problem of key stroke loggers. If you throw those into the mix, then your options are limited. At that point you need a security FOB such as RSA (commonly used with corporate VPNs) to generate truly random PINs. That will probably be expensive, and it is a pain, and I have only seen it used with VPNs but I assume it could work with websites as well.
As far as the website, I'd stick with HTTPS and find a way to encrypt/decrypt through the WebServer rather than relying on JavaScript. The SSL traffic isn't very prone to sniffing (very difficult to decrypt), so that allows the encryption and decryption to happen server-side which (IMHO) is more secure.
Look at banking scenarios and other financial institutions for a starting point, and then go from there. Try not to over-complicate if possible.
You can't guarantee against hacking into the data as long as you have access to the server it lives on. So tell the employer they have to host the data somewhere else and grant access to the client's browser via a secure HTTPS connection.
You can design your web page to dynamically load an XML data stream securely, and format it into a web page using an XSLT script on the client.
See http://www.w3schools.com/xsl/xsl_client.asp for examples
That way you produce the code, but you never have access to the data. Only the user has access to their own data.
As for how the employer is going to host the data without granting any IT people access to it, that's their problem. It's a foolish requirement.
I think that I'll just tell them that they either have to trust a couple of us to have access (and not look at it) or they don't get a project.
Thanks for the answers. Feel free to post more thoughts if you have them.
You can never have 100% security, and extra security comes at a cost of speed/price/convenience etc.
Let's suppose you take scenario 3 - one of your programmers can use social engineering to get the password from one of the users. Goodbye security.
There's no point having a high-security iron door as a gate if people can just walk around it. Just implement a decent level of security.
(They want me to store it without being able to see it, ever.)
Hey, the recording industry wants people to be able to listen to their music, but not copy it. Sounds like they should get together sometime!
Their idea won't work for the same reason DRM doesn't work: the trust chain is inherently compromised. Encryption examples often use Alice, Bob, and Charlie where Alice is trying to communicate with Bob without Charlie listening in. With DRM, the trust chain is compromised because Bob and Charlie are the same person. With your situation, Charlie is the guy writing the software that Alice and Bob use to communicate. There's an implied trust, because if you don't trust Charlie then you can't trust Charlie's software, either.
That's the root of the issue: trust. If they can't trust the programmer, the game is over before it starts.
There are lots of options based on what their goal really is, but I am confused by their paranoia, er, intent:
Is this their (and end-user) data that they wish to keep private or end-user data to be kept private from everyone?
Is it just that your (or any contracted) company is suspect?
Are they afraid of over-the-wire snooping?
Are they afraid of DOM access through JavaScript or browser plugins?
Are they planning staged deployment? In that case you work on test/dev server w/o real data but have no access to the production server with the real data, and DNS logging and/or firewall rules inhibit all of your hacks from working undetected.
Ultimately if the data is stored in a DB then the programmer and DB admin can, by working together, get it. Period. A good audit should uncover that, though.
If this is truly a requirement, the only way to guard against this is to hire an outside firm to audit the code prior to releasing the software, and that's going to be very expensive.

Resources