Best way to store emails for historical/review purposes - sql-server

I have a service which process emails in a mailbox and once processed, stores some information from the email in the database. At the minute the schema looks something like:
ID
Sender
Subject
Body (result of being parsed/stripped to plain text)
DateReceived
I am building a web front-end for the database and the main purpose of storing the emails is to provide the facility for users to look back and see what they have sent. However, another reason is for auditing purposes on my end.
The emails at the moment are being moved to specific mailbox folders. So what I plan to start doing is once the email is processed, record it in the database and delete the email from the mailbox instead of just moving it.
So a couple of questions...
1) Is it a good idea to delete the actual email from exchange? Is it better to hold onto it just in case?
2) To keep the size of the fields down I was stripping the HTML out of the emails, is this a bad idea? should I just store the email as it is received?
Any other advice/suggestions would be great.

In both cases I think you should hold onto the original emails. Storage is cheap, but if disk space is really an issue look to compression rather than excision to solve it.
Both your of your use cases (historical record and audit) will be better served by storing the complete unabridged email in the database. Once you start tampering with the data, albeit "just" removing formatting, it becomes difficult to prove that you haven't edited it in other, more significant ways. Especially if you have deleted the original email instead of archiving it.
You don't say what business you're in, but the other thing to remember is whether there are any data retention policies active within your organisation or in the wider jurisdiction. Compliance is becoming gnarlier all the time.

I would maintain the messages on the Mailbox on a specific folder as you are doing and probably wouldn't even save anything on a database given you can access the Mailbox from within your application.
The Exchange team over the years has developed several APIs for accessing the Mailbox's contents.
With Exchange Server 2007 and 2010, the recommended API would be Exchange Web Services which can be used from any language/environment that is capable of accessing Web Services.
If you are developing with a .Net language (C#, VB.NET for instance), your best bet would be EWS Managed API.
If you are really going to do something meaningful with the body, you can save the results as named properties (extended properties in EWS parlance) on the message itself.
There are other APIs with corresponding functionality for previous versions of Exchange.

Related

Best practice to send reports via external email (SSRS, CRM or other means)

We want to start sending out emails to our customers that has tracking information attached. I've done this internally with a data driven subscription but I'm sure that isn't an industry best practice for sending externally.
I know this could cause some issues with spam filtering, etc...
How have you handled this before? We wouldn't be sending more than 30-50 per day initially, but as the business grows so will those numbers. I'm sure there's a service where we send the data and they would send on our behalf, but I'm pretty unfamiliar with how this is done industry wide.
Do people use their CRM's marketing platforms to take care of something like this?
SSRS reports are usually for internal staff and a CRM or dedicated tool (like MailChimp, SendInBlue or MailerLite) is used for sending mailshots to customers externally. I have known some organisations use SSRS for sending externally, but that's not what it was intended for - and it is probably a security risk to let the internal production SQL Server access the internet directly.

How to structure/coordinate multiple databases?

Imagine a large corp with dozens of companies, each with their own website and each website will have their own unique functional requirements
Most data on each website will be specific to that website
Each website can edit its own data
Some data will be shared across all websites
There will be a central CMS that is allowed to edit this data, but other websites can read and use that data
e.g. say you're planning the infrastructure for a company that owns multiple sub-companies that make different kinds of products, some in the same category (cereal, food), others in completely different categories (books, instruments). Some are marketing websites, some are for CRM, some are online stores
there are a list of regulatory requirements that affect all products
each company should manage the status of compliance of its own products to each requirement
when a new requirement surfaces, details regarding that requirement should only be entered once
How would the multiple databases be coordinated?
edit: added more info per Bob's suggestions
Thanks for the incredibly insightful questions!
compliance data is not shared, silo'd within each site
shared data is only on the one enterprise-wide database, they will mostly be "types of [thing]"
no conclusive list of instances where they'll be used but currently it'd be to populate CMS dropdowns for individual sites.
changes to shared data would occur a few times a year.
Ideally changes would be reflected within a few minutes, but an hour or so should be acceptable
very low volume in shared data.
All DBs will be new, decision on which DB is pending current investigation.
Sub-systems will expose REST api
Here are some ways I have seen this handled, you need to think about the implications of each structure based on the details of your particular business domain. All can work, but all have to be carefully set up if they are going to work.
One database for shared information and one for each client for client-specific information. Set up the overall application so that the first thing you put in the application on log in is the client and it connects to the correct client. People might have to also have a way to change the client if users will handled multiples.
Separate servers for each client if they completely need to be siloed. Database changes are by script (and in source control) and are applied to each server as need be. So the changes to the central database might have a job that runs to push any data changes to the other servers
All the data in one database, but making sure each table has a client_id so that the data is always filtered correctly by client. You can set up separate views by client, so that the users can only see the clients they are supposed to see. This only works if the data for each client is substantially in the same form.
And since you are in a regulatory environment, I strongly urge that you create an audit database that is updated by database triggers (never audit from the application, you will lose changes to the data) for each database.
I agree with Chris that, even after both the sets of questions, there is still a big set of possible solutions. For instance, if the databases were the same technology, and the shared data were stored in the same way in each one, you could do db-level replication from the central db to the others. Is it OK to have 2 separate dbs per application (one with shared stuff and one with not-shared?) - this would influence the kind of replication.
Or you could have a purely code solution, where clicking publish in a GUI that updates the central db calls a set of APIs that also update the other dbs. Or micro-services - updating the central db also creates a message on a shared queue, that is picked up by services that each look after a different db and apply the updates in whatever form makes sense for that db.
It depends on (among the things already mentioned) what your organisation's technology strategy is, what technology and skills you already have in-house, and so on.
So this is as much an architecture question as it is a db question.
I don't think this question is sufficiently clear to get a single answer. However there are a few possibilities.
In many cases, where you have shared data you want to have a single point of ownership of that information. It could be in a database, in an excel file (which can then be turned into csv and periodically loaded on all dbs), or some other form. The specifics depend on what is shared exactly.
Now in this case it sounds like you are going to have some sort of legal department in charge of some shared information and they will manage that data, which will then be shared to the other sites. This might be done with an application they manage which aggregates information from the other companies or it could be data which is pushed to their systems.
A final point:
Software is at its best when it facilitates human solutions to human problems, not when it tries to solve those problems directly. In these cases, you probably want a good human solution in place and then to look at what software can do to support that. A lot of the issues (who owns the information?) will already have been solved and you will be simply automating what is already done.

Do I have to trust the client when using DreamFactory for my backend?

I'm just looking into Dreamfactory, and it seems to me that all permission-logic is to be done by the client.
I want to develop an app with documents, and not all users are allowed to edit all documents. If the logged in user has permissions to the database, how can I prevent that user from making false API calls that delete or modify someone elses document ?
While your observation is correct, the app must control access, we do offer a couple features that will solve your problem.
First off, the roles system is very flexible. You could create different roles, reader, writer, admin, etc. And assign them to the corresponding users. This can be done today.
Secondly, we will be releasing an update in the next few weeks that has several new features, one of which, will also solve your dilemma. I'm not 100% sure what it is going to be named, but it will allow you to have the system automatically inject runtime data (I.e. user ID) into REST service calls to the DSP. Very flexible and powerful.
More information and doc will be available with the release so hold on a tad and we'll get you there!
The upcoming release at the end of the month lets you do a server side filter, or use server side scripting to lock a user to modifying only their own records, or those you see fit.

Subscription website architecture questions + SQL Server & .NET

I have a few questions about the architecture of a subscription service I am about to embark on and I am looking for some feedback on how best to set it up.
I won’t have a large amount of customers as Basecamp, maybe a few hundred and was wondering what would be a solid architecture for setting up the customer sites. I’m running SQL Server and .NET on a dedicated machine. Should create a new database for each customer as to have control and isolation of data or keep them all in one database?
I am also thinking of creating a sub-domain for each customer as well so modifications can be made to each site as needed. The customer URLs would look like this:
https://customer1.foobar.com
https://customer2.foobar.com
I am going to have the ability to ‘plug-in’ reports that will be uploaded to the site so each customer can customize as needed. Off the top of my head this necessitates having each sub domain on its own code-base for the uploading of these reports.
So on the main site the customer would sign up for their new subscription and I would programmatically create a new directory for the customer from the main code base and then create a sub domain pointing to the new directory for the customer and then finally their database.
Does this sound about right? Am I on the right track? How do other such sites accomplish the same thing?
Thanks for letting me bend your ear for a bit on this.
From a maintenance perspective, having a virtual directory for each customer scares me. Having done something similar, I would create separate domain pointers as you are intimating. Then you can check the referral headers to see what should be displayed. I would probably create one main site template and dynamically brand it for each customer. You can still create separate folders for customer specific reports or if you really need custom pages unique to that customer. I just wouldn't make each their own site.
The advantage of separate sites (including databases) is that the fate on one client isn't bound to all others. It'd be easier to upgrade (trial) to a sub-set before deploying to everyone else. The big issue here, as Scot points out, is time. You'd want to have things as automated as possible (and well tested), etc. It's also easy when a client leaves. You can always just back-up their database and send it to them (for example).
Auto-provisioning new sites and databases isn't easy, and the account that does that will need plenty of privileges - so your security testing will need to be better than usual.
A multi-tenancy approach is good for minimizing your time but you do have to be careful, you don't want customers data getting mixed up.
One approach that will work, within the one app (and database), is to make use of HttpHandlers (MVC framework, perhaps) so that some sort of client identifier is in part of the URL - but the folder doesn't have to physically exist (or virtually in the IIS sense). That way you don't have to worry about getting folder permission correct; but you do have to be careful about correctly identifying clients, their ids, and making sure clients can't make calls that use an id that isn't theirs.
https://www.foobar.com/[clientid]/subscriptions
The advantage of this is it's relatively straight forward: everything is in the application, and you don't have to worry about adding new DNS records, setting directory and/or database permissions, etc.

How do I create a web application where I do not have access to the data?

Premise: The requirements for an upcoming project include the fact that no one except for authorized users have access to certain data. This is usually fine, but this circumstance is not usual. The requirements state that there be no way for even the programmer or any other IT employee be able to access this information. (They want me to store it without being able to see it, ever.)
In all of the scenarios I've come up with, I can always find a way to access the data. Let me describe some of them.
Scenario I: Restrict the table on the live database so that only the SQL Admin can access it directly.
Hack 1: I rollout a change that sends the data to a different table for later viewing. Also, the SQL Admin can see the data, which breaks the requirement.
Scenario II: Encrypt the data so that it requires a password to decrypt. This password would be known by the users only. It would be required each time a new record is created as well as each time the data from an old record was retrieved. The encryption/decryption would happen in JavaScript so that the password would never be sent to the server, where it could be logged or sniffed.
Hack II: Rollout a change that logs keypresses in javascript and posts them back to the server so that I can retrieve the password. Or, rollout a change that simply stores the unecrypted data in a hidden field that can be posted to the server for later viewing.
Scenario III: Do the same as Scenario II, except that the encryption/decryption happens on a website that we do not control. This magic website would allow a user to input a password and the encrypted or plain-text data, then use javascript to decrypt or encrypt that data. Then, the user could just copy the encrypted text and put the in the field for new records. They would also have to use this site to see the plain-text for old records.
Hack III: Besides installing a full-fledged key logger on their system, I don't know how to break this one.
So, Scenario III looks promising, but it's cumbersome for the users. Are there any other possibilities that I may be overlooking?
If you can have javascript on the page, then I don't think there's anything you can do. If you can see it in a browser, then that means it's in the DOM, which means you can write a script to get it and send it to you after it has been decrypted.
Aren't these problems usually solved via controls:
All programmers need a certain level of clearance and background checks
They are trained to understand that rolling out code to access the data is a fireable or worse offense
Every change in certain areas needs some kind of signoff
For example -- no JavaScript on page without signoff.
If you are allowed to add any code you want, then there's always a way, IMO.
Ask the client to provide an Non-disclosure Agreement for you to sign, sign it, then look at as much data as you want.
What I'm wondering is, what exactly will you be able to do with encrypted data anyway? Pretty-much all apps require you to do some filtering of the data, whether it be move it to a required place, modify it, sanitize it, or display it. Otherwise, you're just a glorified pipe, and you don't have to do any work.
The only way I can think of where you wouldn't be looking at the data or doing anything with it would be a simple form to table mapping with CRUD options. If you know what format the data will be coming in as you should be able to roll something out with RoR, a simple skin, put SSL into the mix, and roll it out. Test with dummy data in the same format, and you're set.
In fact, is your client unable to supply dummy data for testing? If they can, then your life is simple as all you do is provide an "installable" and tell them how to edit a config file.
I think you could still create the app in the following way:
Create a dev database and set up a user for it.
Ask them for: the data type, size, and name of each field that needs to be on the screen.
Set up the screens, create columns in the database that accept the data type and size they specify.
Deploy the app to production, hooked up to an empty database. Get someone with permission (not you) to go in and set the password on the database user and set the password for the DB user in the web app.
Authorized users can then do whatever they want and you never saw what any of the data looked like.
Of course, maintaining the app and debugging is gonna be a bitch!
--In answer to comments:
Ok, so after setting up the password for the Username in the database and in the web app's config, write a program that connects to the database, sets a randomized password, then writes that same randomized password to the web config.
Prevent any outgoing packets from the machine except to a set of authorized workstations - so you can't install your spyware.
Then set the Admin password on both servers to the same random password, then delete all other users on the servers, delete the program, and delete the program source code.
Wipe the hard drives of the developer machines with the DOD algorithm, and then toss them into an industrial shredder.
10. If the server ever needs debugging, toss it in the trash, buy a new one, and start back at #1.
But seriously - this is an insolvable problem. The best answer to this really is:
Tell them they can't have an application. Write your stuff on paper. Put it in a folder. Lock it in a vault. Thrust, repeat.
Wouldn't scenario 3 just expose all the data to the magic website? This doesn't sound like a solvable problem (at least I can't think of a solution).
Go with whatever solution is easiest for you to implement, I think the requirements show the the client does not understand software development and so it should be easy to sell any approach you take.
I have to say I really don't like the idea of using JavaScript on the client to decrypt the data. That is a huge hole as any script (hacker, GreaseMonkey, IE7Pro, etc.) can access the DOM and get data out of the page.
Also, it is very hard to get around the problem of key stroke loggers. If you throw those into the mix, then your options are limited. At that point you need a security FOB such as RSA (commonly used with corporate VPNs) to generate truly random PINs. That will probably be expensive, and it is a pain, and I have only seen it used with VPNs but I assume it could work with websites as well.
As far as the website, I'd stick with HTTPS and find a way to encrypt/decrypt through the WebServer rather than relying on JavaScript. The SSL traffic isn't very prone to sniffing (very difficult to decrypt), so that allows the encryption and decryption to happen server-side which (IMHO) is more secure.
Look at banking scenarios and other financial institutions for a starting point, and then go from there. Try not to over-complicate if possible.
You can't guarantee against hacking into the data as long as you have access to the server it lives on. So tell the employer they have to host the data somewhere else and grant access to the client's browser via a secure HTTPS connection.
You can design your web page to dynamically load an XML data stream securely, and format it into a web page using an XSLT script on the client.
See http://www.w3schools.com/xsl/xsl_client.asp for examples
That way you produce the code, but you never have access to the data. Only the user has access to their own data.
As for how the employer is going to host the data without granting any IT people access to it, that's their problem. It's a foolish requirement.
I think that I'll just tell them that they either have to trust a couple of us to have access (and not look at it) or they don't get a project.
Thanks for the answers. Feel free to post more thoughts if you have them.
You can never have 100% security, and extra security comes at a cost of speed/price/convenience etc.
Let's suppose you take scenario 3 - one of your programmers can use social engineering to get the password from one of the users. Goodbye security.
There's no point having a high-security iron door as a gate if people can just walk around it. Just implement a decent level of security.
(They want me to store it without being able to see it, ever.)
Hey, the recording industry wants people to be able to listen to their music, but not copy it. Sounds like they should get together sometime!
Their idea won't work for the same reason DRM doesn't work: the trust chain is inherently compromised. Encryption examples often use Alice, Bob, and Charlie where Alice is trying to communicate with Bob without Charlie listening in. With DRM, the trust chain is compromised because Bob and Charlie are the same person. With your situation, Charlie is the guy writing the software that Alice and Bob use to communicate. There's an implied trust, because if you don't trust Charlie then you can't trust Charlie's software, either.
That's the root of the issue: trust. If they can't trust the programmer, the game is over before it starts.
There are lots of options based on what their goal really is, but I am confused by their paranoia, er, intent:
Is this their (and end-user) data that they wish to keep private or end-user data to be kept private from everyone?
Is it just that your (or any contracted) company is suspect?
Are they afraid of over-the-wire snooping?
Are they afraid of DOM access through JavaScript or browser plugins?
Are they planning staged deployment? In that case you work on test/dev server w/o real data but have no access to the production server with the real data, and DNS logging and/or firewall rules inhibit all of your hacks from working undetected.
Ultimately if the data is stored in a DB then the programmer and DB admin can, by working together, get it. Period. A good audit should uncover that, though.
If this is truly a requirement, the only way to guard against this is to hire an outside firm to audit the code prior to releasing the software, and that's going to be very expensive.

Resources