Is it possible to sanitize a Datomic database, given that it is a 'value' and does not 'update-in-place'? - datomic

I am exploring building on top of Datomic. I am sold on the principle of the database 'as a value'. The thing is, we need to be able to provide sanitized copies of the database to our developers to run locally. Any sensitive data which we are required to keep on the correct side of the firewall must not leak out.
With a standard SQL database this is easy: we just have a service inside the firewall which takes a snapshot of the DB and runs some script against it to update-in-place the sensitive value such that my.secret.email#address.com > email00123#address.com etc. Then the sanitised DB is made available to the developer to lift out of the compliance zone.
However, my understanding of Datomic (and its very strength) is that nothing is ever updated in place. So how would it be possible to sanitise a Datomic DB? Thanks.

This is one use case for filtering databases in Datomic. Filtering for security reasons is also discussed in this talk by Nubank.
The operational model is a bit different from the SQL world because user access and authorization and authentication, etc. are not baked into the database to the same degree. Any peer participates in the database fully and can submit transactions, etc., request an unfiltered database, etc. as API calls. You need an additional application layer (i.e. to create a client for your peer-as-a-server and only expose endpoints for queries against the filtered database) if you want stronger security guarantees.

Related

SQL Server - robust protection of client data (multi-tenancy)

We are considering using a single SQL Server database to store data for multiple clients. We feel having all the data in one database could make things more manageable than a "separate db per client" setup.
The biggest concern we have is accidental access to the wrong client. It would be very, very bad if we were to ever accidentally show one client's data to another client. We perform lots of queries, and are afraid of a scenario where someone says "write me a query of this and this to go show the client for the meeting in 15 minutes." If someone is careless and omits the WHERE clause that filters for the correct client then we would be in serious trouble. Is there a robust setup or design pattern for SQL Server such that it makes it impossible (or at least very difficult) to accidently pull the wrong client's data from a single "global" database?
To be clear, this is NOT a database that the clients use directly or via apps (yet). We are talking about a database accessed by several of our programmers and we are afraid of screwing up ourselves.
At the very minimum, you should put the client data in separate schemas. In SQL Server, schemas are the unit of authorization. Only people authorized for a given client should be able to see that client's data. In addition to other protections, you should be using the built-in authorization capabilities of the database.
Right now, it sounds like you are in a situation where a very small group of people are the ones accessing all the data for everyone. Well, if you are successful, then you will probably need more people in the future. In fact, you might be giving some clients direct access to the data. If it is their data, they will want apps running on it.
My best advice, if you are planning on growing, is to place each client's data in a separate database. I would architect the system so this database can be on a remote server. If it needs to synchronize with common data, then develop a replication strategy for moving that data around.
You may think it is bad to have one client see another client's data. From the business perspective, this is deadly -- like "company goes out of business, no job" deadly. Your clients are probably more concerned about such confidentiality than you are. And, an architecture that ensures protection will make them more comfortable.
Multi-Tenant Data Architecture
http://msdn.microsoft.com/en-us/library/aa479086.aspx
here's what we do (mysql unfortunately):
"tenant" column in each table
tables are in one schema [1]
views are in another schema (for easier security and naming). view must not include tenant column. view does a WHERE on the tenant based on current user
tenant value is set by trigger on insert, based on the user
Assuming that all your DDL is in .sql files under source control (which it should be), then having many databases or schemas is not so tough.
[1] a schema in mysql is called a 'database'
You could set up one inline table valued function for each table that takes a required parameter #customerID and filters that particular table to the data of this customer. If the entire app were to use only these TVP's the app would be safe by construction.
There might be some performance implications. The exact numbers depend on the schema and queries. They can be zero, however, as inline TVP's are inlined and optimized together with the rest of the query.
You can limit access to data only via storedprocedures with obligatory customerid parameter.
If you allow you IT build views sooner or later someone forget this where clause as you said.
But a schema per client with already prefiltered views will enable selfservice and extra Brings value i guess.

Database synchronization

Recently my clients have asked me if they can use they’re application remotely, disconnected from the local network and the company server.
One solution is to place the database in the cloud, but a connection to the database, and the cloud and an internet connection must be always available.
There not always the case.
So my question is - Is there any database sync system, or a synchronization library so that I can work disconnected with local database and when I connect synchronize the changes I have made and receive changes others have made?
Update:
The application is under Windows (7/xp) ( for now )
It's in Delphi 2007 win32
All client need to have Read/Write access
All Clients have internet connection, but not always ON
Security is not critical, but the Sync service should encrypt the communication
When in the presence of the companies network the system should sync and use the Server Database and not the local one.
You have a host of issues with thinking about such a solution. First, there are lots of possible solutions, such as:
Using database replication within a database, to mimic every update (like a "hot" backup)
Building an application to copy the database periodically (every night)
Using a third-party tool (which is what you are asking, I think)
With replication services, the connection does not have to always be up. Changes to the database are logged when the connection is not available and then applied when they can be sent.
However, there are lots of other issues when you leave a corporate network. What about security of the data and access rights? Do you have other options, such as making it easier to access the database from within the network? Do the users need only read-access to the database or read-write access? Would both versions need to be accessed at the same time. Would there be updates to both at the same time?
You may have other options that are more secure than just moving a database to the cloud.
I believe RemObjects DataAbstract allows offline mode and synchronization by using what they call Briefcases. All your other requirements (security, encrypted connections, etc.) are also covered.
This is not a drop-in replacement, thought, and may need extensive rewrite/refactoring of your application. There are lots of upsides, thought; business rules can/should be enforced on the server (real security), scriptable business rules, multiplatform architecture, etc.
There are some products available in the Java world (SymmetricDS lgpl license) - apart from actually being a working system it is documents how it achieved synchronization. Connects to any db with jdbc support. . There is a pro version but the user guide (downloadable pdf) gives you the db schema plus rules on push pull syncing. Useful if you want to build your own.
Btw there is a data replication so tag that would help.
One possibility that is free is the Microsoft Sync Framework: http://msdn.microsoft.com/en-us/sync/bb736753.aspx
It may be possible for you to use it, but you would need to provid some more detail about your application and operating environment to be sure.
IS it possible to share a database like .mdb and work fine? i try but sometimes the file where the databse is changes from DB to DB1 i use delphi Xe4 and Google Drive .
Thank´s

Database security / scaling question

Typically I use a database such as MySQL or PostGreSQL on the same machine as the application using it, which makes access easy and secure. I'm just now building the first site that will have a separate physical database server (later this year it will). I'm wondering 3 things:
(security) What things should I look into for starters pertaining to security of accessing a separate machine's database?
(scalability) Are their scalability issues that I should think about pertaining to this (technology agnostic)?
(more ServerFaultish but related) If starting the DB out on the same physical server (using a separate VMWare VM) and later moving to a different physical server, are there implicit problems that I'll have to deal with? Isn't another VM still accessed via localhost?
If these questions are completely ludicrous, I apologize to you DB experts.
Easy, I'll grant you. Secure.. well, security has very little to do with the physical location of the database server.
To get to your three questions though:
First, look at how you can limit access to database tables using the database servers security model. Namely, if your application does not need to drop tables, make sure the user it uses to connect does not have that ability. Second, look into how to encrypt the connection between the database server and your application. In windows this is pretty transparent through kerberos and can even be enforced by group policy settings, not sure about other platforms. Third, look into what features the database has for encrypting the data "at rest". Meaning, does it natively support encryption of the actual data files themselves?
The point here is that your application is only one possible entry point to the database server itself. Ask yourself, what would happen if someone can connect directly without going through your application using your apps credentials. Next ask, what can happen if they find a SQL Injection issue.. Also, ask yourself, what information can be gleaned if someone is able to monitor the IP traffic going between your app and the server. Can they discern any data? Finally, ask yourself, what if they get a copy of the database itself?
The lengths you go for #1 is going to be dependent on several factors such as How valuable is the data (eg: what would happen to you, your company, or your clients if it was lost); and, How much time do you have to come up with an ideal solution?
scalability: This is purely a function of load. Unfortunately, the only way to scale most database applications is to scale up. Meaning that you acquire a larger database server as the need arises. Stack Overflow went through this not too long ago. Some database types (nosql, mongodb, etc) support a concept known as shredding or sharding. MySql, PostGreSql, etc don't. Instead you'll have to specifically design the app to handle it. Which means not using things like auto incrementing keys, etc. This can be a royal PITA... which is why scaling up is a much easier prospect depending on your application.
Another VM is not accessible via "localhost". localhost defines access to your current server. Whether that server is a VM or not is immaterial. You'll have to reference your database server by name. Now, transitioning the database VM to another physical server should have zero impact as your are referencing it by name. Beyond that there aren't any other considerations.
In addition to Chris's valid response,
Security
Use a security mechanism on the network in addition to whatever security features the database or app framework provides. Perhaps this is a simple as firewalling the network, running IPSEC, or over an ssl tunnel. The point is that you shouldn't assume the DB authors are network security experts, or that the DB authentication mechanism has even addressed network security at all.
Scalability
One scalability issue comes to mind when moving from local to remote dbs. Remote TCP/IP communication is much slower than local pipe communication. Your app may have hidden scalability issues due to frequent round-trips to the DB. Between each query, your app waits for each DB response in succession. On a local system, the latency is so small you may not have noticed it.

How to protect a database from the Server Administrator in Sql Server

We have a requirement from a client to protect the database our application uses, even from their local administrators (Auditors just gave them that requirement).
In their requirement, protecting the data means that the Sql Server admin cannot read, nor modify sensitive data stored in tables.
We could do that with Encryption in Sql Server 2005, but that would interfere with our third party ORM, and it has other cons, like indexing, etc.
In Sql Server 2008 we could use TDE, but I understand that this solution doesn't protect against a user with Sql Server admin rights to query the database.
Is there any best practice or known solution to this problem?
This problem could be similar to the one of having an application hosted by a host provider, and you want to protect the data from the host admins.
We can use Sql Server 2005 or 2008.
This has been asked a lot in the last few weeks. The answers usually boil down to:
(
a) If you don't control the application you are doomed to trust the DBA
or
b) If you do control the application you can encrypt everything with a key only known to the application, and decrypt on the way out. It'll hurt performance a bit (or a lot) though, that's why TDE exists. A variant of this to prevent tampering is to use a cryptographic hash of the values in the column, checking them upon application access.
)
and
c) Do extensive auditing, so you can control what are your admins doing.
I might have salary information in my tables, and I don't want my trusted dba's to see.
Faced with the same problem we have narrowed are options to:
1- Encrypt outside SQLServer, before inserts and updates and decrypt after selects. ie: Using .net encryption.
Downside: You loose some indexing and searching capabilities, cannot use like and betweens.
2- Use third party tools (at io level) that block crud to the database unless a password is provided. ie: www.Blockkk.com
Downside: You will need to trust a third party tool installed in your server. It might not keep up with SQL Server patches, etc...
3- Use an Auditing Solution that will keep track of selects, inserts, deletes, etc... And will notify (by email or event log)if violations occurred. A sample violation could be a dba running a select on your Salaries table. then fire the dba and change everyone salaries.
Auditors always ask for this, like they ask for other things that can never be done.
What you need to do is put it into risk-mitigation terms and show what controls you do have (tracking when users are elevated to administrators, what they did and that they were de-elevated afterwards) instead of in absolutes.
I once had a boss ask for total system redundancy without defining what he meant or how much he was willing to pay and sacrifice.
I think the right solution would be to only allow trusted people be DBA's.
It is implicit in being DBA, that you have full access, so in my opinion, your auditor should demand that you have procedures for restricting who has DBA access.
That way you work with the system through processes in stead of working aginst the system (ie. sql server).
To have person you don't trust be DBA would be nuts...
If you don't want any people in the admin group on the server to be able to access the database, then remove the "BUILTIN\Administrators" user on the server.
However, make sure you have another user that is a sysadmin on the server!
another way i heard that a company has implemented but i haven't seen it is:
there's a government body which issues kind of timestamped certificate.
each db change is sent to async queue and is timestamped with this certificate and is stored off site. this way noone can delete anything without breaking the timestamp chain.
i don't know how exactly this works on a deeper level.

Caching to a local SQL instance on a web server

I run a very high traffic(10m impressions a day)/high revenue generating web site built with .net. The core meta data is stored on a SQL server. My team and I have a unique caching strategy that involves querying the database for new meta data at regular intervals from a middle tier server, serializing the data to files and sending those to the web nodes. The web application uses the data in these files (some are actually serialized objects) to instantiate objects and caches those in memory to use for real time requests.
The advantage of this model is that it:
Allows the web nodes to cache all data in memory and not incur any IO overhead querying a database.
If the database ever goes down either unexpectedly or for maintenance windows, the web servers will continue to run and generate revenue. You can even fire up a web server without having to retrieve its initial data from the DB because all the data it needs are in files on its own disks.
Allows us to be completely horizontally scalable. If throughput suffers, we can just add a web server.
The disadvantages are that this caching and persistense layers adds complexity in the code that queries the database, packages the data and unpackages it on the web server. Any time our domain model requires us to add entities, more of this "plumbing" has to be coded. This architecture has been in place for four years and there are probably better ways to tackle this.
One strategy I have been considering is using replication to replicate our master sql server database to local database instances installed on each web server. The web server application would use normal sql/ORM techniques to instantiate objects. Here, we can still sustain a master database outage and we would not have to code up specialized caching code and could instead use nHibernate to handle the persistence.
This seems like a more elegant solution and would like to see what others think or if anyone else has any alternatives to suggest.
I think you're overthinking this. SQL Server already has mechanisms available to you to handle these kinds of things.
First, implement a SQL Server cluster to protect your main database. You can fail over from node to node in the cluster without losing data, and downtime is a matter of seconds, max.
Second, implement database mirroring to protect from a cluster failure. Depending on whether you use synchronous or asynchronous mirroring, your mirrored server will either be updated in realtime or a few minutes behind. If you do it in realtime, you can fail over to the mirror automatically inside your app - SQL Server 2005 & above support embedding the mirror server's name in the connection string, so you don't even have to lift a finger. The app just connects to whatever server's live.
Between these two things, you're protected from just about any main database failure short of a datacenter-wide power outage or network outage, and there's none of the complexity of the replication stuff. That covers your high availability issue, and lets you answer the scaling question separately.
My favorite starting point for scaling is using three separate connection strings in your application, and choose the right one based on the needs of your query:
Realtime - Points directly at the one master server. All writes go to this connection string, and only the most mission-critical reads go here.
Near-Realtime - Points at a load balanced pool of read-only SQL Servers that are getting updated by replication or log shipping. In your original design, these lived on the web servers, but that's dangerous practice and a maintenance nightmare. SQL Server needs a lot of memory (not to mention money for licensing) and you don't want to be tied into adding a database server for every single web server.
Delayed Reporting - In your environment right now, it's going to point to the same load-balanced pool of subscribers, but down the road you can use a technology like log shipping to have a pool of servers 8-24 hours behind. These scale out really well, but the data's far behind. It's great for reporting, search, long-term history, and other non-realtime needs.
If you design your app to use those 3 connection strings from the start, scaling is a lot easier, and doesn't involve any coding complexity - just pick the right connection string.
Have you considered memcached? Since it is:
in memory
can run locally
fully scalable horizontally
prevents the need to re-cache on each web server
It may fit the bill. Check out Google for lots of details and usage stories.
Just some addition to what RickNZ proposed above..
Since your master data which you are caching currently won't change so frequently and probably over some maintenance window, here is what should you do first on database side:
Create a SNAPSHOT replication for the master tables which you want to cache. Adding new entities will be equally easy.
On all the webservers, install SQL Express and subscribe to this Publication.
Since, this is not a frequently changing data, you can rest assure, no much server resource usage issue minus network trips for master data.
All your caching which was available via previous mechanism is still availbale minus all headache which comes when you add new entities.
Next, you can leverage .NET mechanisms as suggested above. You won't face memcached cluster failure unless your webserver itself goes down. There is a lot availble in .NET which a .NET pro can point out after this stage.
It seems to me that Windows Server AppFabric is exactly what you are looking for. (AKA "Velocity"). From the introductory documentation:
Windows Server AppFabric provides a
distributed in-memory application
cache platform for developing
scalable, available, and
high-performance applications.
AppFabric fuses memory across multiple
computers to give a single unified
cache view to applications.
Applications can store any
serializable CLR object without
worrying about where the object gets
stored. Scalability can be achieved by
simply adding more computers on
demand. The cache also allows for
copies of data to be stored across the
cluster, thus protecting data against
failures. It runs as a service
accessed over the network. In
addition, Windows Server AppFabric
provides seamless integration with
ASP.NET that enables ASP.NET session
objects to be stored in the
distributed cache without having to
write to databases. This increases
both the performance and scalability
of ASP.NET applications.
Have you considered using SqlDependency caching?
You could also write the data to the local disk at the web tier, if you're concerned about initial start-up time or DB outages. But at least with a SqlDependency, you shouldn't have to poll the DB to look for changes. It can also be made relatively transparent.
In my experience, adding a DB instance on web servers generally doesn't work out too well from a scalability or performance perspective.
If you're concerned about performance and scalability, you might consider partitioning your data tier. The specifics depend on your app, but as an example, you could move read-only data onto a couple of SQL Express servers that are populated with replication.
In case it helps, I talk about this subject at length in my book (Ultra-Fast ASP.NET).

Resources