Data masking for data in AWS RDS [closed] - data-masking

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have an AWS RDS (AuroraDB) and I want to mask the data on the DB. Does Amazon provides any service for data masking?
I have seen RDS encryption but I am looking for data masking because the database contains sensitive data. So I want to know is there any service they provide for data masking or is there any other tool which can be used to mask the data and add it manually into the DB?
A list of tools which can be used for data masking is most appreciated if any for mine case. Because I need to mask those data for testing as the original DB contains sensitive information like PII(Personal Identifiable information). I also have to transfer these data to my co-workers, so I consider data masking an important factor.
Thanks.

This is a fantastic question and I think your pro-active approach to securing the most valuable asset of your business is something that a lot of people should heed, especially if you're sharing the data with your co-workers. Letting people see only what they need to see is an undeniably good way to reduce your attack surfaces. Standard cyber security methods are no longer enough imo, demonstrated by numerous attacks/people losing laptops/usbs with sensitive data on. We are just humans after all. With the GDPR coming in to force in May next year, any company with customers in the EU will have to demonstrate privacy by design and anonymisation techniques such as masking have been cited as way to show this.
NOTE: I have a vested interest in this answer because I am working on such a service you're talking about.
We've found that depending on your exact use case, size of data set and contents will depend on your masking method. If your data set has minimal fields and you know where the PII is, you can run standard queries to replace sensitive values. i.e. John -> XXXX. If you want to maintain some human readability there are libraries such as Python's Faker that generate random locale based PII you can replace your sensitive values with. (PHP Faker, Perl Faker and Ruby Faker also exist).
DISCLAIMER: Straight forward masking doesn't guarantee total privacy. Think someone identifying individuals from a masked Netflix data set by cross referencing with time stamped IMDB data or Guardian reporters identifying a Judges porn preferences from masked ISP data.
Masking does get tedious as your data set increases in fields/tables and you perhaps want to set up different levels of access for different co-workers. i.e. data science get lightly anonymised data, marketing get a access to heavily anonymised data. PII in free text fields is annoying and generally understanding what data is available in the world that attackers could use to cross reference is a big task.
The service i'm working on aims to alleviate all of these issues by automating the process with NLP techniques and a good understanding of anonymisation maths. We're bundling this up in to a web-service and we're keen to launch on the AWS marketplace. So I would love to hear more about your use-case and if you want early access we're in private beta at the moment so let me know.

If you are exporting or importing data using CSV or JSON files (i.e. to share with your co-workers) then you could use FileMasker. It can be run as an AWS Lamdbda function reading/writing CSV/JSON files on S3.
It's still in development but if you would like to try a beta now then contact me.
Disclaimer: I work for DataVeil, the developer of FileMasker.

Related

How do you handle collection and storage of new data in an existing system? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am new to system design and have been asked to solve a problem.
Given a car rental service website, I need to work on a new feature.
The company has come up with some more data that they would like to capture and analyze along with the data that they already have.
This new data can be something like time and cost to assemble a car.
I need to understand the following:
1: How should I approach the problem, from API design perspective?
2: Is changing the schema of your tables going to do any good, if that is an option?
3: Which databases can be used?
The values once stored can be changed. For example, the time to assemble can reduce or increase, hence the users should be able to update the values.
To answer you question let's divide it in two parts, ideal architecture and Q&A's
Architecture:
A typical system would consist of many technologies working together to solve a practical problem. Problems can be solved in many ways and may have more than one solution. We are not talking about efficiency and effectively of any architecture here as it's whole new subject to explore. But it's always wise to choose what's best for your use case.
Since you already have existing software built, it's always helpful to follow it's existing design pattern which will help you understand existing code in detail and allow you to create logical blocks which will fit nicely and actually help in integrating functionality instead of working against it.
Since this clears the pre planning phase let's discuss on how this affects what solution is ideal for your use case in my opinion.
Q&A's
1. How should I approach the problem, from API design perspective?
There will be lots of assumption, anything but less system consisting of api should have basic functionality of authentication and authorization when ever needed. Apart from that, try to stick to full REST specification, which will allow API consumers to follow standard paths and integration would have minimal impact when deciding what endpoints would look like and what they expect from consumer.
Regardless, not all systems are ideal for such use case and thus it's in up to system designer how much of system is compatible with standard practices.
Name convention matters when newer version api will have api/v2
paths and old one having api/v1, which is good practice for routing
new functionality. Which allows system to expand seamlessly.
2: Is changing the schema of your tables going to do any good, if that is an option?
In short term when you do not have much data, it's relatively easy to migrate data. When it becomes huge, it's much more painful and resource intensive.
Good practices would allow you to prevent such scenarios where you might not need migrate data.
Database normalization becomes so crucial in such cases when potential data structure would grow rapidly and requires attention.
Regardless of using any sql or nosql solution, a good data structure will always be helpful in both data management and programming implementation.
In my opinion, getting data structures near perfect is always a good
idea, because it will reduce future costs of migration and
frustration it brings. Still some use cases requires addition on
columns and its okay to add them as long as it does not have much
impact on existing code. Otherwise it can always be decoupled in
separate table for additional fields.
3: Which databases can be used?
Typically any rdbms is enough for this kind of tasks. You might be surprised when you see case studies of large data creators still using mysql in clusters.
So answer is, as long as you have normal scenario, go ahead and pick any database of your choice, until you hit its single instance scalability limits. And those limits are pretty huge for small to mod scale apps.
How should I approach the problem, from API design perspective?
Design a good data model which is appropriate for the data it needs to store. The API design will follow from the data model.
Is changing the schema of your tables going to do any good, if that is an option?
Does the new data belong in the existing tables? Then maybe you should store it there. Except: can you add new columns without breaking any existing applications? Maybe you can but the regression testing you'll need to undertake to prove it may be ruinous for your timelines. Separate tables are probably the safer option.
Which databases can be used?
You're rather vague about the nature of the data you're working with, but it seems structured (numbers?). So that suggests a SQL with strong datatypes would be the best fit. Beyond that, use whatever data platform is currently being used. Any perceived benefits from a different product will be swept away by the complexities and hassle of deploying it.
Last word. Talk this over with your boss (or whoever set you this task). Don't rely on the opinions of some random stranger on the interwebs.

Does Microsoft have a similar product like Google BigQuery? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I want to see whether Microsoft provide a similar service to Google BigQuery.
I want to run some queries on a database with the size of ~15GB and I want the service to be on the cloud.
P.S: Yes. I have google already but did not find anything similar.
The answer to your question is NO: Microsoft does not offer (yet) a real time big data query service where you pay as you perform queries. Which does not means you won't get a solution to your problem in Azure.
Depending on your need you may have two options on Azure:
SQL Data Warehouse: A new Azure based columnar database service in preview http://azure.microsoft.com/fr-fr/documentation/services/sql-data-warehouse/ which according to Microsoft can scale up to petabytes. Assuming that your data is structured (relational) and that you need sub second response time it should do the job you expect.
HDInsight is hadoop managed service https://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/ which can deal better with semi structured data but is more oriented to batch processing. It contains Hive which is also SQL like but you won't get instant query response-time. You could go for this option if you are expecting to do calculations on a batch mode and store the aggregated result set somewhere else.
The main difference of these products and BigQuery is the prizing model in BigQuery you pay as you perform queries but in Micrisoft options you pay based on the resources you allocate, which can be very expensive if you data is really big.
I think if the expected usage is occasional BigQuery will be much cheaper, Misrosoft options will be better for intense use, but of course you will need to do a detailed prize comparison to be sure.
To get an idea of what BigQuery really is, and how it compares to a relational database (or Hadoop for that matter), take a look at this doc:
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
Take a look at this:
http://azure.microsoft.com/en-in/solutions/big-data/.
Reveal new insights and drive better decision making with Azure HDInsight, a Big Data solution powered by Apache Hadoop. Surface those insights from all types of data to business users through Microsoft Excel.

Is there a database with git-like qualities? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a database where multiple users can contribute and commit new data; other users can then pull that data into their own database repository, all in a git-like manner. A transcriptional database, if you like; does such a thing exist?
My current thinking is to dump the database to a single file as SQL, but that could well get unwieldy once it is of any size. Another option is to dump the database and use the filesystem, but again it gets unwieldy once of any size.
There's Irmin: https://github.com/mirage/irmin
Currently it's only offered as an OCaml API, but there's future plans for a GraphQL API and a Cap'n'Proto one.
Despite the complex API and the still scarce documentation, it allows you to plug any backend (In-Memory, Unix Filesystem, Git In-Memory and Git On-Disk). Therefore, it runs even on Unikernels and Browsers.
It also offers a bidirectional model where changes on the Git local repository are reflected upon Application State and vice-versa. With the complex API, you can operate on any Git-level:
Append-only Blob storage.
Transactional/compound Tree layer.
Commit layer featuring chain of changes and metadata.
Branch/Ref/Tag layer (only-local, but offers also remotes) for mutability.
The immutable store is often associated/regarded for the blobs + trees + commits on documentation.
Due the Content-addressable inherited Git-feature, Irmin allows deduplication and thus, reduced memory-consumption. Some functionally persistent data structures fit perfectly on this database, and the 3-way merge is a novel approach to handle merge conflicts on a CRDT-style.
Answer from: How can I put a database under version control?
I have been looking for the same feature for Postgres (or SQL databases in general) for a while, but I found no tools to be suitable (simple and intuitive) enough. This is probably due to the binary nature of how data is stored. Klonio sounds ideal but looks dead. Noms DB looks interesting (and alive). Also take a look at Irmin (OCaml-based with Git-properties).
Though this doesn't answer the question in that it would work with Postgres, check out the Flur.ee database. It has a "time-travel" feature that allows you to query the data from an arbitrary point in time. I'm guessing it should be able to work with a "branching" model.
This database was recently being developed for blockchain-purposes. Due to the nature of blockchains, the data needs to be recorded in increments, which is exactly how git works. They are targeting an open-source release in Q2 2019.
Because each Fluree database is a blockchain, it stores the entire history of every transaction performed. This is part of how a blockchain ensures that information is immutable and secure.
It's not SQL, but CouchDB supports replicating the database and pushing/pulling changes between users in a way similar to what you describe.
Some more information in the chapter on replication in the O'Reilly CouchDB book.

Versioned RDF store [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Let me try rephrasing this:
I am looking for a robust RDF store or library with the following features:
Named graphs, or some other form of reification.
Version tracking (probably at the named graph level).
Privacy between groups of users, either at named graph or triple level.
Human-readable data input and output, e.g. TriG parser and serialiser.
I've played with Jena, Sesame, Boca, RDFLib, Redland and one or two others some time ago but each had its problems. Have any improved in the above areas recently? Can anything else do what I want, or is RDF not yet ready for prime-time?
Reading around the subject a bit more, I've found that:
Jena, nothing further
Sesame, nothing further
Boca does not appear to be maintained any more and seems only really designed for DB2. OpenAnzo, an open-source fork, appears more promising.
RDFLib, nothing further
Redland, nothing further
Talis Platform appears to support changesets (wiki page and reference in Kniblet Tutorial Part 5) but it's a hosted-only service. Still may look into it though.
SemVersion sounded promising, but appears to be stale.
Talis is the obvious choice, but privacy may be an issue, or perceived issue anyway, since its a SaaS offering. I say obvious because the three emboldened features in your list are core features of their platform IIRC.
They don't have a features list as such - which makes it hard to back up this answer, but they do say that stores of data can be individually secured. I suppose you could - at a pinch - sign up to a separate store on behalf of each of your own users.
Human readable input is often best supported by writing custom interfaces for each user-task, so you best be prepared to do that as needs demand.
Regarding prime-time readiness. I'd say yes for some applications but otherwise "not quite". Mostly the community needs to integrate with existing developer toolsets and write good documentation aimed at "ordinary" developers - probably OO developers using Java, .NET and Ruby/Groovy - and then I predict it will snowball.
See also Temporal Scope for RDF triples
From: http://www.semanticoverflow.com/questions/453/how-to-implement-semantic-data-versioning/748#748
I personally quite like the pragmatic approach which Freebase has adopted.
Browse and edit views for humans:
http ://www.freebase.com/view/guid/9202a8c04000641f80000000041ecebd
http ://www.freebase.com/edit/topic/guid/9202a8c04000641f80000000041ecebd
The data model exposed here:
http ://www.freebase.com/tools/explore/guid/9202a8c04000641f80000000041ecebd
Stricly speaking, it's not RDF (it's probably a superset of it), but part of it can be exposed as RDF:
http ://rdf.freebase.com/rdf/guid.9202a8c04000641f80000000041ecebd
Since it's a community driven website, not only they need to track who said what, when... but they are probably keeping the history as well (never delete anything):
http ://www.freebase.com/history/view/guid/9202a8c04000641f80000000041ecebd
To conclude, the way I would tackle your problem is very similar and pragmatic. AFAIK, you will not find a solution which works out-of-the-box. But, you could use a "tuple" store (3 or 4 aren't enough to keep history at the finest granularity (i.e. triples|quads)).
I would use TDB code as a library (since it gives you B+Trees and a lot of useful things you need) and I would use a data model which allows me to: count quads, assign an ownership to a quad, a timestamp and previous/next quad(s) if available:
[ id | g | s | p | o | user | timestamp | prev | next ]
Where:
id - long (unique identifier, same (g,s,p,o) will have different id...
a lot of space, but you can count quads... and when you have a
community driven website (like this one) counting things it's
important.
g - URI (or blank node?|absent (i.e. default graph))
s - URI|blank node
p - URI
o - URI|blank node|literal
user - URI
timestamp - when the quad was created
prev - id of the previous quad (if present)
next - id of the next quad (if present)
Then, you need to think about which indexes you need and this would depend on the way you want to expose and access your data.
You do not need to expose all your internal structures/indexes to external users/people/applications. And, when (and if), RDF vocabularies or ontologies for representing versioning, etc. will emerge, you are able to quickly expose your data using them (if you want to).
Be warned, this is not common practice and it you look at it with your "semantic web glasses" it's probably wrong, bad, etc. But, I am sharing the idea, since I believe it's not harmful, it allows to provide a solution to your question (it will be slower and use more space than a quad store), part of it can be exposed to the semantic web as RDF / LinkedData.
My 2 (heretic) cents.
LMF comes with a versioning module: http://code.google.com/p/lmf/wiki/ModuleVersioning
The Linked Media Framework is an easy-to-setup server application developed in JavaEE that bundles core Semantic Web technologies to offer many advanced services.
Take a look to see if Virtuoso's RDF support meets your needs, it sounds as though it might go quite a way, and it plays nice with XML and web services too. There's a commercial and a GPL'd version.
Mulgara/Fedora-Commons might fit the bill. I belive that privacy is currently a major project, and I understand that it supports versioning, but it might be too much in that is is an object-store too.
(years later)
I think both Oracle's RDF store:
http://www.oracle.com/technetwork/database/options/semantic-tech/index.html
and the recently announced graph store in IBMs DB2 supports much of this:
http://www-01.ibm.com/software/data/db2/linux-unix-windows/graph-store.html

How do you manage software serial keys, licenses, etc? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
My company is trying to find a tool that will track serial keys for software we have purchased (such as Office), as well as software that we write and sell.
We want software that will allows us to associate a particular computer with the software on that computer and the serial key(s) and license(s) for that software. We'd also like to track history of the computer, such as when software is removed from a computer and moved to another computer.
I've looked around Google and various software sites, but all of the results I've found are for licensing software and creating serial keys, not managing the serial keys that those tools generate. I know this problem has been solved; many companies license software and keep records of the serial keys they generate. So I'm curious if any of you have solved this same problem without writing your own custom software?
Edit: I forgot to mention, I am not asking about the merits of licensing software -- the software we write is not COTS and purchases are controlled at a contractual level. Still, we need to manage how serial keys are generated.
A couple of options (including the one you don't want):
Write your own database for this; Perhaps a simple app using SQLite. (Not very appealing, but not hard either)
You just need an application that lets you create name:value pairs and assign them into groups. A customizable address book would work in a pinch. Each contact could be a program name or a customer name with the license/serial as the data. Then you could group by computer, customer, etc.
This sounds like the classic kind of problem that Access (and programs like it) were designed to solve. You start with access, use it for a couple of years, and then later hire someone to port the data into a custom app when you've outgrown that solution.
I would be extremely tempted to try and use an address book program for this to start. (Note: I'm using Apple's address book program in my mind for referencing features) It allows for custom fields, notes, and groups. The downside is that you have to do more work: searching for part of a serial number to make sure it is not already in use, manually adding a note to two "contacts" indicating the transfer of a license from one to the other.
On the other hand, if the license tracking of your own software is key to your business, it is probably worth your time and money to develop a custom app on top of a SQL database. Write down a list of everything you want to be able to do. Go back and write down any rules or constraints (e.g. can two or more machines have the same license?). The database schema and programming rules will fall right out of that document.
Another idea: programs that track books, dvds, etc. Primarily ones that allow you to keep notes about when you lend them to people.
Take a look at SpiceWorks:
http://www.spiceworks.com/
It does a lot more than just inventory / asset management and is free.
not off-the-shelf, but perhaps a database like MySQL or OpenOffice Base (or Access, bleah)? This sounds pretty simple if you're not looking for many frills; just a couple of tables, e.g. users, computers, software types, license keys, and cross-tables to associate these with each other.
This might be useful, but I have not used it for what you are looking for:
http://www.ezasset.com/i/front.html?page=front_ezindex
There are a number of ways you can handle this - perhaps a license is also considered an asset (not just a computer) and you an group the assets together?
There is a notion of a parent asset and sub assets I think.
It is free for up to 100 assets. Assignment and location are also handled.
My suggestion is PassPack - we use them for password management and they are excellent.
I've used this in the past and been pretty happy with it. The downside is that it runs in FileMaker

Resources