ARX Anonymization Tool - supported databases [closed] - sql-server

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
We are looking into options for open source data masking tools. ARX seems to provide some great functionality, but only lists SQLServer and DB2 (along with flat files and Excel in it's list). Does anyone know what types of things are supported? Oracle for example? How about old-school things like VSAM?
https://arx.deidentifier.org/anonymization-tool/
Anyone have any other great options for data masking? Hopefully something UI-configured, as it's typically not programmers managing the data.

Many great tools exist to help you anonymize data, and it’s a growing field, given the increasing need for data privacy and the demands of recent regulations. Here are just a few of the leading products for data anonymization; quotations are from product websites.
Open Source
ARX Data Anonymization Tool - https://arx.deidentifier.org/
“ARX features a cross-platform graphical tool, which supports data import and cleansing, wizards for creating transformation rules, intuitive ways for tailoring the anonymized dataset to your requirements and visualizations of data utility and re-identification risks.”
Masquerade - https://github.com/TonicAI/masquerade
“Masquerade can anonymize data in real-time enabling anonymous analytics, application development, and QA testing with next to no overhead. It does this by operating a TCP proxy between your Postgres client and Postgres database and modifying the result-sets generated by SELECT statements according to a set of user-defined rules.”
Amnesia - https://amnesia.openaire.eu/
“Amnesia is a data anonymization tool, that allows to remove identifying information from data. Amnesia not only removes direct identifiers like names, SSNs etc but also transforms secondary identifiers like birth date and zip code so that individuals cannot be identified in the data. Amnesia supports k-anonymity and km-anonymity.”
SaaS / Enterprise
Tonic (Synthetic Data Generator) - https://www.tonic.ai/
“Tonic uses pre-trained models and feature extraction to generate synthetic data that is based on your data. It preserves all the characteristics that make your data unique—constraints, statistical correlations, distributions, interdependencies, etc. Mask, anonymize, obscure, or generate brand new data, all at the click of a mouse.”
Informatica (Dynamic or Persistent Data Masking Products) - https://www.informatica.com/in/products/data-security/data-masking.html#fbid=3YKt13oZ5As
“De-identify, de-sensitize, and anonymize sensitive data from unauthorized access for application users, business intelligence, application testing, and outsourcing.”
Oracle (Data Masking and Subsetting Pack) - https://www.oracle.com/database/technologies/security/data-masking-subsetting.html
“Oracle Data Masking and Subsetting helps database customers improve security, accelerate compliance, and reduce IT costs by sanitizing copies of production data for testing, development, and other activities and by easily discarding unnecessary data.”
This list could be much longer; the above is just a sampling. Other companies that offer data masking products include Delphix, IBM, Microsoft SQL Server, Aircloak.
Full disclosure: I'm a founder of Tonic.

ARX developer here. When using the Java library you can connect to any database with a JDBC driver. We also support connections to Oracle via the GUI. AFAICT some users have reported problems when connecting to Oracle databases, though. You have to check.

Related

What is RDBMS and database engine? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
It's kinda a noob question but what is the difference between Relational Database Management System and database engine?
Thanks.
The original idea of an RDBMS differs from what is called an RDBMS these days. SQL DBMSs are commonly called RDBMSs, but it's more correct to say they can be used mostly relationally, if one has the knowledge and discipline. It's also possible to use them in the style of a network data model or even inconsistently, which seems like the more common attitude in the field.
The essence of the relational model is not about tables, but about first-order logic. Tables are simply a general-purpose data structure which can be used to represent relations. For example, a graph can be viewed in a relational way - as a set of ordered pairs - and can be represented as a table (with some rules to ensure the table is interpreted or manipulated correctly). By describing all data using domains, relations, dependencies and constraints, we can develop declarative consistency guarantees and allow any reasonable question to be answered correctly from the data.
A database engine is software that handles the data structure and physical storage and management of data. Different storage engines have different features and performance characteristics, so a single DBMS could use multiple engines. Ideally, they should not affect the logical view of data presented to users of the DBMS.
How easily you can migrate to another DBMS / engine depends on how much they differ. Unfortunately, every DBMS implements somewhat different subsets of the SQL standard, and different engines support different features. Trying to stick to the lowest common denominator tends to produce inefficient solutions. Object-relational mappers reintroduce the network data model and its associated problems which the relational model was meant to address. Other data access middleware generally don't provide a complete or effective data sublanguage.
Whatever approach you choose, changing it is going to be difficult. At least there's some degree of overlap between SQL implementations, and queries are shorter and more declarative than the equivalent imperative code, so I tend to stick to simple queries and result sets rather than using data access libraries or mappers.
A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model where in you can create many tables and have relations between them. While database engine is the underlying software component that a database management system (DBMS) uses to perform the operations from a database

Framework for partial bidirectional database synchronization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm trying to optimize the backend for an information system for high-availability, which involves splitting off a part needed for time-critical client requests (front office) from the rest (back office).
Front office will have redundant application servers with load balancing for maximum performance and will use a database with pre-computed data. Back office will periodically prepare data for the front office based on client statistics and some external data.
A part of the data schema will be shared between both back and front office, but not the whole databases, only parts of some tables. The data will not need to correspond all the time, it will be synchronized between the two databases periodically. Continuous synchronization is also viable, but there is no real-time consistency requierement and it seems that batch-style synchronization would be better in terms of control, debug and backup possibilities. I expect no need for solving conflicts because data will mostly grow and change only on one side.
The solution should allow defining corresponding tables and columns and then it will insert/update new/changed rows. The solution should ideally use data model defined in Groovy classes (probably through annotations?), as both applications run on Grails. The synchronization may use the existing Grails web applications or run externally, maybe even on the database server alone (Postgresql).
There are systems for replicating whole mirrored databases, but I wasn't able to find any solution suiting my needs. Do you know of any existing framework to do help with that or to make my own is the only possibility?
I ended up using Londiste from SkyTools. The project page on pgFoundry site lists quite old binaries (and is currently down), so you better build it from source.
It's one direction (master-slave) only, so one has to set up two synchronization instances for bidirectional sync. Note that each instance consists of two Londiste binaries (master and slave worker) and a ticker daemon that pushes the changes.
To reduce synchronization traffic, you can extend the polling period (by default 1 second) in the configuration file or even turn it off completely by stopping the ticker and then trigger the sync manually by running SQL function pgq.ticker on master.
I solved the issue of partial column replication by writing a simple custom handler (londiste.handler.TableHandler subclass) with column-mapping configured in database. The mapping configuration is not model-driven (yet) as I originally planned, but I only need to replicate common columns, so this solution is sufficient for now.

What are the approaches to the Big-Data problems? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Let us consider the following problem. We have a system containing a huge amount of data (Big-Data). So, in fact we have a data base. As the first requirement we want to be able to write to and to read from the data base quickly. We also want to have a web-interface to the data-bases (so that different clients can write to and read from the data base remotely).
But the system that we want to have should be more than a data base. First, we want to be able to run different data-analysis algorithm on the data to find regularities, correlations, abnormalities and so on (as before we do care a lot about the performance). Second, we want to bind a machine learning machinery to the data-base. Which means that we want to run machine learning algorithms on the data to be able to learn "relations" present on the data and based on that predict the values of entries that are not yet in the data base.
Finally, we want to have a nice clicks based interface that visualize the data. So that the users can see the data in form of nice graphics, graphs and other interactive visualisation objects.
What are the standard and widely recognised approaches to the above described problem. What programming languages have to be used to deal with the described problems?
I will approach your question like this: I assume you are firmly interested in big data database use already and have a real need for one, so instead of repeating textbooks upon textbooks of information about them, I will highlight some that meet your 5 requirements - mainly Cassandra and Hadoop.
1) The first requirement we want to be able to write to and to read from the database quickly.
You'll want to explore NoSQL databases which are often used for storing “unstructured” Big Data. Some open-source databases include Hadoop and Cassandra. Regarding the Cassandra,
Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities (ref).
References:
Big Data and NoSQL: Five Key Insights
NoSQL standouts: New databases for new applications
Big data woes: Which database should I use?
Cassandra and Spark: A match made in big data heaven
List of NoSQL databases (currently 150)
2) We also want to have a web interface to the database
See the list of 150 NoSQL databases to see all the various interfaces available, including web interfaces.
Cassandra has a cluster admin, a web-based environment, a web-admin based on AngularJS, and even GUI clients.
References:
150 NoSQL databases
Cassandra Web
Cassandra Cluster Admin
3) We want to be able to run different data-analysis algorithm on the data
Cassandra, Hive, and Hadoop are well-suited for data analytics. For example, eBay uses Cassandra for managing time-series data.
References:
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
Cassandra at eBay - Cassandra Summit
An Introduction to Real-Time Analytics with Cassandra and Hadoop
4) We want to run machine learning algorithms on the data to be able to learn "relations"
Again, Cassandra and Hadoop are well-suited. Regarding Apache Spark + Cassandra,
Spark was developed in 2009 at UC Berkeley AMPLab, open sourced in
2010, and became a top-level Apache project in February, 2014. It has
since become one of the largest open source communities in big data, with over 200 contributors in 50+ organizations (ref).
Regarding Hadoop,
With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets.
References:
Getting Started with Apache Spark and Cassandra
What is Apache Mahout?
Data Science with Apache Hadoop: Predicting Airline Delays
5) Finally, we want to have a nice clicks-based interface that visualize the data.
Visualization tools (paid) that work with the above databases include Pentaho, JasperReports, and Datameer Analytics Solutions. Alternatively, there are several open-source interactive visualization tools such as D3 and Dygraphs (for big data sets).
References:
Data Science Central - Resources
Big Data Visualization
Start looking at:
what kind of data you want to store in the Database?
what kind of relationship between data you got?
how this data will be accessed? (for instance you need to access a certain set of data quite often)
are they documents? text? something else?
Once you got an answer for all those questions, you can start looking at which NoSQL Database you could use that would give you the best results for your needs.
You can choose between 4 different types: Key-Value, Document, Column family stores, and graph databases.
Which one will be the best fit can be determined answering the question above.
There are ready to use stack that may really help to start with your project:
Elasticsearch that would be your Database (it has a REST API that you can use to write them to the DB and to make queries and analysis)
Kibana is a visualization tool, it will allows you to explore and visualize your data, it it quite powerful and will be more than enough for most of your needs
Logstash can centralize the data processing and help you process and save them in elasticsearch, it already support quite few sources of logs and events, and you can also write your own plugin as well.
Some people refers to them as the ELK stack.
I don't believe you should worry about the programming language you have to use at this point, try to select the tools first, sometimes the choices are limited by the tools you want to use and you can still use a mixture of languages and make the effort only if/when it make sense.
A common way to solve such a requirements is to use Amazon Redshift and the ecosystem around it.
Redshift is a peta-scale data warehouse (it can also start with giga-scale), that exposes Ansi SQL interface. As you can put as much data as you like into the DWH and you can run any type of SQL you wish against this data, this is a good infrastructure to build almost any agile and big data analytics system.
Redshift has many analytics functions, mainly using Window functions. You can calculate averages and medians, but also percentiles, dense rank etc.
You can connect almost every SQL client you want using JDBS/ODBC drivers. It can be from R, R studio, psql, but also from MS-Excel.
AWS added lately a new service for Machine Learning. Amazon ML is integrating nicely with Redshift. You can build predictive models based on data from Redshift, by simply giving an SQL query that is pulling the data needed to train the model, and Amazon ML will build a model that you can use both for batch prediction as well as for real-time predictions. You can check this blog post from AWS big data blog that shows such a scenario: http://blogs.aws.amazon.com/bigdata/post/TxGVITXN9DT5V6/Building-a-Binary-Classification-Model-with-Amazon-Machine-Learning-and-Amazon-R
Regarding visualization, there are plenty of great visualization tools that you can connect to Redshift. The most common ones are Tableau, QliView, Looker or YellowFin, especially if you don't have any existing DWH, where you might want to keep on using tools like JasperSoft or Oracle BI. Here is a link to a list of such partners that are providing free trial for their visualization on top of Redshift: http://aws.amazon.com/redshift/partners/
BTW, Redshift also provides a free trial for 2 months that you can quickly test and see if it fits your needs: http://aws.amazon.com/redshift/free-trial/
Big Data is a tough problem primarily because it isn't one single problem. First if your original database is a normal OLTP database that is handling business transactions throughout the day, you will not want to also do your big data analysis on this system since the data-analysis you will want to do will interfere with the normal business traffic.
Problem #1 is what type of database do you want to use for data-analysis? You have many choices ranging from RDBMS, Hadoop, MongoDB, and Spark. If you go with RDBMS then you will want to change the schema to be more compliant with data-analysis. You will want to create a data warehouse with a star schema. Doing this will make many tools available to you because this method of data analysis has been around for a very long time. All of the other "big data" and data analysis databases do not have the same level of tooling available, but they are quickly catching up. Each one of these will require research on which one you will want to use based on your problem set. If you have big batches of data RDBMS and Hadoop will be good. If you have streaming types of data then you will want to look at MongoDB and Spark. If you are a Java shop then RDBMS, Hadoop or Spark. If you are JavaScript MongoDB. If you are good with Scala then Spark.
Problem #2 is getting your data from your transactional database into your big data storage. You will need to find a programming language that has libraries to talk to both databases and you will have to decide when and where you will be moving this data. You can use Python, Java or Ruby to do this work.
Problem #3 is your UI. If you decide to go with RDBMS then you can use many of the available tools available or you can build your own. The other data storage solutions will have tool support but it isn't as mature is that available for the RDBMS. You are most likely going to build your own here anyway because your analysts will want to have the tools built to their specifications. Java works with all of these storage mechanisms but you can probably get Python to work too. You may want to provide a service layer built in Java that provides a RESTful interface and then put a web layer in front of that service layer. If you do this, then your web layer can be built in any language you prefer.
These three languages are most commonly used for machine learning and data mining on the Server side: R, Python, SQL. If you are aiming for heavy mathematical functions and graph generation, Haskell is very popular.

Is there a powerful database system for time series data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
In multiple projects we have to store, aggregate, evaluate simple measurement values. One row typcially consists of a time stamp, a value and some attributes to the value. In some applications we would like to store 1000 values per second and more. These values must not only be inserted but also deleted at the same rate, since the lifetime of a value is restricted to a year or so (in different aggregation steps, we do not store 1000/s for the whole year).
Until now, we have developped different solutions. One based on Firebird, one on Oracle and one on some self-made storage mechanism. But none of these are very satisfying solutions.
Both RDBMS solutions cannot handle the desired data flow. Besides that, the applications that deliver the values (e.g. device drivers) cannot be easily attached to databases, the insert statements are cumbersome. And finally, while having an SQL interface to the data is strongly desired, typical evaluations are hard to formulate in SQL and slow in the execution. E.g. find the maximum value with time stamp per 15 minutes for all measurements during the last month.
The self-made solution can handle the insertion rate and has a client-friendly API to do it, but it has nothing like a query language and cannot be used by other applications via some standard interface e.g. for reporting.
The best solution in my dreams would be a database system that:
has an API for very fast insertion
is able to remove/truncate the values in the same speed
provides a standard SQL interface with specific support for typical time series data
Do you know some database that comes near those requirements or would you approach the problem in a different way?
Most other answers seem to mention SQL based databases. NoSQL based databases are far superior at this kind of thing.
Some Open source time-series databases:
https://prometheus.io - Monitoring system and time series database
http://influxdb.com/ - time series database with no external dependencies (only basic server is open-source)
http://square.github.io/cube/ - Written ontop of MongoDB
http://opentsdb.net/ - Written on top of Apache HBase
https://github.com/kairosdb/kairosdb - A rewrite of OpenTSDB that also enables using Cassandra instead of Hadoop
http://www.gocircuit.org/vena.html - A tutorial on writing a substitute of OpenTSDB using Go-circuits
https://github.com/rackerlabs/blueflood - Based on Cassandra
https://github.com/druid-io/druid - Column oriented & hadoop based distributed data store
Cloud-based:
https://www.tempoiq.com
influxdb :: An open-source distributed time series database with no external dependencies.
http://influxdb.org/
Consider IBM Informix Dynamic Server with the TimeSeries DataBlade.
That is, however, an extreme data rate that you are working with. (Not quite up to sub-atomic physics at CERN, but headed in that general direction.)
Fair disclosure: I work for IBM on the Informix DBMS, though not on the TimeSeries DataBlade per se.
SQL Server StreamInsight
Microsoft StreamInsight BOL
You can try HDF5 for time series data. It is extremely fast for such applications.
As Jonathan Leffler said, you should try Informix Timeseries feature. It is included in all editions of Informix at no additional charge. You can take a look at the TimeSeries functions it supports:
IBM Informix Time series SQL routines
You can access the data through sql functions or virtual view interfaces, you can even insert into the view.

Is there a database with git-like qualities? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a database where multiple users can contribute and commit new data; other users can then pull that data into their own database repository, all in a git-like manner. A transcriptional database, if you like; does such a thing exist?
My current thinking is to dump the database to a single file as SQL, but that could well get unwieldy once it is of any size. Another option is to dump the database and use the filesystem, but again it gets unwieldy once of any size.
There's Irmin: https://github.com/mirage/irmin
Currently it's only offered as an OCaml API, but there's future plans for a GraphQL API and a Cap'n'Proto one.
Despite the complex API and the still scarce documentation, it allows you to plug any backend (In-Memory, Unix Filesystem, Git In-Memory and Git On-Disk). Therefore, it runs even on Unikernels and Browsers.
It also offers a bidirectional model where changes on the Git local repository are reflected upon Application State and vice-versa. With the complex API, you can operate on any Git-level:
Append-only Blob storage.
Transactional/compound Tree layer.
Commit layer featuring chain of changes and metadata.
Branch/Ref/Tag layer (only-local, but offers also remotes) for mutability.
The immutable store is often associated/regarded for the blobs + trees + commits on documentation.
Due the Content-addressable inherited Git-feature, Irmin allows deduplication and thus, reduced memory-consumption. Some functionally persistent data structures fit perfectly on this database, and the 3-way merge is a novel approach to handle merge conflicts on a CRDT-style.
Answer from: How can I put a database under version control?
I have been looking for the same feature for Postgres (or SQL databases in general) for a while, but I found no tools to be suitable (simple and intuitive) enough. This is probably due to the binary nature of how data is stored. Klonio sounds ideal but looks dead. Noms DB looks interesting (and alive). Also take a look at Irmin (OCaml-based with Git-properties).
Though this doesn't answer the question in that it would work with Postgres, check out the Flur.ee database. It has a "time-travel" feature that allows you to query the data from an arbitrary point in time. I'm guessing it should be able to work with a "branching" model.
This database was recently being developed for blockchain-purposes. Due to the nature of blockchains, the data needs to be recorded in increments, which is exactly how git works. They are targeting an open-source release in Q2 2019.
Because each Fluree database is a blockchain, it stores the entire history of every transaction performed. This is part of how a blockchain ensures that information is immutable and secure.
It's not SQL, but CouchDB supports replicating the database and pushing/pulling changes between users in a way similar to what you describe.
Some more information in the chapter on replication in the O'Reilly CouchDB book.

Resources