Does Microsoft have a similar product like Google BigQuery? [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I want to see whether Microsoft provide a similar service to Google BigQuery.
I want to run some queries on a database with the size of ~15GB and I want the service to be on the cloud.
P.S: Yes. I have google already but did not find anything similar.

The answer to your question is NO: Microsoft does not offer (yet) a real time big data query service where you pay as you perform queries. Which does not means you won't get a solution to your problem in Azure.
Depending on your need you may have two options on Azure:
SQL Data Warehouse: A new Azure based columnar database service in preview http://azure.microsoft.com/fr-fr/documentation/services/sql-data-warehouse/ which according to Microsoft can scale up to petabytes. Assuming that your data is structured (relational) and that you need sub second response time it should do the job you expect.
HDInsight is hadoop managed service https://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/ which can deal better with semi structured data but is more oriented to batch processing. It contains Hive which is also SQL like but you won't get instant query response-time. You could go for this option if you are expecting to do calculations on a batch mode and store the aggregated result set somewhere else.
The main difference of these products and BigQuery is the prizing model in BigQuery you pay as you perform queries but in Micrisoft options you pay based on the resources you allocate, which can be very expensive if you data is really big.
I think if the expected usage is occasional BigQuery will be much cheaper, Misrosoft options will be better for intense use, but of course you will need to do a detailed prize comparison to be sure.

To get an idea of what BigQuery really is, and how it compares to a relational database (or Hadoop for that matter), take a look at this doc:
https://cloud.google.com/files/BigQueryTechnicalWP.pdf

Take a look at this:
http://azure.microsoft.com/en-in/solutions/big-data/.
Reveal new insights and drive better decision making with Azure HDInsight, a Big Data solution powered by Apache Hadoop. Surface those insights from all types of data to business users through Microsoft Excel.

Related

Data masking for data in AWS RDS [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have an AWS RDS (AuroraDB) and I want to mask the data on the DB. Does Amazon provides any service for data masking?
I have seen RDS encryption but I am looking for data masking because the database contains sensitive data. So I want to know is there any service they provide for data masking or is there any other tool which can be used to mask the data and add it manually into the DB?
A list of tools which can be used for data masking is most appreciated if any for mine case. Because I need to mask those data for testing as the original DB contains sensitive information like PII(Personal Identifiable information). I also have to transfer these data to my co-workers, so I consider data masking an important factor.
Thanks.
This is a fantastic question and I think your pro-active approach to securing the most valuable asset of your business is something that a lot of people should heed, especially if you're sharing the data with your co-workers. Letting people see only what they need to see is an undeniably good way to reduce your attack surfaces. Standard cyber security methods are no longer enough imo, demonstrated by numerous attacks/people losing laptops/usbs with sensitive data on. We are just humans after all. With the GDPR coming in to force in May next year, any company with customers in the EU will have to demonstrate privacy by design and anonymisation techniques such as masking have been cited as way to show this.
NOTE: I have a vested interest in this answer because I am working on such a service you're talking about.
We've found that depending on your exact use case, size of data set and contents will depend on your masking method. If your data set has minimal fields and you know where the PII is, you can run standard queries to replace sensitive values. i.e. John -> XXXX. If you want to maintain some human readability there are libraries such as Python's Faker that generate random locale based PII you can replace your sensitive values with. (PHP Faker, Perl Faker and Ruby Faker also exist).
DISCLAIMER: Straight forward masking doesn't guarantee total privacy. Think someone identifying individuals from a masked Netflix data set by cross referencing with time stamped IMDB data or Guardian reporters identifying a Judges porn preferences from masked ISP data.
Masking does get tedious as your data set increases in fields/tables and you perhaps want to set up different levels of access for different co-workers. i.e. data science get lightly anonymised data, marketing get a access to heavily anonymised data. PII in free text fields is annoying and generally understanding what data is available in the world that attackers could use to cross reference is a big task.
The service i'm working on aims to alleviate all of these issues by automating the process with NLP techniques and a good understanding of anonymisation maths. We're bundling this up in to a web-service and we're keen to launch on the AWS marketplace. So I would love to hear more about your use-case and if you want early access we're in private beta at the moment so let me know.
If you are exporting or importing data using CSV or JSON files (i.e. to share with your co-workers) then you could use FileMasker. It can be run as an AWS Lamdbda function reading/writing CSV/JSON files on S3.
It's still in development but if you would like to try a beta now then contact me.
Disclaimer: I work for DataVeil, the developer of FileMasker.

Is there a powerful database system for time series data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
In multiple projects we have to store, aggregate, evaluate simple measurement values. One row typcially consists of a time stamp, a value and some attributes to the value. In some applications we would like to store 1000 values per second and more. These values must not only be inserted but also deleted at the same rate, since the lifetime of a value is restricted to a year or so (in different aggregation steps, we do not store 1000/s for the whole year).
Until now, we have developped different solutions. One based on Firebird, one on Oracle and one on some self-made storage mechanism. But none of these are very satisfying solutions.
Both RDBMS solutions cannot handle the desired data flow. Besides that, the applications that deliver the values (e.g. device drivers) cannot be easily attached to databases, the insert statements are cumbersome. And finally, while having an SQL interface to the data is strongly desired, typical evaluations are hard to formulate in SQL and slow in the execution. E.g. find the maximum value with time stamp per 15 minutes for all measurements during the last month.
The self-made solution can handle the insertion rate and has a client-friendly API to do it, but it has nothing like a query language and cannot be used by other applications via some standard interface e.g. for reporting.
The best solution in my dreams would be a database system that:
has an API for very fast insertion
is able to remove/truncate the values in the same speed
provides a standard SQL interface with specific support for typical time series data
Do you know some database that comes near those requirements or would you approach the problem in a different way?
Most other answers seem to mention SQL based databases. NoSQL based databases are far superior at this kind of thing.
Some Open source time-series databases:
https://prometheus.io - Monitoring system and time series database
http://influxdb.com/ - time series database with no external dependencies (only basic server is open-source)
http://square.github.io/cube/ - Written ontop of MongoDB
http://opentsdb.net/ - Written on top of Apache HBase
https://github.com/kairosdb/kairosdb - A rewrite of OpenTSDB that also enables using Cassandra instead of Hadoop
http://www.gocircuit.org/vena.html - A tutorial on writing a substitute of OpenTSDB using Go-circuits
https://github.com/rackerlabs/blueflood - Based on Cassandra
https://github.com/druid-io/druid - Column oriented & hadoop based distributed data store
Cloud-based:
https://www.tempoiq.com
influxdb :: An open-source distributed time series database with no external dependencies.
http://influxdb.org/
Consider IBM Informix Dynamic Server with the TimeSeries DataBlade.
That is, however, an extreme data rate that you are working with. (Not quite up to sub-atomic physics at CERN, but headed in that general direction.)
Fair disclosure: I work for IBM on the Informix DBMS, though not on the TimeSeries DataBlade per se.
SQL Server StreamInsight
Microsoft StreamInsight BOL
You can try HDF5 for time series data. It is extremely fast for such applications.
As Jonathan Leffler said, you should try Informix Timeseries feature. It is included in all editions of Informix at no additional charge. You can take a look at the TimeSeries functions it supports:
IBM Informix Time series SQL routines
You can access the data through sql functions or virtual view interfaces, you can even insert into the view.

What's the best tool for data integration [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm looking the best tool for data integration. I need the following features:
Customized loading/matching and
clearing of data from different
sources (including MSSQL Server,
PostgreSQL, WebServices, Excel, text
files in various formats). The
receiver of data is MSSQL Server 2008.
Ability to configure rules of data convertation externally (e.g. config files or visual tools)
Support for Unicode, logging, multithreading and fault tolerance
Scalability (very important)
Ability to process large volumes of data (more that 100MB per day)
I look at SQL Server Integration Service 2008, but I'm not sure that it fits these criteria. What do you think?
It looks like Integration Services (SSIS) should handle your requirements. It should definitely be high on your list because it has good integration with SQL Server and is extremely cost effective compared to most alternatives.
As far as scalability is concerned, your data sounds very small (100MB per day is not much these days) so it's well within the capabilities of SSIS, even for complex data flows. For fault tolerance, SSIS has restartability features out of the box but if high availability is important to you then you may want to consider clustering / mirroring.
I only know SSIS from first hand experience so I can't say how it compares to other solutions.
But I'd say it's a good solution for all the points you ask.
The only thing that is a bit tricky:
don't know Ability to configure rules of data convertation externally (e.g. config files or visual tools)
Not sure I get this right.
You can store configuration parameters for SSIS in external files or even in an SQL table. But you would still need to specify the kinds of rules inside the package. Unless you write your own script component (inside of which you can of course interpret the formating rules you store externally)

Is there a database with git-like qualities? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a database where multiple users can contribute and commit new data; other users can then pull that data into their own database repository, all in a git-like manner. A transcriptional database, if you like; does such a thing exist?
My current thinking is to dump the database to a single file as SQL, but that could well get unwieldy once it is of any size. Another option is to dump the database and use the filesystem, but again it gets unwieldy once of any size.
There's Irmin: https://github.com/mirage/irmin
Currently it's only offered as an OCaml API, but there's future plans for a GraphQL API and a Cap'n'Proto one.
Despite the complex API and the still scarce documentation, it allows you to plug any backend (In-Memory, Unix Filesystem, Git In-Memory and Git On-Disk). Therefore, it runs even on Unikernels and Browsers.
It also offers a bidirectional model where changes on the Git local repository are reflected upon Application State and vice-versa. With the complex API, you can operate on any Git-level:
Append-only Blob storage.
Transactional/compound Tree layer.
Commit layer featuring chain of changes and metadata.
Branch/Ref/Tag layer (only-local, but offers also remotes) for mutability.
The immutable store is often associated/regarded for the blobs + trees + commits on documentation.
Due the Content-addressable inherited Git-feature, Irmin allows deduplication and thus, reduced memory-consumption. Some functionally persistent data structures fit perfectly on this database, and the 3-way merge is a novel approach to handle merge conflicts on a CRDT-style.
Answer from: How can I put a database under version control?
I have been looking for the same feature for Postgres (or SQL databases in general) for a while, but I found no tools to be suitable (simple and intuitive) enough. This is probably due to the binary nature of how data is stored. Klonio sounds ideal but looks dead. Noms DB looks interesting (and alive). Also take a look at Irmin (OCaml-based with Git-properties).
Though this doesn't answer the question in that it would work with Postgres, check out the Flur.ee database. It has a "time-travel" feature that allows you to query the data from an arbitrary point in time. I'm guessing it should be able to work with a "branching" model.
This database was recently being developed for blockchain-purposes. Due to the nature of blockchains, the data needs to be recorded in increments, which is exactly how git works. They are targeting an open-source release in Q2 2019.
Because each Fluree database is a blockchain, it stores the entire history of every transaction performed. This is part of how a blockchain ensures that information is immutable and secure.
It's not SQL, but CouchDB supports replicating the database and pushing/pulling changes between users in a way similar to what you describe.
Some more information in the chapter on replication in the O'Reilly CouchDB book.

Where can I find sample databases with common formatted data that I can use in multiple database engines? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Does anybody know of any sample databases I could download, preferably in CSV or some similar easy to import format so that I could get more practice in working with different types of data sets?
I know that the Canadian Department of Environment has historical weather data that you can download. However, it's not in a common format I can import into any other database. Moreover, you can only run queries based on the included program, which is actually quite limited in what kind of data it can provide.
Does anybody know of any interesting data sets that are freely available in a common format that I could use with mySql, Sql Server, and other types of database engines?
The datawrangling blog posted a nice list a while back:
http://www.datawrangling.com/some-datasets-available-on-the-web
Includes financial, government data (labor, housing, etc.), and too many more to list here.
A lot of the data in Stack Overflow is licensed under the create commons. Every 3 months they release a data dump with all the questions, answers, comments, and votes.
For Microsoft SQL Server, there is the Northwind Sample DB and AdventureWorks.
For MySQL there are quite a few sample database at http://dev.mysql.com/doc/index-other.html
world (world countries and cities)
sakila(video rental)
employee
menagerie
I use generatedata.com to generate custom databases schemes with entries.
To use it, you can simply register a new account, or download its sources and install it on your server.
You can export generated code in SQL, XML, JSON, or even server-side scripting language like php etc.
UnData and Swivel are both good sources for data. Any database should be able to import CSV files.
What database engine are you importing into? That will help determine what formats you can include in your search.
The Federal Energy Regulatory Commission has some sample data for download in CSV format.
The Guardian newspaper in the UK has a data-store, http://www.guardian.co.uk/data-store, full of categorized datasets. They're all ultimately stored as Google Documents, so you can export them into csv & Excel.
There's a whole bunch of free SQL Server sample databases on CodePlex:
http://www.codeplex.com/Wikipage?ProjectName=SqlServerSamples#databases
One very simple way to get sample data is use full applications. I needed some sample data to practice what I was learning with MySQL at the time and just downloaded PHPBB and used their provided database. If you need to add users etc, just use the program to do it.
Think generic. You can get weather data from common sources for free, thetvdb.com has a pretty nifty set of data for TV show episodes for free, sites like last.fm have a tonne of data available for music listening habits. If you just want sample data, the easiest way to get it is not thinking in terms of "I want a database". Think "what freely available data is out there".
For FileMaker, see Sample Database:
http://www.yzysoft.com/printouts/yzy_soft___Sample_Database.html
You can probably find the Northwind sample database for SQLServer
It might be overkill but you can install OracleXE, I think it comes with some sample schemas or you can find the old Scott schema online.
Also, in stephen bohlen's Summer of NHibernate screen-cast series he creates a sample database, the code comes with it in xml files and you can import it like he describes in the screencast (maybe episode 2 or 3) and just not delete it later.
For Firebird you have employee.fdb
on windows OS, it is located there C:\Program Files\Firebird\Firebird_2_1\examples\empbuild

Resources