What's the best tool for data integration [closed] - sql-server

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm looking the best tool for data integration. I need the following features:
Customized loading/matching and
clearing of data from different
sources (including MSSQL Server,
PostgreSQL, WebServices, Excel, text
files in various formats). The
receiver of data is MSSQL Server 2008.
Ability to configure rules of data convertation externally (e.g. config files or visual tools)
Support for Unicode, logging, multithreading and fault tolerance
Scalability (very important)
Ability to process large volumes of data (more that 100MB per day)
I look at SQL Server Integration Service 2008, but I'm not sure that it fits these criteria. What do you think?

It looks like Integration Services (SSIS) should handle your requirements. It should definitely be high on your list because it has good integration with SQL Server and is extremely cost effective compared to most alternatives.
As far as scalability is concerned, your data sounds very small (100MB per day is not much these days) so it's well within the capabilities of SSIS, even for complex data flows. For fault tolerance, SSIS has restartability features out of the box but if high availability is important to you then you may want to consider clustering / mirroring.

I only know SSIS from first hand experience so I can't say how it compares to other solutions.
But I'd say it's a good solution for all the points you ask.
The only thing that is a bit tricky:
don't know Ability to configure rules of data convertation externally (e.g. config files or visual tools)
Not sure I get this right.
You can store configuration parameters for SSIS in external files or even in an SQL table. But you would still need to specify the kinds of rules inside the package. Unless you write your own script component (inside of which you can of course interpret the formating rules you store externally)

Related

Data masking for data in AWS RDS [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have an AWS RDS (AuroraDB) and I want to mask the data on the DB. Does Amazon provides any service for data masking?
I have seen RDS encryption but I am looking for data masking because the database contains sensitive data. So I want to know is there any service they provide for data masking or is there any other tool which can be used to mask the data and add it manually into the DB?
A list of tools which can be used for data masking is most appreciated if any for mine case. Because I need to mask those data for testing as the original DB contains sensitive information like PII(Personal Identifiable information). I also have to transfer these data to my co-workers, so I consider data masking an important factor.
Thanks.
This is a fantastic question and I think your pro-active approach to securing the most valuable asset of your business is something that a lot of people should heed, especially if you're sharing the data with your co-workers. Letting people see only what they need to see is an undeniably good way to reduce your attack surfaces. Standard cyber security methods are no longer enough imo, demonstrated by numerous attacks/people losing laptops/usbs with sensitive data on. We are just humans after all. With the GDPR coming in to force in May next year, any company with customers in the EU will have to demonstrate privacy by design and anonymisation techniques such as masking have been cited as way to show this.
NOTE: I have a vested interest in this answer because I am working on such a service you're talking about.
We've found that depending on your exact use case, size of data set and contents will depend on your masking method. If your data set has minimal fields and you know where the PII is, you can run standard queries to replace sensitive values. i.e. John -> XXXX. If you want to maintain some human readability there are libraries such as Python's Faker that generate random locale based PII you can replace your sensitive values with. (PHP Faker, Perl Faker and Ruby Faker also exist).
DISCLAIMER: Straight forward masking doesn't guarantee total privacy. Think someone identifying individuals from a masked Netflix data set by cross referencing with time stamped IMDB data or Guardian reporters identifying a Judges porn preferences from masked ISP data.
Masking does get tedious as your data set increases in fields/tables and you perhaps want to set up different levels of access for different co-workers. i.e. data science get lightly anonymised data, marketing get a access to heavily anonymised data. PII in free text fields is annoying and generally understanding what data is available in the world that attackers could use to cross reference is a big task.
The service i'm working on aims to alleviate all of these issues by automating the process with NLP techniques and a good understanding of anonymisation maths. We're bundling this up in to a web-service and we're keen to launch on the AWS marketplace. So I would love to hear more about your use-case and if you want early access we're in private beta at the moment so let me know.
If you are exporting or importing data using CSV or JSON files (i.e. to share with your co-workers) then you could use FileMasker. It can be run as an AWS Lamdbda function reading/writing CSV/JSON files on S3.
It's still in development but if you would like to try a beta now then contact me.
Disclaimer: I work for DataVeil, the developer of FileMasker.

How can I set up a database that is able to import large amounts of data automatically on a daily basis? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am new to databases and programming and am now supposed to create a database that can store large amounts of data. The critical problem to me is that i need to update the database everyday and add 150 sets of data to 60 different tables. The datasets all come in a different format though (.csv, .row, .sta...).
I would like to be able to create a GUI that can automatically import the datasets every day and update the tables. So far I have found a lot of info on exporting data from databases, not so much on importing data though.
Does someone have a pointer?
You haven't given a reasonable definition of "best", what are your criteria? Is cost a major concern (I'm guessing not if you're considering Matlab, but maybe)? Is the UI for you only or are you turning end users loose on it? If end users, how smart/reliable are they? When you say manual import, do you mean a mostly automatic process that you occasionally manually initiate, or will it have to ask a lot of questions and have many different combinations?
I import lots of data daily and from different sources, I have to manually re-launch a process because a user has made a change and needs to see it reflected immediately, but but my set of defined sources doesn't change often. I've had good luck using the SSIS (SQL Server Integration Services) tool in Microsoft's SQL Server. And it can certainly handle large amounts of data.
The basic functionality is you write a "package" that contains definitions of what your source is, how it's configured (i.e. if you are importing from a text file tell it the name and path, is it fixed field or delimited, what is the delimiter or width of each field, what field to skip, how many rows to skip, etc.), and where to put it (DB name and table, map fields, etc). I then set the schedule (mine are all overnight) in SQL agent, and it is all automatic from there unless something changes, in which case you edit your package to account for the changes.
I can also manually start any Package at any time and with very little effort.
And the range of imports sources is pretty impressive. I pull in data from CSV files, Lotus Notes, and DB2, and it's all automatic every night. It's also a fairly graphical "builder", which is frustrating for a hardcore coder, but if you're new to programming it's probably easier than a more code or script oriented approach.
Good luck, and welcome to the dark side. We have cookies.

Does Microsoft have a similar product like Google BigQuery? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I want to see whether Microsoft provide a similar service to Google BigQuery.
I want to run some queries on a database with the size of ~15GB and I want the service to be on the cloud.
P.S: Yes. I have google already but did not find anything similar.
The answer to your question is NO: Microsoft does not offer (yet) a real time big data query service where you pay as you perform queries. Which does not means you won't get a solution to your problem in Azure.
Depending on your need you may have two options on Azure:
SQL Data Warehouse: A new Azure based columnar database service in preview http://azure.microsoft.com/fr-fr/documentation/services/sql-data-warehouse/ which according to Microsoft can scale up to petabytes. Assuming that your data is structured (relational) and that you need sub second response time it should do the job you expect.
HDInsight is hadoop managed service https://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/ which can deal better with semi structured data but is more oriented to batch processing. It contains Hive which is also SQL like but you won't get instant query response-time. You could go for this option if you are expecting to do calculations on a batch mode and store the aggregated result set somewhere else.
The main difference of these products and BigQuery is the prizing model in BigQuery you pay as you perform queries but in Micrisoft options you pay based on the resources you allocate, which can be very expensive if you data is really big.
I think if the expected usage is occasional BigQuery will be much cheaper, Misrosoft options will be better for intense use, but of course you will need to do a detailed prize comparison to be sure.
To get an idea of what BigQuery really is, and how it compares to a relational database (or Hadoop for that matter), take a look at this doc:
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
Take a look at this:
http://azure.microsoft.com/en-in/solutions/big-data/.
Reveal new insights and drive better decision making with Azure HDInsight, a Big Data solution powered by Apache Hadoop. Surface those insights from all types of data to business users through Microsoft Excel.

Is there a database with git-like qualities? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a database where multiple users can contribute and commit new data; other users can then pull that data into their own database repository, all in a git-like manner. A transcriptional database, if you like; does such a thing exist?
My current thinking is to dump the database to a single file as SQL, but that could well get unwieldy once it is of any size. Another option is to dump the database and use the filesystem, but again it gets unwieldy once of any size.
There's Irmin: https://github.com/mirage/irmin
Currently it's only offered as an OCaml API, but there's future plans for a GraphQL API and a Cap'n'Proto one.
Despite the complex API and the still scarce documentation, it allows you to plug any backend (In-Memory, Unix Filesystem, Git In-Memory and Git On-Disk). Therefore, it runs even on Unikernels and Browsers.
It also offers a bidirectional model where changes on the Git local repository are reflected upon Application State and vice-versa. With the complex API, you can operate on any Git-level:
Append-only Blob storage.
Transactional/compound Tree layer.
Commit layer featuring chain of changes and metadata.
Branch/Ref/Tag layer (only-local, but offers also remotes) for mutability.
The immutable store is often associated/regarded for the blobs + trees + commits on documentation.
Due the Content-addressable inherited Git-feature, Irmin allows deduplication and thus, reduced memory-consumption. Some functionally persistent data structures fit perfectly on this database, and the 3-way merge is a novel approach to handle merge conflicts on a CRDT-style.
Answer from: How can I put a database under version control?
I have been looking for the same feature for Postgres (or SQL databases in general) for a while, but I found no tools to be suitable (simple and intuitive) enough. This is probably due to the binary nature of how data is stored. Klonio sounds ideal but looks dead. Noms DB looks interesting (and alive). Also take a look at Irmin (OCaml-based with Git-properties).
Though this doesn't answer the question in that it would work with Postgres, check out the Flur.ee database. It has a "time-travel" feature that allows you to query the data from an arbitrary point in time. I'm guessing it should be able to work with a "branching" model.
This database was recently being developed for blockchain-purposes. Due to the nature of blockchains, the data needs to be recorded in increments, which is exactly how git works. They are targeting an open-source release in Q2 2019.
Because each Fluree database is a blockchain, it stores the entire history of every transaction performed. This is part of how a blockchain ensures that information is immutable and secure.
It's not SQL, but CouchDB supports replicating the database and pushing/pulling changes between users in a way similar to what you describe.
Some more information in the chapter on replication in the O'Reilly CouchDB book.

Resources for SQL Server code review and best practices [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Are there any good resources out there for T-SQL coding standards?
Check out this excellent resource:
SSW Rules to Better SQL Server Databases
This is also good, although some of the advice may have changed since the article dates from 2001):
SQL Server TSQL Coding Conventions, Best Practices, and Programming Guidelines
I was a developer for an ASP.NET application and my manager required me to submit my SQL statements to the DBA for review. What I did was to consolidate all SQLs used in the application to one module file. (VB .NET module with readonly strings)
Just to name a few mandates, off hand.
All SQL statements must use parameterised queries. This is a good practice. SQL injection is not possible when parameters (aka bind variables in Oracle) are used. Some reported a significant performance increase in using bind variables. This is especially true for Oracle. Not sure for MS SQL
E.g. use "SELECT username FROM user WHERE userid = #userid" instead of
Dim sql as String = "SELECT username FROM user WHERE userid = {0}"
sql = String.Format(sql, userid)
"SELECT *" should not be used. Columns must be explicitly named.
JOINS should be used instead of NESTED QUERIES whenever possible.
Reduce the use of VIEWS as this will impact performance. (This is controversial) My manager went to the extreme to forbid the usage of views. We will developing something which performance and scalability is of more importance than readability of codes.
For SQL coding standards, your best bet is to search for what others have written. There are several resources containing standards that various people have published. You are unlikely to find one that will completely fit your organization. Plus, some have standards that IMHO are just plain wrong. Your best bet is to read through the documents you find and extract the concepts and rules that make sense and fit your organization. Some standards may be overkill, like how to indent the code. It depends on how strict you want the standards to be. Here are a few examples:
http://www.nyx.net/~bwunder/dbChangeControl/standard.htm
http://www.SQLAuthority.com
http://www.SQLserverPortal.com
You'll have to look around at links two and three as I don't have the exact URLs handy. Also checkout the link posted by Mitch Wheat above. These are just some examples, but you'll find more by searching.
I have either contributed to or implemented coding practices for SQL server in several organizations. You can spend days researching what others have done but and you can probably use pieces but I find each environment to be completely unique.
At a high level...I would suggest separating function from form as much as possible. What do I mean? There are some best practices that can be tested and documented to your specific environment and application such as when to use temp tables over large queries, no lock, dynamic sql usage, query hints, configuration. These can totally vary depending on hardware and use. Then there are other standards that are more opinion based: naming conventions, use of schemas, procs, views, functions, version control, etc. The latter group can get pretty political - really political. It is also a good idea to start small - implement a little at a time.
On outside vendors I have found it impractical to influence until there is a performance impact (ex: explicit query hints that cause huge table scans). Then it is most effective to provide data and get them to patch it. If there is some sort of service contract I don't see how you can enforce practices. Note that they may be writing for multiple versions and/or platforms and want the code as flexible as possible.
I recommend to download and install sample database AdventureWorks from codeplex.com
http://www.codeplex.com/MSFTDBProdSamples
It has been created by Microsoft staff and has very good design which can serve you as an example (a la Best Practices).
And also I recommend to read this book:
Professional Microsoft SQL Server 2008 Administration
Professional Microsoft SQL Server 2008 Administration http://ecx.images-amazon.com/images/I/519z8XkHJyL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_.jpg

Resources