Is there any open source library (any programming language) that helps to load data from any data source (file, SQL db, NoSQL db, etc.) and store it into any other data repository? I've checked some ETL libraries like Talend or Octopus but they only deal with SQL databases.
Try https://flywaydb.org/, since NoSQL has different nature than Relational Structure you should write your own converter
{ "item_id" : 1, "tags" : ["a","b","c"] }
How this should be translated into RDBMS? you can use flyway for relational-to-relational db migration
Have a look at Apache Camel and their ETL Example. Camel knows how to load and store from a large variety of sources and repositories, including files, SQL, and various NoSQL databases like Cassandra and MongoDB.
You could also check out 10 Open Source ETL Tools.
By the way, Talend is not limited to SQL databases, as shown in these blog posts:
Talend & MongoDB: An Introduction to Simple Relational Mapping into MongoDB
How to Offload Oracle and MySQL Databases into Hadoop using Apache Spark and Talend
Related
I am trying to migrate Cassandra cluster onto AWS Keyspaces for Apache Cassandra.
After the migration is done how can I verify that the data has been migrated successfully as-is?
Many solutions are possible, you could simply read all rows of a partition and compute a checksum / signature and compare with your original data for instance.Then iterating through all your partitions, then doing it for all your tables. Checksums work.
You could use AWS Glue to perform an 'except' function. Spark has a lot of usefull functions for working with massive datasets. Glue is serverless spark. You can use the spark cassandra connector with Cassandra and Keyspaces to work with datasets in glue. For example you may want to see the data that is not in Keyspaces.
cassandraTableDataframe.except(keyspacesTableDateframe).
You could also do this by exporting both datasets to s3 and performing these queries in Athena.
Here is a helpful repository of Glue and Keyspaces functions including export, count, and distinct.
Our organization uses Elastic Logstash & Kibana (ELK) and we use a SQL Server data warehouse for analysis and reporting. There are some data items from ELK that we want to copy into the data warehouse. I have found many websites describing how to load SQL Server data into ELK. However, we need to go in the other direction. How can I transfer data from ELK to SQL Server, preferably using SSIS?
I have implemented a similar solution in python, where we are ingesting data from elastic cluster into our sql dwh. You can import Elasticsearch package for python which allows you to do that.
you can find more information here
https://elasticsearch-py.readthedocs.io/en/master/
I'm searching free (as in freedom) GUI tools that allow me to export data from one relational database into files (CSV, XML, ...) and to bulk import this data into another database. Both database might be from different vendors.
I'm already aware of tools that migrate schemas, like liquibase and not searching for that.
Extra plus points if such a tool
is written in Java and uses JDBC drivers
is an eclipse plugin (because our other tools are also eclipse based)
allows all kinds of filtering and modification of the data during import or export
can handle large (as in giga- or terabytes) data sets
can be scheduled
can continue an interrupted import/export
Similar questions:
Export large amounts of binary data from one SQL database and import it into another database of the same schema
It seems that the WBExport and WBImport commands of SQLWorkbench are very good candidates. I also need to look whether ETL Tools like Pentaho ETL do this stuff.
CloverETL meets nearly all your requirements. With free version you can work with following databases: MySQL, PostgreSQL, SQLite, MS SQL, Sybase, Oracle, and Derby.
I'm currently doing a business intelligence research about connecting Microsoft SQL Server to a nosql database.
My target is to import data from a nosql table to a relational DWH based on SQL Server.
I found the following approaches:
Microsoft Hadoop Connector
Hadoop Cloudera
Building an individual script and create an xml and include it via Integration Services (not really satisfying)
If somebody did something like this before or knows some kind of "best practices". It doesn't matter wich NoSQL system is used
NoSQL, by "definition", does not have a standard structure. So, depending on what NoSQL backend you are trying to import from, you will need some custom code to translate that into whatever structured format your data warehouse expects.
Your code does not have to generate XML; it could directly use a database connection (e.g., JDBC, if you are using Java) to make SQL queries to insert the data.
I'm looking for a tool to export data from a PostgreSQL DB to an Oracle data warehouse. I'm really looking for a heterogenous DB replication tool, rather than an export->convert->import solution.
Continuent Tungsten Replicator looks like it would do the job, but PostgreSQL support won't be ready for another couple months.
Are there any open-source tools out there that will do this? Or am I stuck with some kind of scheduled pg_dump/SQL*Loader solution?
You can create a database link from Oracle to Postgres (this is called heterogeneous connectivity). This makes it possible to select data from Postgres with a select statement in Oracle. You can use materialized views to schedule and store the results of those selects.
It sounds like SymmetricDS would work for your scenario. SymmetricDS is web-enabled, database independent, data synchronization/replication software. It uses web and database technologies to replicate tables between relational databases in near real time.
Sounds like you want an ETL (extract transform load) tool. There are allot of open source options Enhydra Octopus, and Talend Open Studio are a couple I've come across.
In general ETL tools offer you better flexibility than the straight across replication option.
Some offer scheduling, data quality, and data lineage.
Consider using the Confluent Kafka Connect JDBC sink and source connectors if you'd like to replicate data changes across heterogeneous databases in real time.
The source connector can select the entire database , particular tables, or rows returned by a provided query, and send the data as a Kafka message to your Kafka broker. The source connector can calculate the diffs based on an incrementing id column, a timestamp column, or be run in bulk mode where the entire contents are recopied periodically. The sink can read these messages, optionally check them against an avro or json schema, and populate the source database with the results. It's all free, and several sink and source connectors exist for many relational and non-relational databases.
*One major caveat - Some JDBC Kafka connectors can not capture hard deletes
To get around that limitation, you can use a propietary connector such as Debezium (http://www.debezium.io), see also
Delete events from JDBC Kafka Connect Source.