We are using Talend as ETL tool to extract data from a Hive database tables and load it to a different Hive database table. Can someone suggest the correct Talend components to do the above task?
Extract data from Hive Table A in Database D1 ---> Load data to Hive Table B in Database D2.
I used tELTHive components but there seems to be some restrictions in it. Also Is there a way to load data to Hive Tables without loading the extracted data to a file before writing it to a Hive table.
Related
Can you suggest an approach to load data from one Snowflake(SF) database into another SF database within same cluster?
I have to -
Do data transformation, incremental load while loading into destination SF table
Schedule the load like ETL job
Thanks,
Nikhil
I have an excel file with a column consisting of ID numbers (around 160k rows). I have to fetch some data related to these IDs from Snowflake. Different columns need various data tables. The tables have around 250k entries each. What is the best way to fetch the data from these tables to the excel file?
I tried loading the excel file into a pandas dataframe and then using pandas read_sql method to iterate through the IDs and fetch the corresponding data. This is extremely slow given the size of the data. Is there any practical way to approach this?
Excel File format (an example)
I understand you do not want to load the excel to Snowflake, but the Snowflake data to your excel file.
If this is correct, here are some ways of how to bring Snowflake data to an excel file:
Use COPY INTO to export the data to a CSV file in one of your internal stages and then download with GET: https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html
Connect your excel file to Snowflake via „Data > From other sources“
and the ODBC Driver
Run SELECT in Snowflake and simply download the result set with the download button
I frequently need to validate CSVs submitted from clients to make sure that the headers and values in the file meet our specifications. Typically I do this by using the Import/Export Wizard and have the wizard create the table based on the CSV (file name becomes table name, and the headers become the column names). Then we run a set of stored procedures that checks the information_schema for said table(s) and matches that up with our specs, etc.
Most of the time, this involves loading multiple files at a time for a client, which becomes very time consuming and laborious very quickly when using the import/export wizard. I tried using an xp_cmshell sql script to load everything from a path at once to have the same result, but xp_cmshell is not supported by AzureSQL DB.
https://learn.microsoft.com/en-us/azure/azure-sql/load-from-csv-with-bcp
The above says that one can load using bcp, but it also requires the table to exist before the import... I need the table structure to mimic the CSV. Any ideas here?
Thanks
If you want to load the data into your target SQL db, then you can use Azure Data Factory[ADF] to upload your CSV files to Azure Blob Storage, and then use Copy Data Activity to load that data in CSV files into Azure SQL db tables - without creating those tables upfront.
ADF supports 'auto create' of sink tables. See this, and this
I am using Snowflake.
All may raw data is already in a Snowflake raw table. I need to filter/deduplicate the data to insert into another bonze table. I am considering dbt snapshot, incremental model, Snowflake streams.
my raw layer is an S3 bucket from an API, then this Data is copy into Snowflake using snowpipe but is the same duplicate raw data. I need to apply upsert to another new Bronze table.
What actions do you recommend to write only the upserts in the last table?
I wanted to do onetime load from one source Oracle db to destination oracle db.
it can't done direct load /unload or import/export of data as it as different tables structures columns at source and destination. so it requires good transformation,
My plan is to get the data as in XML format from the source DB and process the XML to destination DB.
and also Data volume would be more ( 1 to 20+ million records or more in some tables) and the databases involved are : Oracle (source) and Oracle (destination),
Please provide some best practices or best way to do this.
I'm not sure that I understand why you can't do a direct load.
If you create a database link on the destination database that points to the source database, you can then put your ETL logic into SQL statements that SELECT from the source database and INSERT into the destination database. That avoids the need to write the data to a flat file, to read that flat file, to parse the XML, etc. which is going to be slow and require a decent amount of coding. That way, you can focus on the ETL logic and you can migrate the data as efficiently as possible.
You can write SQL (or PL/SQL) that loads directly from the old table structure on the old database to the new table structure on the new database.
INSERT INTO new_table( <<list of columns>> )
SELECT a.col1, a.col2, ... , b.colN, b.colN+1
FROM old_table_1#link_to_source a,
old_table_2#link_to_source b
WHERE <<some join condition>>