How to merge clickhouse and postgresql query? - apache-zeppelin

I am using Apache zeppelin,I can get results from both postgres and clickhouse seperately. But I need to merge both queries.

As I understand you have two jdbc interpreters: PostgreSQL and ClickHouse.
You can get query result as string in %python or %spark.pyspark interpreters in such way. Then convert strings to Pandas DataFrames and merge.
Also you can download data from ClickHouse to python, convert to Pandas DataFrame and upload to PostgreSQL by to_sql. In this way you must have rights to write in PostgreSQL.

Related

How to store a database from PostgreSQL into a variable to use with PySpark inside a Jupyter Notebook?

First log here.
I am looking for a way to store a database into a variable that can be modified with pyspark afterwards. I have seen options using Pandas' 'read_sql', but for given restrictions, the use of Pandas is not allowed and I can only worked with pyspark. I have the following paragraphs:
%load_ext sql
%sql %engine.run
%sql select from db
I already have connection with my databse from PostgreSQL using SQLAlchemy, but now I need to store that database into a variable(dataframe) that can be modified with pyspark without using Pandas.

SQLServer to Azure Databricks Conversion

I am working on SQL Server migration to Databricks.
I have a number of TSQL procedures, minimum of 100 lines of code.
I want to convert these procedures to Spark code.
For POC ( worked on 1 TSQL proc), all source files were imported and created as GlobalTempView's, and converted TSQL into Spark SQL.
and by using final globalTempView exported as a file.
Now, I have a question here, creating GlobalTempView's and converting TSQL proc to Spark SQL is the best way?, or loading all files into a data frame and re-write that TSQL proc to Spark data frame logic is best way.
kindly please let me know which is the best way to convert TSQL procs to Spark SQL or dataframes? and reason also.
You can use Databricks to query many SQL databases using JDBC drivers, therefore no extra task is required to convert the existing stored procedure to Spark code.
Check this Databricks official document to know more and steps to Establish connection with SQL Server
Migrating file to DataFrame is also another possible approach but be aware that Spark DataFrames are immutable so any UPDATE or DELETE actions will have to be changes to output to a new modified DataFrame.
I suggest you to go through Executing SQL Server Stored Procedures from Databricks (PySpark) in case you are approaching to execute stored procedures from Databricks.

Is it possible to use Spark like a Database?

i'm new on Spark. From an input stream i got a dataframe, but i don't understand if a dataframe is like a relational table. How can i save the input stream into my distributed file system?
Is a dataframe enough to do this?
Thanks
Spark is a volatile storage i.e. it keeps all the in-memory. Until the data is in memory you can query the data using Spark APIs or SQL. All the data needs to reloaded back with the Spark job.
For persistence you can also save you Spark Dataframes as parquet files on persistence disk and query them by Spark or hive.
No. You cant use spark as database. Spark is a distrusted processing engine. You can use HDFS for storing dataframe. You can also use Hive, Hbase, etc for storing dataframe.

How to parse SQL files in pandas?

I am in an odd situation where I cannot connect to the server using python. I can however connect to the server in other ways using SQL Server Management, so from that end I can execute any query. The problem however is parsing in pandas, data retrieved from SQL Manager. As far as I am aware, data from SQL Manager can be retrieved as csv, txt or rpt. Parsing any of these formats is a pain in the neck and it's not always the same for all tables. My question is then, what is the fastest way to parse any of the file formats that SQL Manager can output in pandas? Is there a standard format that SQL Manager can output and which is parsed the same way in pandas for all tables? Has anyone faced this problem, or is there another workaround?

Sqoop Export into Sql Server VS Bulk Insert into SQL server

I have a unique query regarding Apache Sqoop. I have imported data using apache Sqoop import facility into my HDFS files.
Next ,. I need to put the data back into another database (basically I am performing data transfer from one database vendor to another database vendor) using Hadoop (Sqoop).
To Put data into Sql Server , there are 2 options.
1) Using Sqoop Export facility to connect to my RDBMS,(SQL server) and export data directly.
2) Copy the HDFS data files (which are in CSV format) into my local machine using copyToLocal command and then perform BCP ( or Bulk Insert Query) on those CSV files to put the data into SQL server database.
I would like to understand which is the perfect(or rather correct) approach to do so and which one of them is more Faster out of the two - The Bulk Insert or Apache Sqoop Export from HDFS into RDBMS. ??
Are there any other ways apart from these 2 ways mentioned above which can transfer faster from one database vendor to another.?
I am using 6-7 mappers (records to be transferred is around 20-25 millions)
Please suggest and Kindly let me know if my Question is unclear.
Thanks in Advance.
If all you do is ETL from one vendor to another, then going through Sqoop/HDFS is a poor choice. Sqoop makes perfect sense if the data originates in HDFS or is meant to stay in HDFS. I would also consider sqoop if the set is so large as to warrant a large cluster for the transformation stage. But a mere 25 million records is not worth it.
With SQL Server import it is imperative, on large imports, to achieve minimally logging, which require bulk insert. Although 25 mil is not so large as to make the bulk option imperative, still AFAIK sqoop, nor sqoop2, do not support bulk insert for SQL Server yet.
I recommend SSIS instead. Is much more mature than sqoop, it has bulk insert task and has a rich transformation featureset. Your small import is well within the size SSIS can handle.

Resources