i'm new on Spark. From an input stream i got a dataframe, but i don't understand if a dataframe is like a relational table. How can i save the input stream into my distributed file system?
Is a dataframe enough to do this?
Thanks
Spark is a volatile storage i.e. it keeps all the in-memory. Until the data is in memory you can query the data using Spark APIs or SQL. All the data needs to reloaded back with the Spark job.
For persistence you can also save you Spark Dataframes as parquet files on persistence disk and query them by Spark or hive.
No. You cant use spark as database. Spark is a distrusted processing engine. You can use HDFS for storing dataframe. You can also use Hive, Hbase, etc for storing dataframe.
Related
What is the efficient way to load data into Snowflake database?-
using External table or directly files from S3. If files then format is suggested Parquet or avro?
Of course it depends, but this Snowflake post summarizes it well, I think:
Conclusion
Loading data into Snowflake is fast and flexible. You get the greatest
speed when working with CSV files, but Snowflake’s expressiveness in
handling semi-structured data allows even complex partitioning schemes
for existing ORC and Parquet data sets to be easily ingested into
fully structured Snowflake tables.
I am using Apache zeppelin,I can get results from both postgres and clickhouse seperately. But I need to merge both queries.
As I understand you have two jdbc interpreters: PostgreSQL and ClickHouse.
You can get query result as string in %python or %spark.pyspark interpreters in such way. Then convert strings to Pandas DataFrames and merge.
Also you can download data from ClickHouse to python, convert to Pandas DataFrame and upload to PostgreSQL by to_sql. In this way you must have rights to write in PostgreSQL.
In Spark we can use infer schema to dynamically read schema from file e.g.:
df = sqlContext.read.format('com.databricks.spark.csv').options(delimiter='|',header='true', inferschema='true').load('cars.csv')
Is there a way to do same in Flink?
Flink has no built-in support for automatic schema inference from CSV files.
You could implement such functionality on top by analyzing the first rows of a CSV file and generating a corresponding CsvTableSource.
I need to extract 2 tables from a SQL Server Database to files in Apache Parquet (I don't use Hadoop, only parquet files). The options I know of are:
Load data to a dataframe in Pandas and save to parquet file. But, this method don't stream the data from SQL Server to Parquet, and i have 6 GB of RAM memory only.
Use TurboODBC to query SQL Server, convert the data to Apache Arrow on the fly and then convert to Parquet. Same problem that above, TurboODBC doesn't stream currently.
Does a tool or library exist that can easily and "quickly" extract the 1 TB of data from tables in SQL Server to parquet files?
The missing functionality you are looking for is the retrieval of the result in batches with Apache Arrow in Turbodbc instead of the whole Table at once: https://github.com/blue-yonder/turbodbc/issues/133 You can either help with the implementation of this feature or use fetchnumpybatches to retrieve the result in a chunked fashion meanwhile.
In general, I would recommend you to not export the data as one big Parquet file but as many smaller ones, this will make working with them much easier. Mostly all engines/artifacts that can consume Parquet will be able to handle multiple files as one big dataset. You can then also split your query into multiple ones that write out the Parquet files in parallel. If you limit the export to chunks that are smaller than your total main memory, you should also be able to use fetchallarrow to write to Parquet at once.
I think the odbc2parquet command line utility might be what you are looking for.
Utilizes odbc bulk queries to retrieve data from SQL Server fast (like turbodbc).
It only keeps one batch in memory at a time, so you can write parquet
files which are larger than your system memory.
Allows you to split the result into multiple files if desired.
Full disclosure, I am the author, so I might be biased towards the tool.
Architectural/perf question here.
I have a on premise SQL server database which has ~200 tables of ~10TB total.
I need to make this data available in Azure in Parquet format for Data Science analysis via HDInsight Spark.
What is the optimal way to copy/convert this data to Azure (Blob storage or Data Lake) in Parquet format?
Due to manageability aspect of task (since ~200 tables) my best shot was - extract data locally to file share via sqlcmd, compress it as csv.bz2 and use data factory to copy file share (with 'PreserveHierarchy') to Azure. Finally run pyspark to load data and then save it as .parquet.
Given table schema, I can auto-generate SQL data extract and python scripts
from SQL database via T-SQL.
Are there faster and/or more manageable ways to accomplish this?
ADF matches your requirement perfectly with one-time and schedule based data move.
Try copy wizard of ADF. With it, you can directly move on-prem SQL to blob/ADLS in Parquet format with just couple clicks.
Copy Activity Overview