Reading Parquet files in FlinkSQL without Hadoop? - apache-flink

Trying to read Parquet files in FlinkSQL.
Download the jar file from here: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/, made sure it's the same version as the Flink I have, put it in flink/lib/.
Start the flink cluster using ./flink/bin/start-cluster.sh. Start sql client using ./flink/bin/sql-client.sh
Load the jar fiile: add jar '/home/ubuntu/flink/lib/flink-sql-parquet-1.16.0.jar';
Try to create table with parquet format: create TABLE test2 (order_time TIMESTAMP(3), product STRING, feature INT, WATERMARK FOR order_time AS order_time) WITH ('connector'='filesystem','path'='/home/ubuntu/test.parquet','format'='parquet');
select count(*) from test2;
gets: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
Can somebody please help me read Parquet files in FlinkSQL please?

As outlined in https://issues.apache.org/jira/browse/PARQUET-1126 Parquet still requires Hadoop. You will need to add the Hadoop dependencies to Flink as outlined in https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/configuration/advanced/#hadoop-dependencies

Related

Azure Synapse Data Flows - parquet file names not working

I have created a data flow within Azure synapse to:
take data from a dedicated SQL pool
perform some transformations
send the resulting output to parquet files
I am then creating a View based on the resulting parquet file using OPENROWSET to allow PowerBI to use the data via the built-in serverless SQL pool
My issue is that whatever the file name I enter on the integration record, the parquet files always look like part-00000-2a6168ba-6442-46d2-99e4-1f92bdbd7d86-c000.snappy.parquet - or similar
Is there a way to have a fixed filename which is updated each time the pipeline is run, or alternatively is there a way to update the parquet file to which the View refers each time the pipeline is run, in an automated way.
Fairly new to this kind of integration, so if there is a better way to acheive this whole thing then please let me know
Azure Synapse Data Flows - parquet file names not working
I repro'd the same and got the file name as in below image.
In order to have the fixed name for sink file name,
Set Sink settings as follows
File name Option: Output to single file
Output to single file: tgtfile (give the file name)
In optimize, Select single partition.
Filename is as per the settings

How to load data from UNIX to snowflake

I have created CSV files into UNIX server using Informatica resides in. I want to load those CSV files directly from UNIX box to snowflake using snowsql, can someone help me how to do that?
Log into SnowSQL:
https://docs.snowflake.com/en/user-guide/getting-started-tutorial-log-in.html
Create a Database, Table and Virtual Warehouse, if not done so already:
https://docs.snowflake.com/en/user-guide/getting-started-tutorial-create-objects.html
Stage the CSV files, using PUT:
https://docs.snowflake.com/en/user-guide/getting-started-tutorial-stage-data-files.html
Copy the files into the target table using COPY INTO:
https://docs.snowflake.com/en/user-guide/getting-started-tutorial-copy-into.html

Where can I find my jar on Apache Flink server which I submitted using Apache Flink dashboard

I developed a Flink job and submitted my job using Apache Flink dashboard. Per my understanding, when I submit my job, my jar should be available on Flink server. I tried to figure out path of my jar but couldn't able to. Does Flink keep these jar file on server? If yes, where I can find? Any documentation? Please help. Thanks!
JAR files are renamed when they are uploaded and stored in a directory that can be configured with the web.upload.dir configuration key.
If the web.upload.dir parameter is not set, the JAR files are stored in a dynamically generated directory under the jobmanager.web.tmpdir (default is System.getProperty("java.io.tmpdir")).

How to export data to local system from snowflake cloud data warehouse?

I am using snowflake cloud datawarehouse, which is like teradata that hosts data. I am able run queries and get results on the web UI itself. But I am unclear how can one export the results to a local PC so that we can report based on the data.
Thanks in advance
You have 2 options which both use sfsql which is based on henplus. The first option is to export the result of your query to a S3 staging file as shown below:
CREATE STAGE my_stage URL='s3://loading/files/' CREDENTIALS=(AWS_KEY_ID=‘****' AWS_SECRET_KEY=‘****’);
COPY INTO #my_stage/dump
FROM (select * from orderstiny limit 5) file_format=(format_name=‘csv' compression=‘gzip'');
The other option is to capture the sql result into a file.
test.sql:
set-property column-delimiter ",";
set-property sql-result-showheader off;
set-property sql-result-showfooter off;
select current_date() from dual;
$ ./sfsql < test.sql > result.txt
For more details and help, login to your snowflake account and access the online documentation or post your question to Snowflake support via the Snowflake support portal which is accessible through the Snowflake help section. Help -> Support Portal.
Hope this helps.
You can use a COPY command to export a table (or query results) into a file on S3 (using "stage" locations), and then a GET command to save it onto your local filesystem. You can only do it from the "sfsql" Snowflake command line tool (not from web UI).
Search the documentation for "unloading", you'll find more info there.
You can directly download the data from Snowflakes to Local Filesystem without staging to S3 or redirecting via unix pipe
Use COPY INTO to load table data to table staging
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-location.html
snowsql$> copy into #%test_table/result/data_ from test_table
file_format = (TYPE ='[FILE_TYPE]' compression='[COMPRESSION_TYPE]');
Use GET command to download data from table staging to Local FS
https://docs.snowflake.net/manuals/sql-reference/sql/get.html
snowsql$> get #%test_table/result/data_ file:///tmp/;

Is there any way to import the data from s3 to mssql

I have hadoop cluster running on amazon EMR which process some data and write the output to s3. Now, I want to import that data in mssql. Is there any open source connector for that ? Or i have to manually download the data, change the default seperator '\001' to ',' and then import data in mssql.
There is no direct way.
Use below config in map reduce to write output , as delimiter
job.getConfiguration().set("mapreduce.textoutputformat.separator", ",");
The best way is to keep processed data in s3. You can CSV to s3. The write a php/java/shell to download data from s3 and load it to mssql.
You can use s3download directory to download the processed data and then use bulk insert to load the csv file to mssql.
You can use Apache Sqoop for this use case.
Apache Sqoop supports importing from and exporting to mssql.
The following article explains how to install Sqoop in EMR
http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/
Please refer to Sqoop user guide.
http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html

Resources