We would like to run an experiment to determine whether our target/curated product should be stored in csv or parquet format through a series of queries (joins and aggregations). Other than just checking the execution time in Athena, are there other stats we can check in Athena?
I found the Explain button, but I am not familiar with database explain so I'm unsure what I should be looking for...
Any advice would be appreciated. Thank you.
You should use Parquet or ORC, and make sure it is compressed. It will be both faster and cheaper, no question about it.
Follow these examples and you'll see for yourself: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Basically:
Amazon Athena charges based on data read from disk. Compressed data will reduce the amount of data read from disk. Using a columnar file format will also greatly reduce the amount of disk access required.
Columnar data formats are faster to query because it is 'intelligent' and allows data to be passed-over and never read from disk
You can convert to Snappy-compressed Parquet format using a CREATE TABLE AS command -- see Examples of CTAS queries - Amazon Athena:
CREATE TABLE new_table
WITH (
format = 'Parquet',
write_compression = 'SNAPPY')
AS SELECT *
FROM old_table;
Related
I have a Spark dataframe which is actually a huge parquet file read from the container instance in Azure. And I want to make delta lake format out of it. By every time I try to do that it throws an error without any message attached.
I want to save it to the Databricks itself or to the container instance (if possible).
I tried already df.write.format("delta").save("f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/my_data)
and
df.write.format("delta").saveAsTable("f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/my_data)
and
CREATE DELTA TABLE lifetime_delta
USING parquet
OPTIONS (f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/")
I do think I need to create a table somehow and I heard that since parquet is native for Delta Lake it's already existing in Delta Lake context but for some reason it's not quite true.
None of it worked for me. Thank you in advance.
There are two main ways to convert Parquet files to a Delta Lake:
Read the Parquet files into a Spark DataFrame and write out the data as Delta files. Looks like this is what you're trying to do. Here's your code:
df.write.format("delta").save("f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/my_data)
You may need to change it as follows in accordance with the Python f-string syntax:
df.write.format("delta").save(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/my_data")
You can also convert from Parquet to Delta Lake in place using the following code:
from delta import *
deltaTable = DeltaTable.convertToDelta(spark, "parquet.`tmp/lake2`")
Here's an example notebook with code snippets to perform this operation that you may find useful.
I need to upload files > 10gb size, to Snowflake tables.
The current method I'm using is the python Snowflake Connector :
# Create Staging table
query1 = "create or replace stage demo_stage file_format = (TYPE=CSV) ;"
execute_query(conn, query1)
# Upload file from local to staging table
query2 = "put file://file.csv #demo_stage auto_compress=true"
execute_query(conn, query2)
# Upload file from staging to final table
query3 = "copy into demo from #demo_stage/file.csv.gz" \
"file_format = 'CSV' on_error=continue;"
execute_query(conn, query3)
However this method takes a lot of time for my files.
Is there any way to optimize it? or any alternative method?
In order to improve performance of upload it is advisable to generate smaller CSV files.
The PUT command allows to define PARALLEL option:
Specifies the number of threads to use for uploading files. The upload process separate batches of data files by size:
Small files (< 64 MB compressed or uncompressed) are staged in parallel as individual files.
Larger files are automatically split into chunks, staged concurrently, and reassembled in the target stage. A single thread can upload multiple chunks.
Increasing the number of threads can improve performance when uploading large files.
Supported values: Any integer value from 1 (no parallelism) to 99 (use 99 threads for uploading files).
Default: 4
# Upload file from local to staging table
query2 = "put file://file.csv #demo_stage auto_compress=true parallel=X"
execute_query(conn, query2)
Following snowflake guidelines and similar to lukasz's suggestion, you should split your ~10gb file into chunks of 250-300 mb each (this is best practice) using third-party utilities. You can use tools like Open-refine for this splitting.
After that, you can go ahead with the put command loading of each file into your internal stages (same as your code above).
Ps: You should also consider using a multi-cluster warehouse for this loading activity.
As far as alternatives go, other routes you can explore to upload local files faster into snowflake are:
Prebuilt 3rd-party modeling tools
Snowpipe i.e. if you want to automate the ingestion into snowflake.
I actually work with a team that's working on a prebuilt tool for easy loading into snowflake -Datameer, feel free to check it out here if you wish
https://www.datameer.com/upload-csv-to-snowflake/
I am connected to an OpenEdge Database using JDBC and I want to query information like table size, max size, and row count. Any help would be highly appreciated.
The only way to query this sort of thing at runtime is to read the table and calculate what you need. It's not going to be that pleasant.
The best way to provide this sort of information though is to use a DB Analysis.
From a proenv session on the database server:
proutil [dbname] -C dbanalys > mydb.dbana
The output of this contains all the info that you need.
You should be careful of running this during busy times as it will have a performance impact.
Documentation on the command is available here: https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/dmadm%2Fproutil-dbanalys-qualifier.html%23wwID0EFCKY
This also includes details on the -csoutput option that produces the files in a nicely segregated text file format which may make parsing the info you want easier.
Say I have around 10-20GB of data in HDFS as a Hive table. This has been obtained after several Map-Reduce jobs and JOIN over two separate datasets. I need to make this Queryable to the user. What options do I have?
Use Sqoop to transfer data from HDFS to an RDS like Postgresql. But I want to avoid spending so much time on data transfer. I just tested HDFS->RDS in the same AWS region using Sqoop, and 800mb of data takes 4-8 minutes. So you can imagine ~60gb of data would be pretty unmanagable. This would be my last resort.
Query Hive directly from my Webserver as per user request. I haven't ever head of Hive being used like this so I'm skeptical about this. This struck me because I just found out you can query hive tables remotely after some port forwarding on the EMR cluster. But being new to big(ish) data I'm not quite sure about the risks associated with this. Is it commonplace to do this?
Some other solution - How do people usually do this kind of thing? Seems like a pretty common task.
Just for completeness sake, my data looks like this:
id time cat1 cat2 cat3 metrics[200]
A123 1234212133 12 ABC 24 4,55,231,34,556,123....(~200)
.
.
.
(time is epoch)
And my Queries look like this:
select cat1, corr(metrics[2],metrics[3]),corr(metrics[2],metrics[4]),corr(metrics[2],metrics[5]),corr(metrics[2],metrics[6]) from tablename group by cat1;
I need the correlation function, which is why I've chosen postgresql over MySQL.
You have correlation function in Hive:
corr(col1, col2)
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
You can simply connect to a hiveserver port via odbc and execute queries.
Here is an example:
http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive/odbc/hive-odbc-v2-5-10.html
Hive User Experience (hue) has a Beeswax query editor designed specifically for the purpose of exposing Hive to end users who are comfortable with SQL. This way they can potentially run ad-hoc queries against the data residing in Hive without needing to move it elsewhere. You can see an example of the Beeswax Query Editor here: http://demo.gethue.com/beeswax/#query
Will that work for you?
What i can understand from the question posted above is you have some data (20GB ) which you have stored in hdfs and using hive. Now you want to access that data to perform some kind of statistics functions like correlation and others.
You have functions in hive that perform correlation.
Otherwise you can directly connect R to hive using RHive or even excel to hive using datasource.
The other solution is installing hue which comes with hive editors where you can directly query the hive.
my application play videos files after that user they are registered .(files are larger than 100 MB ) .
Is it better to do I store them on the hard drive and Keep file path in database ?
Or
do I store in database as File Stream Type ?
When data is stored in the database, are more secure against manipulation vs with stored in hard ?
How to provide data security against manipulation ?
Thanks .
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it LARGE_DATA.
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!
1 - depends on how you define "better". In general, I prefer to store binary assets in the database so they are backed up alongside the associated data, but cache them on the file system. Streaming the binary data out of SQL Server for a page request is a real performance hog, and it doesn't really scale.
If an attacker can get to your hard drive, your entire system is compromised - storing things in the database will offer no significant additional security.
3 - that's a whole question in its own right. Too wide for Stack Overflow...