I have data in HDFS in following dir structure :
/exported/2014/07/01/00/SEARCHES/part-m-00000.bz2
part-m-00001.bz2
/exported/2014/07/01/02/SEARCHES/part-m-00000.bz2
part-m-00001.bz2
part-m-00003.bz2
.
.
.
.
/exported/2014/08/01/09/SEARCHES/part-m-00005 .bz2
there are multiple part files in each subdirectory.
I want to load this dataset into impala table, so use following query to create table :
CREATE EXTERNAL TABLE search(time_stamp TIMESTAMP, ..... url STRING,domain STRING) PARTITIONED BY (year INT, month INT, day INT. hour INT) row format delimited fields terminated by '\t';
Then
ALTER TABLE search ADD PARTITION (year=2014, month=08, day=01) LOCATION '/data/jobs/exported/2014/08/01/*/SEARCHES/';
But it failed to load with following error :
ERROR: AnalysisException: Failed to load metadata for table: magneticbi.search_mmx
CAUSED BY: TableLoadingException: Failed to load metadata for table: search_mmx
CAUSED BY: RuntimeException: Compressed text files are not supported: part-m-00000.bz2
not sure what is the correct way to do this.
Anyone can help in this ?
Thanks
Here's a link to a table from Cloudera that describes your options. To summarize:
Impala supports the following compression codecs:
Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy compression is very fast, but GZIP provides greater space savings. Not supported for text files.
GZIP. Recommended when achieving the highest level of compression (and therefore greatest disk-space savings) is desired. Not supported for text files.
Deflate. Not supported for text files.
BZIP2. Not supported for text files.
LZO, for Text files only. Impala can query LZO-compressed Text tables, but currently cannot create them or insert data into them; perform these operations in Hive.
Related
I have a container with partitioned parquet files that I want to use with the copy into command. My directories look like the below.
ABC_PARTITIONED_ID=1 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
4-snappy.parquet
ABC_PARTITIONED_ID=2 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
ABC_PARTITIONED_ID=3 (directory)
1-snappy.parquet
2-snappy.parquet
....
Each partitioned directory can contain multiple parquet files. I do not have a hive partition column that matches the pattern of the directories (ID1, ID2 etc).
How do I properly use the pattern parameter in the copy into command to write to a SF table from my ADLS? I am using this https://www.snowflake.com/blog/how-to-load-terabytes-into-snowflake-speeds-feeds-and-techniques/ as an example.
I do not think that you have anything to do with the pattern parameter.
You said you do not have a hive partition column that matches the pattern of the directories. If you do not have a column to use these partitions, then they are probably not beneficial for querying the data. Maybe they were generated to help maintenance. If this is the case, ignore the partition, and read all files with the COPY command.
If you think having such a column would help, then the blog post (you mentioned) already shows how you can parse the filenames to generate the column value. Add the partition column to your table (and even you may define it as the clustering key), and run the COPY command to read all files in all partitions/directories, parse the value of the column from the file name.
For parsing the partition value, I would use this one which seems easier:
copy into TARGET_TABLE from (
select
REGEXP_SUBSTR (
METADATA$FILENAME,
'.*\ABC_PARTITIONED_ID=(.*)\/.*',
1,1,'e',1
) partitioned_column_value,
$1:column_name,
...
from #your_stage/data_folder/);
If the directory/partition name doesn't matter to you, then you can use some of the newer functions in Public Preview that support Parquet format to create the table and ingest the data. Your question on how to construct the pattern would be PATTERN='*.parquet' as all subfolders would be read.
//create file format , only required to create one time
create file format my_parquet_format
type = parquet;
//EXAMPLE CREATE AND COPY INTO FOR TABLE1
//create an empty table using this file format and location. name the table table1
create or replace table ABC
using template (
select array_agg(object_construct(*))
from table(
infer_schema(
location=>'#mystage/ABC_PARTITIONED_ROOT',
file_format=>'my_parquet_format'
)
));
//copy parquet files in folder /table1 into table TABLE1
copy into ABC from #mystage/ABC_PARTITIONED_ROOT pattern = '*.parquet' file_format=my_parquet_format match_by_column_name=case_insensitive;
This should be possible by creating a storage integration, granting access in Azure for Snowflake to access the storage location, and then creating an external stage.
Alternatively you can generate a shared access signature (SAS) token to grant Snowflake (limited) access to objects in your storage account. You can then access an external (Azure) stage that references the container using the SAS token.
Snowflake metadata provides
METADATA$FILENAME - Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER - Row number for each record
We could do something like this:
select $1:normal_column_1, ..., METADATA$FILENAME
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
For example: it would give something like:
METADATA$FILENAME
----------
path/to/data/year=2021/part-00020-6379b638-3f7e-461e-a77b-cfbcad6fc858.c000.snappy.parquet
we need to handle deducing the column from it. We could do a regexp_replace and get the partition value as column like this:
select
regexp_replace(METADATA$FILENAME, '.*\/year=(.*)\/.*', '\\1'
) as year
$1:normal_column_1,
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
In the above regexp, we give the partition key.
Third parameter \\1 is the regex group match number. In our case, first group match - this holds the partition value.
More detailed answer and other approaches to solve this issue is available on this stackoverflow answer
I am trying to import data exported from BigQuery as AVRO and compressed as DEFLATE. The only encoding common to both is DEFLATE besides NONE.
I am exporting one of publicly available datasets bigquery-public-data:covid19_open_data.covid19_open_data with 13,343,598 rows. I am using the following command to export:
bq extract --destination_format=AVRO --compression=DEFLATE bigquery-public-data:covid19_open_data.covid19_open_data gs://staging/covid19_open_data/avro_deflate/covid19_open_data_2_*.avro
The command creates 17 files in GCP. When I query the data in the files with command:
SELECT count(*) FROM #shared.data_warehouse_ext_stage/covid19_open_data/avro_deflate;
I only get a count of 684,5021 rows. To troubleshoot the error in the pipe I issue the command:
SELECT * from table(information_schema.copy_history(table_name=>'covid19_open_data', start_time=> dateadd(hours, -1, current_timestamp())));
The error reported by the pipeline is as follows:
Invalid data encountered during decompression for file: 'covid19_open_data_3_000000000006.avro',compression type used: 'DEFLATE', cause: 'data error'
The SQL for the File Format command is:
CREATE OR REPLACE FILE FORMAT monitoring_blocking.dv_avro_deflate_format TYPE = AVRO COMPRESSION = DEFLATE;
I know the problem is only related to the compression being DEFLATE. There are only two compressions for AVRO that are common for both BigQuery and Snowflake NONE and DEFLATE. I also created two pipes one file format AVRO with compression NONE and the second with CSV and GZIP. They both load data into the table. The two AVRO pipelines are a mirror of each other except for the file format. Here is snippet of the SQL for the pipe:
CREATE OR REPLACE PIPE covid19_open_data_avro
AUTO_INGEST = TRUE
INTEGRATION = 'GCS_PUBSUB_DATA_WAREHOUSE_NOTIFICATION_INT' AS
COPY INTO covid19_open_data(
location_key
,date
,place_id
,wikidata_id
...
)
FROM
(SELECT
$1:location_key
,$1:date AS date
,$1:place_id AS place_id
,$1:wikidata_id AS wikidata_id
...
FROM #shared.staging/covid19_open_data/avro_deflate)
FILE_FORMAT = monitoring_blocking.dv_avro_deflate_format;
The problem lies within Snowflake. When we change the compression format in the FILE FORMAT definition to AUTO it worked
CREATE OR REPLACE FILE FORMAT my_schema.avro_compressed_format
TYPE = AVRO
COMPRESSION = DEFLATE;
to
CREATE OR REPLACE FILE FORMAT my_schema.avro_compressed_format
TYPE = AVRO
COMPRESSION = AUTO;
I am looking to load data from S3 into Snowflake. My files are in Parquet format, and were created by a spark job. There are 199 parquet files in my folder in S3, each with about 5500 records. Each parquet file is snappy compressed and is about 485 kb.
I have successfully created a storage integration and staged my data. However, when I read my data, I get the following message:
Max LOB size (16777216) exceeded, actual size of parsed column is 19970365
I believe I have followed the General File Sizing Recommendations but I have not been able to figure out a solution to this issue, or even a clear description of this error message.
Here is the basics of my SQL query:
CREATE OR REPLACE TEMPORARY STAGE my_test_stage
FILE_FORMAT = (TYPE = PARQUET)
STORAGE_INTEGRATION = MY_STORAGE_INTEGRATION
URL = 's3://my-bucket/my-folder';
SELECT $1 FROM #my_test_stage(PATTERN => '.*\\.parquet')
I seem to be able to read each parquet file individually by changing the URL parameter in the CREATE STAGE query to the full path of the parquet file. I really don't want to have to iterate through each file to load them.
The VARIANT data type imposes a 16 MB (compressed) size limit on individual rows.
The resultset is actually a display as a virtual column, so the 16MB limit also applied
Docs Reference:
https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html#semi-structured-data-size-limitations
There may be issue with one or more records from your file, try to run copy command with "ON_ERROR" option, to debug whether all the records has similar problem or only few.
I need to load data from sybase(production database) to HDFS. By using sqoop it is taking very long time and frequently hit the production database. So, I am thinking to create data files from sybase dump and after that copy the data files to hdfs. Is there any tool(open source) is available to create required data files(flat files) from sybase dump.
Thanks,
The iq_bcp command line utility is designed to do this on a per table basis. You just need to generate a list of tables, and you can iterate through the list.
iq_bcp [ [ database_name. ] owner. ] table_name { in | out } datafile
iq_bcp MyDB..MyTable out MyTable.csv -c -t#$#
-c specifies a character (plaintext) output
-t allows you to customize the column delimiter. You will want to use a character or series of characters that do not appear in your extact e.g. if you have a text column that contains text with a comma, a csv will be tricky to import without additional work.
Sybase IQ: iq_bcp
I have encountered some errors with the SDP where one of the potential fixes is to increase the sample size used during schema discovery to 'unlimited'.
For more information on these errors, see:
No matched schema for {"_id":"...","doc":{...}
The value type for json field XXXX was presented as YYYY but the discovered data type of the table's column was ZZZZ
XXXX does not exist in the discovered schema. Document has not been imported
Question:
How can I set the sample size? After I have set the sample size, do I need to trigger a rescan?
These are the steps you can follow to change the sample size. Beware that a larger sample size will increase the runtime for the algorithm and there is no indication in the dashboard other than the job remaining in 'triggered' state for a while.
Verify the specific load has been stopped and the dashboard status shows it as stopped (with or without error)
Find a document https://<account>.cloudant.com/_warehouser/<source> where <source> matches the name of the Cloudant database you have issues with
Note: Check https://<account>.cloudant.com/_warehouser/_all_docs if the document id is not obvious
Substitute "sample_size": null (which scans a sample of 10,000 random documents) with "sample_size": -1 (to scan all documents in your database) or "sample_size": X (to scan X documents in your database where X is a positive integer)
Save the document and trigger a rescan in the dashboard. A new schema discovery run will execute using the defined sample size.