How to create empty files of desired size in HDFS? - file

I am new to Hadoop and HDFS. I believe my question is somewhat related to this post. Essentially, I am trying to create empty files of 10 GB size in HDFS. The truncate command fails as specifying file size larger than the existing file size seems forbidden. Under such circumstances, what are the alternatives? For example, in Linux systems, one can use "truncate" command to set arbitrary file size.

You can use TestDFSIO to create the file with the required size in HDFS directly.
Program TestDFSIO is packaged in jar file 'hadoop-mapreduce-client-jobclient-tests.jar'. This jar comes with the hadoop installation, locate this jar and provide the path of this jar in the below command.
hadoop jar <PATH_OF_JAR_hadoop-mapreduce-client-jobclient-tests.jar> TestDFSIO -write -nrFiles 1 -fileSize 10GB
where "nrFiles" is Number of files and "filesize" is each file size to be generated.
File will be generated at path /benchmarks/TestDFSIO/ in HDFS.

Related

When I loaded file/s into Snowflake Stage i see difference in number of bytes loaded in snowflake stage compared to the files in my local system

When I loaded file/s into Snowflake Stage i see difference in number of bytes loaded compared to the files in my local system, does anyone know the reason for this issue ? How it can be resolved.
File size in my local is 16622146 bytes, after loaded into stage it shows as 16622160 bytes, i have checked with .csv and .txt file types. (I know .txt file is not supported in snowflake).
I compressed the file and loaded into snowflake stage using snowsql using put command.
When loading small files from the local file system, snowflake automatically compresses the file. Please try that option.
Refer to this section of the documentation
https://docs.snowflake.com/en/user-guide/data-load-prepare.html#data-file-compression
This will help you to avoid any data corruption related issues during the compression.
Thanks
Balaji

How to read an excel file created by Microsoft Excel by CH376 IC?

In my embedded system I am using a CH376 IC (PDF link) for file handling. I am able to detect a Flash Disk, but not able to read the excel file created by Microsoft Excel. The excel file is created on the PC and copied in the Flash Disk.
I want to create a database in an Excel file on PC and after creating it, I want to upload in to my embedded system for this I need read the file created.
Please help me to read the file.
The .xls and .xlsx file formats are both extremely complex. Parsing them is unlikely to be feasible in an embedded environment. (In particular, .xlsx is a PKZIP archive containing XML data -- you will need a minimum of 32 KB of SRAM just to decompress the file containing the cell data, and even more to parse it.)
Use a different file format. Consider using .csv, for instance -- it's just a text file, with one row of data on each line, so it's pretty straightforward to work with.

Hadoop - split manually files in HDFS

I have submitted a file with size 1 GB and I want to split this file in files with size 100MB. How can I do that from the command line.
I'm searching for a command like:
hadoop fs -split --bytes=100m /user/foo/one_gb_file.csv /user/foo/100_mb_file_1-11.csv
Is there a way to do that in HDFS?
In HDFS, we cannot expect all feature that are available in unix. Current version of hadoop fs utility doesn't provide this functionality. May be we can expect in future. you can raise a bug(improvement in apache Jira) for including this feature in hdfs.
For now you got to write your own implementation in Java.

Analyze VMDK (vmware virtual machine disk) files for changes

Is there a good way to analyze VMware delta VMDK files between snapshots to list changed blocks, so one can use a tool to tell which NTFS files are changed?
I do not know a tool that does this out of the block, but it should not be so difficult.
The VMDK file format specification is available and the format is not that complex. As far as I remember, a VMDK file consists of a lot of 64k block. At the beginning of the VMDK file there is some directory that contains the information where a logical block is stored in the physical file.
It should be pretty easy to detect there a logical block is stored in both files and than compare the data in the two version of the VMDK file.

Clearing a file after reading with SSIS

Is there any built in way to read a file with SSIS and after reading it clearing the file of all content?
Use a File System Task in the Control Flow to either delete or move the file. If you want an empty file, then you can recreate the file with another File System Task after you have deleted it.
My team generally relies on moving files to archive folders after we process a file. The archive folder is compressed whereas the working folder is uncompressed. We setup a process with our Data Center IT to archive the files in the folders to tape on a regular schedule. This gives us full freedom to retrieve any raw files we have processed while getting them off the SAN without requiring department resources.
What we do is create a template file (that just has headers) and then copy it to a file of the name we want to use for processing.

Resources