I am doing some analysis for a chunk of a desktop app we're working on.
One requirement is that it be able to do i/o of some legacy file formats, which in practice are running as large as 800Mb each. An import might reasonably be expected to be on the order of 5Gb in size.
Ideally, I'd just stick whatever files I want into a jar file, sign the thing, and send it off for re-import at a some later time.
But our app must support XP Pro (FAT32), which has a max file size limit of around 4Gb, from what I can tell.
Must I break my data up into multiple chunks? (And therefore take on the complexity of keeping track of what's going on?)
There's no other way of storing 5GB of data on FAT32 than splitting it in chunks.
Write a routine that will deal with archives of more than 4GB, i.e. split and merge. Encapsulate it in some util class or util file and call it from your save/load method.
The max file size on fat 32 is 4gb (actually 4Gb - 2bytes), so if you have to use it and your single file is over that you will have to split it.
Related
Im using rclone to tranfer data between a minio bucket and a shared storage. Im migrating a store and The amount of data is around 200GB of product pictures. Every single picture have his own folder/path. So there are a lot of folders that needs to create to. Rclone is installed on the new server and the storage is connected to the server via san. The transfer is running over a week and we are at 170GB right now. Everything works fine but it is really slow in my opinion. Is it normal that a transfer out of a bucket into a classic filesystem is that slow?
(Doing the math, the speed is only 2.3Mbps. I am honestly not going to pay anything for that speed.)
Perhaps you should break down the issue and diagnose part by part. Below are several common places to look out for slow transfer (generally speaking for any file transfer):
First of all, network and file systems are usually not performant with lots of small files, so to isolate the issue, upload a bigger file to minio first (1GB+). And for each step, test with big file first.
Is the speed of the source fast enough? Try copying the files from minio to a local storage or Ramdisk (/tmp is usually tmpfs and in turn stored in RAM, use mount to check).
Is the speed of the destination fast enough? Try dd or other disk performance testing utility.
Is the network latency to source high? Try pinging or curling the API (with timing)
Is the network latency to destination high? Try iostat
Maybe the CPU is the bottleneck? As encoding and decoding stuff takes quite a lot of computing power. Try top when a copy is running.
Again, try these steps with the big file and fragmented file separately. The is quite a chance that fragmented files is an issue. If that is the case, I would try to look for concurrency option in rclone.
I had the same problem copying hundreds of thousands of small files from a S3-compatible storage to a local storage. Originally I was using s3fs+rsync. Very (very) slow, and it was getting stuck on the largest folders. Then I discovered rclone, and finished the migration within a few hours with these parameters:
rclone copy source:/bucket /destination/folder --checkers 256 --transfers 256 --fast-list --size-only --progress
Explanation of the options (from https://rclone.org/flags/)
--checkers 256 Number of checkers to run in parallel (default 8)
--transfers 256 Number of file transfers to run in parallel (default 4)
--fast-list Use recursive list if available; uses more memory but fewer transactions
--size-only Skip based on size only, not mod-time or checksum (wouldn't apply in your case if copying to an empty destination)
--progress Show progress during transfer
In HTTP Live Streaming the files are split into fixed sized chunks for streaming. Whats the rational behind this? How is this better than having a single file and using offsets to retrieve the various chunks.
My rough ideas at the moment.
Splitting the files into multiple chunks reduces file seek time during streaming.
From what I understand file are stored as a persistent linked list on the HDD. Is this even true for modern file systems (such as NTFS, ext3) or do they use a more sophisticated data structure such as a balanced tree or hash maps to index the blocks of a file? Whats the run time complexity of seeking (using seekp, tellp, etc) in a file?
HDD is not a consideration. Its done to simplify at the network/CDN layer, as well as client logic. HTTP is a request/response protocol. It doesn't deal well with long streams. Its also doesn't multiplex. To use multiple sockets, you must make separate requests. Requiring the client to become aware of flies structure, and be able to convert a seek bar to a byte offset is complicated. Especially for variable bitrate media. But if you know a video has 100 segments (files), and you seek to 50%, Its really easy of know what file you need. And finally, how should a caching tier deal with a range request? download the whole file from the origin, or just request data as needed and 'stitch' the file back together locally? Either way the caching tier would need that logic. Additional logic comes at the cost of fewer requests per second.
I am studying about HPC applications and Parallel Filesystems. I came across the term scratch space AND scratch filesystem.
I cannot visualize where this scratch space exists. Is it on the compute node as a mounted filesystem /scratch or on the main storage space.
What are it's contents.
Is scratch space independent on each compute node or, two or more nodes can share a single scratch space.
So lets say I have a file 123.txt which I want to process parallelly. Will the scratch space contain the parts of this file or the whole file will be copied.
I am confused and nowhere on google is there a clear description. Please point out to some.
Thanks a Lot.
It all depends on how the cluster was setup and what the users need. When you are given access to a cluster you should also be given some information about how it is meant to be used which should answer most of your questions.
On one of the clusters I work with NFS is used for long term storage and some Lustre space is available for job scratch space. Both the NFS and Lustre are seen by all of the nodes. Each of the nodes also has some scratch space on the node that only that node can see.
If you want your job to work on 123.txt in parallel you can copy 123.txt to a shared scratch space(Lustre) or you can copy it to each of your node scratch spaces in your job file.
for i in `cat $PBS_NODEFILE | sort -u ` ; do scp 123.txt $i:/scratch ; done
Once each node has a copy you can run your job. Once the job is done you need to copy your results to persistent storage since clusters will often run scripts to cleanup scratch space.
There are a lot of different ways to think about or deploy scratch space or a scratch file system.
Let's say you have a cluster of linux nodes, and these nodes all have a hard disk. You could imagine a /scratch space, local to each node. Since the OS image is going to be relatively small, and one cannot procure anything smaller than a terabyte drive nowadays, you end up with close to a terabyte of storage for the node to use.
What would you do with this node-local storage? Oh, lots of things. Scalable Checkpoint-Restart. Local out-of-core operations.
When I first started playing with clusters, it seemed like a good idea to gang all this un-used space into a parallel file system. PVFS worked really well for that purpose.
which lets me segue to a /scratch parallel file system available to all nodes. There is a technology component to this (which parallel file system will a site deploy?) but there is also a policy component: how long will data on this file system be retained? is it backed up? /scratch often implies files are not backed up and in fact are purged after some period of not being accessed (typically two weeks)
This is just something I noticed and just very curious to understand why this was and if someone has a possible explanation for this behavior.
I created 2 sqlite3 files, both with the exact same data. One was version 3.7.5 and on CentOS. The other one was on version 3.7.13 and on OSX. The resulting file had sizes of 16K and 28K, and page sizes of 1024 and 4096.
Does this have anything to do with default block sizes on the OSes or something else file-system related? Or nothing at all and this is because of some additional information that SQLite now stores in its files?
Newer SQLite versions do not store anything additional in database files (as long as you do not use new features).
All tables and indexes use their own pages, so the database file size is affected by the page size.
Each page has a fixed amount of overhead, so increasing the page size typcially increases performance by a little bit.
Changing the page size allows you to trade off speed against space requirements.
The default page size is affected by the actual block size of the storage device, and by how the OS reports it.
Currently i'm using Zend_Cache_Backend_File for caching my project (especially responses from external web services). I was wandering if I could find some benefit in migrating the structure to Zend_Cache_Backend_Sqlite.
Possible advantages are:
File system is well-ordered (only 1 file in cache folder)
Removing expired entries should be quicker (my assumption, since zend wouldn't need to scan internal-metadatas for expiring date of each cache)
Possible disadvantages:
Finding record to read (with files zend check if file exists based on filename and should be a bit quicker) in term of speed.
I've tried to search a bit in internet but it seems that there are not a lot of discussion about the matter.
What do you think about it?
Thanks in advance.
I'd say, it depends on your application.
Switch shouldn't be hard. Just test both cases, and see which is the best for you. No benchmark is objective except your own.
Measuring just performance, Zend_Cache_Backend_Static is the fastest one.
One other disadvantage of Zend_Cache_Backend_File is that if you have a lot of cache files it could take your OS a long time to load a single one because it has to open and scan the entire cache directory each time. So say you have 10,000 cache files, try doing an ls shell command on the cache dir to see how long it takes to read in all the files and print the list. This same lag will translate to your app every time the cache needs to be accessed.
You can use the hashed_directory_level option to mitigate this issue a bit, but it only nests up to two directories deep, which may not be enough if you have a lot of cache files. I ran into this problem on a project, causing performance to actually degrade over time as the cache got bigger and bigger. We couldn't switch to Zend_Cache_Backend_Memcached because we needed tag functionality (not supported by Memcached). Switching to Zend_Cache_Backend_Sqlite is a good option to solve this performance degradation problem.