I've implemented a (very) basic filesystem using FUSE. As of right now, the way FUSE mounts the system is with:
./myfusesystem /path/to/mountpoint
This is all well and good, and creates the proper layer over the mountpoint. But I'm not attempting to build a virtual filesystem; I'm working on an on-disk filesystem where data will be stored to a disk partition. Thus, I want to use my FUSE filesystem with an actual device partition, something like:
./myfusesystem /dev/sdd2 /path/to/mountpoint
...where the data will be stored persistently on that partition. Is this something that can be done with FUSE? If so, how? I cannot find any references to this in the documentation or the tutorials I've read.
FUSE lets you store the data anywhere, but if you want to use the partition as a storage, it's your job to read/and write the data from/to it.
Related
Development environment: mobile app in Android
I'm looking for a way to uniquely identify files in a FAT32/VFAT file system (which has no inodes).
I thought about hashing (SHA1?) the full path. The problem with this solution is that it doesn't support moving/renaming.
Is there something better, that will hold even when moving/renaming the file?
Thanks
Unfortunately FAT doesn't have Unique file IDs and when they are needed, various system components emulate them by maintaining the list of all files of the filesystem in memory (thus the ID is unique and valid only when the system is running).
Depending on what you control (either you have a filesystem driver, a filter or just a user-mode application) potentially you can do the same - have a list of files and provide some unique ID based on that list.
Is there any good way to storing lucene index in db without any external library, that touches connection layer (like JDBCDirectory) and also without using file system (even temporary). RAMDirectory would be fine for me if I could get from it specific parts of the index - .cfs "file" and segment. Don't know if it's doable. Will be thankful for any help.
I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!
I want to move large number of small files to HDFS sequence file(s). I have come across two options:
Use Flume. Flume does not have a built in file source and this requires a custom source to push the files.
Use apache camel file to hdfs route.
Even though the above two methods serve the purpose, I would like to weigh other options available before picking one. In particular i am interested in a solution that is more configurable and results in less maintainable code.
Use Flume. Flume does not have a built in file source and this requires a custom source to push the files.
Umm... no, that's not right. Flume has a Spooling Directory Source which would do the high level of what you want.
Seems like a few lines of code with Camel. i.e. from("file:/..").to("hdfs:..") plus some init and project setup.
Not sure how much easier (less lines of code) you can do it using any method.
If the HDFS options in Camel is enough for configuration and flexibility, then I guess this approach is the best. Should take you just a matter of hours (or even minutes) to have some test cases up and running.
I am running Hadoop 0.20.1 under SLES 10 (SUSE).
My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.
Right now, I am using the temp folder and task id, to create a unique folder, and then working within subfolders of that folder.
reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir");
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);
The problem with this approach is that I am not sure it is optimal, also I have to delete each new folder or I start to run out of space.
Thanks
akintayo
(edit)
I found that the best place to keep files that you don't want beyond the life of map would be job.get("job.local.dir") which provides a path that will be deleted when the map tasks finishes. I am not sure if the delete is done on a per key basis or for each tasktracker.
The problem with that approach is that the sort and shuffle is going to move your data away from where that data was localized.
I do not know much about your data but the distributed cache might work well for you
${mapred.local.dir}/taskTracker/archive/ : The distributed cache. This directory holds the localized distributed cache. Thus localized distributed cache is shared among all the tasks and jobs
http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
"It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.
The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back."