Is there such thing as an efficient writable archive? - filesystems

Looking for a way to store multiple files inside one file.
Things i've tried:
sqlite
zip
The problem is that i need writing to be nearly as efficient as writing to the raw file. Sqlite's blob format is very slow. Zipfile writing is even slower.
I started rolling my own file format, but it feels like I'm reinventing a wheel.

Related

Processing large number of html files to extract data best storage mechanism for flat text files

I index the html of specific websites and to do so pull it down to disk so I have a quite a few flat files of html. I then take the html and extract data from it and generate json files which contain the data I need.
I end up with a structure something like this
/pages/website.com/index_date/sectionofsite/afile.html
/pages/website.com/index_date/sectionofsite/afile.json
I need to keep the original html as I might need to reprocess it to produce the json. The problem now is that I have gigs and gigs of flat html files.
I can compress the html files no problem but sometimes I need to reprocess everything to extract another value or fix a bug. If I compress the html then the problem is if I re-processed a set of files I would need to
unzip the html
extract data and generate json.
compress the html back to zip.
The reality is that is super slow when you have tons and tons of files. I looked at mongodb (and its WiredTiger Storage Engine with zlib compression) as a possible solution for storing the html as its essentially text and not binary but mongo db kept crashing with lots of plain html text. I think the PHP Library is a big buggy.
I need a way other than the file system to store plain text files but have a way to access them fast. It would be preferable if the storage mechanism also compressed the plain text files. Curious if anyone has run into a similar problem and how they solved it.
First of all, since HTML and JSON compress very well, you should store them compressed.
Rather than zip, use gzip. Because zip is an archiver, while gzip compresses one stream only. Every programming language has a function to read and write gzip files as if they weren't compressed. E.g. in python, you simply use gzip.open instead of open, or in Java you wrap it with a GZipInputStream.
Then you may want to look into embedded databases. Don't use MongoDB, because it is slow. Use e.g. one SQLite file per site to store the compressed data. Using a server (i.e. PostgreSQL or MongoDB) is only beneficial if you have multiple processes working on the same files. Unless you need this concurrency, embedded databases are much faster (because they don't transmit the data). If you don't need any of the SQL features, libraries such as BerkeleyDB are even smaller.
But in the end, your filesystem is also a database. Not a particularly bad one, but not designed for millions of entries and only supporting only name->data lookups. But most file systems use blocks for storing, so any file will use a multiple of e.g. 8kb of disk, even if your data is much smaller. It's these situations where embedded databases help. They also use blocks, but you can configure the block size to be smaller to reduce waste.

what is the best way to read and parse millions of files on WindowsNT

i have millions of files in one directory(on directory with many child directorys),
these files are all small files.
i think there are 2 challenges:
how to traverse the directory to find all files. i have try the 'FindFirstFile/FindNextFile' way, but i feel it is too slow.Should I use the Windows Change Journal?
after i have find all filenames, i need read a whole file to memory,and then parse it.Should I use the FILE_FLAG_SEQUENTIAL_SCAN flag? or is there more efficient way?
Some ideas to kick around..
Text Crawler - Awesome indispensable Windows Search Tool - http://digitalvolcano.co.uk/textcrawler.html
Microsoft log parser - http://technet.microsoft.com/en-us/scriptcenter/dd919274.aspx
If you have a SQL (or MySQL) Server that has enough space, you can setup a SQL Job to import/link to the files in question, then you can query them
What my fear is that if you load the content of the file/s into memory, you are going to run out of server memory quickly. What you need to do is locate the files in question and write the results to a log or report that you can parse out and interpret.
NTFS, or in fact any non-specialized file system will be slow with millions of small files. That's the territory of databases.
If the files are in fact small, it doesn't matter at all how you read them. Overhead costs will dominate. It may be worthwhile to use a second thread, but a third thread is unlikely to help further.
Also, use FindFirstFileEx to speed up the search. You don't need alternate file names but would prefer a larger buffer.
You can use NtQueryDirectoryFile with a large buffer (say, 64 KB) to query for the children.
This function is the absolute limit to the fastest you can possibly communicate with the file system.
If that doesn't work for you, you can read the NTFS file table directly, but that means you'll have to have administrative privileges and will need to implement the file system reader by hand.

Text Compression in Erlang

Is there a text compression library for Erlang? When working with very long strings, it may be advantageous to compress the character data. Has anyone compressed text or thought of a way to do it in Erlang?
I was thinking of using the zip module, but instead of working with files, I work in-memory like this:
compress(LargeText)->
Binary = list_to_binary(LargeText),
{ok,{_,Zipped}} = zip:zip("ZipName",[{"Name",Binary}],[memory]),
Zipped.
Then I would unzip the text back into memory when I need it. Like this:
{ok,[{"Name",Binary}]} = zip:unzip(Zipped,[memory]).
My Erlang application is supposed to be part of a middle tier in which large text may have to pass through to, and out of a storage system. The storage system is intended on storing large text. To optimize the storage, there is need to compress it before sending it. Assume that the text value is like a CLOB data type in Oracle Database. I am thinking that if I combine the zipping and the erlang:garbage_collect/0, i can pull it off.
Or if it's not possible in Erlang, perhaps it is possible using a system call via os:cmd({Some UNIX script}) and then I would grab the output in Erlang? If there's a better way, please show it.
There is a zlib module for Erlang, which supports in-memory compression and decompression.
You can consider using snappy compression which is a lot faster than zip especially for decompression.
Edit:
Nowadays I am using LZ4 a lot and I am very happy with it. It has a nice and readable code, simple format, well maintained and is even faster than Snappy.

database for huge files like audio and video

My application creates a large number of files, each up to 100MB. Currently we store these files in the file system which works pretty well. But I am wondering if there is a better solution to store the files in a some kind of file database. The simple advantage with database is if it can split the file and store in small chunks instead of one 100mb file.
A file system is perfectly suited for storing files. If you need to associate them with a database, do it by filename. The filesystem already does numerous fancy things to assure it is efficient. It's probably best that you don't try to outsmart it.
Relational databases are no good at files this big. You could go to something like HDFS, but it may not be worth the trouble if what you have is doing the job. I believe it does break large files down into chunks though.

BLOB Storage - 100+ GB, MySQL, SQLite, or PostgreSQL + Python

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.
I'm still researching this option for one of my own projects, but CouchDB may be worth a look.
Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.
My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.
I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.
Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.
why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

Resources