Text Compression in Erlang - database

Is there a text compression library for Erlang? When working with very long strings, it may be advantageous to compress the character data. Has anyone compressed text or thought of a way to do it in Erlang?
I was thinking of using the zip module, but instead of working with files, I work in-memory like this:
compress(LargeText)->
Binary = list_to_binary(LargeText),
{ok,{_,Zipped}} = zip:zip("ZipName",[{"Name",Binary}],[memory]),
Zipped.
Then I would unzip the text back into memory when I need it. Like this:
{ok,[{"Name",Binary}]} = zip:unzip(Zipped,[memory]).
My Erlang application is supposed to be part of a middle tier in which large text may have to pass through to, and out of a storage system. The storage system is intended on storing large text. To optimize the storage, there is need to compress it before sending it. Assume that the text value is like a CLOB data type in Oracle Database. I am thinking that if I combine the zipping and the erlang:garbage_collect/0, i can pull it off.
Or if it's not possible in Erlang, perhaps it is possible using a system call via os:cmd({Some UNIX script}) and then I would grab the output in Erlang? If there's a better way, please show it.

There is a zlib module for Erlang, which supports in-memory compression and decompression.

You can consider using snappy compression which is a lot faster than zip especially for decompression.
Edit:
Nowadays I am using LZ4 a lot and I am very happy with it. It has a nice and readable code, simple format, well maintained and is even faster than Snappy.

Related

Is there such thing as an efficient writable archive?

Looking for a way to store multiple files inside one file.
Things i've tried:
sqlite
zip
The problem is that i need writing to be nearly as efficient as writing to the raw file. Sqlite's blob format is very slow. Zipfile writing is even slower.
I started rolling my own file format, but it feels like I'm reinventing a wheel.

Using Redis to temporary caching files

I would like to temporary cache uploaded files in Redis. I know it is utilizing a lot of memory, but I think it is the best way to have a really low latency for a temporary amount of time.
How do I store files in Redis? Do I somehow convert them into binary and store them and decode them when I need them?
Strings in Redis are binary safe, which means you could store binary files without any problem (https://redis.io/topics/data-types#strings).
The way you will do this depends on the language and frameworks you are using, but, generally speaking, one way to accomplish this is just storing in Redis the file content as base64.
Hope it helps.

Processing large number of html files to extract data best storage mechanism for flat text files

I index the html of specific websites and to do so pull it down to disk so I have a quite a few flat files of html. I then take the html and extract data from it and generate json files which contain the data I need.
I end up with a structure something like this
/pages/website.com/index_date/sectionofsite/afile.html
/pages/website.com/index_date/sectionofsite/afile.json
I need to keep the original html as I might need to reprocess it to produce the json. The problem now is that I have gigs and gigs of flat html files.
I can compress the html files no problem but sometimes I need to reprocess everything to extract another value or fix a bug. If I compress the html then the problem is if I re-processed a set of files I would need to
unzip the html
extract data and generate json.
compress the html back to zip.
The reality is that is super slow when you have tons and tons of files. I looked at mongodb (and its WiredTiger Storage Engine with zlib compression) as a possible solution for storing the html as its essentially text and not binary but mongo db kept crashing with lots of plain html text. I think the PHP Library is a big buggy.
I need a way other than the file system to store plain text files but have a way to access them fast. It would be preferable if the storage mechanism also compressed the plain text files. Curious if anyone has run into a similar problem and how they solved it.
First of all, since HTML and JSON compress very well, you should store them compressed.
Rather than zip, use gzip. Because zip is an archiver, while gzip compresses one stream only. Every programming language has a function to read and write gzip files as if they weren't compressed. E.g. in python, you simply use gzip.open instead of open, or in Java you wrap it with a GZipInputStream.
Then you may want to look into embedded databases. Don't use MongoDB, because it is slow. Use e.g. one SQLite file per site to store the compressed data. Using a server (i.e. PostgreSQL or MongoDB) is only beneficial if you have multiple processes working on the same files. Unless you need this concurrency, embedded databases are much faster (because they don't transmit the data). If you don't need any of the SQL features, libraries such as BerkeleyDB are even smaller.
But in the end, your filesystem is also a database. Not a particularly bad one, but not designed for millions of entries and only supporting only name->data lookups. But most file systems use blocks for storing, so any file will use a multiple of e.g. 8kb of disk, even if your data is much smaller. It's these situations where embedded databases help. They also use blocks, but you can configure the block size to be smaller to reduce waste.

database for huge files like audio and video

My application creates a large number of files, each up to 100MB. Currently we store these files in the file system which works pretty well. But I am wondering if there is a better solution to store the files in a some kind of file database. The simple advantage with database is if it can split the file and store in small chunks instead of one 100mb file.
A file system is perfectly suited for storing files. If you need to associate them with a database, do it by filename. The filesystem already does numerous fancy things to assure it is efficient. It's probably best that you don't try to outsmart it.
Relational databases are no good at files this big. You could go to something like HDFS, but it may not be worth the trouble if what you have is doing the job. I believe it does break large files down into chunks though.

SQLite / Firebird embedded for numeric data

I have an experiment streaming up 1Mb/s of numeric data which needs to be stored for later processing.
It seems as easy to write directly into a database as to a CSV file and I would then have the ability to easily retrieve subsets or ranges.
I have experience of sqlite2 (when it only had text fields) and it seemed pretty much as fast as raw disk access.
Any opinions on the best current in-process DBMS for this application?
Sorry - should have added this is C++ intially on windows but cross platform is nice. Ideally the DB binary file format shoudl be cross platform.
If you only need to read/write the data, without any checking or manipulation done in database, then both should do it fine. Firebird's database file can be copied, as long as the system has the same endianess (i.e. you cannot copy the file between systems with Intel and PPC processors, but Intel-Intel is fine).
However, if you need to ever do anything with data, which is beyond simple read/write, then go with Firebird, as it is a full SQL server with all the 'enterprise' features like triggers, views, stored procedures, temporary tables, etc.
BTW, if you decide to give Firebird a try, I highly recommend you use IBPP library to access it. It is a very thin C++ wrapper around Firebird's C API. I has about 10 classes that encapsulate everything and it's dead-easy to use.
If all you want to do is store the numbers and be able to easily to range queries, you can just take any standard tree data structure you have available in STL and serialize it to disk. This may bite you in a cross-platform environment, especially if you are trying to go cross-architecture.
As far as more flexible/people-friendly solutions, sqlite3 is widely used, solid, stable,very nice all around.
BerkeleyDB has a number of good features for which one would use it, but none of them apply in this scenario, imho.
I'd say go with sqlite3 if you can accept the license agreement.
-D
Depends what language you are using. If it's C/C++, TCL, or PHP, SQLite is still among the best in the single-writer scenario. If you don't need SQL access, a berkeley DB-style library might be slightly faster, like Sleepycat or gdbm. With multiple writers you could consider a separate client/server solution but it doesn't sound like you need it. If you're using Java, hdqldb or derby (shipped with Sun's JVM under the "JavaDB" branding) seem to be the solutions of choice.
You may also want to consider a numeric data file format that is specifically geared towards storing these types of large data sets. For example:
HDF -- the most common and well supported in many languages with free libraries. I highly recommend this.
CDF -- a similar format used by NASA (but useable by anyone).
NetCDF -- another similar format (the latest version is actually a stripped-down HDF5).
This link has some info about the differences between the above data set types:
http://nssdc.gsfc.nasa.gov/cdf/html/FAQ.html
I suspect that neither database will allow you to write data at such high speed. You can check this yourself to be sure. In my experience - SQLite failed to INSERT more then 1000 rows per second for a very simple table with a single integer primary key.
In case of a performance problem - I would use CSV format to write the files, and later I would load their data to the database (SQLite or Firebird) for further processing.

Resources