I have an old project which uses Elastic Search version 5.5. The problem we are facing right now is that ES eats up huge amount of storage at a faster rate. I checked into the server and confirmed that the data in /var/data/elasticsearch is really really huge (like 900GB).
I also noticed that in that directory are a bunch of snapshots and meta files (meta-xxxx.dat, snap-xxx.dat).
Browsing the web, it was mentioned that these are snapshots and backup files that elasticsearch generate automatically. As there is very limited resource available online to learn how to delete or at least reduce it, I am shooting my queries here instead.
Is it safe to delete these files (meta-xxx.dat & snap-xxx.dat)?
Is there a way to delete these the "elasticsearch way"?
What are the consequences of deleting these?
Thank you in advance!
DO NOT delete any files directly from the filesystem, it's likely to cause major issues with Elasticsearch and your data. you should be able to see, and manage snapshots via the applicable api (the GET _snapshot part) - https://www.elastic.co/guide/en/elasticsearch/reference/5.5/modules-snapshots.html#_snapshot
deleting snapshots, via the api, will delete the underlying files and free up disk space
also 5.5 is long EOL and you should look to upgrade asap. there's been tonnes of improvements around storage efficiencies in later versions
Related
So I have this requirement that says the app must let users upload and download about 6000 files per month (mostly pdf, doc, xls).
I was thinking about the optimal solution for this. Question is whether I'll use BLOb's in my database or a simple file hierarchy for writing/reading these bunch of files.
The app architecture is based on Java 1.6, Spring 3.1 and DOJO, Informix 10.X.
So I'm here just to be advised based on your experience.
When asking what's the "best" solution, it's a good idea to include your evaluation criteria - speed, cost, simplicity, maintenance etc.
The answer Mikko Maunu gave is pretty much on the money. I haven't used Informix in 20 years, but most databases are a little slow when dealing with BLOBs - especially the step of getting the BLOB into and out of the database can be slow.
That problem tends to get worse as more users access the system simultaneously, especially if they use a web application - the application server has to work quite hard to get the files in and out of the database, probably consumes far more memory for those requests than normal, and probably takes longer to complete the file-related requests than for "normal" pages.
This can lead to the webserver slowing down under only moderate load. If you choose to store the documents in your database, I'd strongly recommend running some performance tests to see if you have a problem - this kind of solution tends to expose flaws in your setup that wouldn't otherwise come to light (slow network connection to your database server, insufficient RAM in your web servers, etc.)
To avoid this, I've stored the "master" copies of the documents in the database, so they all get backed up together, and I can ask the database questions like "do I have all the documents for user x?". However, I've used a cache on the webserver to avoid reading documents from the database more than I needed to. This works well if you have a "write once, read many" time solution like a content management system, where the cache can earn its keep.
If you have other data in database in relation to these files, storing files to file system makes it more complex:
Back-up should be done separately.
Transactions have to be separately implemented (as far as even possible for file system operations).
Integrity checks between database and file system structure do not come out of the box.
No cascades: removing users pictures as consequence of removing user.
First you have to query for path of file from database and then pick one from file system.
What is good with file system based solution is that sometimes it is handy to be able to directly access files, for example copying part of the images somewhere else. Also storing binary data of course can dramatically change size of database. But in any case, more disk storage is needed somewhere with both solutions.
Of course all of this can ask more DB resources than currently available. There can be in general significant performance hit, especially if decision is between local file system and remote DB. In your case (6000 files monthly) raw performance will not be problem, but latency can be.
I'm writing a document editing web service, in which documents can be edited via a website, or locally and pushed via git. I'm trying to decide if the documents should be stored as individual documents on the filesystem, or in a database. The points I'm wondering are:
If they're in a database, is there any way for git to see the documents?
How much higher are the overheads using the filesystem? I assume the OS is doing a lot more work. How can I alleviate some of this? For example, the web editor autosaves, what would the best way to cache the save data be, to minimise writes?
Does one scale significantly better or worse than the other? If all goes according to plan, this will be a service with many thousands of documents being accessed and edited.
If the documents go into a database, git can't directly see the documents. git will see the backing storage file(s) for the database, but have no way of correlating changes there to changes to files.
The overhead of using the database is higher than using a filesystem, as answered by Carlos. Databases are optimized for transactions, which they'll do in memory, but they have to hit the file. Unless you program the application to do database transactions at a sub-document level (Eg: changing only modified lines), the database will give you no performance improvement. Most modern filesystems do caching and you can 'write' in a way that will sit in RAM rather than going to your backing stoage as well. You'll need to manage the granularity of the 'autosaves' in your application (every change? every 30 seconds? 5 minutes?), but really, doing it at the same granularity with a database will cause the same amount of traffic to the backing store.
I think you intended to ask "does the filesystem scale as well as the database"? :) If you have some way to organize your files per-user, and you figure out the security issue of a particular user only being able to access/modify the files they should be able to (which are doable imo), the filesystem should be doable.
Filesystem will always be faster than DB, because after all, DB's store data in the Filesystem!
Git is quite efficiently on it's own as proven on github, so i say you stick with git, and workaround it.
After all, Linus should know something... ;)
Can anyone suggest a database solution for storing large documents which will have multiple branched revisions? Partial edits of content should be possible without having to update the entire document.
I was looking at XML databases and wondering about the suitability of them, or maybe even using a DVCS (like Mercurial).
It should preferably have Python bindings.
Try Fossil -- it has a good delta encoding algorithm, and keeps all versions. It's backed by a single SQLite database, and has both a web based and a command line UI.
This depends on your storage behavior and use case. If you plan to store a massive number of "document revisions" and keep historical versions, and can comply with a write-once-read-many pattern, you should look into something like Hadoop HDFS. This requires a lot of (cheap) infrastructure to run your cluster, but you will be able to keep adding revisions/data over time and will be able to quickly look it up using a MapReduce algorithm.
I'm running a database-backed web site on shared hosting that occasionally gets swarmed after a mention on a link sharing site.
Because of how much load the first couple of traffic surges put on the database, I have implemented file-based caching.
When a query runs, I just serialize the resultset object and save it to a file. I have a sub-directory structure in the cache directory that keeps thousands of files from ending up in the same directory. Next time I have to run the same query, I just pull the object out of the file instead.
It's been working pretty well so far. But I'm worried that I am overlooking something, and possibly asking for trouble if there is a higher level of traffic than I've previously enjoyed. Or maybe there's just an easier way to do this?
Please poke some holes in this for me? Thanks!
Ideally. cache in memory to remove disk access. Have a look at something like memcached
Since you're on shared hosting, you should do some throttling (google "Throttling your web server (Oct 00)" for ideas).
A related interesting read (which also mentions Stonehenge::Throttle) is
Building a Large-Scale E-commerce site with Apache and mod_perl
http://perl.apache.org/docs/tutorials/apps/scale_etoys/etoys.html
A web app I'm working on requires frequent parsing of diverse web resources (HTML, XML, RSS, etc). Once downloaded, I need to cache these resources to minimize network load. The app requires a very straightforward cache policy: only re-download a cached resource when more than X minutes have passed since the access time.
Should I:
Store both the access time (e.g. 6/29/09 at 10:50 am) and the resource itself in the database.
Store the access time and a unique identifier in the database. The unique identifier is the filename of the resource, stored on the local disk.
Use another approach or third party software solution.
Essentially, this question can be re-written as, "Which is better for storing moderate amounts of data - a database or flat files?"
Thanks for your help! :)
NB: The app is running on a VPS, so size restrictions on the database/flat files do not apply.
To answer your question: "Which is better for storing moderate amounts of data - a database or flat files?"
The answer is (in my opinion) Flat Files. Flat files are easier to backup, and easier to remove.
However, you have extra information that isn't encapsulated in this question, mainly the fact that you will need to access this stored data to determine if a resource has gone stale.
Given this need, it makes more sense to store it in a database. Flat Files do not lend themselves well for random access, and search, compared to a relational DB.
Depends on the platform, IF you use .NET
The answer is 3, use Cache object, ideally suited for this in ASP.NET
You can set time and dependency expiration,
this doc explains the cache object
https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-5034946.html
Neither.
Have a look at memcached to see if it works with your server/client platform. This is easier to set up and performs much better than filesystem/rdbms based caching, provided you can spare the RAM needed for the data being cached.
All of the proposed solutions are reasonable. However, for my particular needs, I went with flat files. Oddly enough, though, I did so for reasons not mentioned in some of the other answers. It doesn't really matter to me that flat files are easier to backup and remove, and both DB and flat-file solutions allow for easy checking of whether or not the cached data has gone stale. I went with flat files first and foremost because, on my mid-sized one-box VPS LAMP architecture, I think it will be faster than a third-party cache or DB-based solution.
Thanks to all for your thoughts! :)