EHCache heap and off-heap evaluate resources enough to use? - heap-memory

EHCache version 3.2
How do I evaluate "ehcache:resources" ?
File: ehcache.xml
<ehcache:cache alias="messageCache">
<ehcache:key-type>java.lang.String</ehcache:key-type>
<ehcache:value-type>org.cache.messageCacheBean</ehcache:value-type>
<ehcache:resources>
<ehcache:heap unit="entries">10</ehcache:heap>
<ehcache:offheap unit="MB">1</ehcache:offheap>
</ehcache:resources>
</ehcache:cache>
Assume.
Table Name: Message have 4 columns type varchar2(100 Byte) and have data 1000 rows or more than in the database.
How much provide enough define heap/offheap value ?
Thank you.

Sizing caches is part of the exercise and does not really have a general answer.
However, understanding a bit how Ehcache works is needed. In the context of your configuration:
All mappings will always be present in offheap. So this tier should be sized large enough to balance the memory usage / latency benefits that match your application needs.
The onheap tier will be used for the hot set, giving even better latency reduction. However if too small, it will be evicting all the timing, making its benefit less intersting but if too large it impacts Java garbage collection too much as well.
One thing you can do to help onheap tier sizing is to move to byte size in one test. While it will have a performance impact, it will enable you to evaluate how much memory a mapping takes. And thus derived how much mapping fit in the memory you are ready to spare.

Related

Tuning rocksDB to handle a lot of missing keys

I'm trying to configure the rocksdb I'm using as a backend for my flink job. The state rocksdb needs to hold is not too big (around 5G) but it needs to deal with a lot of missing keys. I mean that 80% of the get requests will not find the key in the data base. I wonder whether there is a specific configuration to help with the memory consumption. I have tried to use bloom filters with 3 bits key and increase the block size to 16kb but it doesn't seem to help and the job fails on out of memory exceptions.
I'll be glad to hear more suggestions 😊
I wonder whether there is a specific configuration to help with the memory consumption.
If you are able to obtain a heap profiling (like https://gperftools.github.io/gperftools/heapprofile.html ?), it will be helpful to figure out out what part of RocksDB consume the most memory.
Given your memory budget (i.e, expectation) you plan for your RocksDB, you might start with some general memory controls as following:
WriteBufferManager (for memtable, which is a large memory consumer) https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager
Twisting block cache size (another large memory consumer) https://github.com/facebook/rocksdb/wiki/Block-Cache
Tracking/capping memory in block cache (https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#block-cache, https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency)
I am not clear on how missing keys can potentially affect your memory consumption in specific way though.

Which NoSQL Database for Mostly Writing

I'm working on a system that will generate and store large amounts of data to disk. A previously developed system at the company used ordinary files to store its data but for several reasons it became very hard to manage.
I believe NoSQL databases are good solutions for us. What we are going to store is generally documents (usually around 100K but occasionally can be much larger or smaller) annotated with some metadata. Query performance is not top priority. The priority is writing in a way that I/O becomes as small a hassle as possible. The rate of data generation is about 1Gbps, but we might be moving on 10Gbps (or even more) in the future.
My other requirement is the availability of a (preferably well documented) C API. I'm currently testing MongoDB. Is this a good choice? If not, what other database system can I use?
The rate of data generation is about 1Gbps,... I'm currently testing MongoDB. Is this a good choice?
OK, so just to clarify, your data rate is ~1 gigaBYTE per 10 seconds. So you are filling a 1TB hard drive every 20 minutes or so?
MongoDB has pretty solid write rates, but it is ideally used in situations with a reasonably low RAM to Data ratio. You want to keep at least primary indexes in memory along with some data.
In my experience, you want about 1GB of RAM for every 5-10GB of Data. Beyond that number, read performance drops off dramatically. Once you get to 1GB of RAM for 100GB of data, even adding new data can be slow as the index stops fitting in RAM.
The big key here is:
What queries are you planning to run and how does MongoDB make running these queries easier?
Your data is very quickly going to occupy enough space that basically every query will just be going to disk. Unless you have a very specific indexing and sharding strategy, you end up just doing disk scans.
Additionally, MongoDB does not support compression. So you will be using lots of disk space.
If not, what other database system can I use?
Have you considered compressed flat files? Or possibly a big data Map/Reduce system like Hadoop (I know Hadoop is written in Java)
If C is key requirement, maybe you want to look at Tokyo/Kyoto Cabinet?
EDIT: more details
MongoDB does not support full-text search. You will have to look to other tools (Sphinx/Solr) for such things.
Larges indices defeat the purpose of using an index.
According to your numbers, you are writing 10M documents / 20 mins or about 30M / hour. Each document needs about 16+ bytes for an index entry. 12 bytes for ObjectID + 4 bytes for pointer into the 2GB file + 1 byte for pointer to file + some amount of padding.
Let's say that every index entry needs about 20 bytes, then your index is growing at 600MB / hour or 14.4GB / day. And that's just the default _id index.
After 4 days, your main index will no longer fit into RAM and your performance will start to drop off dramatically. (this is well-documented under MongoDB)
So it's going to be really important to figure out which queries you want to run.
Have a look at Cassandra. It executes writes are much faster than reads. Probably, that's what you're looking for.

Why do DBS not adapt/tune their buffer sizes automatically?

Not sure whether there isn't a DBS that does and whether this is indeed a useful feature, but:
There are a lot of suggestions on how to speed up DB operations by tuning buffer sizes. One example is importing Open Street Map data (the planet file) into a Postgres instance. There is a tool called osm2pgsql (http://wiki.openstreetmap.org/wiki/Osm2pgsql) for this purpose and also a guide that suggests to adapt specific buffer parameters for this purpose.
In the final step of the import, the database is creating indexes and (according to my understanding when reading the docs) would benefit from a huge maintenance_work_mem whereas during normal operation, this wouldn't be too useful.
This thread http://www.mail-archive.com/pgsql-general#postgresql.org/msg119245.html in the contrary suggests a large maintenance_work_mem would not make too much sense during final index creation.
Ideally (imo), the DBS should know best what buffers size combination it could profit most given a limited size of total buffer memory.
So, are there some good reasons why there isn't a built-in heuristic that is able to adapt the buffer sizes automatically according to the current task?
The problem is the same as with any forecasting software. Just because something happened historically doesn't mean it will happen again. Also, you need to complete a task in order to fully analyze how you should have done it more efficient. Problem is that the next task is not necessarily anything like the previously completed task. So if your import routine needed 8gb of memory to complete, would it make sense to assign each read-only user 8gb of memory? The other way around wouldn't work well either.
In leaving this decision to humans, the database will exhibit performance characteristics that aren't optimal for all cases, but in return, let's us (the humans) optimize each case individually (if like to).
Another important aspect is that most people/companies value reliable and stable levels over varying but potentially better levels. Having a high cost isn't as big a deal as having large variations in cost. This is of course not true all the times as entire companies are based around the fact the once in a while hit that 1%.
Modern databases already make some effort into adapting itself to the tasks presented, such as increasingly more sofisticated query optimizers. At least Oracle have the option to keep track of some of the measures that are influencing the optimizer decisions (cost of single block read which will vary with the current load).
My guess would be it is awfully hard to get the knobs right by adaptive means. First you will have to query the machine for a lot of unknowns like how much RAM it has available - but also the unknown "what do you expect to run on the machine in addition".
Barring that, by setting a max_mem_usage parameter only, the problem is how to make a system which
adapts well to most typical loads.
Don't have odd pathological problems with some loads.
is somewhat comprehensible code without error.
For postgresql however the answer could also be
Nobody wrote it yet because other stuff is seen as more important.
You didn't write it yet.

How does Voldemort compare to Cassandra?

How does Voldemort compare to Cassandra?
I'm not talking about size of community and only want to hear from people who have actually used both.
Especially I'm interested in:
How they dynamically scale when adding and removing nodes
Query performance
How they scale when adding nodes (linear)?
Write speed
Voldemort's support for adding nodes was just added recently (this month). So I would expect Cassandra's to be more robust given the longer time to cook and a larger community testing.
Both are fast (> 10k ops/s per machine). Because of their storage designs, I would expect Cassandra to be faster at writes, and Voldemort to be faster at reads. I would also expect Cassandra's performance to degrade less as the amount of data per node increases. And of course if you need more than just a key/value data model Cassandra's ColumnFamily model wins.
I don't know of any head-to-head benchmarks since the one done for NoSQL SF last June, which found Cassandra to be somewhat faster at whatever workload mix he was using. (The "vpork" talk from http://blog.oskarsson.nu/2009/06/nosql-debrief.html) 8 months is an eternity with projects under this much development, though.
Some additional comments:
Regarding write speed, Cassandra should be faster -- it is designed to be faster to write than read (you can avoid immediate disk hit for writes due to specialized way storage is done)
But main difference I think is actually not performance but feature set: Voldemort is strictly a key/value store (currently anyway), whereas Cassandra can offer range queries (with order-preserving partitioner), and bit more structure around data (column families etc). Former is an important consideration for design; latter IMO less so, you can always structure BLOB data on client side.

Is it recommended practice to use uniform extent sizes in Oracle tablespaces?

I've been using Oracle for quite some time since Oracle 8i was released. I was new to the database at that time and was taught that it was best to use constant sized extent sizes when defining tablespaces.
From what I have read, it seems that today using 10/11g, that Oracle can manage these extent sizes for you automatically and that it may not keep extent sizes constant. I can easily see how this can more efficiently use disk space, but are their downsides to this. I'm thinking it may be time to let go of the past on this one. (assuming my past teaching was correct in the first place)
Yes, except for very unusual cases it's time to let go of the past and use the new Oracle extent management features. Use locally-managed tablespaces (LMT's) and the auto extent sizing and you don't have to think about this stuff again.
As a DBA, the variable extent sizing worried me at first since in the 7.3 days I spent a lot of time reorganizing tablespaces to eliminate the fragmentation that resulted from extent allocation with non-zero percent increases. (and you needed non-zero percent increases because your maximum number of extents was capped at different levels depending on the database block size used when you created the database) However, Oracle uses an algorithm to determine the rate and magnititude of extent size increases that effectively eliminates fragmentation.
Also, forget anything you have heard about how the optimum configuration is to have a table or index fit into a single extent or that you can somehow manage i/o through extent configuration - this has never been true. In the days of dictionary-managed tablespace there was probably some penalty to having thousands of extents managed in a dictionary table, but LMT's use bitmaps and this is not an issue. Oracle buffers blocks, not segment extents.
If you have unlimited disk space with instant access time, you don't have to care about extents at all.
You just make every table INITIAL 100T NEXT 100T MAXEXTENTS UNLIMITED PCTINCREASE 0 and forget about extents for next 300 years.
The problems arise when your disk space is not unlimited or access time varies.
Extents are intended to cope with data sparseness: when your data are fragmented, you will have your HDD head to jump from one place to another, which takes time.
The ideal situation is having all your data for each table to reside in one extent, while having data for the table you join most often to reside in the next extent, so everything can be read sequentially.
Note that access time also includes access time needed to figure out where you data resides. If your data are extremely sparse, extra lookups into the extent dictionaries are required.
Nowadays, disk space is not what matters, while access time still matters.
That's why Oracle created extent management.
This is less efficient in term of space used than hand-crafted extent layout, but more efficient in terms of access time.
So if your have enough disk space (i. e. your database will take less than half of the disk for 5 years), then just use automatic extents.

Resources