AgensGraph recommended memory parameters and hardware sizing - graph-databases

I've been playing with a beta version(0.9) of AgensGraph for
the last few weeks.
Now, I am testing the product on my VM machine(with 2 cores and 2gb memory),
and planning to install the product on a real server(x86_64 with 32 cores and 96gb memory).
While I am planning the installation, I am having a trouble to find proper parameters for the product.
Of course, since the product is based on PostgreSQL, I am pretty familiar with all the parameters which the product is using. However, since we are talking about graph data(not relational data), I am not sure whether I can allocate server memory and do hardware sizing like what I did previously on PostgreSQL servers.
It would be really helpful if someone can answer or provide guidelines for my question about database parameter configuration and hardware sizing.
For your information, my test scenario is written as below:
os: CentOS 6.5
cpu: 2 X Intel Xeon CPU E5-2640 v3 # 2.60GHz
total cores: 32 cores(w/ Hyper-threading)
memory: 96gb
data size: about 60gb
concurrent user: 100

One of good things of AgensGraph is that most of operational experiences from PostgreSQL are applicable. So if you are familiar with PostgreSQL already and have some experiences of operating PostgreSQL in production, it will very helpful to configure AgensGraph as well.
But as you also pointed out, AgensGraph is a graph database, whose workloads are extensively random I/O access. So in order to optimize query performance, you allocate shared buffers as much as you can and prewarm database instances with graph objects. Of course, AgensGraph also can exploit filesystem buffers but if you can explicitly allocate enough shared buffers for graph data and the graph data is cached in the shared buffer, the performance can be the best.
You can use pg_prewarm extension to warm up the AgensGraph cache. You can refer this link (https://www.postgresql.org/docs/9.6/static/pgprewarm.html) to find out how to use this extension.
If you want warm up cache by a graph data, say 'test_graph', you can use the following query.
select pg_prewarm(c.oid)
from pg_class c
left join pg_namespace n on n.oid = c.relnamespace
where nspname = 'test_graph' and (relkind = 'i' OR relkind = 'r');
This query warms up caches with heap spaces and indexes of 'test_graph'. It is rather a long and complex query. But I think AgensGraph will provide more simple way to do this in near future.
And when the entire data can be cached in memory, it can be recommended to set random_page_cost = 1. This parameter means that the cost of random page scans is the same with the sequential scan. But because this influences the query optimizer to choose the optimal plan, you should be careful to change this.
And last thing, if you have a lot of concurrent users, you should also be careful to balance the size of shared_buffers and work_mem. I cannot assume anything before analyzing your workloads. But in general more concurrent clients means more usages of work_mem. So if the total amount of shared_buffers and work_mem exceeds the size of physical memory, page faults can occurs. You have to avoid this.

Related

Handling large data in a C program

I am working on a project which runs queries on a database and the results are greater than the memory size. I have heard of memory pool libraries but I'm not sure that it's the best way solution to this problem.
Do memory pool libraries support writing and reading back from disk (as the result of a query that needs to be parsed many times). Are there also some other ways to achieve this?
P.S
I am using MySQL Database and its C API to access database.
EDIT: here's an example:
Suppose I have five tables, each having a million rows. I want to find how much one table is similar to another, so I am creating a bloom filter for each table and then check each filter against the data in the rest of the four tables.
Extending your logical memory beyond the physical memory by using secondary storage (e.g. disks) is usually called swapping, not memory pooling. Your operating system already does it for you, and you should try letting it do its job first.
Memory pool libraries provide more speed and real-time predictability to memory allocation by using fixed-size allocation, but don't increase your actual memory.
You should restructure your program to not use so much memory. Instead of pulling the "whole" (or large part) of the DB into memory you should use a cursor and incrementally update the datastructure your program is maintaining or incrementally change the metric you are querying.
EDIT: you added that you might want to run a bloom filter on the tables?
Have a look at incremental bloom filters: here
how about the Physical Address Extension(PAE)

Which NoSQL Database for Mostly Writing

I'm working on a system that will generate and store large amounts of data to disk. A previously developed system at the company used ordinary files to store its data but for several reasons it became very hard to manage.
I believe NoSQL databases are good solutions for us. What we are going to store is generally documents (usually around 100K but occasionally can be much larger or smaller) annotated with some metadata. Query performance is not top priority. The priority is writing in a way that I/O becomes as small a hassle as possible. The rate of data generation is about 1Gbps, but we might be moving on 10Gbps (or even more) in the future.
My other requirement is the availability of a (preferably well documented) C API. I'm currently testing MongoDB. Is this a good choice? If not, what other database system can I use?
The rate of data generation is about 1Gbps,... I'm currently testing MongoDB. Is this a good choice?
OK, so just to clarify, your data rate is ~1 gigaBYTE per 10 seconds. So you are filling a 1TB hard drive every 20 minutes or so?
MongoDB has pretty solid write rates, but it is ideally used in situations with a reasonably low RAM to Data ratio. You want to keep at least primary indexes in memory along with some data.
In my experience, you want about 1GB of RAM for every 5-10GB of Data. Beyond that number, read performance drops off dramatically. Once you get to 1GB of RAM for 100GB of data, even adding new data can be slow as the index stops fitting in RAM.
The big key here is:
What queries are you planning to run and how does MongoDB make running these queries easier?
Your data is very quickly going to occupy enough space that basically every query will just be going to disk. Unless you have a very specific indexing and sharding strategy, you end up just doing disk scans.
Additionally, MongoDB does not support compression. So you will be using lots of disk space.
If not, what other database system can I use?
Have you considered compressed flat files? Or possibly a big data Map/Reduce system like Hadoop (I know Hadoop is written in Java)
If C is key requirement, maybe you want to look at Tokyo/Kyoto Cabinet?
EDIT: more details
MongoDB does not support full-text search. You will have to look to other tools (Sphinx/Solr) for such things.
Larges indices defeat the purpose of using an index.
According to your numbers, you are writing 10M documents / 20 mins or about 30M / hour. Each document needs about 16+ bytes for an index entry. 12 bytes for ObjectID + 4 bytes for pointer into the 2GB file + 1 byte for pointer to file + some amount of padding.
Let's say that every index entry needs about 20 bytes, then your index is growing at 600MB / hour or 14.4GB / day. And that's just the default _id index.
After 4 days, your main index will no longer fit into RAM and your performance will start to drop off dramatically. (this is well-documented under MongoDB)
So it's going to be really important to figure out which queries you want to run.
Have a look at Cassandra. It executes writes are much faster than reads. Probably, that's what you're looking for.

Is Redis good for what I need

I have a website where users can submit text messages, dead simple data structure...
Name <-- Less than 20 characters
Message <-- Max 150 characters
Timestamp
IP
Hidden <-- Bool (True or False)
On the previous version of the website they are stored in MySQL database which is very big, lots of tables, and am wanting to simplify the database. So I heard Redis is good for simple data structures and non relational information...
Would Redis be a good option for this kind of data and how would it perform, with memory usage and read times when talking about 100,000+ records a year...
redis is really only good for in-memory problem sets. It DOES have a page-to-disk capability - but then you're at the mercy of the OS swapper - namely you're RAM will be in competition with system-caches. Also, I think the keys always have to fit in RAM. So you're NOT going to want to store 1G+ log records - mysql-archive-table is MUCH better for that.
redis has a master-slave functionality, similar to mysql. So you can perform various tricks such as sorting on a slave to keep the master responsive. While I haven't used it, I'd speculate that for in-memory databases, mysql-cluster is probably far more advanced - but with corresponding extra complexity / resource-costs.
If you have large values for your key-value set, you can perform client-side compression/decompression. There isn't much the server can do to search on the values of those 'blobs' anyway.
One common way to get around the RAM limitation is to perform client-side sharding (partitioning). Namely, if you KNOW your upper bounds, and you don't have enough RAM to throw at the problem for some reason (say you already have 64GB of RAM), then you could 'shard' based on the primary key.. If it's a sequence counter, you could take the bottom 3 bits (or some hashing function + partition function), and distribute amongst 4,8,16, etc server nodes. That scales linearly, though if you need to re-partition, that could be painful. You COULD take advantage of the 'slots' in redis to start off with fewer machines.. Say 1 machine with 16 slots.. Then later, dump slots 7-15 and restore on a different machine and remap all the clients to point to the two machines (with the same slot numbers). And so forth to 16-way sharding. At which point, you'd need to remap ALL your data to go to 32-way.
Obviously first evaluate the command-set of redis to see if ALL your data-storage and reporting needs can be met. There are equivalents to "select * from foo for update", but they're not obvious. Not all RDBMS queries can be reproduced efficiently with key-value stores. But for simple natural-primary-key record-structures it should do fine.
Additionally, it should be easy to extend the redis command-set to perform custom operations.. Just keep in mind, it's designed around no-pause single-threaded execution (avoids locking /context-switching overhead).
But things I really like are the FIFOs, pub/sub, data-time-outs, atomic-mutations (inc/dec), lazy-sorting (e.g. on client with read-only nodes), maps of maps. It's simple enough that instead of using name-spaces, you just launch separate redis processes on different ports / UNIX-sockets (my preference if possible).
It's meant to replace memcached more than anything else, but has a very nice background persistent framework.

What to monitor on SQL Server

I have been asked to monitor SQL Server (2005 & 2008) and am wondering what are good metrics to look at? I can access WMI counters but am slightly lost as to how much depth is going to be useful.
Currently I have on my list:
user connections
logins per second
latch waits per second
total latch wait time
dead locks per second
errors per second
Log and data file sizes
I am looking to be able to monitor values that will indicate a degradation of performance on the machine or a potential serious issue. To this end I am also wondering at what values some of these things would be considered normal vs problematic?
As I reckon it would probably be a really good question to have answered for the general community I thought I'd court some of you DBA experts out there (I am certainly not one of them!)
Apologies if a rather open ended question.
Ry
I would also monitor page life expectancy and your buffer cache hit ratio, see Use sys.dm_os_performance_counters to get your Buffer cache hit ratio and Page life expectancy counters for details
Late answer but can be of interest to other readers
One of my colleagues had the similar problem, and used this thread to help get him started.
He also ran into a blog post describing common causes of performance issues and an instruction on what metrics should be monitored, beside ones already mentioned here. These other metrics are:
• %Disk Time:
This counter indicates a disk problem, but must be observed in conjunction with the Current Disk Queue Length counter to be truly informative. Recall also that the disk could be a bottleneck prior to the %Disk Time reaching 100%.
• %Disk Read Time and the %Disk Write Time:
The %Disk Read Time and %Disk Write Time metrics are similar to %Disk Time, just showing the operations read from or written to disk, respectively. They are actually the Average Disk Read Queue Length and Average Disk Write Queue Length values presented in percentages.
• %Idle Time:
Measures the percentage of time the disk was idle during the sample interval. If this counter falls below 20 percent, the disk system is saturated. You may consider replacing the current disk system with a faster disk system.
• %Free Space:
Measures the percentage of free space on the selected logical disk drive. Take note if this falls below 15 percent, as you risk running out of free space for the OS to store critical files. One obvious solution here is to add more disk space.
If you would like to read the whole post, you may find it here:
http://www.sqlshack.com/sql-server-disk-performance-metrics-part-2-important-disk-performance-measures/
Use SQL Profiler to identify your Top 10 (or more) queries. Create a baseline performance for these queries. Review current average execution times vs. your baseline, and alert if significantly above your baseline. You can also use this list to identify queries for possible optimization.
This attacks the problem at a higher level than just reviewing detailed stats, although those stats can also be useful. I have found this approach to work on any DBMS, including MySQL and Oracle. If your top query times start to go up, you can bet you are starting to run into performance issues, which you can then start to drill into in more detail.
Budget permitting, it's worth looking at some 3rd party tools to help. We use Idera's SQL Diagnostic Manager to monitor server health and Confio's Ignite to keep an eye on query performance. Both products have served us well in our shop.
Percent CPU utilization and Average disk queue lengths are also pretty standard. CPUs consistently over 80% indicates you may need more or better CPUs (and servers to house them); Consistently over 2 on any disk queue indicates you have a disk I/O bottleneck on that drive.
You Should monitor the total pages allocated to a particular process. You can get that information from querying the sys databases.
sys.dm_exec_sessions s
LEFT JOIN sys.dm_exec_connections c
ON s.session_id = c.session_id
LEFT JOIN sys.dm_db_task_space_usage tsu
ON tsu.session_id = s.session_id
LEFT JOIN sys.dm_os_tasks t
ON t.session_id = tsu.session_id
AND t.request_id = tsu.request_id
LEFT JOIN sys.dm_exec_requests r
ON r.session_id = tsu.session_id
AND r.request_id = tsu.request_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) TSQL
The following post explains really well how you can use it to monitor you server when nothing works
http://tsqltips.blogspot.com/2012/06/monitor-current-sql-server-processes.html
Besides the performance metrics suggested above, I strongly recommend monitoring available memory, Batch Requests/sec, SQL Compilations/sec, and SQL Recompilations/sec. All are available in the sys.dm_os_performance_counters view and in Windows Performance Monitor.
As for
ideally I'd like to organise monitored items into 3 categories, say 'FYI', 'Warning' & 'Critical'
There are many third party monitoring tools that enable you to create alerts of different severity level, so once you determine what to monitor and what are recommended values for your environment, you can set low, medium, and high alerts.
Check Brent Ozar's article on not so useful metrics here.

Column Stores: Comparing Column Based Databases

I've really been struggling to make SQL Server into something that, quite frankly, it will never be. I need a database engine for my analytical work. The DB needs to be fast and does NOT need all the logging and other overhead found in typical databases (SQL Server, Oracle, DB2, etc.)
Yesterday I listened to Michael Stonebraker speak at the Money:Tech conference and I kept thinking, "I'm not really crazy. There IS a better way!" He talks about using column stores instead of row oriented databases. I went to the Wikipedia page for column stores and I see a few open source projects (which I like) and a few commercial/open source projects (which I don't fully understand).
My question is this: In an applied analytical environment, how do the different column based DB's differ? How should I be thinking about them? Anyone have practical experience with multiple column based systems? Can I leverage my SQL experience with these DBs or am I going to have to learn a new language?
I am ultimately going to be pulling data into R for analysis.
EDIT: I was requested for some clarification in what exactly I am trying to do. So, here's an example of what I would like to do:
Create a table that has 4 million rows and 20 columns (5 dims, 15 facts). Create 5 aggregation tables that calculate max, min, and average for each of the facts. Join those 5 aggregations back to the starting table. Now calculate the percent deviation from mean, percent deviation of min, and percent deviation from max for each row and add it to the original table. This table data does not get new rows each day, it gets TOTALLY replaced and the process is repeated. Heaven forbid if the process must be stopped. And the logs... ohhhhh the logs! :)
The short answer is that for analytic data, a column store will tend to be faster, with less tuning required.
A row store, the traditional database architecture, is good at inserting small numbers of rows, updating rows in place, and querying small numbers of rows. In a row store, these operations can be done with one or two disk block I/Os.
Analytic databases typically load thousands of records at a time; sometimes, as in your case, they reload everything. They tend to be denormalized, so have a lot of columns. And at query time, they often read a high proportion of the rows in the table, but only a few of these columns. So, it makes sense from an I/O standpoint to store values of the same column together.
Turns out that this gives the database a huge opportunity to do value compression. For instance, if a string column has an average length of 20 bytes but has only 25 distinct values, the database can compress to about 5 bits per value. Column store databases can often operate without decompressing the data.
Often in computer science there is an I/O versus CPU time tradeoff, but in column stores the I/O improvements often improve locality of reference, reduce cache paging activity, and allow greater compression factors, so that CPU gains also.
Column store databases also tend to have other analytic-oriented features like bitmap indexes (yet another case where better organization allows better compression, reduces I/O, and allows algorithms that are more CPU-efficient), partitions, and materialized views.
The other factor is whether to use a massively parallel (MMP) database. There are MMP row-store and column-store databases. MMP databases can scale up to hundreds or thousands of nodes, and allow you to store humungous amounts of data, but sometimes have compromises like a weaker notion of transactions or a not-quite-SQL query language.
I'd recommend that you give LucidDB a try. (Disclaimer: I'm a committer to LucidDB.) It is open-source column store database, optimized for analytic applications, and also has other features such as bitmap indexes. It currently only runs on one node, but utilizes several cores effectively and can handle reasonable volumes of data with not much effort.
4 million rows times 20 columns times 8 bytes for a double is 640 mb. Following the rule of thumb that R creates three temporary copies for every object, we get to around 2 gb. That is not a lot by today's standard.
So this should be doable in memory on a suitable 64-bit machine with a 'decent' amount of ram (say 8 gb or more). Installing Ubuntu or Debian (possibly in the server version) can be done in a few minutes.
I have some experience with Infobright Community edition --- column-or. db, based on mysql.
Pro:
you can use mysql interfaces/odbc mysql drivers, from R too
fast enough queries on big chunks of data selection (because of KnowledgeGrid & data packs)
very fast native data loader and connectors for ETL (talend, kettle)
optimized exactly that operations what I (and I think most of us) use (selection by factor levels, joining etc)
special "lookup" option for optimized storing R factor variables ;) (ok, char/varchar variables with relatively small levels number/rows number)
FOSS
paid support option
?
Cons:
no insert/update operations in Community edition (yet?), data loading only via native data loader/ETL connectors
no utf-8 official support (collation/sort etc), planned for q3 2009
no functions in aggregate queries f.e. select month (date) from ...) yet, planned for July(?) 2009, but because of column storage, I prefer simply create date columns for every aggregation levels (week number, month, ...) I need
cannot installed on existing mysql server as storage engine (because of own optimizer, if I understood correctly), but you may install Infobright & mysql on different ports if you need
?
Resume:
Good FOSS solution for daily analytical tasks, and, I think, your tasks as well.
Here is my 2 cents: SQL server does not scale well. We attempted to use SQL server to store financial data in real time (i.e. prices ticks coming in for 100 symbols). It worked perfectly for the first 2 weeks - then it went slower and slower as the database size increased, and finally ground to a halt, too slow to insert each price as it was received. We tried to work around it by moving data from the active database to offline storage every night, but ultimately the project was abandoned as it just didn't work.
Bottom line: if you're planning on storing a lot of data ( >1GB) you need something that scales properly, and that probably means a column database.
It looks like an implementation change (2-D array in column-major order, instead of row-major order), rather than an interface change.
Think "strategy" pattern, rather than being an entire paradigm shift. Of course, I've never used these products, so they may in fact force a paradigm shift down your throat. I don't know why, though.
We might be better able to help you reach an informed decision if you described [1] your specific goal and [2] the issues you're running into with SQL Server.

Resources