Database system for tracking user actions - database

My current problem is to track the last 250 actions of each user using an app - these need to be stored in some kind of database, with reasonably fast access and fairly fast writes.
Up till now, I have just saved all actions of all users to a database, but the size of the database is completely blowing up. All messages older than 250 messages (per user) can be discarded, but SQL based libraries do not provide a reasonable way to do this.
Optimally, the database system would use circular buffers of non-fixed length. The problem with pre-allocating all the space required will be impossible (over 90% of users only ever perform < 100 actions, meaning pre-allocated space would be unfeasible for memory reasons). Additionally, for memory reasons, the size of entries needs to be dynamic, since allocating the maximum message length for each message will cause alot of empty space.
I was considering writing such a database system myself, using small (256byte) linked equally-sized chunks of data and keeping track of empty spaces. Obviously I would prefer a pre-made solution, since writing an own database system is time-consuming and error-prone.
Is there any system that - more or less - does what I intend to do? If not, what approach is the best towards writing such a database system myself?

Related

Is Redis good for what I need

I have a website where users can submit text messages, dead simple data structure...
Name <-- Less than 20 characters
Message <-- Max 150 characters
Timestamp
IP
Hidden <-- Bool (True or False)
On the previous version of the website they are stored in MySQL database which is very big, lots of tables, and am wanting to simplify the database. So I heard Redis is good for simple data structures and non relational information...
Would Redis be a good option for this kind of data and how would it perform, with memory usage and read times when talking about 100,000+ records a year...
redis is really only good for in-memory problem sets. It DOES have a page-to-disk capability - but then you're at the mercy of the OS swapper - namely you're RAM will be in competition with system-caches. Also, I think the keys always have to fit in RAM. So you're NOT going to want to store 1G+ log records - mysql-archive-table is MUCH better for that.
redis has a master-slave functionality, similar to mysql. So you can perform various tricks such as sorting on a slave to keep the master responsive. While I haven't used it, I'd speculate that for in-memory databases, mysql-cluster is probably far more advanced - but with corresponding extra complexity / resource-costs.
If you have large values for your key-value set, you can perform client-side compression/decompression. There isn't much the server can do to search on the values of those 'blobs' anyway.
One common way to get around the RAM limitation is to perform client-side sharding (partitioning). Namely, if you KNOW your upper bounds, and you don't have enough RAM to throw at the problem for some reason (say you already have 64GB of RAM), then you could 'shard' based on the primary key.. If it's a sequence counter, you could take the bottom 3 bits (or some hashing function + partition function), and distribute amongst 4,8,16, etc server nodes. That scales linearly, though if you need to re-partition, that could be painful. You COULD take advantage of the 'slots' in redis to start off with fewer machines.. Say 1 machine with 16 slots.. Then later, dump slots 7-15 and restore on a different machine and remap all the clients to point to the two machines (with the same slot numbers). And so forth to 16-way sharding. At which point, you'd need to remap ALL your data to go to 32-way.
Obviously first evaluate the command-set of redis to see if ALL your data-storage and reporting needs can be met. There are equivalents to "select * from foo for update", but they're not obvious. Not all RDBMS queries can be reproduced efficiently with key-value stores. But for simple natural-primary-key record-structures it should do fine.
Additionally, it should be easy to extend the redis command-set to perform custom operations.. Just keep in mind, it's designed around no-pause single-threaded execution (avoids locking /context-switching overhead).
But things I really like are the FIFOs, pub/sub, data-time-outs, atomic-mutations (inc/dec), lazy-sorting (e.g. on client with read-only nodes), maps of maps. It's simple enough that instead of using name-spaces, you just launch separate redis processes on different ports / UNIX-sockets (my preference if possible).
It's meant to replace memcached more than anything else, but has a very nice background persistent framework.

For millions of objects, is it better to store in an array or a database like redis if the objects are needed in realtime?

I am developing a simulation in which there can be millions of entities that can interact with each other. At the moment, all the entities are stored in a list. Would it be better to store the objects in a database like redis instead of a list?
Note: I assumed this was being implemented in Java (force of habit). My answer is not terribly useful if it is not Java.
Making lots of assumptions about your requirements, I'd consider Redis if:
You are running into unacceptable GC pauses as a result of your millions of objects OR
The entities you create can be reused across multiple simulation runs
Java apps with giant heaps and lots of long-lived objects can run into very long GC pauses, depending on work-load. i.e. the old gen fills up with all these millions of objects and they're never eligible for collection. Regardless, periodically a full collect will happen (unless you're a GC tuning master) and have to scan these millions of objects in the old gen. This can take many seconds each time it happens, and you're frozen during that time. If this is happening and you don't like it, you could off-load all these long-lived objects to Redis, and pay the serialize/deserialize cost of accessing them rather than the GC pauses.
On the other point about reusing entities: if you're loading up a big Redis db and then dropping all its data when the simulation ends, it feels a bit wasteful. If you can re-use entities across simulation runs you might save yourself a bunch of time by persisting them in Redis.
The best choice depends on a number of factors, including how you access data, whether it will fit in memory, and what the distribution of accesses looks like. As a broad generalization, keeping data in memory is always faster than on disk, and keeping it in-process is faster than keeping it elsewhere.
If your data fits in memory, is accessed in a manner that means you can use basic data structures like lists/arrays and hashtables efficiently, and all items are accessed roughly equally often, keeping your data in memory is probably the best option.
If your data fits in memory, but you need to access it in complex ways, you may be best choosing a datastore like redis that supports in-memory databases.
If your data doesn't fit in memory, or you have a very uneven access pattern such that evicting the least used data to disk might allow other things to be loaded, speeding up your task in general, a regular disk-based datastore may be a better choice.
A list is not necessarily the best data structure unless "interaction" is limited to the respective next or previous element. Random access (by index) is very slow on a list.
Lists rocket at inserting at front and end, and at finding the next (or previous) element, or inserting one in between. They totally blow for accessing element 164553 and then element 10657, being O(N) on random access. Thus "interact with each other" suggests that list is a bad choice.
It very much depends on the access and allocation patterns, but a vector or deque will likely be much better suited than a list for your simulation.
Redis is based on a hash table, which has a (much!) better characteristic for random access, but it will most likely still be slower, because it has considerable overhead for you serializing the data, it going through a socket, redis unserializing and analyzing it, sending a reply, and you parsing that.

Why do I have to set the max length of every single text column in the database?

Why is it that every RDBMS insists that you tell it what the max length of a text field is going to be... why can't it just infer this information form the data that's put into the database?
I've mostly worked with MS SQL Server, but every other database I know also demands that you set these arbitrary limits on your data schema. The reality is that this is not particulay helpful or friendly to work with becuase the business requirements change all the time and almost every day some end-user is trying to put a lot of text into that column.
Does any one with some inner working knowledge of a RDBMS know why we just don't infer the limits from the data that's put into the storage? I'm not talking about guessing the type information, but guessing the limits of a particular text column.
I mean, there's a reason why I don't use nvarchar(max) on every text column in the database.
Because computers (and databases) are stupid. Computers don't guess very well and, unless you tell them, they can't tell that a column is going to be used for a phone number or a copy of War and Peace. Obviously, the DB could be designed to so that every column could contain an infinite amount of data -- or at least as much as disk space allows -- but that would be a very inefficient design. In order to get efficiency, then, we make a trade-off and make the designer tell the database how much we expect to put in the column. Presumably, there could be a default so that if you don't specify one, it simply uses it. Unfortunately, any default would probably be inappropriate for the vast majority of people from an efficiency perspective.
This post not only answers your question about whether to use nvarchar(max) everywhere, but it also gives some insight into why databases historically didn't allow this.
It has to do with speed. If the max size of a string is specified you can optimize the way information is stored for faster i/o on it. When speed is key the last thing you want is a sudden shuffling of all your data just because you changed a state abbreviation to the full name.
With the max size set the database can allocate the max space to every entity in that column and regardless of the changes to the value no address space needs to change.
This is like saying, why can't we just tell the database we want a table and let it infer what type and how many columns we need from the data we give it.
Simply, we know better than the database will. Supposed you have a one in a million chance of putting a 2,000 character string into the database, most of the time, it's 100 characters. The database would probably blow up or refuse the 2k character string. It simply cannot know that you're going to need 2k length if for the first three years you've only entered 100 length strings.
Also, the length of the characters are used to optimize row placement so that rows can be read/skipped faster.
I think it is because the RDBMS use random data access. To do random data access, they must know which address in the hard disk they must jump into to fastly read the data. If every row of a single column have different data length, they can not infer what is the start point of the address they have to jump directly to get it. The only way is they have to load all data and check it.
If RDBMS change the data length of a column to a fixed number (for example, max length of all rows) everytime you add, update and delete. It is an extremely time consuming
What would the DB base its guess on? If the business requirements change regularly, it's going to be just as surprised as you. If there's a reason you don't use nvarchar(max), there's probably a reason it doesn't default to that as well...
check this tread http://www.sqlservercentral.com/Forums/Topic295948-146-1.aspx
For the sake of an example, I'm going to step into some quicksand and suggest you compare it with applications allocating memory (RAM). Why don't programmers ask for/allocate all the memory they need when the program starts up? Because often they don't know how much they'll need. This can lead to apps grabbing more and more memory as they run, and perhaps also releasing memory. And you have multiple apps running at the same time, and new apps starting, and old apps closing. And apps always want contiguous blocks of memory, they work poorly (if at all) if their memory is scattered all over the address space. Over time, this leads to fragmented memory, and all those garbage collection issues that people have been tearing their hair out over for decades.
Jump back to databases. Do you want that to happen to your hard drives? (Remember, hard drive performance is very, very slow when compared with memory operations...)
Sounds like your business rule is: Enter as much information as you want in any text box so you don't get mad at the DBA.
You don't allow users to enter 5000 character addresses since they won't fit on the envelope.
That's why Twitter has a text limit and saves everyone the trouble of reading through a bunch of mindless drivel that just goes on and on and never gets to the point, but only manages to infuriate the reader making them wonder why you have such disreguard for their time by choosing a self-centered and inhumane lifestyle focused on promoting the act of copying and pasting as much data as the memory buffer gods will allow...

Comment post scalability: Top n per user, 1 update, heavy read

Here's the situation. Multi-million user website. Each user's page has a message section. Anyone can visit a user's page, where they can leave a message or view the last 100 messages.
Messages are short pieces of txt with some extra meta-data. Every message has to be stored permanently, the only thing that must be real-time quick is the message updates and reading (people use it as chat). A count of messages will be read very often to check for changes. Periodically, it's ok to archive off the old messages (those > 100), but they must be accessible.
Currently all in one big DB table, and contention between people reading the messages lists and sending more updates is becoming an issue.
If you had to re-architect the system, what storage mechanism / caching would you use? what kind of computer science learning can be used here? (eg collections, list access etc)
Some general thoughts, not particular to any specific technology:
Partition the data by user ID. The idea is that you can uniformly divide the user space to distinct partitions of roughly the same size. You can use an appropriate hashing function to divide users across partitions. Ultimately, each partition belongs on a separate machine. However, even on different tables/databases on the same machine this will eliminate some of the contention. Partitioning limits contention, and opens the door to scaling "linearly" in the future. This helps with load distribution and scale-out too.
When picking a hashing function to partition the records, look for one that minimizes the number of records that will have to be moved should partitions be added/removed.
Like many other applications, we could assume the use of the service follows a power law curve: few of the user pages cause much of the traffic, followed by a long tail. A caching scheme can take advantage of that. The steeper the curve, the more effective caching will be. Given the short messages, if each page shows 100 messages, and each message is 100 bytes on average, you could fit about 100,000 top-pages in 1GB of RAM cache. Those cached pages could be written lazily to the database. Out of 10 Mil users, 100,000 is in the ballpark for making a difference.
Partition the web servers, possibly using the same hashing scheme. This lets you hold separate RAM caches without contention. The potential benefit is increasing the cache size as the number of users grows.
If appropriate for your environment, one approach for ensuring new messages are eventually written to the database is to place them in a persistent message queue, right after placing them in the RAM cache. The queue suffers no contention, and helps ensure messages are not lost upon machine failure.
One simple solution could be to denormalize your data, and store pre-calculated aggregates in a separate table, e.g. a MESSAGE_COUNTS table which has a column for the user ID and a column for their message count. When the main messages table is updated, then re-calculate the aggregate.
It's just shifting the bottleneck from one place to another, but it might move it somewhere that's less of a burden.

How to store and compress data for real time data logging?

When developing software that records input signals (numbers) in real time, how can this data be best stored and compressed? Would an SQL engine be good for this, permitting fast data mining in the future, or are there other data formats that would be suitable or compressed enough for upto 1000 data samples per second?
I don't mind building in VC++ but ideas applicable to C# would be ideal.
It is hard to say without more info, such as, what is the source, will you be needing to query the stored data, and so on.
But for 1000 samples/sec, you should propably look at holding a few seconds of data in memory, and then writing them out in bulk to persistent storage on another thread. (Multi-processor machine recommended).
If you decide to do it via a managed language, keep the same data structure around for keeping the samples - so that the GC does not need to collect memory too often. You can get marginally better performance by using pointers and the unsafe keyword (provides direct access to the memory structure and eliminates bounds checking code for arrays).
I don't know how much CPU time is needed for you to collect each sample; and how time-critical it is to read each sample at a specified time (will they be buffered in the device you are reading from ?). If the sampling is time-critical, you have 1 ms per sample; and then you probably cannot afford the risk of the garbage collector kicking in, as it will block your thread for some time. In this case, I would go for an unmanaged approach.
SQL Server would easily be able to hold your data, or you could write them to a file. It mostly depends on what you need to do with the data at a later time. I don't know how much data each sample is, but let's assume it is 8 bytes. Then you have 8000 bytes per second to write of raw data - perhaps you have some overhead, so it could be 10 kB/s. Most storage mechanisms I can think of will be able to write data at this speed. Just make sure to write on another thread than the one that are doing the sampling.
You may want to look at time-series databases, rather than relational. These will be optimised to deal with the sort of data and usage you're considering.
Kx is a popular choice, as is Fame.

Resources