Fast read-only embedded "database"? - database

I'm looking to distribute some information to different machines for efficient and extremely fast access without any network overhead. The data exists in a relational schema, and it is a requirement to "join" on relations between entities, but it is not a requirement to write to the database at all (it will be generated offline).
I had alot of confidence that SQLite would deliver on performance, but RDMBS seems to be unsuitable at a fundamental level: joins are very expensive due to cost of index lookups, and in my read-only context, are an unnecessary overhead, where entities could store direct references to each other in the form of file offsets. In this way, an index lookup is switched for a file seek.
What are my options here? Database doesn't really seem to describe what I'm looking for. I'm aware of Neo4j, but I can't embed Java in my app.
TIA!
Edit, to answer the comments:
The data will be up to 1gb in size, and I'm using PHP so keeping the data in memory is not really an option. I will rely on the OS buffer cache to avoid continually going to disk.
Example would be a Product table with 15 fields of mix type, and a query to list products with a certain make, joining on a Category table.
The solution will have to be some kind of flat file. I'm wondering if there already exists some software that meets my needs.
#Mark Wilkins:
The performance problem is measured. Essentially, it is unacceptable in my situation to replace a 2ms IO bound query to Memcache with an 5ms CPU bound call to SQLite... For example, the categories table has 500 records, containing parent and child categories. The following query takes ~8ms, with no disk IO: SELECT 1 FROM categories a INNER JOIN categories B on b.id = a.parent_id. Some simpler, join-less queries are very fast.

I may not be completely clear on your goals as to the types of queries you are needing. But the part about storing file offsets to other data seems like it would be a very brittle solution that is hard to maintain and debug. There might be some tool that would help with it, but my suspicion is that you would end up writing most of it yourself. If someone else had to come along later and debug and figure out a homegrown file format, it would be more work.
However, my first thought is to wonder if the described performance problem is estimated at this point or actually measured. Have you run the tests with the data in a relational format to see how fast it actually is? It is true that a join will almost always involve more file reads (do the binary search as you mentioned and then get the associated record information and then lookup that record). This could take 4 or 5 or more disk operations ... at first. But in the categories table (from the OP), it could end up cached if it is commonly hit. This is a complete guess on my part, but in many situations the number of categories is relatively small. If that is the case here, the entire category table and its index may stay cached in memory by the OS and thus result in very fast joins.
If the performance is indeed a real problem, another possibility might be to denormalize the data. In the categories example, just duplicate the category value/name and store it with each product record. The database size will grow as a result, but you could still use an embedded database (there are a number of possibilities). If done judiciously, it could still be maintained reasonably well and provide the ability to read full object with one lookup/seek and one read.

In general probably the fastest thing you can do at first is to denormalize your data thus avoiding JOINs and other mutli-table lookups.
Using SQLite you can certainly customize all sorts of things and tailor them to your needs. For example, disable all mutexing if you're only accessing via one thread, up the memory cache size, customize indexes (including getting rid of many), custom build to disable unnecessary meta data, debugging, etc.
Take a look at the following:
PRAGMA Statements: http://www.sqlite.org/pragma.html
Custom Builds of SQLite: http://www.sqlite.org/custombuild.html
SQLite Query Planner: http://www.sqlite.org/optoverview.html
SQLite Compile Options: http://www.sqlite.org/compile.html
This is all of course assuming a database is what you need.

Related

Algorithms : when caching 5 datasources at different speeds, how to keep cross-datasource data up to date?

I'm getting 5 different pieces of data on a regular basis from the internet. It's not quite price data, but it sort-of is, so I'm hoping using that as an example would help here.
Suppose I get prices for the all products from 5 stores. In each individual store's dataset there are various differences. Sometimes some products are missing. Sometimes one product is broken up in sub-products, or bundled, or ... Going from one "get" to the next, products may disappear, or appear.
Now I want to calculate and keep up-to-date a set of data that are summaries of all 5 datasets. So say, I want to keep track of the cheapest dvd player in all 5 datasets. How would I go about this ? Or the average price of all dvd players. Or ...
How would you go about architecting a database that does this sort of thing ?
Now I want to calculate and keep up-to-date a set of data that are
summaries of all 5 datasets.
At one extreme, you just query the data when the user asks for it. Live with the consequences. (Queries can be slow, depending on a lot of factors. Materialized views might help.)
At the other extreme, you calculate everything, say, once a day. Live with the consequences. (Data might be 24 hours out of date.)
The basic database structure is the same regardless. If you calculate everything, say, once a day, you'll probably want one or more additional tables to store the calculated data.
Caching won't solve your problem, because
you need to keep the result sets up to date,
underlying data is constantly changing, and
it takes too long to recalculate.
It seems your only options are faster queries, or living with the consequences of your situation. Faster queries begin with reading the execution plans. Every dbms from SQLite to Oracle includes a tool that tells you what the dbms thinks is the best way to go about returning the results you want.
You should approach this as a tuning problem. Since we don't know what your real problem is--we only see the simplified problem you wrote about--we can only give simplified suggestions.
Your dbms undoubtedly has documentation about most of these topics, if not all of them. Some of these things can be done at more than one level. DBMS support is uneven. For example, some support memory tuning at the query and session level, others only at the database or server level.
Query tuning--look for sargable WHERE clauses, rewrite or eliminate correlated subqueries, set transaction isolation levels, run different queries (maybe using async methods like AJAX(, etc.
Session tuning--per-session memory, session-level defaults for queries, etc.
Table tuning--restructure, eliminate joins caused by surrogate keys, create indexes, consider "exotic" indexes, etc.
Database tuning--move some tables and indexes to faster disks or SSDs, move transaction journals, log files, set different values for database-level parameters, change architecture (incorporate some data warehouse structures), etc.
DBMS tuning--adjust buffer sizes and number, memory tuning, buffer allocation and tuning, etc., (very platform-dependent)
Filesystem tuning--use a different filesystem altogether (research), automatic filesystem checks (can sometimes be disabled after 'n' days), stripe alignment, etc.
Operating system tuning--test LVM performance, use different mount options, use different mkfs options, etc.
Server tuning--faster hardware, better RAID card, different RAID level, switch some drives to SSD drives, update firmware, etc.

Is there a disadvantage to having large columns in your database?

My database stores user stats on a variety of questions. There is no table of question types, so instead of using a join table on the question types, I've just stored the user stats for each type of question the user has done in a serialized hash-map in the user table. Obviously this has led to some decently sized user rows - the serialized stats for my own user is around 950 characters, and I can imagine them easily growing to 5 kb on power users.
I have never read an example of a column this large in any book. Will performance be greatly hindered by having such large/variable columns in my table? Should I add in a table for question types, and make the user stats a separate table as well?
I am currently using PostgreSQL, if that's relevant.
I've seen this serialized approach on systems like ProcessMaker, which is a web workflow and BPM app and stores its data in a serialized fashion. It performs quite well, but building reports based on this data is really tricky.
You can (and should) normalize your database, which is OK if your information model doesn´t change so often.
Otherwise, you may want to try non-relational databases like RavenDB, MongoDB, etc.
The big disadvantage has to do with what happens with a select *. If you have a specific field list, you are not likely to have a big problem but with select * with a lot of TOASTed columns, you have a lot of extra random disk I/O unless everything fits in memory. Selecting fewer columns makes things better.
In an object-relational database like PostgreSQL, database normalization poses different tradeoffs than in a purely relational model. In general it is still a good thing (as I say push the relational model as far as it can comfortably go before doing OR stuff in your db), but it isn't the absolute necessity that you might think of it as being in a purely relational db. Additionally you can add functions to process that data with regexps, extract elements from JSON, etc, and pull those back into your relational queries. So for data that cannot comfortably be normalized, big amorphous "docdb" fields are not that big of a problem.
Depends on the predominant queries you need:
If you need queries that select all (or most) of the columns, then this is the optimal design.
If, however, you select mostly on a subset of columns, then it might be worth trying to "vertically partition"1 the table, so you avoid I/O for the "unneeded" columns and increase the cache efficiency.2
Of course, all this is under assumption that the serialized data behaves as "black box" from the database perspective. If you need to search or constrain that data in some fashion, then just storing a dummy byte array would violate the principle of atomicity and therefore the 1NF, so you'd need to consider normalizing your data...
1 I.e. move the rarely used columns to a second table, which is in 1:1 relationship to the original table. If you are using BLOBs, similar effect could be achieved by declaring what portion of the BLOB should be kept "in-line" - the remainder of any BLOB that exceeds that limit will be stored to a set of pages separate from the table's "core" pages.
2 DBMSes typically implement caching at the page level, so the wider the rows, the less of them will fit into a single page on disk, and therefore into a single page in cache.
You can't search in serialzed arrays.

Database Structure for hierarchical data with horizontal slices

We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields.
level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)
We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in).
The queries we need to do on this are:
fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children
We read from this datasource a lot, as-in hundreds of times a second. All of the queries we need to perform are known and optimised as well as they can be to the current data structure.
We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable.
My question is: what's the best way to model this data for efficient read performance?
Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. It's probably modeled fine (not sure though), but you need a different tool to boost performance.
I currently manage a system like this. We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting.
OLAP servers:
http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers
Combining relational and OLAP:
http://en.wikipedia.org/wiki/HOLAP
Tableau:
http://www.tableausoftware.com/
*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). It will make local copies as necessary to improve performance. I strongly advise giving it a look.
If I've misunderstood the issue you're having, then by all means please ignore this answer :\
UPDATE: After more discussion, an Object DB might be a solution as well. Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member).
ODBMS info: http://en.wikipedia.org/wiki/Object_database
InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said.
http://www.intersystems.com/cache/
If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner).
In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. Text data is one of the THE MOST expensive types of data to process. So for instance, if you have a query doing a lot of INNER JOINs on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data.
An example of high normalization could be using ID numbers for a person's First or Last Name. Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name.
The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there.
And of course, in general you should be sure to have appropriate indexes on your data.
Hope that helps.
Document/Tree based database is designed to perform hierarchical queries. Do you have any hierarchical queries in your design -- I fail to see any? Querying one level up and down doesn't count: it is a simple join. Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition.
there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard.
the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this).
and if your data volume isn't so high then with sharding you might be able to get it down to ssds...

How should I store extremely large amounts of traffic data for easy retrieval?

for a traffic accounting system I need to store large amounts of datasets about internet packets sent through our gateway router (containing timestamp, user id, destination or source ip, number of bytes, etc.).
This data has to be stored for some time, at least a few days. Easy retrieval should be possible as well.
What is a good way to do this? I already have some ideas:
Create a file for each user and day and append every dataset to it.
Advantage: It's probably very fast, and data is easy to find given a consistent file layout.
Disadvantage: It's not easily possible to see e.g. all UDP traffic of all users.
Use a database
Advantage: It's very easy to find specific data with the right SQL query.
Disadvantage: I'm not sure if there is a database engine that can efficiently handle a table with possibly hundreds of millions datasets.
Perhaps it's possible to combine the two approaches: Using an SQLite database file for each user.
Advantage: It would be easy to get information for one user using SQL queries on his file.
Disadvantage: Getting overall information would still be difficult.
But perhaps someone else has a very good idea?
Thanks very much in advance.
First, get The Data Warehouse Toolkit before you do anything.
You're doing a data warehousing job, you need to tackle it like a data warehousing job. You'll need to read up on the proper design patterns for this kind of thing.
[Note Data Warehouse does not mean crazy big or expensive or complex. It means Star Schema and smart ways to handle high volumes of data that's never updated.]
SQL databases are slow, but that slow is good for flexible retrieval.
The filesystem is fast. It's a terrible thing for updating, but you're not updating, you're just accumulating.
A typical DW approach for this is to do this.
Define the "Star Schema" for your data. The measurable facts and the attributes ("dimensions") of those facts. Your fact appear to be # of bytes. Everything else (address, timestamp, user id, etc.) is a dimension of that fact.
Build the dimensional data in a master dimension database. It's relatively small (IP addresses, users, a date dimension, etc.) Each dimension will have all the attributes you might ever want to know. This grows, people are always adding attributes to dimensions.
Create a "load" process that takes your logs, resolves the dimensions (times, addresses, users, etc.) and merges the dimension keys in with the measures (# of bytes). This may update the dimension to add a new user or a new address. Generally, you're reading fact rows, doing lookups and writing fact rows that have all the proper FK's associated with them.
Save these load files on the disk. These files aren't updated. They just accumulate. Use a simple notation, like CSV, so you can easily bulk load them.
When someone wants to do analysis, build them a datamart.
For the selected IP address or time frame or whatever, get all the relevant facts, plus the associated master dimension data and bulk load a datamart.
You can do all the SQL queries you want on this mart. Most of the queries will devolve to SELECT COUNT(*) and SELECT SUM(*) with various GROUP BY and HAVING and WHERE clauses.
I think the proper answer really depends on the definition of a "dataset". As you mention in your question you are storing individual sets of information for each record; timestamp, userid, destination ip, source ip, number of bytes etc..
SQL Server is perfectly capable of handing this type of data storage with hundreds of millions of records without any real difficulty. Granted this type of logging is going to require some good hardware to handle it, but it shouldn't be too complex.
Any other solution in my opinion is going to make reporting very hard, and from the sounds of it that is an important requirement.
So you are in one of the cases where you have much more write activity than read, you want your writes not to block you, and you want your reads to be "reasonably fast" but not critical. It's a typical business intelligence use case.
You should probably use a database and store your data in as a "denormalized" schema to avoid complex joins and multiple inserts for each record. Think of your table as a huge log file.
In this case, some of the "new and fancy" NoSQL databases are probably what you're looking for: they provide relaxed ACID constraints, which you should not terribly mind here (in case of crash, you can loose the last lines of your log), but they perform much better for insertion, because they don't have to sync journals on disk at each transaction.

Need for speed: Best database solution

What I want to create is a huge index over an even bigger collection of data. The data is a huge collection of images (and I mean millions of photos!) and I want to build an index on all unique images.
So I calculate a hash value of every image and append this with the width, height and file size of the image. This would generate a very unique key for every image. This would be combined with the location of the image, or locations in case of duplicates.
Technically speaking, this would fit perfectly in a single database table. An unique index on file name, plus an additional non-unique index on hash-width-height-size would be enough. However, I could use an existing database system to solve this, or just write my own, optimized version. It will be a single-user application anyway and the main purpose is to detect when I add a duplicate image to the collection so it will warn me that I already have it in my collection and display the locations where the other copies are. I can then decide to still add the duplicate or to discard it.
I've written hash-table implementations before and it's not that difficult once you know what you have to be aware of. So I could just implement my own file format for this data. It's unlikely that I'll ever need to add more information to these images and I'm not interested in similar images, just exact images. I'm not storing the original images in this file either, just the hash, size and location.
From experience, I know this could run extremely fast. I've done it before and have been doing similar things for nearly three decades so it's likely that I will chose this solution.
But I do wonder... Doing the same with an existing database system like SQL Server, Oracle, Interbase or MySQL, would performance still be high enough? There would be about 750 TB of images indexed in this database, which roughly translates to around 30 million records in a single, small table. Is it even worth considering the use of a regular database?
I have doubts about the usability of a database for this project. The amount of data is huge, yet the structure is real simple. I don't need multi-user support or most other features that most databases provide. So I don't see a need for a database. But I'm interested in the opinions of other programmers about this. (Although I expect most will agree with me here.)
The project itself, which is still just an idea in my head, is supposed to be some tool or add-on for explorer or whatever. Basically, it builds an index for any external hard disk that I attach to the system and when I copy an image to this disk somewhere, it's supposed to tell me if the image already exists at this disk. It will allow me to avoid filling up my backup disks with duplicates, although I sometimes would like to add duplicates. (E.g. because they're part of a series.) Since I like to create my own rendered artwork I have plenty of images. Plus, I've been taking digital pictures with digital cameras since 1996 so I also have a huge collection of photos. Add some other large collections to this and you'll soon realise that the amount of data will be huge. (And yes, there are already plenty of duplicates in my collection...)
Since it's a single-user application that you are considering, I'd probably have a look at SQLite. It ought to fit your other requirements rather nicely, I'd say.
I just tested the performance of PostgreSQL on my laptop (Core 2 Duo T5800 2.0 GHz 3.0 GiB RAM). I have a table with slightly more than 100M records, 5 columns and some indexes. I performed a range query on one indexed column (not the primary key) and returned all columns. A mean query returned 75 rows and executed in 750ms. You have to decide if this is fast enough.
I would avoid DIY-ing it unless you know all the repocussions of what you're doing.
Transactional Consistency for example, is not trivial.
I would suggest designing your code in such a way the backend can be easily replaced later, and then run with something sane ( SQLite is a good starting choice ), develop it the most sane and rational way possible, and then try slotting in the alternative backing store.
Then profile the differences, and run regression tests against it to make sure your database is not worse than SQLite.
Exisiting database solutions tend to win because they've had years of improvement and fine tuning to get their benefits, an a naïve attempt will likely be slower, buggier, and do less, all the while Increasing your development load to purely MONUMENTAL proportions.
http://fetter.org/optimization.html
The first rule of Optimization is, you do not talk about Optimization.
The second rule of Optimization is, you DO NOT talk about Optimization.
If your app is running faster than the underlying transport protocol, the optimization is over.
One factor at a time.
No marketroids, no marketroid schedules.
Testing will go on as long as it has to.
If this is your first night at Optimization Club, you have to write a test case.
Also, with databases, there is one thing you utterly MUST get ingrained.
Speed is unimportant
Your data being there when you need it, that is important.
When you have the assuredness that your data will always be there, then you may worry about trivial concerns like speed.
Hashes
You also lament that you'll be using image SHA's/MD5's etc to deduplicate images. This is a fallacious notion of its own, Hashes of files are only able to tell if the files are different, not if they're the same.
The logic is akin to asking 30 people to flip a coin, and you see the first one get heads, and thus decide to delete every other person who gets a head, because they're obviously the same person.
https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer
Although you may think it unlikely you'd have 2 different files with the same hash, your odds are about as good as winning the lotto. The chances of you winning the lotto are low, but somebody wins the lotto every day. Don't let it be you.

Resources