I don't have experience in database development, so I need your suggestions in choosing of a database that can be used in Firemonkey.
I need to store html files (without media now, but they can be with), their total size is around 20 GB (uncompressed text). A main feature must be maximally fast searching of text in the database, and it must be possible to implement human searching (like google). Plus, there can be compression (20 GB is to much to store. If compression makes searching slow it's not required).
What kind of databases are appropriate for my concern?
Thanks a lot for your suggestions!
Edited
Requirements:
Price: Free
Location: local or remote
Operating system support: Windows
System requirements: a database with a large footprint
(hopefully in exchange of better performances)
Performances: fast text searching
Concurrent users: 20
Full text indexing and searching: human (Google-like) fast
text searching is required
Manageability: doesn't matter much
I know an on-line web legal database that can search words through 100 GB of information in milliseconds. I need the same performance, and Google-like searching is required.
Delphi database access layer is separated from FireMonkey, it's the same used by VCL (although FM AFAIK relies only on LiveBindings to access data, but that's not an issue in your case).
Today 20 GB are really not much data. Almost any database will handle them without much effort if properly configured. What engine to choose depends on:
Price: how much are you going to spend for it?
Location: do you need a local database (same machine) or a remote one (LAN or WAN)?
Operating system support: which OS should it run on?
System requirements: do you need a database with a small footprint, or you can use one with a larger one (hopefully in exchange of better performances)?
Performances: what are the required performances?
Concurrent users: how much user will connect to the database concurrently?
Full text indexing and searching: not all databases offer it out of the box
Manageability: some databases may require more management than others.
There is no "one database fits all" yet.
I'm no DBA so I can't say directly, and honestly I'm not sure that any one person could give a direct answer to this question as it's one of those it just depends scenarios.
http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
That's a good starting point to compare features and platform compatibility. I think the major thing to consider here is what hardware will be running it and how can you best utilize that to accomplish the task at hand.
If you have a server farm being sure your DB supports distribution and some sort of load balancing (most do to some degree from what I understand).
To speed up searching unless you code up a custom algorithm that searches the compressed version somehow I think you're going to want to keep the data un-compressed. Searching the compressed data actually might be faster. If you're able to use the index for the compressed file to compare against your plain text search parameters then are just looking for those keys that were matched within the index. If any are found in the index check for them within the compressed data. Without tons of custom code I haven't heard of any DB that supports this idea of searching compressed text (though I could easily be wrong on this point).
If the entire data set needs to be decompressed before doing the search it will very likely be much slower (memory is relatively cheap compared to CPU time). It looks like Firemonkey has a limited selection of DBs to use so that will help to narrow your choices down as well.
What I would suggest based on your edited question, is to write (or find) a parser or regular expression to extract all the important elements from the HTML that you would like to be searchable. Then store those in a database along with a reference for where they were found in the HTML. In terms of Google like searching, if you mean in terms of how it can correct misspellings and use synonyms, you probably need some sort of custom code to do dictionary look ups for spelling and thesaurus look ups for synonyms. I believe full text searching in any modern DB will handle the need to query with LIKE or similar statements in the where clause.
Looks like ldsandon's answer covers most of this anyhow. TLDR; if not thanks for reading.
I would recommend PostgreSQL for this task. It has good performance, and built in full text search capability for Google-like searching. And it's free and open source.
Unfortunately Delphi doesn't come with Postgres data access components out of the box. You can connect by ODBC, or you can purchase components available from, for example, Devart, DA-Soft or microOLAP.
Have you considered NoSQL databases? The Wikipedia article explains their differences to SQL databases and also mentions that they are suited as document store.
http://en.wikipedia.org/wiki/NoSQL
The article lists around twelve implementations in the document store category, many are open source. (Jackrabbit, CouchDB, MongoDB).
This question on Stackoverflow contains some pointers to Delphi clients:
Delphi and NoSQL
I would also consider caching on the application server, to speed up search. And of course a text indexing solution like Apache Lucene.
I would take Microsoft SQL Server Express Edition. I think 2008 R2 is latest stable version but there is also Denali (2011). It match all criterien you have.
You can use ADO to work with.
Try the Advantage Database Server.
It's easy to manage and configure.
Both dbase-like and SQL data management languages.
Fast indexed full text search capabilities.
Plus, unparalled support from the developers themselves.
The local server (stand-alone version, as opposed to the network based server) is free.
devzone.advantagedatabase.com
There is a Firebird version with full text search according to its documentation - http://www.red-soft.biz/en/document_21 - it uses Apache Lucene, a popular search engine
Related
I need a SMALL/LIGHTWEIGHT DB control (maybe delivered as a single PAS file) that I can integrate directly into my application. I need to store relatively small amounts of data in a small number of tables and I want to access some columns fast. I know that Delphi 7 has that nice BDE but I don't want to trouble the user with the installation process.
I use Delphi 7.
EDIT:
I think I asked the wrong question. So, here is what I actually need:
How to store dynamic data (unknown number of fields) to a file?
NexusDB offers a free embedded version. Here is an Example
If you're committed to not include any more dependencies with your application, take a look at TClientDataSet.
I'd recommend some sort of "embedded" database. Example: In order to use Firebird as an embedded database, at a minimum, you only need to ship one DLL. You can put that DLL in your installer so the user doesn't need to install anything.
As an alternative, how about the freeware TDbf database? It compiles directly into your application and is reliable for lightweight uses.
Also, if you are old enough to remember the days when DBase was the standard desktop database platform, then you probably know how to use it already. :-)
It's at http://tdbf.sourceforge.net
(If there doesn't seem to be a lot of recent activity then it is because it's been around for 10 years and is very stable).
Just a thought.
You could try SQLite. Its an excellent embedded database. Fast, reliable, and you cant beat the price (open source, public domain). There's a number of Delphi wrappers, or you can use the library directly if you want a light weight solution.
Two Open Source solutions (working from Delphi 6 up to XE):
One ORM oriented solution, which can store data using SQLite or a pure Delphi storage. It can be either as stand-alone, either as Client/Server.
One very fast pure Delphi NoSQL table storage engine. Sample benchmark was able to store 1,000,000 records with one integer and one text field in 800 ms (with automatic index creation). You create your own table columns, then access the fields content via Late-Binding.
If have some budget to spend, give AnyDAC a try. It provides native and embedded SQLite access, so you don't even have to ship an external DLL.
TurboPower FlashFiler
I have tried (arranged by lightweight):
NexusDB - commercial, too big for what I need; adds quite some overhead
DISQLite - seems powerful; difficult to use
kbmMemTable - commercial, UNDOCUMENTED FOR TRIAL USERS (it cannot be trialed unless you purchase the documentation, first which befits the purpose of the TRIAL concept)
TDBF - free but not maintained anymore; also it totally lacks of documentation
Synopse BigTable - seems to be the solution that I need. It consists in 2 PAS files only.
In some situations a custom system may fit better that a general one. So, for what I need, I will tailor my own system. Because I know the size/type of the data I can make fields that perfectly fit my data. The DB size will be smaller this way and faster (plus it is free).
:)
Solution: How to store dynamic data (unknown number of fields) to a file?
JvCSVDataSet and JvCSVBase from JVCL
kbmMemTable
Embedded Firebird server
jbDBF
While the manual is paid for, the actual component is available for free, including a pretty extensive demo application showing lots of the features. In addition there are lots and lots of information on the internet about how to use it.
Use www.codenewsfast.com or google.com to search for it.
best regards
Kim Madsen
kbm#components4developers.com
I have a large number of records (say around 10 to 100 million), which I want to be able to query.
This is a research project, the database is going to be mostly read only, and I only need one connection at a time. I would like the queries to be reasonably fast.
Is SQLite a reasonable choice for this purpose?
My experience with SQLite is that it may be quite slow on large recordsets, depending on how you structure your queries. If your data is de-normalized and you can get by querying a single table against its primary key then it's acceptably fast, but if your data is fully normalized and your queries involve several joins then it can be much slower than a client-server database.
SQLite's principal advantage is its small size and single file nature that make it easy to distribute embedded in an app. As that doesn't seem to be a requirement for you though, I think you'd be better off going with something else. SQL Server Express is good if you're using Windows, MySQL or Postgres otherwise would be a good choice.
As pointed out in the previous posts, SQLite is a great SQL library, but it can run out of gas when the data set gets very large. Berkeley DB recently introduced a SQL API which is completely SQLite compatible. It was added to Berkeley DB in order to provide the best of both worlds to SQLite users -- the ubiquity, simplicity and ease of use of SQLite with the concurrency, scalability and reliability of Berkeley DB.
The Berkeley DB SQL API was designed to be a drop-in replacement for SQLite applications, especially those that specifically need the Berkeley DB features and scalability that isn't available in native SQLite. You can read more about it in the Berkeley DB SQL API documentation.
Disclaimer: I'm one of the Product Managers for Berkeley DB, so I'm a little biased. But your use case is one of the reasons that we worked with Dr. Hipp and the SQLite developers in order to combine the SQLite API with the Berkeley DB storage manager. It allows SQLite application developers to take their applications into new areas with added capabilities, while remaining compatible with their existing implementation.
Please let us know if you have any questions or if there is anything that we can do to help. You can find an active community of Berkeley DB developers on the OTN Forums.
Best of luck with your project.
Regards,
Dave
SQLite is not particularly fast when getting into the millions of entries. Results will vary according to what you put in there, schema, number of columns, indexes.
The advantage (especially in your case) of SQLite is that it is so light that trying it with some data would probably be worth the time and effort. It's very much straightforward and its ideal use case is indeed for single user access.
I'd say try and build it up with a representative amount of data (you can do an import from a CSV file from the command line, or use one of the many wrappers available out there). If the speed is not satisfactory, you might have to switch to something with more power, but admittedly a bit more setup too, like MySQL.
We're trying to identify the locations of certain information stored across our enterprise in order to bring it into compliance with our data policies. On the file end, we're using Nessus to search through differing files, but I'm wondering about on the database end.
Using Nessus would seem largely pointless because it would output the raw data and wouldn't tell us what table or row it was in, or give us much useful information, especially considering these databases are quite large (hundreds of gigabytes).
Also worth noting, this system needs to be able to do pattern-based matching (such as using regular expressions). Not just a "dumb search" engine.
I've investigated the use of Data Mining and Data Warehousing in order to find this data but it seems like they're more for analysis of data than actually just finding data.
Is there a better method of searching through large amounts of data in a database to try and find this information? We're using both Oracle 11g and SQL Server 2008 and need to perform the searches on both, so I'd like to stay away from server-specific paradigms (although if I have to rewrite some code to translate from T-SQL to PL/SQL, and vice versa, I don't mind)
On SQL Server for searching through large amounts of text, you can look into Full Text Search.
Read more here http://msdn.microsoft.com/en-us/library/ms142559.aspx
But if I am reading right, you want to spider your database in a similar fashion to how a web search engine spiders web sites and web pages.
You could use a set of full text queries that bring back the results spanning multiple tables.
Oracle supports regular expression with the RegExp_Like() function and it ought to be fairly straightforward to automate the generation of the code you need based on system metadate (to find all text columns over a certain length, for example, and include them in a predicate againt that table to find the rows and values that match your regexp). Doesn't sound too challenging really. In theory you could check constrain columns to prevent the insertion of values that match a regexp but that might be overkill.
Oracle Text is suited for searching for words/phrases in larg(ish) bits of text (eg PDFs, HTMLs, TXT or DOCs) held in the database. There is some limited fuzziness searching, but not regular expressions per se.
You don't really go into what sort of data you are looking for or what you have in your databases. Nessus indicates you are looking for security issues, but the title of "Data Correlation" suggests something completely different.
Really the data structures should provide the information about what to look for and where. That's what databases are about - structuring data for accessibility. A database backing a CMS, forum software or similar would be a different kettle of fish.
I am about to write a program to keep track of my school assignments and I was wondering what database language would be the most efficient and simple to implement to track the meta-data of the assignments? I am thinking about XML, but it would require several documents.
I (currently) have at least ten assignments per week for 45 weeks. The data that has to be stored includes name, issue date, due date, path, and various states of completion. What ever language it's in would have to be able to take a large increase in both the number of assignments and the amount of meta-data without having to make large changes in either the format or the retrieval system.
Quite frankly, if you pick a full-fledged database you run the risk of spending more time on data entry than you do on your homework. If you really need to keep track of this, I would seriously recommend a spreadsheet.
First, I think you are confusing a relational database system with a database language. In all likelihood, you will be using a database that uses SQL. From there, you will need to another programming platform to build an application around. If you wanted, you could use an Microsoft Access database that allows you to build a simple front-end that is stored in the same file as the database. In this case you would be programming with VBA.
Pretty much any more database system would be suitable for your needs, even Access handle orders of magnitude more work than you are describing.
Some possible database systems are, again, Microsoft Access, Microsoft SQL Server Express, VistaDB, SQLite (probably the best choice after access for your needs), and of course there are many others.
You could either build a web front end or a desktop; I assume you are using Windows. You could use Visual Studio C# Express for this if you wanted. Or you could go with VB.NET, VB6, or what have you.
My answer isn't directly related, but as you are designing your database structures you might want to take at some of the the objects in the SIF specification in particular look at the Assignment and GradingAssignment objects.
As for how to store the data, you could use a rdbms (sqlite, mysql) or perhaps key-value database (zodb, link).
Of course, if this is just a small personal project you could just serialize the data to something like xml, json, csv or whatever and storing it as a file. It might be better in the long run to use a database though. A database format will probably scale a lot easier.
I would recommend Oracle Express (With Application Express) It will scale up to 4gb of user data. Beyond that, you would have to start paying. Application Express is very simple and build CRUD applications for, which is what is sounds like yours is.
For a project like that I would use Sqlite or Mysql, it's be fast enough. Plus it's easy to setup.
I'm looking for a cross-platform database engine that can handle databases up hundreds of millions of records without severe degradation in query performance. It needs to have a C or C++ API which will allow easy, fast construction of records and parsing returned data.
Highly discouraged are products where data has to be translated to and from strings just to get it into the database. The technical users storing things like IP addresses don't want or need this overhead. This is a very important criteria so if you're going to refer to products, please be explicit about how they offer such a direct API. Not wishing to be rude, but I can use Google - please assume I've found most mainstream products and I'm asking because it's often hard to work out just what direct API they offer, rather than just a C wrapper around SQL.
It does not need to be an RDBMS - a simple ISAM record-oriented approach would be sufficient.
Whilst the primary need is for a single-user database, expansion to some kind of shared file or server operations is likely for future use.
Access to source code, either open source or via licensing, is highly desirable if the database comes from a small company. It must not be GPL or LGPL.
you might consider C-Tree by FairCom - tell 'em I sent you ;-)
i'm the author of hamsterdb.
tokyo cabinet and berkeleydb should work fine. hamsterdb definitely will work. It's a plain C API, open source, platform independent, very fast and tested with databases up to several hundreds of GB and hundreds of million items.
If you are willing to evaluate and need support then drop me a mail (contact form on hamsterdb.com) - i will help as good as i can!
bye
Christoph
You didn't mention what platform you are on, but if Windows only is OK, take a look at the Extensible Storage Engine (previously known as Jet Blue), the embedded ISAM table engine included in Windows 2000 and later. It's used for Active Directory, Exchange, and other internal components, optimized for a small number of large tables.
It has a C interface and supports binary data types natively. It supports indexes, transactions and uses a log to ensure atomicity and durability. There is no query language; you have to work with the tables and indexes directly yourself.
ESE doesn't like to open files over a network, and doesn't support sharing a database through file sharing. You're going to be hard pressed to find any database engine that supports sharing through file sharing. The Access Jet database engine (AKA Jet Red, totally separate code base) is the only one I know of, and it's notorious for corrupting files over the network, especially if they're large (>100 MB).
Whatever engine you use, you'll most likely have to implement the shared usage functions yourself in your own network server process or use a discrete database engine.
For anyone finding this page a few years later, I'm now using LevelDB with some scaffolding on top to add the multiple indexing necessary. In particular, it's a nice fit for embedded databases on iOS. I ended up writing a book about it! (Getting Started with LevelDB, from Packt in late 2013).
One option could be Firebird. It offers both a server based product, as well as an embedded product.
It is also open source and there are a large number of providers for all types of languages.
I believe what you are looking for is BerkeleyDB:
http://www.oracle.com/technology/products/berkeley-db/db/index.html
Never mind that it's Oracle, the license is free, and it's open-source -- the only catch is that if you redistribute your software that uses BerkeleyDB, you must make your source available as well -- or buy a license.
It does not provide SQL support, but rather direct lookups (via b-tree or hash-table structure, whichever makes more sense for your needs). It's extremely reliable, fast, ACID, has built-in replication support, and so on.
Here is a small quote from the page I refer to above, that lists a few features:
Data Storage
Berkeley DB stores data quickly and
easily without the overhead found in
other databases. Berkeley DB is a C
library that runs in the same process
as your application, avoiding the
interprocess communication delays of
using a remote database server. Shared
caches keep the most active data in
memory, avoiding costly disk access.
Local, in-process data storage
Schema-neutral, application native data format
Indexed and sequential retrieval (Btree, Queue, Recno, Hash)
Multiple processes per application and multiple threads per process
Fine grained and configurable locking for highly concurrent systems
Multi-version concurrency control (MVCC)
Support for secondary indexes
In-memory, on disk or both
Online Btree compaction
Online Btree disk space reclamation
Online abandoned lock removal
On disk data encryption (AES)
Records up to 4GB and tables up to 256TB
Update: Just ran across this project and thought of the question you posted:
http://tokyocabinet.sourceforge.net/index.html . It is under LGPL, so not compatible with your restrictions, but an interesting project to check out, nonetheless.
SQLite would meet those criteria, except for the eventual shared file scenario in the future (and actually it could probably do that to if the network file system implements file locks correctly).
Many good solutions (such as SQLite) have been mentioned. Let me add two, since you don't require SQL:
HamsterDB fast, simple to use, can store arbitrary binary data. No provision for shared databases.
Glib HashTable module seems quite interesting too and is very
common so you won't risk going into a dead end. On the other end,
I'm not sure there is and easy way to store the database on the
disk, it's mostly for in-memory stuff
I've tested both on multi-million records projects.
As you are familiar with Fairtree, then you are probably also familiar with Raima RDM.
It went open source a few years ago, then dbstar claimed that they had somehow acquired the copyright. This seems debatable though. From reading the original Raima license, this does not seem possible. Of course it is possible to stay with the original code release. It is rather rare, but I have a copy archived away.
SQLite tends to be the first option. It doesn't store data as strings but I think you have to build a SQL command to do the insertion and that command will have some string building.
BerkeleyDB is a well engineered product if you don't need a relationDB. I have no idea what Oracle charges for it and if you would need a license for your application.
Personally I would consider why you have some of your requirements . Have you done testing to verify the requirement that you need to do direct insertion into the database? Seems like you could take a couple of hours to write up a wrapper that converts from whatever API you want to SQL and then see if SQLite, MySql,... meet your speed requirements.
There used to be a product called b-trieve but I'm not sure if source code was included. I think it has been discontinued. The only database engine I know of with an ISAM orientation is c-tree.