Data compression definition - database

Could anyone explain the term "Data compression" in database (lame words). Sorry if this question is simple, but it does help me.
I did find the technical definition, but still did not get a right understanding.
Data compression saves space and saves reading times and so on. Does that mean its aggregating data in the table? Please clarify

Since you've tagged this post with db2, then I'm assuming your asking about compression within the database. DB2 does dictionary compression – it replaces common strings with shorter tokens on the actual data pages, reducing the size of the table.
Please see: The wikipedia article on Dictionary Coder for a general discussion on how this algorithm works.
If you're using DB2 for Linux, UNIX and Windows you can read this developerWorks article that describes the compression specifically in DB2. The article is a few years old but it holds true today (even though there have been many enhancements beyond the initial release).

Related

Optimizing Performance of a Large Dataset in SQL Server

I'm using SQL Server running on an Azure VM with 8 SSDs. The SSDs are grouped together in Storage Spaces as 1 disk - in order to increase the capacity and also to combine the IOPS/Throughput. But the "combine the IOPS" part just doesn't seem to be working as far as I can tell by all of my tests/benchmarks (the "combine the throughput" part is working though). In fact, it looks like the SSD performance (IOPS) are better on 1 single disk than the whole 8-physical-disk virtual disk. So, I'm thinking about just forgetting about Storage Spaces and splitting up my data across 8 disks.
But what would be the best way to do that? (I don't have much experience with mulitple files, or filegroups, or partitioning tables, and that sort of thing.)
Just make 8 mdf files (1 on each disk) and let SQL Server redistribute the data across all of these files? If so, I would like to know how SQL Server knows which disk a given record is on. Would doing this speed things up?
And maybe split up the ldf files too?
What about multiple filegroups? I really don't know what the practical difference is between multiple files and filegroups.
What about splitting up the big tables somehow by using a partitioning function? Would that help, since now, maybe, SQL Server would "have a better idea" of where (in which file) a given record would be - since that is defined by a partition function?
Please don't try to close this question because it seems very general or open-ended. Life is tough enough as it is. This is a very good question. And I'm sure there are a lot of people out there who could give very helpful, experienced answers to this which would help a lot of people. Just because there might not be one exact answer to this question, doesn't mean it's a bad question. And anyway, if you think about it, there IS one best answer to this question - there is a best way to do things in this - very common - situation.
The details you are asking in a single thread require too much of depth research. The use case varies from project to project.
I recommend you to go in-depth on Storage: Performance best practices for SQL Server on Azure VMs, Microsoft's official document. Go through the Checklist details. Refer the disk type most suitable for your use case based on IOPS. You will get the answers to all your queries within this document.

What is the underlying Storage and Search Algorithm for SQL Server Columnstore Indexes

I'm trying to figure out the guts of how Columnstore Indexes work within SQL Server. What I'm looking for is a technical reference guide or a whitepaper to the underlying storage and accompanying search algorithms for Columnstore Indexes, specifically regarding SQL 2016 (in case that differs from earlier versions). I don't even know if this algorithm/design has a formal academic name or not, as I've not found anything resembling one in the Microsoft documentation I've reviewed.
An equivalent of what I'm after regarding traditional rowstore indexes is that their underlying Storage and Search Algorithms are based on B+ Trees. The B+ Tree algorithm has a plethora of white papers out there to digest. The only algorithm reference I do see regarding Columnstore Indexes pertains to the DeltaStore functionality which is also based on B+ Trees.
I hope the underlying storage and search algorithm isn't proprietary and that my Google skills are just failing me, but if it turns out this is proprietary, knowing that would help quell my curiosity. Any help would be appreciated!
Anything regarding internal data structures of a product that's sold for buckets of money would not have complete details published. For SQL Server, there are books from MSFT such as this this one which talk about the internals.
About finding the details on what you exactly want: YMMV.
At this point, the best resource I've come across is The-Paper-Trail.org's blog post on columnar storage. It doesn't get into the details behind the search algorithms, but it has some great explanations behind the underlying storage as well as additional references to academic white-papers. If anyone else is interested in this stuff, I would highly recommend you review this page earlier rather than later.
EDIT: Upon further reading, it looks like the "search algorithm" for Columnstore Indexes is basically a vanilla scan of the index, less any Rowstore Elimination and Column Elimination. The scan operation is made even more efficient by being executed in batch-mode against highly compressed data (due to the column-wise storage model) and depending on the query, aggregate and string predicate pushdown optimizations can further limit records pulled from disk. Columnstore indexes - query performance
These two resources combined give a pretty good picture of what's going on under the covers, so if you're interested, give them a look. Finally, a word of advice; ignore or skip much of the literature published prior to the release of SQL 2016, because a lot of the underlying terms and logic have changed significantly over the past 3 versions of SQL Server, and I wouldn't recommend anyone use anything earlier than 2016 if you're going to use this feature.
EDIT 2: I found an article from Microsoft confirming Columnstore Indexes are not B+ Trees.

Creating and using databases

So the solid consensus I got from the answers to this question: Editing a single line in a large text file
was that instead of using a text file I should create a database and store my data there. While I think this is a great idea, I don't know the first thing about databases, the programming languages used for databases, or how to use a database once I have set it up. Could you guys give me a shove in the right direction and point me an absolute noob tutorial that might help me with this?
UPDATE: Hey guys, so I was looking at mySQL and there are a whole bunch of versions! The Cluster CGE looks like the best one, and it says it is good for "real-time open source transactional database designed for fast, always-on access to data under high throughput conditions" which just about hits the nail on the head of what I need. It says commercial next to it though, so I don't know if I would have to pay some god awful fee for it. I tried it anyway, and it said I should have gotten a license already, and until I did I could only use it for 30 days. Im confused...
Can I get this version for free? If so, where do I get the license?
Is this version way overpowered for what I need? I need:
1. A storage medium through which I can store large amounts of data
2. Read and write from in real time with simultaneous access
3. Have two different "keys" (I think I'm using that right, I need to be able to search for entrees based on one of two criteria).
MySQL is a great choice, given your Python flair.
http://dev.mysql.com/tech-resources/articles/mysql_intro.html
Good luck!

Appropriate Database to store 20 GB for Delphi, Firemonkey

I don't have experience in database development, so I need your suggestions in choosing of a database that can be used in Firemonkey.
I need to store html files (without media now, but they can be with), their total size is around 20 GB (uncompressed text). A main feature must be maximally fast searching of text in the database, and it must be possible to implement human searching (like google). Plus, there can be compression (20 GB is to much to store. If compression makes searching slow it's not required).
What kind of databases are appropriate for my concern?
Thanks a lot for your suggestions!
Edited
Requirements:
Price: Free
Location: local or remote
Operating system support: Windows
System requirements: a database with a large footprint
(hopefully in exchange of better performances)
Performances: fast text searching
Concurrent users: 20
Full text indexing and searching: human (Google-like) fast
text searching is required
Manageability: doesn't matter much
I know an on-line web legal database that can search words through 100 GB of information in milliseconds. I need the same performance, and Google-like searching is required.
Delphi database access layer is separated from FireMonkey, it's the same used by VCL (although FM AFAIK relies only on LiveBindings to access data, but that's not an issue in your case).
Today 20 GB are really not much data. Almost any database will handle them without much effort if properly configured. What engine to choose depends on:
Price: how much are you going to spend for it?
Location: do you need a local database (same machine) or a remote one (LAN or WAN)?
Operating system support: which OS should it run on?
System requirements: do you need a database with a small footprint, or you can use one with a larger one (hopefully in exchange of better performances)?
Performances: what are the required performances?
Concurrent users: how much user will connect to the database concurrently?
Full text indexing and searching: not all databases offer it out of the box
Manageability: some databases may require more management than others.
There is no "one database fits all" yet.
I'm no DBA so I can't say directly, and honestly I'm not sure that any one person could give a direct answer to this question as it's one of those it just depends scenarios.
http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
That's a good starting point to compare features and platform compatibility. I think the major thing to consider here is what hardware will be running it and how can you best utilize that to accomplish the task at hand.
If you have a server farm being sure your DB supports distribution and some sort of load balancing (most do to some degree from what I understand).
To speed up searching unless you code up a custom algorithm that searches the compressed version somehow I think you're going to want to keep the data un-compressed. Searching the compressed data actually might be faster. If you're able to use the index for the compressed file to compare against your plain text search parameters then are just looking for those keys that were matched within the index. If any are found in the index check for them within the compressed data. Without tons of custom code I haven't heard of any DB that supports this idea of searching compressed text (though I could easily be wrong on this point).
If the entire data set needs to be decompressed before doing the search it will very likely be much slower (memory is relatively cheap compared to CPU time). It looks like Firemonkey has a limited selection of DBs to use so that will help to narrow your choices down as well.
What I would suggest based on your edited question, is to write (or find) a parser or regular expression to extract all the important elements from the HTML that you would like to be searchable. Then store those in a database along with a reference for where they were found in the HTML. In terms of Google like searching, if you mean in terms of how it can correct misspellings and use synonyms, you probably need some sort of custom code to do dictionary look ups for spelling and thesaurus look ups for synonyms. I believe full text searching in any modern DB will handle the need to query with LIKE or similar statements in the where clause.
Looks like ldsandon's answer covers most of this anyhow. TLDR; if not thanks for reading.
I would recommend PostgreSQL for this task. It has good performance, and built in full text search capability for Google-like searching. And it's free and open source.
Unfortunately Delphi doesn't come with Postgres data access components out of the box. You can connect by ODBC, or you can purchase components available from, for example, Devart, DA-Soft or microOLAP.
Have you considered NoSQL databases? The Wikipedia article explains their differences to SQL databases and also mentions that they are suited as document store.
http://en.wikipedia.org/wiki/NoSQL
The article lists around twelve implementations in the document store category, many are open source. (Jackrabbit, CouchDB, MongoDB).
This question on Stackoverflow contains some pointers to Delphi clients:
Delphi and NoSQL
I would also consider caching on the application server, to speed up search. And of course a text indexing solution like Apache Lucene.
I would take Microsoft SQL Server Express Edition. I think 2008 R2 is latest stable version but there is also Denali (2011). It match all criterien you have.
You can use ADO to work with.
Try the Advantage Database Server.
It's easy to manage and configure.
Both dbase-like and SQL data management languages.
Fast indexed full text search capabilities.
Plus, unparalled support from the developers themselves.
The local server (stand-alone version, as opposed to the network based server) is free.
devzone.advantagedatabase.com
There is a Firebird version with full text search according to its documentation - http://www.red-soft.biz/en/document_21 - it uses Apache Lucene, a popular search engine

Is there any books or tutorial about writing a small database system? [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am interested in learning how a database engine works (i.e. the internals of it). I know most of the basic data structures taught in CS (trees, hash tables, lists, etc.) as well as a pretty good understanding of compiler theory (and have implemented a very simple interpreter) but I don't understand how to go about writing a database engine. I have searched for tutorials on the subject and I couldn't find any, so I am hoping someone else can point me in the right direction. Basically, I would like information on the following:
How the data is stored internally (i.e. how tables are represented, etc.)
How the engine finds data that it needs (e.g. run a SELECT query)
How data is inserted in a way that is fast and efficient
And any other topics that may be relevant to this. It doesn't have to be an on-disk database - even an in-memory database is fine (if it is easier) because I just want to learn the principals behind it.
Many thanks for your help.
If you're good at reading code, studying SQLite will teach you a whole boatload about database design. It's small, so it's easier to wrap your head around. But it's also professionally written.
SQLite 2.5.0 for Code Reading
http://sqlite.org/
The answer to this question is a huge one. expect a PHD thesis to have it answered 100% ;)
but we can think of the problems one by one:
How to store the data internally:
you should have a data file containing your database objects and a caching mechanism to load the data in focus and some data around it into RAM
assume you have a table, with some data, we would create a data format to convert this table into a binary file, by agreeing on the definition of a column delimiter and a row delimiter and make sure such pattern of delimiter is never used in your data itself. i.e. if you have selected <*> for example to separate columns, you should validate the data you are placing in this table not to contain this pattern. you could also use a row header and a column header by specifying size of row and some internal indexing number to speed up your search, and at the start of each column to have the length of this column
like "Adam", 1, 11.1, "123 ABC Street POBox 456"
you can have it like
<&RowHeader, 1><&Col1,CHR, 4>Adam<&Col2, num,1,0>1<&Col3, Num,2,1>111<&Col4, CHR, 24>123 ABC Street POBox 456<&RowTrailer>
How to find items quickly
try using hashing and indexing to point at data stored and cached based on different criteria
taking same example above, you could sort the value of the first column and store it in a separate object pointing at row id of items sorted alphabetically, and so on
How to speed insert data
I know from Oracle is that they insert data in a temporary place both in RAM and on disk and do housekeeping on periodic basis, the database engine is busy all the time optimizing its structure but in the same time we do not want to lose data in case of power failure of something like that.
so try to keep data in this temporary place with no sorting, append your original storage, and later on when system is free resort your indexes and clear the temp area when done
good luck, great project.
There are books on the topic a good place to start would be Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom
SQLite was mentioned before, but I want to add some thing.
I personally learned a lot by studying SQlite. The interesting thing is, that I did not go to the source code (though I just had a short look). I learned much by reading the technical material and specially looking at the internal commands it generates. It has an own stack based interpreter inside and you can read the P-Code it generates internally just by using explain. Thus you can see how various constructs are translated to the low-level engine (that is surprisingly simple -- but that is also the secret of its stability and efficiency).
I would suggest focusing on www.sqlite.org
It's recent, small (source code 1MB), open source (so you can figure it out for yourself)...
Books have been written about how it is implemented:
http://www.sqlite.org/books.html
It runs on a variety of operating systems for both desktop computers and mobile phones so experimenting is easy and learning about it will be useful right now and in the future.
It even has a decent community here: https://stackoverflow.com/questions/tagged/sqlite
Okay, I have found a site which has some information on SQL and implementation - it is a bit hard to link to the page which lists all the tutorials, so I will link them one by one:
http://c2.com/cgi/wiki?CategoryPattern
http://c2.com/cgi/wiki?SliceResultVertically
http://c2.com/cgi/wiki?SqlMyopia
http://c2.com/cgi/wiki?SqlPattern
http://c2.com/cgi/wiki?StructuredQueryLanguage
http://c2.com/cgi/wiki?TemplateTables
http://c2.com/cgi/wiki?ThinkSqlAsConstraintSatisfaction
may be you can learn from HSQLDB. I think they offers small and simple database for learning. you can look at the codes since it is open source.
If MySQL interests you, I would also suggest this wiki page, which has got some information about how MySQL works. Also, you might want to take a look at Understanding MySQL Internals.
You might also consider looking at a non-SQL interface for your Database engine. Please take a look at Apache CouchDB. Its what you would call, a document oriented database system.
Good Luck!
I am not sure whether it would fit to your requirements but I had implemented a simple file oriented database with support for simple (SELECT, INSERT , UPDATE ) using perl.
What I did was I stored each table as a file on disk and entries with a well defined pattern and manipulated the data using in built linux tools like awk and sed. for improving efficiency, frequently accessed data were cached.

Resources