DB technology for efficient search in tabular data? - database

We have a repository of tables. Around 200 tables, each table can be thousands of rows, all tables are originally in Excel sheets.
Each table has a different scheme. All data is text or numbers.
We would like to create an application that allows free text search on all tables (we define which columns will be searched in each table) efficiently - speed is important.
The main dilemma is which DB technology we should choose.
We created a mock up by importing all tables to MS SQL Server, and creating a full text index over them. The search is done using the CONTAINS keyword. This solution works well for a small number of tables, but it doesn't scale.
We thought about a NoSQL solution, but we don't yet have any experience in it.
Our limitations (which unfortunately I can not effect): Windows servers only. But we can install on them whatever we want.
Thank you.

Check out ElasticSearch! It's a search server based on Apache Lucene and has a clean REST- and JavaScript-based API. Although it's used usually as a search-index for a primary database, it can also be used stand-alone. So you may want to write a backup routine for a few of your tables/data and try it out.
http://www.elasticsearch.org/
http://en.wikipedia.org/wiki/ElasticSearch
Comparison of ElasticSearch and Apache Solr (another Lucene-based search server):
https://docs.google.com/present/view?id=dc6zhtt5_1frfxwfff&pli=1

Related

Extracting Data from SAP to SQL Server

I am using SSIS packages to extract data from SAP database tables into SQL Server tables. I am using OLEDB source/destination connections to achieve this.
The problem now is that a table in SAP has 5 Million records and its taking around 2 hours to extract this data into my SQL Server table. I have used the trunc-dump method (truncating the table in sql server and dumping data into it from SAP table) and also tried using Multiple Hash key to bring in the updated/new records.
The problem with Hash key is that it still has to scan the entire table to look for changed/new records and hence takes almost the same time as the trunc-dump method.
I am looking for a new way or changing the existing way to reduce the time taken to complete this extraction.
As you mentioned you were using OLEDB source connection to access SAP, if that means you were accessing SAP's underlying database directly, you should pause doing that for three reasons till there are explicit IT approvals:
You skipped SAP's application layer security. There can be an enterprise security compliance issue;
Your company's SAP license may not allow you to do that. If your company only has SAP indirect access license, then you may have to stay on application layer;
You will not get SAP's official support by accessing the underlying database directly.
You have multiple options to fetch data using SSIS through SAP application layer:
Use commercial SSIS custom components for this job (disclaimer: AecorSoft is one of the leading vendors offering such connectivity components);
Look into SAP's own OData Gateway interface to consume data.
Request your SAP ABAP team to write custom ABAP programs to dump SAP data into CSV files, and then use SSIS to fetch them.
Let's now look at the performance side:
SAP ETL Performance depends on many factors, but in general, even for the SAP transactional tables with 100+ columns, it's considered very slow to extract 5 millions rows per a couple of hours. For example, we've seen cases of extracting standard SAP General Ledger header table BKPF (almost 100 columns) at consistent performance of 1M rows every 1-2 minutes. Of course such performance is achieved through commercial component and SSIS, but you should expect at least 1M per 10 minutes even for the #3 option above, going through an intermediate CSV file. Under the hood, through SAP application layer, all the 3 options would leverage SAP Open SQL (in contrast to the "Native SQL" which the underlying database offers) to access SAP tables, therefore, if you experience application layer performance issue, you can analyze the Open SQL side.
As you also mentioned about update/new records scenario, it's a typical delta extraction problem. Normally, in SAP transactional tables, there are Create Date and Changed Date fields which can help you capture delta. In this case, in order to avoid full table scan, apply indices through SAP application layer on those "delta fields". For example, if you need to extract Sales Document Header VBAK table, you can filter by ERDAT (Created on) and AEDAT (Changed on). Delta is a complex subject in SAP. There is no simple statement to describe the delta solution, as SAP data models are complex and very different across functional modules. The delta analysis is always a case-by-case effort. Some people may also simply recommend using "delta extractors", but don't treat that as silver bullet, because extractor has its own problem. In short, if you look into table based extraction, focus on that, and try to work with your SAP functional team to determine the suitable delta fields. Try avoiding doing full table scan and hashing. Do incremental load with some optional overlap of previous extract (e.g. loading today and yesterday's records), and do MERGE to absorb the changes.
There are few cases you may not be able to find any delta field, and it is not practical to do full load all the time. One great example is the Address Master data table ADRC. In this case, if you are required to do delta load on such table, you ether have to request your SAP function team to figure out delta for you (meaning they inject custom logic to every place where Address master can be created, updated, or deleted), or you have to request your SAP Basis team to create DB trigger on the underlying database table, and expose the trigger table at application layer. This way, you can create an application layer view on the main table and the trigger table to do delta. Still, there is no direct database access through your solution. The DB layer trigger is fully managed and controlled by your SAP Basis team who also supports the database.
Hope this helps!

Full Text Search Auto-Partition Schemes and Functions

We have some full text searches running on our SQL Server 2012 Development (Enterprise) database. We noticed that partition schemes and functions are being (periodically) added to the DB. I can only assume that the partitions are for FTS as they have the following form:
Scheme:
CREATE PARTITION SCHEME [ifts_comp_fragment_data_space_46093FC3] AS PARTITION [ifts_comp_fragment_partition_function_46093FC3] TO ([FTS], [FTS], [FTS])
Function:
CREATE PARTITION FUNCTION [ifts_comp_fragment_partition_function_46093FC3](varbinary(128)) AS RANGE LEFT FOR VALUES (0x00330061007A00660073003200360036, 0x0067006F00730066006F00720064)
The problem is that our production servers are running SQL Server 2012 Standard which does not support partitions. Thus it adds an extra admin burden on our schema compares (using SSDT) to exclude these partitions every time. When one does (inevitably) creep in it is a pain to remove. We have done some extensive research and have not been able to come up with any answer as to why this is even happening. Any ideas?
Yes, those are internal to the fulltext search functionality. You have no control over them.
However, I would consider it a bug that they show up in your schema compares. You'll never create/alter/drop them yourselves, and they completely maintained by sql server, so I would file a bug report on http://connect.microsoft.com

Sql Server XML columns substitute for Document DB?

Is it possible to use Sql Server XML columns as a substitute for a real Document DB (such as Couch or Mongo) ?
If I were to create a table with a guid PK Id and an XML column for the document.
What would be the main problems compared to using a document DB?
Sql Server supports indexing over XML columns so querying should not be completely horrible?
You've got several questions in here:
Is it possible to use Sql Server XML columns as a substitute for a real Document DB (such as Couch or Mongo) ? Yes, you can use it as a substitute, but no, you probably wouldn't be satisfied with performance if you're exclusively storing XML and not leveraging any of SQL Server's relational tools.
If I were to create a table with a guid PK Id and an XML column for the document. What would be the main problems compared to using a document DB? In a nutshell, scaling out. SQL Server doesn't scale this kind of thing out well. You can do it with replication, but it's painful to manage relative to a "real" Document DB.
Sql Server supports indexing over XML columns so querying should not be completely horrible? The problem is that SQL Server's XML indexes can take several times the storage space of the original data. These indexes can't be maintained online (as in defrags), so you end up with locking issues during maintenance windows.
I'm doing some experimenting with this on:
http://rogeralsing.com/2011/03/02/linq-to-sqlxml-projections/
Query speed is 'decent' , it's nothing I'd use for scaling.
But the joy of schema free storage running on standard infrastructure is quite nice.
Yes, you can. Storing a document inside a SqlServer XML column will work and if you use standard XML serialization that will leave you with a decent ACID complant key/value store. Also, it will allow you to do queries on it with relative ease and you can join the results to data that you store in a more relational way. We do so, it works. If you store content in XML fields, storage demands are a lot lower than using NTEXT and querying it will be more flexible and faster.
What SqlServer will not get you (comparing to mongo) is the seamless failover of replica-sets an the autosharding of mongo. Also, atomic operations like incrementing a specific property deep inside a document is hard (though not impossible with the XQuery update function). Updates tend to be faster on most NoSql databases, because they are more relaxed on the "data is only safe on disk" principle.
Yes, it is possible. As to whether it's a good idea, this is just my 2 cents...
Before the XML datatype came along I worked on a system storing XML in an NTEXT column - that wasn't pleasant, and to get any real use out of the data meant shredding some of that data out into relational form.
OK, the XML datatype now makes it easier to query an XML blob and to extract certain values/index them. But personally, in general, I wouldn't. I'm not saying never use XML as there are scenarios for that - rather if that's all your planning on doing then I'd be thinking "is this the right tool for the job". Using a RDBMS as a document database makes me feel a bit uneasy. Whereas something like MongoDB has been built from the ground up as a document database.
In all honesty, I haven't done any performance testing on storing data as XML so I can't give you an indication of what performance would be like. Would be interested to know how this performs at scale.

large database file (mdb) takes time to load in vb.net so need alternative

I have 4k records in access database. And one of the field value contains ~100 lines each
so and one other field has ~25 lines. So total database size reaches ~30MB and it takes lot of time 15-20 seconds to load the database in vb.net using odbc http://www.homeandlearn.co.uk/net/nets12p5.html
and updating of any other small fields also takes time due to database being large
So as an alternative I used rtf file (txt files were not preserving all the newline characters). So these file are around 5-10kb only. But for 4k records and 2 fields I have now 8k files. And copying of these 8k rtf files is taking huge time for 5MB transfer it takes an hour or so.
So is there any other alternative for storage of this data. So that it will be portable and easily loaded/accessed/updated from vb.net?
MDB Databases
MDB is the Access database filetype. Access databases were never designed to be used for backends of web systems, they are mainly for light office use.
Improving performance
For temporary improvement of performance, you can compact and repair the database. Open it up, and find the link in the tools menu. Alternatively you can do this programaticaly. This should be done reasonably frequently depending on the number of changes your databases has made to it. What does compact and repairing do?
Also, slowness is often a sign of inefficient design. Consider reading up on database normalisation if your database is not fully normalised. This should significantly improve performance and is an essential standard that should be learned.
Alternatives
For 4k+ records you should probably be using a decent database system designed specifically for larger amounts of data.
SQL-Server is an excellent database system from Microsoft. MySQL is also a great open source alternative. The Internet is full of tutorials on how to connect to these databases.
I'm using sometimes Access databases in .net too. Ok, MS-Access isn't the best database for this kind of application, I know. But the easy-doing complex queryes and the functional and well-knowed reports makes Access a good cost-benefit solution.
I saw the link that you've indicated. This way was my first technique, but then I realized there was another easier and faster. I suggest you to do the linkage for Access database in a different way.
Create a dataSet, if you already didn't it.
Create a connection to the MS-Access database using database explorer.
Drag and drop your desired tables on created DataSet (.net will create the designer code for you in backStage)
On code, create an tableAdapter object and a table object:
Supose that your dataSet name is DS1 and a table name is table01.
language: VB.NET
check intellisense autocomplete for your dataobjects
creates a tableadapter object and table object (designed when you drop the database explorer objects in dataset)
dim table01_TA as new ds1Tableadapters.table01_tableAdapter
dim table01 as new ds1.table01dataTable
loads the database data into the on-memory table table01
table01 = table01_TA.getData
do your opperations using table01 (add, update, insert, delete, queries)
for automatic generation of scripts for update, insert and delete, make sure your table has primaryKeys and correct relationships.
finally, update the table adapter. Unless you do it, the data will not be updated in the database.
table01_Ta.update(table01)
I suggest you use LINQ to query your data, and the datatable methods to adding and editing data. These methods are created automatically when you drop the databaseExplorer tables on dataSet and save it. Its worth to compact and repair Access database frequently.
Contat-me if you have troubles.
I agree with Tom's recommendation. Get yourself a decent database server. However, judging by your description of your performance issues it seems like you have other serious problems which are probably going to be difficult to resolve here.

sql server - full-text search

So let's say I have two databases, one for production purposes and another one for development purposes.
When we copied the development database, the full-text catalog did not get copied properly, so we decided to create the catalog ourselves. We matched all the tables and indexes and created the database and the search feature seems to be working okay too (but been entirely tested yet).
However, the former catalog had a lot more files in its folder than the one we manually created. Is that fine? I thought they would have exact same number of files (but the size may vary)
First...when using full text search I would suggest that you don't manually try to create what the wizard does for you. I have to wonder about missing more than just some data. Why not just recreate the indexes?
Second...I suggest that you don't use freetext feature of sql server unless you have no other choice. I used to be a big believer in freetext but was shown an example of creating a Lucene(.net) index and searching it in comparison to creating an index in SQL Server and searching it. Creating a SQL Server index in comparison to creating a Lucene index is considerably slower and hard to maintain. Searching a SQL Server index is considerably less accurate (poor results) in comparison to Lucene. Lucene is like having your own personal Google for searching data.
How? Index your data (only the data you need to search) in Lucene and include the Primary Key of the data that you are indexing for use later. Then search the index using your language and the Lucene(.net) API (many articles written on this topic). In your search results make sure you return the PK. Once you have identified the records you are interested in you can then go get the rest of the data and/or any related data based on the PK that was returned.
Gotchas? Updating the index is also much quicker and easier. However, you have to roll your own for creating the index, updating the index, and searching the index. SUPER EASY to do...but still...there are no wizards or one handed coding here! Also, the index is on the file system. If the file is open and being searched and you try to open it again for another search you will obviously have some issues...so writing some form of infrastructure around opening and reading these indexes needs to be built.
How does this help in SQL Server? You can easily wrap your Lucene search in a CLR function or proc which can be installed in the database that you can then use as though it were native to your t-SQL queries.

Resources