BaseX attribute index lost after reboot? - basex

I have a pretty big BaseX database (>2 Gb) containing a large number of XML documents. The XML files are pretty flat in nature. A simplified example of a typical xml file:
<document id="doc_id_1234">
<value id="1">value 1</value>
<value id="2">value 2</value>
<value id="3">value 3</value>
</document>
My XQueries are largely based on attribute selectors (i.e. //value[#id='1' or #id='3']) and I have found that creating an Attribute Index in the database resulted in a massive query performance increase.
I upload new XML data on a monthly or quarterly basis. After importing the new XML files I re-create the Attribute Index again.
I have found however that after a reboot of the server (which seem to happen quite often at my service provider) the query speed significantly decreases. It feels like the performance drops to the state without the Attribute Index present.
If I open the database using the BaseX GUI, it looks like the Attribute Index is still there. When I drop the existing Attribute Index and re-create it again, the performance of my XQueries is lightning fast again.
I am using BaseX version 7.7.1.
I would like to know:
Where is the Attribute Index stored? Is it in RAM (which would explain why the query speed decreases after a reboot)?
How can I configure my database in such a way that the XQuery performance remains consistently good?
Really hope you can help me out as this is a significant issue on my production website.

To answer your questions:
The attribute index is at least materialized on hard disk inside your BaseXData folder (in which there's a folder for each database). It will usually reside in your home directory. The attribute indexes (names and values) are stored in the files following the pattern atv*.basex.
Usually, the attribute index should survive restarts of both BaseX and your operating system. If you can somehow reproduce the index being invalidated without doing any updates to the database, you might want to post to BaseX' mailing list to make sure this isn't a bug. Maybe try the following steps in advance and make sure you're really not updating the database on startup.
You might want to try setting the UPINDEX option to true. This should rebuild the index when it is invalidated or not available. To make sure the index is used, run the query from basexclient -V.
Disclaimer: I'm somewhat affiliated with the BaseX-Team.

Related

Migrate data between search services

I am trying to move a Azure Search service from standard pricing tier to basic. I can't seem to find a way to do that otherwise than create another and manually move data between. I am about to create a temp console project that selects all data from source service and uploads to the destination service. Is there no data migration tool for this?
Unfortunately, we do not yet have migration support between tiers in Azure Search and it does require re-creating the index in a new service. Please know that we understand the importance of this and have it high on our priority list.
Also, when you do this migration of your index, please keep in mind that there are some things you will need to keep in mind.
First off, when you export the data, you will likely be using our paging (skip and top), but note that this paging is limited to 100K documents. As a result, if you have more than 100K docs, you will need to have some sort of filtering. Perhaps if you have a State or Province field you could search and $filter where State = 'WA'
If you happen to have the original data for the index in a different location (such as SQL), you will find it easier to do this re-loading from there.
Finally, taking into account all of the above, I have been working on a sample here that shows how to do the exporting and reloading of the schema and data which hopefully will help for smaller indexes (less than 100K docs) but ultimately it is really important to make sure that all of the documents are successfully migrated.
Also, it would be great if you could vote for this feature.

About PATH in FTS alfresco queries

I'm using Alfresco 4.1.6 and SOLR 1.4.
For search, I use fts_alfresco_language and the searchService.query method.
And in my query I search by PATH, TYPE and some custom properties like direction, telephone, mail, or similar.
I have now over 2 millions of documents, and we can see how the performance of the searchs are worst than at the beginning.
I read that in version 1.4 of solr, using PATH on the query is a bad idea. And is better avoid it and only use TYPE and the property key and value.
But I have 2 questions...
Why the PATH increase the response time? It's not a help? I have over 1000 main folders at the root of the repository. If I specify the folder that solr may search, why this not filter the results and give me a worst time response than if I don't specify this? Or there are another way to say to solr the main folder to reduce results and then do the rest of the query?
When I find by custom properties, I use 3 or 4 properties, all indexed, to search. These merged lookups has a higher overhead than one? Maybe is better to search only by one property, and not by the 3? Or maybe use ORs and not ANDs to quickly results? How works SOLR?
Thanks!
First let me start with this, I'm not sure what you want of this question cause it's vague. You're not asking how to make your query better, your asking why a bad-practice(bad-performance) is working bad for you.
Do some research on how to structure your ECM system, first thing what makes your ECM any good is a proper Content Model. There are books out there which will help you.
If you're structuring your content with folders (Path) and these are important for you, than you need to add these as metadata to your content. If you haven't done that, then you should start with that.
A good Content Model will be able to find content wherever it's placed within your ECM system.
Sure it's easy to migrate a filesystem to an ECM system and just leave it there, but you've done only half the work.
The path queries are slow in general cause it uses a loop pattern and it's expensive. It has been greatly improved in the new SOLR, but it still isn't as fast as normal metadata querying.

Sitecore - Increasingly low performance when adding large no. of items

In our sitecore 6.6.0 (rev. 130404) based project, we are required to migrate data from the old system's database to the sitecore database. We need to migrate around 650,000 objects. Each of these objects from the old database will create around 4 sitecore items in the master database. So it's a fairly large set of data being migrated.
We've hooked up sitecore APIs with a windows application and we run the data migration logic from that app. At the begining of the data migration, things are fairly fast, around 4 objects per second are transferred to sitecore master database. The first 10,000 objects only took 40 minutes. At this rate, one would predict that in 7 hours, 100,000 objects will be migrated.
But the problem is over time, things get increasingly and noticeably slow. After having around 100,000 objects migrated, now it takes around 7 hours to migrate just 30,000 objects. I even rebuilt sitecore database indexes time to time as mentioned in the performance tuning guide. We also don't perform any sitecore queries to find where to place the newly created sitecore items. No sitecore agents or lucene index update operations are running when our data migration is happening.
Here's the code at the beginning of the data migration logic:
using (new Sitecore.SecurityModel.SecurityDisabler())
using (new Sitecore.Data.Proxies.ProxyDisabler())
using (new Sitecore.Data.DatabaseCacheDisabler())
using (new Sitecore.Data.BulkUpdateContext())
Could the reason for this slowness be the growth of sitecore database indexes. I'm not an SQL expert but after some reading, I got a report on the index operational statistics. I'm not sure whether the numbers indicate the cause of our problem.
Can anybody with better sitecore/sql knowledge than me, help on this?
edit: after bit more digging I got statistics for sql server latches (don't really understand those).
Thanks
After few days of tedious investigations I found out the root cause to this slowness. It was not because of database indexes. The problem was Database.GetItem(<item path>) method calls inside the sitecore MediaCreator class. (Our data migration includes creation of image items)
In the sitecore tree of our website, some items have quite a large number (tens of thousands) of children under them. Allthough it's not recommended having large no. items in sitecore, that's the correct design for our project. If we do a GetItem(<item path>) call to one of these child items, it takes a long time to return that item. Obviously GetItem() using the item path is much slower than getting by ID. Unfortunately we don't have any control over this situation because sitecore MediaCreator uses item paths to create media items.
By using dotPeek I was able to investigate sitecore source code and created a version of MediaCreator class that didn't use item paths for GetItem() and the data migration began to run fast.
I'm going to ask from the sitecore forum whether there are any ways to overcome this performance issue without duplicating MediaCreator source code.
The first things you should look at are:
Disable all indexes during the migration
Wrap the your custom logic
into: SecurityDisabler(), EventDisabler(), ProxyDisabler()
SQL server performance might be the problem - make sure to set proper
values for database growth -
https://www.simple-talk.com/sql/database-administration/sql-server-database-growth-and-autogrowth-settings/
Also, see similar question here: Optimisation tips when migrating data into Sitecore CMS
You can hash the media creator path into a unique guid. Then you can likely use guids as lookup values.
also don't forget to run DB jobs that "defragment" your db indexes (SQL job, I forgot proper name of index maintenance, but it is hugely important).

Use Lucene as Dbms

In our project our data capacity is high (100Gb of data) and we use sql serve as dbms .
unfortunately full text search in sql server is rather disappointing so we're using lucene to search our data . but the problem is lucene needs to index data and so the capacity of holding both lucene index and our database would take too much disk space .
so i was wondering can we put sql server aside and just use lucene ? is it stable enough for holding millions of records of data ?
If you want full text search you need to have full text index, no matter where it's physically located.
But, since you have problems with space, I assume you used stored="true" in your schema fields.
Store it in db (preferably something other than MSSQL) and index it in Solr/Lucene.
You might want to take a look at RavenDB. It's lightning fast, based on Lucene and can function as a stand-alone db. Not to mention the maker likes to put it under all kind of stress.
Only "downside": it's commercial, so it's gonna cost ya :)

Configure Lucene.Net with SQL Server

Has anyone used Lucene.NET rather than using the full text search that comes with sql server?
If so I would be interested on how you implemented it.
Did you for example write a windows service that queried the database every hour then saved the results to the lucene.net index?
Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.
I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.
Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.
BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.
Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)
EDIT Corey Trager documented how to use Lucene.NET in BugTracker.NET here.
I have not done it against database yet, your question is kinda open.
If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database.
If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.
I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.
I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.
Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.
With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.
I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:
Store record:
insert text, other data to the table
get latest inserted ID
create lucene document
put (ID, text) into lucene document
update lucene index
Querying
search lucene index
for each lucene doc in result set load data from DB by stored record's ID
Just to note, I switched from Lucene to Sphinx due to it superb performance

Resources