Solr - Reindex recommended batch size - solr

I've just installed Solr on my Rails application (using sunspot).
I want solr to re-index a couple of columns on one of my tables, the tables is pretty big (~50M records).
What is the recommended batch size to use? currently i'm using 1000 and its running for over a day.
Any ideas?

The batch size is not that important, 1000 is probably OK, though I wouldn't go any larger than that. It depends on the size of the documents, how many bytes of text are indexed for each one.
Are you committing after each batch? That can be slow. I load a 23M document index with a single commit at the end. The documents are small, the metadata for books, and it takes about 90 minutes. To get that speed, I needed to use a single SQL query for the load. Using any subqueries made it about 10X slower.
I'm using the JDBC support in the DataInputHandler, though I may move to some custom code that makes a DB query and submits batches.
I've heard that the CSV input handler is very efficient, so it might work to dump your data to CSV, then load it with that handler.

Related

full build solr index with large amount of data

I have a text file containing over 10 million records of web pages.
I want to build solr index with this file every day(because this file is updated daily).
Is there any effective solutions to full build solr index at once? Such as using map reduce model to accelerate building process.
I think using solr api to add document is a little bit slow.
It is not clear how much content is in those 10 million records, but it may actually be simple enough to index those in bulk. Just check your solrconfig.xml for your commit settings, you may, for example, have autoCommit configured with low maxDocs settings. In your case, you may want to disable autoCommit completely and just do it manually at the end.
However, if it is still a bit slow, before going to map-reduce, you could think about building a separate index and then swapping it with the current index.
This way, you actually have the previous collection to roll-back to and/or to compare if needed. The new collection can even be built on a different machine and/or more close to the data.

SQLite record update speed improvements

I currently have trouble updating SQLite database records at scale within a healthy amount of time.
I have a small database of about 70,000 records and I have some office personal using Navicat to filter the records and make some bulk edits to commit. When trying to perform an UPDATE on a large amount of records in one field everything comes to a crawl, when I look at the raw SQL query I can see that the program is using an UPDATE ..SET.. WHERE to perform the operations.
My question is what can I do to help this query run faster? I have a SQLite auto index on the column used for matching the record to update, but I have read and searched all over and have yet to see any kind of resolve other then what I already have in place. Updating one field in 20,000 records is literally taking close to 3 hours...regardless of using Navicat or not. The whole database is only 30mb so I have to be doing something wrong.
All other database operations run nice and speedy, not sure whats wrong and looking for some guidance from a SQLite vet.
There are very few reasons why SQLite can perform as poorly as you describe, but I've experienced pretty much the same thing.
First, if you are performing a table-wide update on an indexed column, you're gonna thrash the whole index, all while making the index less useless. Delete the index before you update, then recreate the index.
Additionally, you can get some serious perf boost by changing the synchronous PRAGMA, but it comes with a non-zero amount of extra risk of corrupt data on a crash (extremely unlikely in my experience)
PRAGMA schema.synchronous = 0

Which approach and database to use in performance-critical solution

I have the following scenario:
Around 70 million of equipments send a signal every 3~5 minutes to
the server sending its id, status (online or offiline), IP, location
(latitude and longitude), parent node and some other information.
The other information might not be in an standard format (so no schema for me) but I still need to query it.
The equipments might disappear for some time (or forever) not sending
signals in the process. So I need a way to "forget" the equipments if
they have not sent a signal in the last X days. Also new equipments
might come online at any time.
I need to query all this data. Like knowing how many equipments are offline on a specific region or over
an IP range. There won't be many queries running at the same time.
Some of the queries need to run fast (less than 3 min per query) and
at the same time as the database is updating. So I need indexes on
the main attributes (id, status, IP, location and parent node). The
query results do not need to be 100% accurate, eventual consistency
is fine as long as it doesn't take too long (more than 20 min on
avarage) for them to appear in the queries results.
I don't need
persistence at all, if the power goes out it's okay to lose
everything.
Given all this I thought of using a noSQL approach maybe MongoDB or CouchDB since I have experience with MapReduce and Javascript but I don't know which one is better for my problem (I'm gravitating towards CouchDB) or if they are fit at all to handle this massive workload. I don't even know if I actually need a "traditional" database since I don't need persistence to disk (maybe a main-memory approach would be better?), but I do need a way to build custom queries easily.
The main problem I detect are the following:
Need to insert/update lots of tuples really fast and I don't know
beforehand if the signal I receive is already in the database or not.
Almost all of the signals will be in the same state as they were the
last time, so maybe query by id and check to see if the tuple changed if not do nothing, if it did update?
Forgeting offline equipments. A batch job that runs during the night
removing expired tuples would solve this problem.
There won't be many queries running at the same time, but they need
to run fast. So I guess I need to have a cluster that perform a
single query on multiple nodes of the cluster (does CouchDB MapReduce
splits the workload to multiple nodes of the cluster?). I'm not
enterily sure I need a cluster though, could a single more expensive
machine handle all the load?
I have never used a noSQL system before, but I have theoretical
knowledge of the subject.
Does this make sense?
Apache Flume for collecting the signals.
It is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Easy to configure and scale. Store the data in HDFS as files using Flume.
Hive for batch queries.
Map the data files in HDFS as external tables in Hive warehouse. Write SQL like queries using HiveQL whenever you need offline-batch processing.
HBase for random real-time reads/writes.
Since HDFS, being a FS, lacks the random read/write capability, you would require a DB to serve that purpose. Looking at your use case HBase seems good to me. I would not say MongoDB or CouchDB as you are not dealing with documents here and both these are document-oriented databases.
Impala for fast, interactive queries.
Impala allows you to run fast, interactive SQL queries directly on your data stored in HDFS or HBase. Unlike Hive it does not use MapReduce. It instead leverages the power of MPP so it's good for real time stuff. And it's easy to use since it uses the same metadata, SQL syntax (Hive SQL), ODBC driver etc as Hive.
HTH
Depending on the type of analysis, CouchDB, HBase of Flume may be all be good choices. For strictly numeric "write-once" metrics data graphite is a very popular open source solution.

Does it make sense to use Hadoop for import operations and Solr to provide a web interface?

I'm looking at the need to import a lot of data in realtime into a Lucene index. This will consist of files of various formats (Doc, Docx, Pdf, etc).
The data will be imported as batches compressed files, and so they will need to be decompressed and indexed into an individual file, and somehow relate to the file batch as a whole.
I'm still trying to figure out how to accomplish this, but I think I can use Hadoop for the processing and import into lucene. I can then use Solr as a web interface.
Am I over complicating things since Solr can already process data? Since the CPU load for import is very high (due to pre processing) I believe I need to separate import and casual searching regardless of the implementation.
Q: "Please define a lot of data and realtime"
"A lot" of data is 1 Billion email messages per year (or more), with an average size of 1K, with attachments ranging from 1K to 20 Megs with a small amount of data ranging from 20 Megs to 200 Megs. These are typically attachments that need indexing referenced above.
Realtime means it supports searching within 30 minutes or sooner after it is ready for import.
SLA:
I'd like to provide a search SLA of 15 seconds or less for searching operations.
If you need the processing done in real-time (or near real-time for that matter) then Hadoop may not be the best choice for you.
Solr already handles all aspects of processing and indexing the files. I would stick with a Solr-only solution first. Solr allows you to scale to multiple machines, so if you find that the CPU load is too high because of the processing, then you can easily add more machines to handle the load.
I suggest that you use Solr Replication to ease the load, by indexing on one machine and retrieving from others. Hadoop is not suitable for real-time processing.
1 billion documents per year translates to approximately 32 documents per second spread uniformly.
You could run text extraction on a separate machine and send the indexable text to Solr. I suppose, at this scale, you have go for multi-core Solr. So, you can send indexable content to different cores. That should speed up indexing.
I have done indexing of small structured documents in the range of 100 million without much trouble on a single core. You should be able scale to few 100 million documents with a single solr instance. (The text extraction service could use another machine.)
Read about large scale search on Hathi Trust's blog for various challanges and solutions. They use Lucene/Solr.

What are the performance characteristics of sqlite with very large database files? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
2020 update, about 11 years after the question was posted and later closed, preventing newer answers.
Almost everything written here is obsolete. Once upon a time sqlite was limited to the memory capacity or to 2 GB of storage (32 bits) or other popular numbers... well, that was a long time ago.
Official limitations are listed here. Practically sqlite is likely to work as long as there is storage available. It works well with dataset larger than memory, it was originally created when memory was thin and it was a very important point from the start.
There is absolutely no issue with storing 100 GB of data. It could probably store a TB just fine but eventually that's the point where you need to question whether SQLite is the best tool for the job and you probably want features from a full fledged database (remote clients, concurrent writes, read-only replicas, sharding, etc...).
Original:
I know that sqlite doesn't perform well with extremely large database files even when they are supported (there used to be a comment on the sqlite website stating that if you need file sizes above 1GB you may want to consider using an enterprise rdbms. Can't find it anymore, might be related to an older version of sqlite).
However, for my purposes I'd like to get an idea of how bad it really is before I consider other solutions.
I'm talking about sqlite data files in the multi-gigabyte range, from 2GB onwards.
Anyone have any experience with this? Any tips/ideas?
So I did some tests with sqlite for very large files, and came to some conclusions (at least for my specific application).
The tests involve a single sqlite file with either a single table, or multiple tables. Each table had about 8 columns, almost all integers, and 4 indices.
The idea was to insert enough data until sqlite files were about 50GB.
Single Table
I tried to insert multiple rows into a sqlite file with just one table. When the file was about 7GB (sorry I can't be specific about row counts) insertions were taking far too long. I had estimated that my test to insert all my data would take 24 hours or so, but it did not complete even after 48 hours.
This leads me to conclude that a single, very large sqlite table will have issues with insertions, and probably other operations as well.
I guess this is no surprise, as the table gets larger, inserting and updating all the indices take longer.
Multiple Tables
I then tried splitting the data by time over several tables, one table per day. The data for the original 1 table was split to ~700 tables.
This setup had no problems with the insertion, it did not take longer as time progressed, since a new table was created for every day.
Vacuum Issues
As pointed out by i_like_caffeine, the VACUUM command is a problem the larger the sqlite file is. As more inserts/deletes are done, the fragmentation of the file on disk will get worse, so the goal is to periodically VACUUM to optimize the file and recover file space.
However, as pointed out by documentation, a full copy of the database is made to do a vacuum, taking a very long time to complete. So, the smaller the database, the faster this operation will finish.
Conclusions
For my specific application, I'll probably be splitting out data over several db files, one per day, to get the best of both vacuum performance and insertion/delete speed.
This complicates queries, but for me, it's a worthwhile tradeoff to be able to index this much data. An additional advantage is that I can just delete a whole db file to drop a day's worth of data (a common operation for my application).
I'd probably have to monitor table size per file as well to see when the speed will become a problem.
It's too bad that there doesn't seem to be an incremental vacuum method other than auto vacuum. I can't use it because my goal for vacuum is to defragment the file (file space isn't a big deal), which auto vacuum does not do. In fact, documentation states it may make fragmentation worse, so I have to resort to periodically doing a full vacuum on the file.
We are using DBS of 50 GB+ on our platform. no complains works great.
Make sure you are doing everything right! Are you using predefined statements ?
*SQLITE 3.7.3
Transactions
Pre made statements
Apply these settings (right after you create the DB)
PRAGMA main.page_size = 4096;
PRAGMA main.cache_size=10000;
PRAGMA main.locking_mode=EXCLUSIVE;
PRAGMA main.synchronous=NORMAL;
PRAGMA main.journal_mode=WAL;
PRAGMA main.cache_size=5000;
Hope this will help others, works great here
I've created SQLite databases up to 3.5GB in size with no noticeable performance issues. If I remember correctly, I think SQLite2 might have had some lower limits, but I don't think SQLite3 has any such issues.
According to the SQLite Limits page, the maximum size of each database page is 32K. And the maximum pages in a database is 1024^3. So by my math that comes out to 32 terabytes as the maximum size. I think you'll hit your file system's limits before hitting SQLite's!
Much of the reason that it took > 48 hours to do your inserts is because of your indexes. It is incredibly faster to:
1 - Drop all indexes
2 - Do all inserts
3 - Create indexes again
Besides the usual recommendation:
Drop index for bulk insert.
Batch inserts/updates in large transactions.
Tune your buffer cache/disable journal /w PRAGMAs.
Use a 64bit machine (to be able to use lots of cacheā„¢).
[added July 2014] Use common table expression (CTE) instead of running multiple SQL queries! Requires SQLite release 3.8.3.
I have learnt the following from my experience with SQLite3:
For maximum insert speed, don't use schema with any column constraint. (Alter table later as needed You can't add constraints with ALTER TABLE).
Optimize your schema to store what you need. Sometimes this means breaking down tables and/or even compressing/transforming your data before inserting to the database. A great example is to storing IP addresses as (long) integers.
One table per db file - to minimize lock contention. (Use ATTACH DATABASE if you want to have a single connection object.
SQLite can store different types of data in the same column (dynamic typing), use that to your advantage.
Question/comment welcome. ;-)
I have a 7GB SQLite database.
To perform a particular query with an inner join takes 2.6s
In order to speed this up I tried adding indexes. Depending on which index(es) I added, sometimes the query went down to 0.1s and sometimes it went UP to as much as 7s.
I think the problem in my case was that if a column is highly duplicate then adding an index degrades performance :(
There used to be a statement in the SQLite documentation that the practical size limit of a database file was a few dozen GB:s. That was mostly due to the need for SQLite to "allocate a bitmap of dirty pages" whenever you started a transaction. Thus 256 byte of RAM were required for each MB in the database. Inserting into a 50 GB DB-file would require a hefty (2^8)*(2^10)=2^18=256 MB of RAM.
But as of recent versions of SQLite, this is no longer needed. Read more here.
I think the main complaints about sqlite scaling is:
Single process write.
No mirroring.
No replication.
I've experienced problems with large sqlite files when using the vacuum command.
I haven't tried the auto_vacuum feature yet. If you expect to be updating and deleting data often then this is worth looking at.

Resources