Best way to index database table data in Solr? - database

I have a table with around 100,000 rows at the moment. I want to index the data in this table in a Solr Index.
So the naive method would be to:
Get all the rows
For each row: convert to a SolrDocument and add each document to a request
Once all rows are converted then post the request
Some problems with this approach that I can think of are:
Loading too much data (the content of the whole table) in to memory
POSTing a big request
However, some advantages:
Only one request to the Database
Only one POST request to Solr
The approach is not scalable, I see that since as the table grows so will the memory requirements and the size of the POST request. I need to perhaps take n number of rows, process them, then take the next n?
I'm wondering if any one has any advice about how to best implement this?
(ps. I did search the site but I didn't find any questions that were similar to this.)
Thanks.

If you want to balance between POSTing all documents at once and doing one POST per document you could use a queue to collect documents and run a separate thread that sends documents once you have collected enough. This way you can manage the memory vs. request time problem.

I used the suggestion from nikhil500:
DIH does support many transformers. You can also write custom transformers. I will recommend using DIH if possible - I think it will need the least amount of coding and will be faster than POSTing the documents. – nikhil500 Feb 6 at 17:42

I once had to upload ~3000 rows (each of 5 fields) from DB to Solr. I ran uploaded each document separately and did a single commit. The entire operation took only a few seconds, but some uploads (8 of 3000) had failed.
What worked perfectly was uploading in batches of 50 before commiting. 50 may have been very low. There are recommended limits to how many documents you can upload before doing a commit. It depends of the size of the documents.
But then, this is a one-off operation, which you can supervise with a hacked script. Would a subsequent operation make you index 100,000 rows at once? Or can you get away with indexing only a few hundred updated documents per operation?

Related

Google Cloud Datastore queries too slow when fetching all records

I am experiencing extremely slow performance of Google Cloud Datastore queries.
My entity structure is very simple:
calendarId, levelId, levelName, levelValue
And there are only about 1400 records and yet the query takes 500ms-1.2 sec to give back the data. Another query on a different entity also takes 300-400 ms just for 313 records.
I am wondering what might be causing such delay. Can anyone please give some pointers regarding how to debug this issue or what factors to inspect?
Thanks.
You are experiencing expected behavior. You shouldn't need to get that many entities when presenting a page to user. Gmail doesn't show you 1000 emails, it shows you 25-100 based on your settings. You should fetch a smaller number (e.g., the first 100) and implement some kind of paging to allow users to see other entities.
If this is backend processing, then you will simply need that much time to process entities, and you'll need to take that into account.
Note that you generally want to fetch your entities in large batches, and not one by one, but I assume you are already doing that based on the numbers in your question.
Not sure if this will help but you could try packing more data into a single entity by using embedded entities. Embedded entities are not true entities, they are just properties that allow for nested data. So instead of having 4 properties per entity, create an array property on the entity that stores a list of embedded entities each with those 4 properties. The max size an entity can have is 1MB, so you'll want to pack the array to get as close to that 1MB limit as possible.
This will lower the number of true entities and I suspect this will also reduce overall fetch time.

When to optimize a Solr Index [duplicate]

I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?
A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.
Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.
The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.
Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.

Search using Solr vs Map Reduce on Files - which is reliable?

I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?
You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.
You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.

Maximum number of records for a custom object in salesforce.com

What is the maximum number of records within a single custom object in salesforce.com?
There does not seem to be a limit indicated in https://login.salesforce.com/help/doc/en/limits.htm
But of course, there has to be a limit of some kind. EG: Could 250 million records be stored in a single salesforce.com custom object?
As far as I'm aware the only limit is your data storage, you can see what you've used by going to Setup -> Administration Setup -> Data Management -> Storage Usage.
In one of the Orgs I work with I can see one object has almost 2GB of data for just under a million records, and this accounts for a little over a third of the storage available. Your storage space depends on your Salesforce Edition and number of users. See here for details.
I've seen the performance issue as well, though after about 1-2M records the performance hit appears magically to plateau, or at least it didn't appear to significantly slow down between 1M and 10M. I wonder if orgs are tier-tuned based on volume... :/
But regardless of this, there are other challenges which make it less than ideal for big data. Even though they've increased the SOQL governor limit to permit up to 50 million records to be retrieved in one call, you're still strapped with a 200,000 line execution limit in Apex and a 10K DML limit (per execution thread). These can be bypassed through Batch Apex, yet this has limitations as well. You can only execute 250K batches in 24 hours and only have 5 batches running at any given time.
So... the moral of the story seems to be that even if you managed to get a billion records into a custom object, you really can't do much with the data at that scale anyway. Therefore, it's effectively not the right tool for that job in its current state.
2-cents
LaceySnr is correct. However, there is an inverse relationship between the number of records for an object and performance. Any part of the system that filters on that object will be impacted, such as views, reports, SOQL queries, etc.
It's hard to talk specific numbers since salesforce has upwards of a dozen server clusters, each with their own performance characteristics. And there's probably a lot of dynamic performance management that occurs regularly. But, in the past I've seen performance issues start to creep in around 2M records. One possible remedy is you can ask salesforce to index fields that you plan to filter on.

Insert thousands entities in a reasonnable time into BigTable

I'm having some issues when I try to insert the 36k french cities into BigTable. I'm parsing a CSV file and putting every row into the datastore using this piece of code:
import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery
def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)
It's taking around 5 minutes (!!!) to do it with the local dev server, even 10 when deleting them with db.delete() function.
When I try it online calling a test.py page containing add_cities(), the 30s timeout is reached.
I'm coming from the MySQL world and I think it's a real shame not to add 36k entities in less than a second. I can be wrong in the way to do it, so I'm refering to you:
Why is it so slow ?
Is there any way to do it in a reasonnable time ?
Thanks :)
First off, it's the datastore, not Bigtable. The datastore uses bigtable, but it adds a lot more on top of that.
The main reason this is going so slowly is that you're doing a query (on the 'Region' kind) for every record you add. This is inevitably going to slow things down substantially. There's two things you can do to speed things up:
Use the code of a Region as its key_name, allowing you to do a faster datastore get instead of a query. In fact, since you only need the region's key for the reference property, you needn't fetch the region at all in that case.
Cache the region list in memory, or skip storing it in the datastore at all. By its nature, I'm guessing regions is both a small list and infrequently changing, so there may be no need to store it in the datastore in the first place.
In addition, you should use the mapreduce framework when loading large amounts of data to avoid timeouts. It has built-in support for reading CSVs from blobstore blobs, too.
Use the Task Queue. If you want your dataset to process quickly, have your upload handler create a task for each subset of 500 using an offset value.
FWIW we process large CSV's into datastore using mapreduce, with some initial handling/ validation inside a task. Even tasks have a limit (10 mins) at the moment, but that's probably fine for your data size.
Make sure if you're doing inserts,etc. you batch as much as possible - don't insert individual records, and same for lookups - get_by_keyname allows you to pass in an array of keys. (I believe db put has a limit of 200 records at the moment?)
Mapreduce might be overkill for what you're doing now, but it's definitely worth wrapping your head around, it's a must-have for larger data sets.
Lastly, timing of anything on the SDK is largely pointless - think of it as a debugger more than anything else!

Resources