Search using Solr vs Map Reduce on Files - which is reliable?

Search using Solr vs Map Reduce on Files - which is reliable? - solr

I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?

You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.

You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.

Related

Best Practice to Combine both DB and Lucene Search

I am developing an advanced search engine using .Net where users can build their query based on several Fields:
Title
Content of the Document
Date From, Date To
From Modified Date, To modified Date
Owner
Location
Other Metadata
I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).
How I can Perform the Search?
when any user search for a term:
Narrow down the search results according to criteria selected by user by looking up in the SQL DB.
Return the matching IDs to the lucene searcher web service, which search for keyword entered in the DocumnentIDs returned From the Adv Search web service.
Then Get the relevant metadata for the Document ids (returned from lucence) by looking again in the DB.
AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.
Questions:
How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).
Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?
I welcome any idea, so please share your thoughts.

Your current solution incurs the following overhead at query-time:
1) Narrowing search space via MS-SQL
Generating query in your app
Sending it over the wire to MS-SQL
Parsing/Optimizing/Execution of SQL query
[!!] I/O overhead of returning 100,000s of IDs
2) Executing bounded full-text search via Lucene.NET
[!!] Lucene memory overhead of generating/executing large BooleanQuery containing 100,000s of ID clauses in app (you'll need to first override the default limit of 1024 clauses to even measure this effect)
Standard Lucene full text search execution
Returning matching IDs
3) Materializing result details via MS-SQL
Fast, indexed, ID-based lookup of search result documents (only needed for the first page of displayed results usually about ~10-25 records)
There are two assumptions you may be making that would be worth reconsidering
A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.
Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.
Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).
B) MS-SQL queries will be the quickest way to filter on non-text metadata.
Reconsider this: With your metadata properly indexed using the appropriate Lucene types, you won't incur any additional overhead querying Lucene vs query MS-SQL. (In some cases, Lucene may even be faster.)
Your mileage may vary, but in my experience, this type of filtered-full-text-search when executed on a Lucene collection of 2MM documents will typically run in well under 100ms.
So to summarize the best practice:
Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).
Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)
Return IDs
Materialize documents from MS-SQL using returned IDs.
I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:
You won't have to worry about search-index memory requirements cannibalizing application memory requirements.
You'll take advantage of sophisticated filter caching performance boosts and OS-based I/O optimizations.
You'll be able to iterate upon your search solution easily from a mostly configuration-based environment that is widely used/supported.
You'll have tools in place to scale/tune/backup/restore search without impacting your application.

Solr - Use of Cache in Billion Data

We have SOLR storing 3 billions of records in 23 machines and each machine have 4 shards and only 230 million documents have some field like aliasName. Currently queryCache or documentCache or Filter Cache is disable.
Problem: We are trying to get the results which have query like (q=alisaName:[* TO *] AND firstname:ash AND lastName:Coburn) is returning the match documents in 4.3 seconds. Basically we want only those matched firstname and lastname records where aliasName is not empty.
I am thinking to enable filter query fq=aliasName:[* TO *] and not sure it will make it faster as firstname and last name is mostly different in the each queries? how much memory should we allocate for filter query to perform? It should not impact the other existing queries like q=firstanme:ash AND last name:something)
Please don't worry about I/O operations as we are using flash drive.
Really appreciate the reply if you have worked on similar issue and suggest the best solution.

According to solr documentation...
filterCache
This cache stores unordered sets of document IDs that match the key (usually queries)
URL: https://wiki.apache.org/solr/SolrCaching#filterCache
So I think it comes down to two things:
What is the percentage of documents that you have with populated aliasName ? In my opinion if most documents have this field populated, then the filter cache might be useless. But, if it is only a small percentage of documents, the filter cache will have a huge performance impact, and less memory used.
What kind of Id are you using? Although I assume that the documentation refers to lucene document Ids, and not solr Ids. But maybe a smaller Solr Ids could result in a smaller cache size as well (I am not sure).
At the end you will have to perform a trial and see how it goes, maybe try on a couple of nodes first and see if there is a performance improvement.

Best way to index database table data in Solr?

I have a table with around 100,000 rows at the moment. I want to index the data in this table in a Solr Index.
So the naive method would be to:
Get all the rows
For each row: convert to a SolrDocument and add each document to a request
Once all rows are converted then post the request
Some problems with this approach that I can think of are:
Loading too much data (the content of the whole table) in to memory
POSTing a big request
However, some advantages:
Only one request to the Database
Only one POST request to Solr
The approach is not scalable, I see that since as the table grows so will the memory requirements and the size of the POST request. I need to perhaps take n number of rows, process them, then take the next n?
I'm wondering if any one has any advice about how to best implement this?
(ps. I did search the site but I didn't find any questions that were similar to this.)
Thanks.

If you want to balance between POSTing all documents at once and doing one POST per document you could use a queue to collect documents and run a separate thread that sends documents once you have collected enough. This way you can manage the memory vs. request time problem.

I used the suggestion from nikhil500:
DIH does support many transformers. You can also write custom transformers. I will recommend using DIH if possible - I think it will need the least amount of coding and will be faster than POSTing the documents. – nikhil500 Feb 6 at 17:42

I once had to upload ~3000 rows (each of 5 fields) from DB to Solr. I ran uploaded each document separately and did a single commit. The entire operation took only a few seconds, but some uploads (8 of 3000) had failed.
What worked perfectly was uploading in batches of 50 before commiting. 50 may have been very low. There are recommended limits to how many documents you can upload before doing a commit. It depends of the size of the documents.
But then, this is a one-off operation, which you can supervise with a hacked script. Would a subsequent operation make you index 100,000 rows at once? Or can you get away with indexing only a few hundred updated documents per operation?

Can Apache Solr Handle TeraByte Large Data

I am an apache solr user about a year. I used solr for simple search tools but now I want to use solr with 5TB of data. I assume that 5TB data will be 7TB when solr index it according to filter that I use. And then I will add nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data. (without shards)
a- Can solr server answers the queries in an acceptable time
b- what is the expected time for commiting of 50MB data on 7TB index.
c- Is there an upper limit for index size.
2- what are the suggestions that you offer
a- How many shards should I use
b- Should I use solr cores
c- What is the committing frequency you offered. (is 1 hour OK)
3- are there any test results for this kind of large data
There is no available 5TB data, I just want to estimate what will be the result.
Note: You can assume that hardware resourses are not a problem.

if your sizes are for text, rather than binary files (whose text would be usually much less), then I don't think you can pretend to do this in a single machine.
This sounds a lot like Logly and they use SolrCloud to handle such amount of data.
ok if all are rich documents then total text size to index will be much smaller (for me its about 7% of my starting size). Anyway, even with that decreased amount, you still have too much data for a single instance I think.

Storing time-series data, relational or non?

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.
I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:
What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.
Relational
Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:
Fields: id fk_to_device fk_to_metric metric_value timestamp
When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:
SELECT metric_value, timestamp FROM data_instances
WHERE fk_to_device=1 AND fk_to_metric=2
The number of rows in this table would be:
d * m_d * f * t
where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.
For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.
Indexes
Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.
Non-Relational (NoSQL)
MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.
I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.
Undecided
Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?

Definitely Relational. Unlimited flexibility and expansion.
Two corrections, both in concept and application, followed by an elevation.
Correction
It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).
Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:
(Device, Metric, DateTime)
The Id column does nothing, it is totally and completely redundant.
An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.
You can get rid of it. Please.
Elevation
Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.
(I have one index only, not three; on the Non-SQLs you may need three indices).
I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.
(Server, Device, Metric, DateTime)
The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.
Monitor Statistics Data Model.
(Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)
It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.
One More Thing
Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.

Found very interesting the above answers.
Trying to add a couple more considerations here.
1) Data aging
Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:
1-sec raw samples for a short period (e.g. for 24 hours)
5-min detail aggregate samples for a medium period (e.g. 1 week)
1-hour detail over that (e.g. up to 1 year)
Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:
automated data purging (see Redis' EXPIRE command)
multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)
2) Real-time collection
Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).

You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.
Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.
NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.
Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).

Create a file, name it 1_2.data. weired idea? what you get:
You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
You save up even more space because you don't need any indices.
Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.
if you like it even more optimized start thinking about splitting your files like that;
1_2_january2014.data
1_2_february2014.data
1_2_march2014.data
or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.
There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru

You should look into Time series database. It was created for this purpose.
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).
Popular example of time-series database InfluxDB

I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage.
Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc.
Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location).
For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)
I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.
As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO):
You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series

5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.

I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight