MongoDB: storing large text data - database

I'm currently storing 100-200Kb HTML documents "as is" in bson documents. I have ~300K records.
I don't need to run search in that data, just to retrieve full HTML content by document _id.
I have an imperession that some queries became a bit long. At least, when I run the "Documents count" operation, it takes about 15-20 seconds. At the same time, if I run the same operation on another collection of ~300K lightweight records, it's very fast.
I'm starting to think that I should probably redesign my database schema.
Should I store my HTML documents as files (probably using GridFS)?

Related

SOLR Document with 10000 Index fields

The use case I have is to allows users to build a custom report out of the 10000 fields in our database.
I have read various documentation
I'm not sure if SOLR could handle 10000 indexed field and 500K documents.
All how much time does it take to update a document which has 10K indexed field
Indexing 10K fields, how much of storage will it take? Given a document without index takes 50Kb.

MongoDb - Does it help performance if all insertMany documents write to a single shard?

I have a very large collection that is sharded on hashed(user_id).
I have a script that pulls some file containing thousands of JSON lines, each one being an individual document, and writes them to the collection via insertMany(ordered: false)
I'm dealing with very large amounts of data and still running into CPU usage problems and slower than desired write speeds.
I know that Mongo suggests to pre-split chunks, but only for an empty collection, and of course ours will be populated after the initial load.
Would it help boost performance if we pre-"bucketed" documents into a group keyed on user_id, and then executed an insertMany with all docs having the same user_id, since they would all go to the same shard, as opposed to mixed together? Or does Mongo still need to examine and "manually" balance each individual document in the insertMany even if they do all have the same user_id?

Best Practice to Combine both DB and Lucene Search

I am developing an advanced search engine using .Net where users can build their query based on several Fields:
Title
Content of the Document
Date From, Date To
From Modified Date, To modified Date
Owner
Location
Other Metadata
I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).
How I can Perform the Search?
when any user search for a term:
Narrow down the search results according to criteria selected by user by looking up in the SQL DB.
Return the matching IDs to the lucene searcher web service, which search for keyword entered in the DocumnentIDs returned From the Adv Search web service.
Then Get the relevant metadata for the Document ids (returned from lucence) by looking again in the DB.
AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.
Questions:
How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).
Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?
I welcome any idea, so please share your thoughts.
Your current solution incurs the following overhead at query-time:
1) Narrowing search space via MS-SQL
Generating query in your app
Sending it over the wire to MS-SQL
Parsing/Optimizing/Execution of SQL query
[!!] I/O overhead of returning 100,000s of IDs
2) Executing bounded full-text search via Lucene.NET
[!!] Lucene memory overhead of generating/executing large BooleanQuery containing 100,000s of ID clauses in app (you'll need to first override the default limit of 1024 clauses to even measure this effect)
Standard Lucene full text search execution
Returning matching IDs
3) Materializing result details via MS-SQL
Fast, indexed, ID-based lookup of search result documents (only needed for the first page of displayed results usually about ~10-25 records)
There are two assumptions you may be making that would be worth reconsidering
A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.
Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.
Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).
B) MS-SQL queries will be the quickest way to filter on non-text metadata.
Reconsider this: With your metadata properly indexed using the appropriate Lucene types, you won't incur any additional overhead querying Lucene vs query MS-SQL. (In some cases, Lucene may even be faster.)
Your mileage may vary, but in my experience, this type of filtered-full-text-search when executed on a Lucene collection of 2MM documents will typically run in well under 100ms.
So to summarize the best practice:
Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).
Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)
Return IDs
Materialize documents from MS-SQL using returned IDs.
I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:
You won't have to worry about search-index memory requirements cannibalizing application memory requirements.
You'll take advantage of sophisticated filter caching performance boosts and OS-based I/O optimizations.
You'll be able to iterate upon your search solution easily from a mostly configuration-based environment that is widely used/supported.
You'll have tools in place to scale/tune/backup/restore search without impacting your application.

Query about Elasticsearch

I am writing a service that will be creating and managing user records. 100+ million of them.
For each new user, service will generate a unique user id and write it in database. Database is sharded based on unique user id that gets generated.
Each user record has several fields. Now one of the requirement is that the service be able to search if there exists a user with a matching field value. So those fields are declared as index in database schema.
However since database is sharded based on primary key ( unique user id ). I will need to search on all shards to find a user record that matches a particular column.
So to make that lookup fast. One thing i am thinking of doing is setting up an ElasticSearch cluster. Service will write to the ES cluster every time it creates a new user record. ES cluster will index the user record based on the relevant fields.
My question is :
-- What kind of performance can i expect from ES here ? Assuming i have 100+million user records where 5 columns of each user record need to be indexed. I know it depends on hardware config as well. But please assume a well tuned hardware.
-- Here i am trying to use ES as a memcache alternative that provides multiple keys. So i want all dataset to be in memory and does not need to be durable. Is ES right tool to do that ?
Any comment/recommendation based on experience with ElasticSearch for large dataset is very much appreciated.
ES is not explicitly designed to run completely in memory - you normally wouldn't want to do that with large unbounded datasets in a Java application (though you can using off-heap memory). Rather, it'll cache what it can and rely on the OS's disk cache for the rest.
100+ million records shouldn't be an issue at all even on a single machine. I run an index consisting 15 million records of ~100 small fields (no large text fields) amounting to 65Gb of data on disk on a single machine. Fairly complex queries that just return id/score execute in less than 500ms, queries that require loading the documents return in 1-1.5 seconds on a warmed up vm against a single SSD. I tend to given the JVM 12-16GB of memory - any more and I find it's just better to scale up via a cluster than a single huge vm.

Storage of many log files

I have a system which is receiving log files from different places through http (>10k producers, 10 logs per day, ~100 lines of text each).
I would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ...
My question is : what's the best way to store them ?
Flat text files (with proper locking), one file per uploaded file, one directory per day/producer
Flat text files, one (big) file per day for all producers (problem here will be indexing and locking)
Database Table with text (MySQL is preferred for internal reasons) (pb with DB purge as delete can be very long !)
Database Table with one record per line of text
Database with sharding (one table per day), allowing simple data purge. (this is partitioning. However the version of mysql I have access to (ie supported internally) does not support it)
Document based DB à la couchdb or mongodb (problem could be with indexing / maturity / speed of ingestion)
Any advice ?
(Disclaimer: I work on MongoDB.)
I think MongoDB is the best solution for logging. It is blazingly fast, as in, it can probably insert data faster than you can send it. You can do interesting queries on the data (e.g., ranges of dates or log levels) and index and field or combination of fields. It's also nice because you can randomly add more fields to logs ("oops, we want a stack trace field for some of these") and it won't cause problems (as it would with flat text files).
As far as stability goes, a lot of people are already using MongoDB in production (see http://www.mongodb.org/display/DOCS/Production+Deployments). We just have a few more features we want to add before we go to 1.0.
I'd pick the very first solution.
I don't see why would you need DB at all. Seems like all you need is to scan through the data. Keep the logs in the most "raw" state, then process it and then create a tarball for each day.
The only reason to aggregate would be to reduce the number of files. On some file systems, if you put more than N files in a directory, the performance decreases rapidly. Check your filesystem and if it's the case, organize a simple 2-level hierarchy, say, using the first 2 digits of producer ID as the first level directory name.
I would write one file per upload, and one directory/day as you first suggested. At the end of the day, run your processing over the files, and then tar.bz2 the directory.
The tarball will still be searchable, and will likely be quite small as logs can usually compress quite well.
For total data, you are talking about 1GB [corrected 10MB] a day uncompressed. This will likely compress to 100MB or less. I've seen 200x compression on my log files with bzip2. You could easily store the compressed data on a file system for years without any worries. For additional processing you can write scripts which can search the compressed tarball and generate more stats.
Since you would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ... You're expecting 100,000 files a day, at a total of 10,000,000 lines:
I'd suggest:
Store all the files as regular textfiles using the following format : yyyymmdd/producerid/fileno.
At the end of the day, clear the database, and load all the textfiles for the day.
After loading the files, it would be easy to get the stats from the database, and post them in any format needed. (maybe even another "stats" database). You could also generate graphs.
To save space ,you could compress the daily folder. Since they're textfiles, they would compress well.
So you would only be using the database to be able to easily aggregate the data. You could also reproduce the reports for an older day if the process didn't work, by going through the same steps.
To my experience, single large table performs much faster then several linked tables if we talk about database solution. Particularly on write and delete operations. For example, splitting one table into three linked tables decreases performance 3-5 times. This is very rough, of course it depends on details, but generally this is the risk. It gets worse when data volumes get very large. Best way, IMO, to store log data is not in a flat text, but rather in a structured form, so that you can do efficient queries and formatting later. Managing log files could be pain, especially when there are lots of them and coming from many sources and locations. Check out our solution, IMO it can save you lots of development time.

Resources