I am currently thinking how to best store web-crawling results in a database. In another question document-oriented databases were recommended to use for a web-crawler project: Database for web crawler in python?
Now I am wondering if map/reduce is the right way for such classification and value-generation. At least it seems to be able to do such stuff (map for only classification like years or authors, and map/reduce for calculating numerical values which I cannot think of an example at the moment).
However, would map-reduce / DocumentStores also be able to give me the right documents for a given word? In a relational database I would have to use a JOIN on some tables and then get documents containing these words:
SELECT * FROM docs d
JOIN doc_words dw ON dw.doc_id = d.id
JOIN words w ON dw.word_id = w.id
WHERE w.word = 'foo'
I guess DocumentStores are not capable of such an operation as they do not support fulltext index and are not intended to have many references / relations.
Would the better alternative be mixing several systems? E.g. one for searching by words, one for searching by different values if present (like year of publication, author, …)? I think DocumentStores are not so bad for storing the metadata, as sometimes there are specific values and sometimes not (and DocumentStores are easy to use across multiple servers if wanted, as soon as there are too many documents for one server). Yet, I am not sure what would the best way to implement searching for a collection of documents (including webpages, pdfs, images, which have always different meta-data, but often also need fulltext index).
To make a clear question: Should I use another database system together with DocumentStores, use DocumentStores alone (howto search for words quickly?) or another DB system alone?
PS: Another example for such a problem would be the linking between webpages, which cannot be saved in DocumentStores well either. However, OrientDB might solve this problem as it seems to combine graph database and document-oriented database.
Checkout RavenDB. It is a document DB with Map/Reduce queries, using Lucene under the hood, so full-text search is fully supported also within Map/Reduce queries.
Custom Lucene analyzers are supported as well, so there's a lot of room for further full-text extensions.
Other features like Includes and Live Projections may give you everything else a simple Map/Reduce will be missing.
See MarkLogic - which was designed specifically for searching documents. http://developer.marklogic.com/products/marklogic-server/which-nosql
Related
I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.
I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.
I want to know if there is any way to perform wildcard searches in cassandra database.
e.g.
select KEY,username,password from User where username='\*hello*';
Or
select KEY,username,password from User where username='%hello%';
something like this.
There is no native way to perform such queries in Cassandra. Typical options to achieve the same are
a) Maintain an index yourself on likely search terms. For example, whenever you are inserting an entry which has hello in the username, insert an entry in the index column family with hello as the key and the column value as the key of your data entry. While querying, query the index CF and then fetch data from your data CF. Of course, this is pretty restrictive in nature but can be useful for some basic needs.
b) A better bet is to use a full text search engine. Take a look at Solandra, https://github.com/tjake/Solandra or Datastax enterprise http://www.datastax.com/products/enterprise
This project also looks promising
http://tuplejump.github.io/stargate/
I have not looked deeply at it recently, but when I last evaluated it, it looked promising.
What would be considered best practice when you need additional data about facet results.
ie. i need a friendlyname / image / meta keywords / description / and more.. for product categories. (when faceting on categories)
include it in the document? (can lead to looots of duplication)
introduce category as a new index in solr (or fake by doctype=category field in solr)
use a rdbms to lookup additional data using a SELECT WHERE IN (..category facet result ids..)
Thanks,
Remco
use fast NoSQL db that fits your data
BTW Lucene, which is Solr's underlying layer, is in fact also NoSQL-type storage facility.
If I were you, I'd use MongoDB. That's the first db that came to mind, since you need binary data and they practically invented BSON, which is now widespread mean of transferring binary data in a JSON-like fashion.
If your data structure is more graph-shaped (like social network) check out Neo4j, which has blindingly fast graph traversal algorithms.
A relational DB can reliably enforce the "category is first class entity" thing. You would need referential integrity: a product may not belong to a category that doesnt exist. A deleted category must not have it's child categories lying around. A normalized RDB can enforce referential integrity through schema. A NoSQL DB must work with client-side code (you must write) to enforce referential integrity.
Lets see how "product's category must exist" and "subcategories' parents must exist" are done:
RDB: The table that assigns categories to products (an m:n relation) must be keyed up to the product and category by an ON DELETE CASCADE. If a category is deleted, a product simply cannot have such a category. A category that links up to another category as a child: the relavent field has an ON DELETE CASCADE. This means that if a parent is deleted, it's children cannot exist. This entire method is declarative ("it is declared thus"), all complexities exist in the data, we dont need no stinking code to do it for us. You can model a DB as naturally as you understand their real world implications.
Document store-type NoSQL: You need to write code to do everything. A "category is deleted" is an use case, and you need to find products that have that category, and update each one. You have to write code for each use case. Same goes for managing subcategories. The data model may be incredibly stupid, but their real-world implications must be modeled in the code. And its tougher to reason in code and control flow rather than in data structures.
Do you really have performance needs that require NoSQL databases?
So use RDBMSs to manage your data. Then use Direct Import handler or client-side code to insert/update denormalized entities for searching. If most requests to your site can be expressed in Solr queries, great!
As for expressing hierarchial faceting in Solr, see ' Ways to do hierarchial faceting in Solr? '.
I would think about 2 alternatives:
1.) strong the informations for every document without indexing it (to keep the index small as possible). The point is, that i would not store the image insight Lucene/Solr - only an file pointer.
2.) store the additional data on an rdbms or nosql (linke mongoDB) to lookup, as you wrote.
My favorite is the 2nd. one, because an database is the traditional and most optimized way to storing data.
But finally it depends on your system, because you should keep in mind, that you need time for connecting an database, searching through the data and sending the additional information back to the application.
So it could be faster to store everything on lucene.
Probably an small performance test would be useful.
maybe I am wrong, but if you are on Solr trunk you could benefit from Solr join suport, this would allow you to index several entities with relations among them while enforcing conditions on both.
We are searching disparate data sources in our company. We have information in multiple databases that need to be searched from our Intranet. Initial experiments with Full Text Search (FTS) proved disappointing. We've implemented a custom search engine that works very well for our purposes. However, we want to make sure we are doing "the right thing" and aren't missing any great tools that would make our job easier.
What we need:
Column search
ability to search by column
we flag which columns in a table are searchable
Keep some relation between db column and data
we provide advanced filtering on the results
facilitates (amazon style) filtering
filter provided by grouping of results and allowing user to filter them via a checkbox
this is a great feature, users like it very much
Partial Word Match
we have a lot of unique identifiers (product id, etc).
the unique id's can have sub parts with meaning (location, etc)
or only a portion may be available (when the user is searching)
or (by a decidedly poor design decision) there may be white space in the id
this is a major feature that we've implemented now via CHARINDEX (MSSQL) and INSTR (ORACLE)
using the char index functions turned out to be equivalent performance(+/-) on MSSQL compared to full text
didn't test on Oracle
however searches against both types of db are very fast
We take advantage of Indexed (MSSQL) and Materialized (Oracle) views to increase speed
this is a huge win, Oracle Materialized views are better than MSSQL Indexed views
both provide speedups in read-only join situations (like a search combing company and product)
A search that matches user expectations of the paradigm CTRL-f -> enter text -> find matches
Full Text Search is not the best in this area (slow and inconsistent matching)
partial matching (see "Partial Word Match")
Nice to have:
Search database in real time
skip the indexing skip, this is not a hard requirement
Spelling suggestion
Xapian has this http://xapian.org/docs/spelling.html
Similar to google's "Did you mean:"
What we don't need:
We don't need to index documents
at this point searching our data sources are the most important thing
even when we do search documents, we will be looking for partial word matching, etc
Ranking
Our own simple ranking algorithm has proven much better than an FTS equivalent.
Users understand it, we understand it, it's almost always relevant.
Stemming
Just don't need to get [run|ran|running]
Advanced search operators
phrase matching, or/and, etc
according to Jakob Nielsen http://www.useit.com/alertbox/20010513.html
most users are using simple search phrases
very few use advanced searches (when it's available)
also in Information Architecture 3rd edition Page 185
"few users take advantage of them [advanced search functions]"
http://oreilly.com/catalog/9780596000356
our Amazon like filtering allows better filtering anyway (via user testing)
Full Text Search
We've found that results don't always "make sense" to the user
Searching with FTS is hard to tune (which set of operators match the users expectations)
Advanced search operators are a no go
we don't need them because
users don't understand them
Performance has been very close (+/1) to the char index functions
but the results are sometimes just "weird"
The question:
Is there a solution that allows us to keep the key value pair "filtering feature", offers the column specific matching, partial word matching and the rest of the features, without the pain of full text search?
I'm open to any suggestion. I've wondered if a document/hash table nosql data store (MongoDB, et al) might be of use? ( http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo ). Any experience with these is appreciated.
Again, just making sure we aren't missing something with our in-house customized version. If there is something "off the shelf" I would be interested in it. Or if you've built something from some components, what components (search engines, data stores, etc) did you use and why?
You can also make your point for FTS. Just make sure it meets the requirements above before you say "just use Full Text Search because that's the only tool we have."
I ended up coding my own.
The results are fantastic. Users like it, it works well with our existing technologies.
It really wasn't that hard. Just took some time.
Features:
Faceted search (amazon, walmart, etc)
Partial word search (the real stuff not full text)
Search databases (oracle, sql server, etc) and non database sources
Integrates well with our existing environment
Maintains relations, so I can have a n to n search and display
--> this means I can display child records of a master record in search results
--> also I can search any child field and return the master record
It's really amazing what you can do with dictionaries and a lot of memory.
I recommend looking into Solr, I believe it will meet you needs:
http://lucene.apache.org/solr/
For an off-she-shelf solution: Have you checked out the Google Search Appliance?
Quote from the Google Mini/GSA site:
... If direct database indexing is a requirement for you, we encourage you to consider the Google Search Appliance, which has direct database connectivity.
And of course it indexes everything else in the Googly manner you'd expect it to.
Apache Solr is a good way to start your project with and it is open source . You can also try Elastic Search and there are a lot of off shelf products which offer good customization abilities and search features such as Coveo, SharePoint Fast, Google ...