I'm using ServiceStack.Redis to implement a demo project. It contains two POCOs i.e Albums and its Songs.
Below is the search results measured using a stopwatch instance:
Time elapsed searching 5804 items is 00:00:00.1243984 <-- Albums
Time elapsed searching 138731 items is 00:00:02.0592068 <-- Songs
As you can see the search for the songs is taking too much time. I'm displaying the results in a WPF application wherein the search term is also entered. The lag is a no-go for redis.
Below is the code used for searching:
IEnumerable<int> songsFromRedis =
songRedis.GetAll()
.Where(song => song.Title != null
&& song.Title.ToLowerInvariant().Contains(searchText))
.OrderBy(song => song.Title)
.Select(x => x.AlbumId);
If we cannot make it any faster, would ElasticSearch help ?
The issue is how you're using Redis, i.e. songRedis.GetAll() downloads the entire dataset, deserializes all entities into C# objects and performs the search on the client.
You should never download and query an entire dataset across the network on the client (i.e with any datastore), even a full server-side table-scan query would perform much better since only the filtered results are returned to the client and not the entire dataset. Ideally even full server-side table-scans should be avoided and any queries should be made via an index.
Redis doesn't have support for indexes built-in, but when needed you can use a SET to manually create indexes between entities in Redis.
Any search oriented database would help. Even mysql fulltext search, which is notoriusly slow, would be a lot better here.
Elasticsearch is one good alternative, Sphinx another good one. ES has the easy scalability and ease of use, sphinx has the performance win and otherwise most of the common features but is a bit more work to learn and to scale.
Related
I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.
This question might be relevant for any document based NoSQL database.
I'm making some interest specific social network and decided to go with DynamoDB because of scalability and no-pain-administration factors. There are only two main entities in database: users and posts.
Requirement for common queries are very simple:
Home feed (feed of people I'm following)
My/User feed (feed of mine, or specific user feed)
List of user I/user followed
List of followers
Here is a database scheme I come up with so far (legend: __thisIsHashKey and _thisIsRangeKey):
timeline = { // post
__usarname:"totocaster",
_date:"1245678901345",
record_type:"collection",
items: ["2d931510-d99f-494a-8c67-87feb05e1594","2d931510-d99f-494a-8c67-87feb05e1594","2d931510-d99f-494a-8c67-87feb05e1594","2d931510-d99f-494a-8c67-87feb05e1594","2d931510-d99f-494a-8c67-87feb05e1594"],
number_of_likes:123,
description:"Hello, this is cool"
}
timeline = { // new follower
__usarname:"totocaster",
_date:"1245678901345",
type:"follow",
follower:"tamuna123"
}
timeline = { // new like
__usarname:"totocaster",
_date:"1245678901345",
record_type:"like",
liker:"tamuna123",
like_date:"123255634567456"
}
users = {
__username:"totocaster",
avatar_url:"2d931510-d99f-494a-8c67-87feb05e1594",
followers:["don_gio","tamuna123","barbie","mikecsharp","bassman"],
following:["tamuna123","barbie","mikecsharp"],
likes:[
{
username:'barbie',
date:"123255634567456"
},
{
username:"mikecsharp",
date:"123255634567456"
}],
full_name:"Toto Tvalavadze",
password:"Hashed Key",
email:"totocaster#myemailprovider.com"
}
As you can see I came-up storing all my post directly in timeline collection. This way I can query for posts using date and username (hash and range keys). Everything seems fine, but here is the problem:
I can not query for User-Timeline in one go. This will be one of the most demanded queries by system and I can not provide efficient way to do this. Please help. Thanks.
I happen to work with news feeds daily. (Author of Stream-Framework and founded getstream.io)
The most common solutions I see are:
Cassandra (Instagram)
Redis (expensive, but easy)
MongoDB
DynamoDB
RocksDB (Linkedin)
Most people use either fanout on write or fanout on read. This makes it easier to build a working solution, but it can get expensive quickly. Your best bet is to use a combination of those 2 approaches. So do a fanout on write in most cases, but for very popular feeds keep them in memory.
Stream-Framework is open source and supports Cassandra/Redis & Python
getstream.io is a hosted solution build on top of Go & Rocksdb.
If you do end up using DynamoDB be sure to setup the right partition key:
https://shinesolutions.com/2016/06/27/a-deep-dive-into-dynamodb-partitions/
Also note that a Redis or DynamoDB based solution will get expensive pretty quickly. You'll get the lowest cost per user by leveraging Cassandra or RocksDB.
I would check out the Titan graph database (http://thinkaurelius.github.com/titan/) and Neo4j (http://www.neo4j.org/).
I know Titan claims to scale pretty well with large data sets.
Ultimately I think your model maps well to a graph. Users and posts would be nodes, and then you can connect them arbitrarily via edges. A user (node) is a friend (edge) of another user (node).
A user (node) has many posts (nodes) in their timeline. Then you can run interesting traversals via the graph.
You can also use Amazon Neptune (https://aws.amazon.com/neptune/) (Graph DB) which is well suited for social network. I don't think DynomoDB would be a good choice for yours use cases.
Is it better to use
a lot of indexes (eg. for every user as your application allows that)
in Lucene
or just one, having every document in int
... if you think about:
performance
disk space
health
I am using elasticsearch, therefore I am using Lucene.
In Elastic Search, I think based off your information I would use 1 index. My understanding is users are only searching there own documents, and the documents seems to be relatively similar.
Performance - When searching you can use a Filtered Query to filter to only the documents matching the user. The user id filter is cache-able, and fast.
Scalable - In Elasticsearch, you control sharding and replication at index level. Elasticsearch can handle large numbers of indexes, I just think configuring appropriate shards and replications could be valuable for the entire index.
In a single index, you can still easy wipe away data (see delete by query) , and there should be little concern of seeing others data unless you write your queries wrong. A filtered query with that filters results to only those associated with a user id is very simple. Similar in complexity to searching a different index per user.
Your exact needs might fit a different approach better. Based what I have so far, I would do choose one index though.
From the appengine blog:
Advanced Query Planning - We are removing the need for exploding indexes and reducing the custom index requirements for many queries. The SDK will suggest better indexes in several cases and an upcoming article will describe what further optimizations are possible.
As a test, I have an Entity in appengine that has a listProperty
class Entity(db.Model):
tags = db.StringListProperty()
I have 500,000 entities, half of them have tags = ['1'], and the other half have tags = ['2']
my query is
SELECT FROM Entity WHERE tags='1' and tags='2'
It returns no results really quickly. What plan is it using to achieve this? How is the list indexed to achieve this? In the old days, an exploding index would have been needed.
The algorithm used internally ('merge-join') was described in the Google I/O 2009 tech talk Building Scalable, Complex Apps on App Engine. This functionality has also been available since the launch of GAE; the 'exploding indexes' only happen if you create a compound index of multiple StringListProperties.
It's worth noting that this functionality is actually a bit more general than you may realize - any combination of multiple equality filters on any arbitrary combination of properties can be satisfied without any compound indices, provided they're all equality filters and you don't have a sort order. They don't all have to be from a StringListProperty, and can even be split across multiple StringListProperty.
We are searching disparate data sources in our company. We have information in multiple databases that need to be searched from our Intranet. Initial experiments with Full Text Search (FTS) proved disappointing. We've implemented a custom search engine that works very well for our purposes. However, we want to make sure we are doing "the right thing" and aren't missing any great tools that would make our job easier.
What we need:
Column search
ability to search by column
we flag which columns in a table are searchable
Keep some relation between db column and data
we provide advanced filtering on the results
facilitates (amazon style) filtering
filter provided by grouping of results and allowing user to filter them via a checkbox
this is a great feature, users like it very much
Partial Word Match
we have a lot of unique identifiers (product id, etc).
the unique id's can have sub parts with meaning (location, etc)
or only a portion may be available (when the user is searching)
or (by a decidedly poor design decision) there may be white space in the id
this is a major feature that we've implemented now via CHARINDEX (MSSQL) and INSTR (ORACLE)
using the char index functions turned out to be equivalent performance(+/-) on MSSQL compared to full text
didn't test on Oracle
however searches against both types of db are very fast
We take advantage of Indexed (MSSQL) and Materialized (Oracle) views to increase speed
this is a huge win, Oracle Materialized views are better than MSSQL Indexed views
both provide speedups in read-only join situations (like a search combing company and product)
A search that matches user expectations of the paradigm CTRL-f -> enter text -> find matches
Full Text Search is not the best in this area (slow and inconsistent matching)
partial matching (see "Partial Word Match")
Nice to have:
Search database in real time
skip the indexing skip, this is not a hard requirement
Spelling suggestion
Xapian has this http://xapian.org/docs/spelling.html
Similar to google's "Did you mean:"
What we don't need:
We don't need to index documents
at this point searching our data sources are the most important thing
even when we do search documents, we will be looking for partial word matching, etc
Ranking
Our own simple ranking algorithm has proven much better than an FTS equivalent.
Users understand it, we understand it, it's almost always relevant.
Stemming
Just don't need to get [run|ran|running]
Advanced search operators
phrase matching, or/and, etc
according to Jakob Nielsen http://www.useit.com/alertbox/20010513.html
most users are using simple search phrases
very few use advanced searches (when it's available)
also in Information Architecture 3rd edition Page 185
"few users take advantage of them [advanced search functions]"
http://oreilly.com/catalog/9780596000356
our Amazon like filtering allows better filtering anyway (via user testing)
Full Text Search
We've found that results don't always "make sense" to the user
Searching with FTS is hard to tune (which set of operators match the users expectations)
Advanced search operators are a no go
we don't need them because
users don't understand them
Performance has been very close (+/1) to the char index functions
but the results are sometimes just "weird"
The question:
Is there a solution that allows us to keep the key value pair "filtering feature", offers the column specific matching, partial word matching and the rest of the features, without the pain of full text search?
I'm open to any suggestion. I've wondered if a document/hash table nosql data store (MongoDB, et al) might be of use? ( http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo ). Any experience with these is appreciated.
Again, just making sure we aren't missing something with our in-house customized version. If there is something "off the shelf" I would be interested in it. Or if you've built something from some components, what components (search engines, data stores, etc) did you use and why?
You can also make your point for FTS. Just make sure it meets the requirements above before you say "just use Full Text Search because that's the only tool we have."
I ended up coding my own.
The results are fantastic. Users like it, it works well with our existing technologies.
It really wasn't that hard. Just took some time.
Features:
Faceted search (amazon, walmart, etc)
Partial word search (the real stuff not full text)
Search databases (oracle, sql server, etc) and non database sources
Integrates well with our existing environment
Maintains relations, so I can have a n to n search and display
--> this means I can display child records of a master record in search results
--> also I can search any child field and return the master record
It's really amazing what you can do with dictionaries and a lot of memory.
I recommend looking into Solr, I believe it will meet you needs:
http://lucene.apache.org/solr/
For an off-she-shelf solution: Have you checked out the Google Search Appliance?
Quote from the Google Mini/GSA site:
... If direct database indexing is a requirement for you, we encourage you to consider the Google Search Appliance, which has direct database connectivity.
And of course it indexes everything else in the Googly manner you'd expect it to.
Apache Solr is a good way to start your project with and it is open source . You can also try Elastic Search and there are a lot of off shelf products which offer good customization abilities and search features such as Coveo, SharePoint Fast, Google ...