Currently I am trying to add a new functionality to my system in which users will be able to see customs lists of products.
For the creation of these lists I will mostly likely use an algorithm or some criteria that will be used to gather data from my database, or sometimes use hand-picked items.
I wonder what is the best way to do that, in terms of storage and computational time. I was thinking about using an object for my model (something like CustomList), that will store some attributes regarding to this list (that is basically filled with products which are results of an advanced search in my database) and by doing so, store a query string or something like that so it can be reprocessed periodically and if it's personalized for an specific user, run it for every user that requests it.
Example of a query (in natural language): "Select all items that are cheaper than 15 dollars and are designed for the gender of the user X"
I don't know if there is a better way to do that. I wonder how Spotify work with their personalized and custom lists (like Discovery Weekly, Running Musics, Sleepy Monday et cetera).
Should I use a query string and store it on an attribute inside this object on my model? Should I do all of that without an object model (on the fly)? What are the best options? How big companies do that?
Related
I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?
The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.
For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers
It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.
I'm developing an application that allows users tag product purchases (via a Web App).
I intend to use the tags to automatically query DBPedia (Possible other Open Data Sources such as FreeBase).
The top N results returned from DBPEdia will be displayed to users and they will select the one that most closely resembles the tag they entered. (I will only extract specific data).
For example:
User enters tag 'iPhone' and SparSQL query sent to DBPedia. Results are parsed and some data on each result shown to user who then selects the one that most closely resembles what they bought.
I want to extract some of the data from the users selected DBpedia result and store it for marketing purposes at a later stage. (Ideally via some call to an API)
I was thinking either Bigdata or Protege OWL but have no experience of using either
Can anybody suggest the best tool for this task and advantages/disadvantages/learning curve/etc...?
Thanks
It all depends on what you want to do with the data that you've extracted. The simplest option is just to store the reconciled entity URI along with your other data in a relational database or even a NoSQL database. This lets you easily query Freebase and DBpedia for that entity later on.
If you want to pull in "everything there is to know" about an entity from Freebase and DBpedia, then you're probably better off with a triple store. With this approach, you can query all the data locally; but now you have to worry about keeping it updated.
For the kind of thing you have in mind, I don't think you necessarily need a highly scalable triplestore solution. More important seems to me that you have a toolkit for easy execution of SPARQL queries, result processing, and quick local caching of RDF data.
With those things in mind, I'd recommend having a look at OpenRDF Sesame. It's a Java toolkit and API for working with RDF and SPARQL with support for multiple storage backends. It has a few built-in stores that perform well for what you need (scaling up to about 100 million facts in a single store), and if you do find you need a bigger/better storage solution, stores like BigData or OWLIM are pretty much just drop-in replacements for Sesame's own storage backends, so you get to switch without having to make large changes to your code.
Just to give you an idea: the following lines of code use Sesame to fire a SPARQL query against DBPedia and process the result:
SPARQLRepository dbpediaEndpoint = new SPARQLRepository("http://dbpedia.org/sparql");
dbpediaEndpoint.initialize();
RepositoryConnection conn = dbpediaEndpoint.getConnection();
try {
String queryString = " SELECT ?x WHERE { ?x a foaf:Person } LIMIT 10";
TupleQuery query = conn.prepareTupleQuery(Querylanguage.SPARQL, queryString);
TupleQueryResult result = query.evaluate();
while(result.hasNext()) {
// and so on and so forth, see sesame manual/javadocs
// for details and examples
}
}
finally {
conn.close();
}
(disclosure: I work on Sesame)
I'm currently specing out a project that stored threaded comment trees.
For those of you unfamiliar with what I'm talking about I'll explain, basically every comment has a parent comment, rather than just belonging to a thread. Currently, I'm working on a relational SQL Server model of storing this data, simply because it's what I'm used to. It looks like so:
Id int --PK
ThreadId int --FK
UserId int --FK
ParentCommentId int --FK (relates back to Id)
Comment nvarchar(max)
Time datetime
What I do is select all of the comments by ThreadId, then in code, recursively build out my object tree. I'm also doing a join to get things like the User's name.
It just seems to me that maybe a document storage like MongoDB which is NoSql would be a better choice for this sort of model. But I don't know anything about it.
What would be the pitfalls if I do choose MongoDB?
If I'm storing it as a Document in MongoDB, would I have to include the User's name on each comment to prevent myself from having to pull up each user record by key, since it's not "relational"?
Do you have to aggressively cache "related" data on the objects you need them on when you're using MongoDB?
EDIT: I did find this arcticle about storing trees of information in MongoDB. Given that one of my requirements is the ability to list to a logged in user a list of his recent comments, I'm now strongly leaning towards just using SQL Server, because I don't think I'll be able to do anything clever with MongoDB that will result in real performance benefits. But I could be wrong. I'm really hoping an expert (or two) on the matter will chime in with more information.
The main advantage of storing hierarchical data in Mongo (and other document databases) is the ability to store multiple copies of the data in ways that make queries more efficient for different use cases. In your case, it would be extremely fast to retrieve the whole thread if it were stored as a hierarchical nested document, but you'd probably also want to store each comment un-nested or possibly in an array under the user's record to satisfy your 2nd requirement. Because of the arbitrary nesting, I don't think that Mongo would be able to effectively index your hierarchy by user ID.
As with all NoSQL stores, you get more benefit by being able to scale out to lots of data nodes, allowing for many simultaneous readers and writers.
Hope that helps
I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.
We have many years of weather data that we need to build a reporting app on. Weather data has many fields of different types e.g. city, state, country, zipcode, latitude, longitude, temperature (hi/lo), temperature (avg), preciptation, wind speed, date etc. etc.
Our reports require that we choose combinations of these fields then sort, search and filter on them e.g.
WeatherData.all().filter('avg_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
or
WeatherData.all().filter('lo_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
May be easy to see that these queries require different indexes. May also be obvious that the 200 index limit can be crossed very very easily with any such data model where a combination of fields will be used to filter, sort and search entities. Finally, the number of entities in such a data model can obviously run into millions considering that there are many cities and we could do hourly data instead of daily.
Can anyone recommend a way to model this data which allows for all the queries to still be run, at the same time staying well under the 200 index limit? The write-cost in this model is not as big a deal but we need super fast reads.
Your best option is to rely on the built-in support for merge join queries, which can satisfy these queries without an index per combination. All you need to do is define one index per field you want to filter on and sort order (if that's always date, then you're down to one index per field). See this part of the docs for details.
I know it seems counter-intuitive but you can use a full-text search system that supports categories (properties/whatever) to do something like this as long as you are primarily using equality filters. There are ways to get inequality filters to work but they are often limited. The faceting features can be useful too.
The upcoming Google Search API
IndexTank is the service I currently use
EDIT:
Yup, this is totally a hackish solution. The documents I am using it for are already in my search index and I am almost always also filtering on search terms.