I want to add a field on my mongoDB schema that specifies how long a course will take. Could I use a basic int that would represent the minutes and then just implement some logic on my frontend to turn that into an hours, or is there a more efficient way to do this?
Cheers Darius
Related
I have a use case requirement, where I want to design a hashtag ranking system. 10 most popular hashtag should be selected. My idea is something like this:
[hashtag, rateofhitsperminute, rateofhisper5minutes]
Then I will query, find out the 10 most popular #hashtags, whose rateofhits per minute are highest.
My question is what sort of databases, can I use, to provide me statistics like 'rateofhitsperminute' ?
What is a good way to calculate such a detail and store in it db ? Do some DBs offer these features?
First of all, "rate of hits per minute" is calculated:
[hits during period]/[length of period]
So the rate will vary depending on how long the period is. (The last minute? The last 10 minutes? Since the hits started being recorded? Since the hashtag was first used?)
So what you really want to store is the count of hits, not the rate. It is better to either:
Store the hashtags and their hit counts during a certain period (less memory/cpu required but less flexible)
OR the timestamp and hashtag of each hit (more memory/cpu required but more flexible)
Now it is a matter of selecting the time period of interest, and querying the database to find the top 10 hashtags with the most hits during that period.
If you need to display the rate, use the formula above, but notice it does not change the order of the top hashtags because the period is the same for every hashtag.
You can apply the algorithm above to almost any DB. You can even do it without using a database (just use a programming language's builtin hashmap).
If performance is a concern and there will be many different hashtags, I suggest using an OLAP database. OLAP databases are specially designed for top-k queries (over a certain time period) like this.
Having said that, here is an example of how to accomplish your use case in Solr: Solr as an Analytics Platform. Solr is not an OLAP database, but this example uses Solr like an OLAP DB and seems to be the easiest to implement and adapt to your use case:
Your Solr schema would look like:
<fields>
<field name="hashtag" type="string"/>
<field name="hit_date" type="date"/>
</fields>
An example document would be:
{
"hashtag": "java",
"hit_date": '2012-12-04T10:30:45Z'
}
A query you could use would be:
http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hashtag&facet.mincount=1&facet.limit=10&facet.range=hit_date&facet.range.end=2013-01-01T00:00:00Z&facet.range.start=2012-01-01T00:00:00
Finally, here are some advanced resources related to this question:
Similar question: Implementing twitter and facebook like hashtags
What is the best way to compute trending topics or tags? An interesting idea I got from these answers is to use the derivative of the hit counts over time to calculate the "instantaneous" hit rate.
HyperLogLog can be used to estimate the hit counts if an approximate calculation is acceptable.
Look into Sliding-Window Top-K if you want to get really academic on this topic.
No database has rate per minute statistics just built in, but any modern database could be used to create a database in which you could quite easily calculate rate per minute or any other calculated values you need.
Your question is like asking which kind of car can drive from New York to LA - well no car can drive itself or refuel itself along the way (I should be careful with this analogy because I guess cars are almost doing this now!), but you could drive any car you like from New York to LA, some will be more comfortable, some more fuel efficient and some faster than others, but you're going to have to do the driving and refueling.
You can use InfluxDB. It's well suited for your use case, since it was created to handle time series data (for example "hits per minute").
In your case, every time there is a hit, you could send a record containing the name of the hashtag and a timestamp.
The data is queryable, and there are already tools that can help you process or visualize it (like Grafana).
If you are happy with a large data set you could store and calculate this information yourself.
I believe Mongo is fairly fast when it comes to index based queries so you could structure something like this.
Every time a tag is "hit" or accessed you could store this information as a row
[Tag][Timestamp]
Storing it in such a fashion allows you to first of all run simple Group, Count and Sort operations which will lead you to your first desired ability of calculating the 10 most popular tags.
With the information in this format you can then perform further queries based on tag and timestamp to Count the amount of hits for a specific tag between the times X and Y which would give you your hits Per period.
Benefits of doing it this way:
High information granularity depending on time frames supplied via query
These queries are rather fast in mongoDB or similar databases even on large data sets
Negatives of doing it this way:
You have to store many rows of data
You have to perform queries to retrieve the information you need rather than returning a single data row
In a appengine search document, can I set a field to not be indexed?
I looked at the documentation and I think this is not possible, but I don't find anywhere that concretely says so, so trying to make sure.
I wonder why this is... perhaps they are trying to prevent "abusive" use of the search by storing "too much" non-indexed information in the document... but it is pretty convenient in some cases vs having to go to another data store, especially when the total data sizes are fairly small.
Thanks
No, the full text search service is a search index, not a data repository, so by definition everything is indexed.
You would probably struggle with this approach because the data types you put in are not necessarily what are stored, for example dates put in only have precision of a day (the time component is dropped), and numbers have truncated precision.
I would like to implement datetime ordered entity in appengine, pretty much like Appengine's own logs. So I probably will need some kind of unique ordered id generation algorithm.
Has anyone got any suggestion on this?
Having a similar need I passed a long integer time stamp as identifier to the Entity constructor. The identifier can be only a string or a long integer according to Java Datastore Entities, Properties, and Keys. In order to see the actual dates and times in the Datastore Viewer I put the same value converted to a java.util.Date into an unindexed property as well. Admittedly some denormalized redundancy but convenient in practice.
Use the date that you are appending. One way is to convert it to unix time (ms since 1970) so its numeric.
A better way but more code is to not use the datastore and use bigquery instead. Probably cheaper.
We need more informations about what you want to do.
If you want to make some logs, you can use timestamp indeed.
With python and ndb it's easy :
class Log(ndb.Model):
date = ndb.DateTimeProperty(auto_now_add=True)
message = ndb.StringProperty()
Then you order your logs by the date field.
If you want to make like AppEngine, you can link your log with a parent key and order by date and parent key.
AppEngine Python ndb
I hope it helped you.
I'm creating a calendar application in which each date has one of 3 states: available, maybe available, and unavailable. Trying to figure out the best schema for this situation.
One thought might be to have a UserDate model with a field state. The problem with this is that the DB will have #-of-users- x 365 rows for each year - seems like it would grow too quickly for a modestly sized app.
Another thought might be to have a default state, and only create a UserDate object when the user has signified that their availability on that date is different from the default. This seems convoluted though.
Has anyone dealt with this situation before? Any suggestions on the best way to go about this?
When you create a new user, you do not want to be inserting records for the next 50 years of their life. Only creating the UserDate object when there is a non-default value is what you should do.
You could consider storing a range of dates for a user if you are likely to have lots of consecutive dates with the same status. For example, if they are unavailable for all of December, then this could be represented as a single row.
Think about the sort of information you want to extract from your database, and how difficult this will be with each of your possible designs.
I have an entity with 2 properties: UserId(String) and RSSSubscriptions(String). Instances of this class will be storing in App Engine Datastore.
Where RSSSubscriptions should be a key value pair like "Site1: Feed1", "Site2: Feed2".
Since datatypes like Hashmaps are not persistable I am forced to keep this data in a String format. Currently I have stored it as a string type with JSONArray format. Say, "[{"Site1: Feed1"}, {"Site2: Feed2"}]".
My client will be an Android app. So Iam supposed to parse this string as JSON Array at client side. But I think its a bad idea to create a String with JSON format and append it with existing string, each time when user is adding new subscription. Any better Ideas?
You can use JSONProperty which is supported by ndb for that particular reason. In my opinion its a "hairy" solution to store Json as string and parse it back and forth. You have to be very careful to guarantee validity.
Correct answer depends on several factors with expected number of pairs being the most important. Important to remember that there are significant costs associated with storing the pair in an entity accessed by query. There are numerous ops costs for doing a query, and there will be significant cpu time. Compare this to using a single record keyed by user id, and storing the JSON inside a TextProperty. That is one small op cost and cpu times which will likely be 10x less than a query.
Please consider these factors when deciding to go with the technically cleaner approach of querying entities. Myself, I would always use a serialized string inside a TextProperty for anything in the "thousands of pairs" volume unless there was a very high rate of deletions (and even this it likely the string approach could be better). Using a query is generally the last design choice for GAE given its high resource costs.