Building personalized feed on App Engine - google-app-engine

I have been working on a social app. I'll first explain the problems, and then summarize in the questions below.
In the network, there would be channels, and users. Users can subscribe to these channels, and to other users. This way, we have two sources from which posts can be generated.
Now, we can simply keep one Activity model where we record all the actions, their kind, and what they affect. Be it from channels, or from the users. And refer these while creating a feed for each user.
I found a solution given in a talk by Brett Slatkin which basically suggests using ListProperty to link each post with each subscriber. But Guido suggests not to use lists if there's going to be more than 1000 elements. So if there's going to be more than 1000 subscribers to a channel, this will probably run into problem. Even if this were to work --
I want to rank the posts based on popularity (based on number of votes, comments), and apply some time decay function. More like Reddit. To do so, I will have to keep the Activity in memory, and filter and order it based on ranks for each user. I'll also need to do it periodically since new activities will keep occurring also old activities will gain, or lose their values.
The challenge is -- To keep the data in memory (for processing the feed as well as to keep things fast). I will have to store a copy of each users feed to persistent storage, but if the order of posts is going to be changing, how do I keep track of that in the database?
Also: I have kept my options open -- I will move to AWS if I have to.
To summarize:
Is there a better solution to keep track of subscribers without using Lists? Using something like PostID > SubscriberID in one entity would be very, very expensive and inefficient.
If there's any cost-effective and fast solution to the problem above, how do I deal with the next challenge -- which is to generate a personalized feed? (memory issues - unknown size of memcache)
If I can generate a personalized feed (which will be dynamic, will be changing) how to I keep it in the database?.
I have gone through several articles and I can probably solve first two problems with AWS, but I am trying to stay away from the manual scaling work. If there is no way, I am willing to move to AWS. Even if I move to AWS, I can't think of a solution to the third problem.
Any thoughts, directions, resources would be helpful! Thanks!

Related

which methods require the least amount of resources?

I am currently working on the development of a social network with Cassandra. My problem is that I hesitate between two solutions to optimize the consumption of my server.In the first case, when a user publishes a post, it contains all the information like the nickname or the profile picture. But when the user change his nickname for example, I have to change the value for all the posts.
In case two, the information of the user and the posts are stored separately, but each time a user recovers a post he makes two commades instead of one.
In the long run, which of these solutions is the best to optimize the speed of my server?
Thank you for your answers,
Jesver
Since you mentioned that it's a social media application, so the assumption is that there can be plenty of posts by a single user. Also since you have mentioned about using Cassandra, the first scenario will not scale well for a few reasons.
Extensive updates can somehow be costly since it involves a read operation prior to write. It can be considered as an Antipattern.
The information about user will be redundant in each post. In the longer run, it will be a pain ensuring consistent user information in all the post.
The second alternative might be better in which 2 tables are created for users & posts, along with maintaining the relationship at the application level.

How do you technically save how many people have visited a certain URL or a certain post?

I am trying to create a forum app using Django as a backend and React as a front end. I want to find out how many people have visited a post created by a user so that I can store as views and list the posts according to popularity.
I am just a student and I have no experience with live websites, so I'm wondering if it is okay to just save a user at componentdidmount life cycle? But I'm afraid it'd make the same user be counted as many as he visits and the post creator will be able to increase his post's popularity by just spamming his website.
I would suggest you implement this on the backend, not the front end. I don't know Django well, but there should be some way to know whether that particular post is getting requested. At that point, you'll want to increment the counter for that post.
The problem of course is determining which "views" count as real views. Was it the poster? Was it a robot? Was it a spider? Was it a scraper? Was it the same person who is not the poster viewing it many times?
I wouldn't say this last part is not an easy thing to implement, and would probably take some trial and error before finding the right conditions to get your metrics "right".
As #Mike suggests above, there are many analytics packages which use sophisticated algorithms to determine "realness", and you may be able to use this data. My understanding is that you want to actually apply the data to sorting and UI for your app, not just view it on the dashboard of your analytics tool. I've never tried to look for one that supports an API to discover what you're interested in programmatically, but they all probably allow you do download structured data about your traffic. The problem with the latter approach is that creates a delay and a manual step (always something to try to avoid imo).

Database architecture for chatting app to handle tons of users

I found all applications has Messages Collection but I found it insufficient to search all web-app messages every time you get a request.
So If I thought about making a collection for each person is this a good practice ?
Well that would mean a couple of things:
If you had 100 users, you would have 100 collections. Afterwards, if you get 1000 users , you would have to create an additional 900 collections. That is not practical as you would have to keep creating new collections as the number of users increases.
You would have to somehow keep track of the collection relatively to the user. Most DBs have nothing like that out of the box and you would have to create from scratch a program just to be able to delete update etc the correct collections. This is not a small task. Your time is better put to use developing the main functionality of your app
DBs specialize in data lookup in collections, as long as you have your collection properly indexed , you could put millions of messages in a collection and find the ones you need in almost no time at all.
And that is just the tip of the iceberg. As such, making collections per user are not only bad practice , but very impractical unfortunately.
Having said all that, I encourage to keep thinking out of the box. Not all the ideas will work out (like this one), but many great innovations have come from trying something new.

Freebase: Is it worth it to base my company's entire database on it?

I'm with a company that is building a venue / artist database for live music and recently came across Freebase. It looks very compelling, even if the data isn't there for new, up-and-coming bands. For those of you who have worked with Freebase, I have a couple questions:
Are there downsides to integrating all of the data entry with Freebase? We are not looking to sell or privatize this information.
What are the weaknesses of Freebase, with regards to usability?
Disclosure: I work on Freebase at Google.
The music data in Freebase is one of our strongest areas and is going to continue to get broader and richer as we continue to load more datasets. For example, we import data from MusicBrainz, clean it up and match the topics against existing topics in Freebase to avoid duplicates.
In terms of downsides, you should be prepared to work with a lot of data. For example, Freebase currently has 4 musical artists named "John Smith" which may or may not be useful for your application but you'll still need to figure out which one(s) map to the John Smith that your users are interested in. We call this "reconciliation" and its necessary so that your app knows precisely which topics to query the API for.
Since you mentioned music venues I should also point out that while Freebase has a lot of data about places, we don't yet have a geosearch API so you'd need to roll your own if that's something you need.
Since anyone can edit Freebase, you should also consider using as_of_time to protect your site against vandalism.
Freebase is great for developers because you can easily jump in and clean up bad data or add missing topics. However, one area that has always been a challenge is loading large amounts of data from outside of Google. We've built the OpenRefine which allows folks to upload datasets, but these datasets must pass a QA process that takes some time to complete. Its necessary to have these QA processes to maintain the level of quality in Freebase, but it does slow down the process of loading large datasets.
I really hope that you choose to make use of Freebase music data to build your company. I know that there are already a number of music startups happily using our data.

Are there any algorithms for creating playlists that don't require a massive database (or that work with a publicly accessible one)?

Is there any algorithm with which I can automatically create a playlist of songs that well with each other -- similarly to services like iTunes Genius -- that a single developer can actually implement? It should either a) not require any sort of remote database of listening habits etc. or b) require such a database, but work with one that is freely available.
i did this, and i used the last.fm database as described by tomasz. i didn't use "related artist" directly, but instead constructed my own relationship graph by comparing tags associated with different artists (this is not the approach suggested by lcfseth btw - i have quite a large range of music and i wanted to explore "natural" connections that might not be common partners in "normal" playlists; also i wasn't sure how uniform the related artists were).
i also used a local database to cache data from last.fm, because calls to the api are rate limited, and i experimented with using other parts of the api to improve / normalize the information i was reading from mp3 tags.
generating a useful graph of related artists was actually quite hard; largely because some nodes in the graph naturally tend to be more important than others. if you don't "even out" the graph then your playlist will keep returning to the "important" artists.
the final result did work well, in that the selection of music had a good balance between "central theme" and variation. but the implementation is not at all polished, the calculation of the graph can take a long time (many hours), the program takes up a fair amount of memory when running, and it still seems to play elvis costello a little more than expected ;o)
if you are interested, the code is at http://code.google.com/p/uykfe/
the best part of all, from my point of view as a user, is that it can update logitech media server (squeezeserver) playlists in "realtime", adding a new track whenever the list is empty. that works really well in continuing from whatever music you select "by hand". it can also generate one-off playlists, of course, and, finally, by tweaking parameters you can get a kind of "random walk" through your music collection - it will play related tunes but slowly drift from one style to another (in fact, this is really the "default" mode - to get it to stay on a single theme i needed extra logic that biased it towards whatever music it had played earlier).
ps also, the dump of the final graph to gephi was really cool - i had it printed out and it's now pinned to the wall...
pps i also experimented with the musicbrainz database, which in theory sounds like a fantastic resource. but in practice it is over-complex and poorly documented.
I don't know iTunes Genius, but I think last.fm database and API might be useful for you. Every time you see any track it shows you a list of similar tracks, based on other users preferencs. The same information can be obtained using track.getSimilar API method.
The idea behind most of these databases, is to see what other users listens to after they listen to a given song. The accuracy of these statistics depends on the number of users therefor it is probably hard to use this locally. The algorithm itself is not that hard to implement.
The alternative would be to sort song based on genre, singer... which are informations that are usually embedded in the songs but not always. Winamp have this feature, but it won't work for old songs, unless you manually set the informations or use an On-line song database.

Resources