I am working on a social networking project with cassandra. Users can subscribe to a profile and have access to the list of people who have subscribed to that same profile. My goal is to retrieve in a table called user_follows the list of people subscribed to a profile.
CREATE TABLE users_follows (to_id text, from_id text, followed_at timestamp, PRIMARY KEY(to_id, from_id))
The problem is that some profiles can have thousands of subscribers and I don't want to get them all at once. That's why I'd like to get the list in increments of 20 depending on how far down the user goes. My problem is that I can't see how to retrieve the other parts of the list after the first select because Cassandra always returns the same users.
SELECT * FROM users_follows where to_id = 'xxxxx'
A possible solution was to sort with a timestamp but in case I want to retrieve the list of people to whom a user is subscribed (the reverse query) this would not work. One solution would be to use materialized views but I'm not sure that it would be very optimal given the size of the table. Or to create a different table, one user_follows and another user_followers, but I don't think this is very recommended....
Our product is using Google Datastore as the application database. Most of the entities use IDs of type Long and some of type String. I noticed that the IDs of type Long are not in consecutive order.
Now we are exporting some big tables, with around 30 - 40 million entries, to json files for some business purposes. Initially we expected that a simple query like "ofy().load().type(ENTITY.class).startAt(cursor).limit(BATCH_LIMIT).iterator()" will help us iterate through the entire content of that specific table, starting from the first entry and ending with the most recently created one. We are working in batches and storing the cursor after every batch, so that the next task can load the batch and resume.
But after noticing that an entity created some minutes ago can have an ID smaller than the ID of another entity created 1 week ago, we are wondering if we should consider a content freeze during this export period. On one hand it's critical to make a good export and not to miss older data up to a specific date, on the other hand a content freeze longer than 1 day is a problem for our customers.
What do you advice us to do?
Thanks,
Cristian.
I do not think you need to worry about uniqueness of your id. Datastore build on top of Bigtable with 6 tables.
first table stores entities
second stores entities by kind
third stores indexes for the property values in the ascending order
fourth to store indexes for the property values in the descending order
fifth stores indexes for multiple properties together
sixth keeps a track of the next unique ID for Kind
Format is something like this.
[application ID]-[namespace]-[Kind]-[ID]
It is garanties of uniqueness each entities.
Yes, the format on that table is [Application ID]-[Kind Name] and the value is the next value. Let say you have kind products and that table will look like this |key(yourapp-products), Next ID(3)|. Now you created new entity for kind products it will be assigned to ID(3) and the row on that table will get new value |key(yourapp-products), Next ID(4)|. Also to mention that table has only one row since we have only one kind products.
Do you specify ID yourself or let datastore generate itself? It sounds like you have "Pre-allocating IDs" issue, just speculating but for every batch you need sort Kind.allocate_ids(size=blah) that way you can keep sequence.
I was wondering how to set up a database for storing actions people recently done when they travel. For example, if they go to a museum, the database will store this text "Bob went to this museum" and store the user id and timestamp. I was wondering if these events should be stored in just one table, and if I want the events of a single person I will just query this table with a user id.
On a similar note I want to store 50 users the user has "recently met" meaning the last 50 users the userhas been around in their travels. I was thinking this could be stored in one table as well, with just user IDs being paired with no duplicates. I'm just afraid the table might get too big.
Any suggestions on table set up?
Thanks
Personally I would go with an ER structure like this:
I am working on a system, which will run on GAE, which will have several related entities and I am not sure of the best way to store the data. This post is a request for advice from others who may have similar experience....
The system will have users, with profile data and an image. Those users will be able to create "events" and add journal entries to it. For the purpose of the system, the "events" will likely have 1 or 2 journal entries in them, and anything over 10 would likely never happen. Other users will be able to add comments to users' entries as well, where popular ones may have hundreds or even thousands of comments. When a random visitor uses the system, they should be able to see the latest events (latest, being defined by those with latest journal entries in them), search by tag, and a very perform basic text search. Then upon selecting an event to view, it should be displayed with all journal entries, and all user comments, with user images alongside comments. A user should also have a kind of self-admin page, to view/modify/delete their events and to view/modify/delete comments they have made on other events. So, doing all this on a normal RDBMS would just queries with some big joins across several tables. On GAE it would obviously need to work differently. Here are my initial thoughts on the design of the entities:
Event entity - id, name, timstamp, list
property of tags, view count,
creator's username, creator's profile
image id, number of journal entries
it contains, number of total comments
it contains, timestamp of last update to contained journal entries, list property of index words for search (built/updated from text from contained journal entries)
JournalEntry entity - timestamp,
journal text, name of event,
creator's username, creator's profile
image id, list property of comments
(containing commenter username and
image id)
User entity - username, password hash, email, list property of subscribed events, timestamp of create date, image id, number of comments posted, number of events created, number of journal entries created, timestamp of last journal activity
UserComment entity - username, id of event commented on, title of event commented on
TagData entity - tag name, count of events with tag on them
So, I'd like to hear what people here think about the design and what changes should be made to help it scale well. Thanks!
Rather than store Event.id as a property, use the id automatically embedded in each entity's key, or set unique key names on entities as you create them.
You have lots of options for modeling the relationship between Event and JournalEntry: you could use a ReferenceProperty, you could parent JournalEntries to Events and retrieve them with ancestor queries, or you could store a list of JournalEntry key ids or names on Event and retrieve them in batch with a key query. Try some things out with realistically-distributed dummy data, and use appstats to see what works best.
UserComment references an Event, while JournalEntry references a list of UserComments, which is a little confusing. Is there a relationship between UserComment and JournalEntry? or just between UserComment and Event?
Persisting so many counts is expensive. When I post a comment, you're going to write a new UserComment entity and also update my User entity and a JournalEntry entity and an Event entity. The number of UserComments you expect per Event makes it unwise to include everything in the same entity group, which means you can't do these writes transactionally, so you'll do them serially, and the entities might be stored across different network nodes, making the whole operation slow; and you'll also be open to consistency problems. Can you do without some of these counts and consider storing others in memcache?
When you fetch an Event from the datastore, you don't actually care about its list of search index words, and retrieving and deserializing them from protocol buffers has a cost. You can get around this by splitting each Event's search index words into a separate child EventIndex entity. Then you can query EventIndex on your search term, fetch just the EventIndex keys for EventIndexes that match your search, derive the corresponding Events' keys with key.parent(), and fetch the Events by key, never paying for the retrieval or deserialization of your search index word lists. Brett Slatkin explains this strategy here at 14:35.
Updating Event.viewCount will fail if you have a lot of views for any Event in rapid succession, so you should try out counter sharding.
Good luck, and tell us what you learn by trying stuff out.
I am starting work on a project which has multiple websites for a client. For 'analytic' purposes need to measure metrics across websites over a period of time. This means i need to centralize the data model and all the possible questions/answers/lookup values so it can be used across websites. The question i have is:
Example: Age range of a user visiting website 1 is say: 30-39 years old. (We ask for age range when they enter). so in the data model I have a lookup table for answers which has all possible answers used across all websites. So (30-39) has PK ID of say 102. Now in website 2, same thing, so again (30-39) has PK of 102. This way i can measure across websites the same age range. But the problem is where or how to store the user's answer and map that to this ID?
If i have a table called say UserAnsers, it has an AgeRange colunm. Do i make this a FK to the Answer table at PK 102 to store (30-39) for the user? if yes, then what value gets written in the Useranswer table, would it be 102?
Secondly I need to measure textfields also. Like how many are complete across websites. So say "email address" field. I give this textfield a field Id of 10. Again when i write the consumer' email of say xyz#abc.com in the 'email' colunm of the the answers table how will i link this to the field ID 10?
I'd probably make an API for each website that dumps the data I want in a consistent format, which I then could load into a separate database, where I would then do the metrics on.
This gets unfeasible if the data is so large it takes hours every night to migrate the data.
But otherwise it's a solution that has the benefit that you don't have to touch the current schemas.
If the sites doesn't exist already, then I'd just force a consistent datamodel on all sites, which is trivial. It doens't matter where the mapping between "30-39" and 102 is mapped, the only thing that matters is that it's the same everywhere, which you then set up when you create the databases.
If you want the values and schemas to change a lot, then using just one database for all sites would probably be better, but if you can't do this, then make an export for each site.