Updating existing products' info from CJ datafeed updates - affiliate

I know variations on this question have been asked, but I think this is specific enough to merit a new question.
When I receive an updated data feed from Commission Junction I dump it all into a database. Then those products become searchable and selectable for use on our site. However, since all of the data used seems to be fluid, how can I update the products I have saved on the site with the new information? How can I match new info to existing products?
I'm hoping that someone has been doing this long enough to be able to say certain fields rarely change and can generally be relied upon, etc. etc.
Thank you!

Related

Syncronize file with table

We are receiving from partner the excell file with list of products. We need to update table that contains these products with data from file. For example some product information was updated, some new products were added , or some removed.
We currently selecting all products for partner in memory, making a diff with products received from from file. The diff will contain list of new products, products to delete, products to update. Than we are applying updates, inserts and deletes.
The approach works, especially if there are not a lot of products, but when there are a lot, memory became an issue. Does anyone had similar tasks, what are the approaches you used? Or any useful advises?
Also we are using a lot of 'home made' libs for this, I am just wondering, if there are already some libs for this already available( googling didn't helped me a lot)?
Thanks a lot for advises!
Best regards, Alex

Viewing a "log" of new (distinct) events in a database

I have an application which has several unrelated tables in its db. I'll explain by using an "auto-updating" version of the SO homepage as an example, so lets say I have the tables "users", "comments" and "questions".
The homepage client side needs to periodically poll the server, and get a log of all the new "events" that have happened. I.e., I'd like to display (somehow) the new questions, comments and users that have been added to SO since the last poll.
On way would be to simply keep a variable on the client side containing the last index of each of my tables, send it to the server, and have the server send me the new users, comments and questions.
The problem is, what happens when I add a new type of information, say, votes. Now I have to store another variable on the client-side, and the server has to poll another table. And so on, for every new type of information I keep.
I'm looking for a solution that helps me avoid this.
Another problem - say I'd like to see all the "events" that have happened since last time, but sorted according to when they took place.
One direction I had is to have a single "events" table, which contains the info about when each event happened. I can then poll only this table, and get a list of all the new events that have happened. The problem is that each event is pretty different (a new comment has different columns than a new upvote, etc.) So I'm not sure how to implement this, or if this is even a good idea.
Does anybody have any ideas how I can solve this? This seems like something that would come up a lot, but I don't really have much experience with databases, unfortunately.
Thanks!
It sounds to me like you're trying to future proof via database design. While this can be done through something an EVA model I caution against that because the value its adds tend to not be worth the cost.
Instead you should model the database as closely to reality as possible and not how you intend to use it.
Then use SQL to project the data to how you need it. You can do this by statements that will either deliver the meta data that you need
e.g.
Select
Count(ID) , 'Comments' Type
From
Comments
Where
lastUpdate > #InputParamter1#
UNION Select
Count(ID) , 'Questions' Type
From
Questions
Where
lastUpdate > #InputParamter1#
Or (and this doesn't get used Often enough)
Return more than one result set from your database in one go
Select
userid,
ComentText
From
Comments
Where
lastUpdate > #InputParamter1#;
Select
userId,
Questions,
Tags
From
Questions
Where
lastUpdate > #InputParamter1#
That said you will still have to write some code if you add new stuff but it should be limited to updating your sql, adding new containers for your data and then code to display to the end users and then to validate and store it.
Honestly the idea of adding new stuff requiring some work doesn't seem that awful to me.

Is it highly necessary to record the registration date of new website users?

What are the advantages and disadvantages?
That depends on what your site is, and how you use that information. On StackOverflow, you are awarded a "yearling" badge once a full year elapses from the time you registered. Clearly here that information is necessary.
If I were you, I'd save it. It's a small piece of information that may become useful eventually. It's better to have it and not need it than to need it and not have it. It would be rather difficult to extrapolate an accurate registration date retrospectively if you don't store it to begin with.
Advantage:
You don't get in a migration horror when needing it at some point. For a lot of data you cannot find out this data afterwards. You could fake around with MODIFICATION_DATE but often this is not accurate and sits in the future (e.g. when profile can be edited by user).
Disadvantage:
In case you never need this information, you wasted space (though another small data payload column shouldn't make a problem). Further more you have an 'all-time' deprecated field, which can be confusing to new developers ("what is this column for, cannot see where it is used...?")
As mentioned the registration-date is most likely a valuable information I would add it from start on. When thinking of persistent data and its model you sometimes have to think "more" for the future.

historical data modelling literature, methods and techniques

Last year we launched http://tweetMp.org.au - a site dedicated to Australian politics and twitter.
Late last year our politician schema needed to be adjusted because some politicians retired and new politicians came in.
Changing our db required manual (SQL) change, so I was considering implementing a CMS for our admins to make these changes in the future.
There's also many other sites that government/politics sites out there for Australia that manage their own politician data.
I'd like to come up with a centralized way of doing this.
After thinking about it for a while, maybe the best approach is to not model the current view of the politician data and how they relate to the political system, but model the transactions instead. Such that the current view is the projection of all the transactions/changes that happen in the past.
Using this approach, other sites could "subscribe" to changes (a la` pubsubhub) and submit changes and just integrate these change items into their schemas.
Without this approach, most sites would have to tear down the entire db, and repopulate it, so any associated records would need to be reassociated. Managing data this way is pretty annoying, and severely impedes mashups of this data for the public good.
I've noticed some things work this way - source version control, banking records, stackoverflow points system and many other examples.
Of course, the immediate challenges and design issues with this approach includes
is the current view cached and repersisted? how often is it updated?
what base entities must exist that never change?
probably heaps more i can't think of right now...
Is there any notable literature on this subject that anyone could recommend?
Also, any patterns or practices for data modelling like this that could be useful?
Any help is greatly appreciated.
-CV
This is a fairly common problem in data modelling. Basically it comes down to this:
Are you interesting in the view now, the view at a point in time or both?
For example, if you have a service that models subscriptions you need to know:
What services someone had at a point in time: this is needed to work out how much to charge, to see a history of the account and so forth; and
What services someone has now: what can they access on the Website?
The starting point for this kind of problem is to have a history table, such as:
Service history: id, userid, serviceid, start_date, end_date
Chain together the service histories for a user and you have their history. So how do you model what they have now? The easiest (and most denormalized view) is to say the last record or the record with a NULL end date or a present or future end date is what they have now.
As you can imagine this can lead to some gnarly SQL so this is selectively denomralized so you have a Services table and another table for history. Each time Services is changed a history record is created or updated. This kind of approach makes the history table more of an audit table (another term you'll see bandied about).
This is analagous to your problem. You need to know:
Who is the current MP for each seat in the House of Representatives;
Who is the current Senator for each seat;
Who is the current Minister for each department;
Who is the Prime Minister.
But you also need to know who was each of those things at a point in time so you need a history for all those things.
So on the 20th August 2003, Peter Costello made a press release you would need to know that at this time he was:
The Member for Higgins;
The Treasurer; and
The Deputy Prime Minister.
because conceivably someone could be interesting in finding all press releases by Peter Costello or the Treasurer, which will lead to the same press release but will be impossible to trace without the history.
Additionally you might need to know which seats are in which states, possibly the geographical boundaries and so on.
None of this should require a schema change as the schema should be able to handle it.

Creating a Notifications type feed in GAE Objectify

I'm working on a notification feed for my mobile app and am looking for some help on an issue.
The app is a Twitter/Facebook like app where users can post statuses and other users can like, comment, or subscribe to them.
One thing I want to have in my app is to have a notifications feed where users can see who liked/comment on their post or subscribed to them.
The first part of this system I have figured out, when a user likes/comments/subscribes, a Notification entity will be written to the datastore with details about the event. To show a users Notification's all I have to do is query for all Notification's for that user, sort by date created desc and we have a nice little feed of actions other users took on a specific users account.
The issue I have is what to do when someone unlikes a post, unsubscribes or deletes a comment. Currently, if I were to query for that specific notification, it is possible that nothing would return from the datastore because of eventual consistency. We could imagine someone liking, then immediate unliking a post (b/c who hasn't done that? =P). The query to find that Notification might return null and nothing would get deleted when calling ofy().delete().entity(notification).now(); And now the user has a notification in their feed saying Sally liked his post when in reality she liked then quickly unliked it!
A wrench in this whole system is that I cannot delete by Key<Notification>, because I don't really have a way to know id of the Notification when trying to delete it.
A potential solution I am experimenting with is to not delete any Notifications. Instead I would always write Notification's and simply indicate if the notification was positive or negative. Then in my query to display notifications to a specific user, I could somehow only display the sum-positive Notification's. This would save some money on datastore too because deleting entities is expensive.
There are three main ways I've solved this problem before:
deterministic key
for example
{user-Id}-{post-id}-{liked-by} for likes
{user-id}-{post-id}-{comment-by}-{comment-index} for comments
This will work for most basic use cases for the problem you defined, but you'll have some hairy edge cases to figure out (like managing indexes of comments as they get edited and deleted). This will allow get and delete by key
parallel data structures
The idea here is to create more than one entity at a time in a transaction, but to make sure they have related keys. For example, when someone comments on a feed item, create a Comment entity, then create a CommentedOn entity which has the same ID, but make it have a parent key of the commenter user.
Then, you can make a strongly consistent query for the CommentedOn, and use the same id to do a get by key on the Comment. You can also just store a key, rather than having matching IDs if that's too hard. Having matching IDs in practice was easier each time I did this.
The main limitation of this approach is that you're effectively creating an index yourself out of entities, and while this can give you strongly consistent queries where you need them the throughput limitations of transactional writes can become harder to understand. You also need to manage state changes (like deletes) carefully.
State flags on entities
Assuming the Notification object just shows the user that something happened but links to another entity for the actual data, you could store a state flag (deleted, hidden, private etc) on that entity. Then listing your notifications would be a matter of loading the entities server side and filtering in code (or possibly subsequent filtered queries).
At the end of the day, the complexity of the solution should mirror the complexity of the problem. I would start with approach 3 then migrate to approach 2 when the fuller set of requirements is understood. It is a more robust and flexible approach, but complexity of XG transaction limitations will rear its head - but ultimately a distributed feed like this is a hard problem.
What I ended up doing and what worked for my specific model was that before creating a Notification Entity I would first allocate and ID for it:
// Allocate an ID for a Notification
final Key<Notification> notificationKey = factory().allocateId(Notification.class);
final Long notificationId = notificationKey.getId();
Then when creating my Like or Follow Entity, I would set the property Like.notificationId = notificationId; or Follow.notificationId = notificationId;
Then I would save both Entities.
Later, when I want to delete the Like or Follow I can do so and at the same time get the Id of the Notification, load the Notification by key (which is strongly consistent to do so), and delete it too.
Just another approach that may help someone =D

Resources