Life Time Value Model (LTV)... where do I even start? - analytics

I recently was picked to lead a longitudinal LTV model for our analytics dept. The final deliverable will be for external stakeholders, so essentially how the users on our platform (can't specify the company) are providing life time value to our external partners.
We'll be building this model from the ground up. We have nothing in place for this currently, just a sea of data (assume very generic assets, e.g. users, sign ups, user interaction with platform, etc.)
So... where do I even start? I've just been reading random docs on google for the time being. Any specific resources that are good? Are there different LTV methodologies? What's the "best" one (please take that with a grain of salt)?
I know this is an extremely broad topic so any answers even loosely related to LTV will hold significant value. Thanks all
I haven't tried anything yet. Just reading up on a few resources.

First thing you want to do is lay out the reasoning for having LTV. What's it's gonna be used for and by whom. I'll give some examples, but your industry and your business will have to have it tailored to them.
Next, you have series of meetings with all the stakeholders so that they would agree on a good definition for LTV under a tight guidance of someone who understands the data, or at least what dimensions have to influence it and what format it has to be in.
An example would be: you have an app that offers seven products. The first two products are freebies. Another requires an email to get. The fourth product is just one buck per month, the fifth costs a hundred, but one-time payment, the sixths is 20$/month and the final product is an enterprise/b2b level solution.
An arbitrary model would be to have something like:
No products (guests) => LTV = 0
Product 1 => LTV + 1
Product 2 => LTV + 1
Product 3 => LTV + 3
Product 4 => LTV + 10/month of subscription
Product 5 => LTV + 1000
Product 6 => LTV + 200/month of subscription
Product 7 => LTV + 10k/month of subscription
Then the LTV stakeholders, - mainly business owners and PMs refine the model depending on what kinds of analysis they need conducted typically. That basically depends on what and how they report to their executives or the board.
This is if you want to go with a simple integer as an LTV. Most commonly used for weighting users. Going with integer is a very comfortable starting point since it allows for easy mathematical aggregations. Just to make your user-based analysis more robust. Say, you found out that 2% of your users encounter certain issue that blocks them from navigating somewhere or finishing a process. How should it be prioritized? Should it just be ignored? Should it be addressed immediately?
Well that depends on who those users are. If they're just free users or even just guests and the error is not blocking them from product onboarding, then it's worth to get the ticket to the backlog, but realistically it won't get released any time soon if ever.
However, if those users are enterprise customers, then the issue not only has to be hotfixed. It has to be hotfixed immediately. Probably paying overtime to the devs, qa and devops to work till late today.
Generally, LTV should be a user-level dimension. There are implementations of it as a session-level, but it's way more difficult.
From the technical standpoint, LTV is most commonly implemented on the tracking stage, so commonly in a TMS, say, GTM by a tracking specialist.
Another way it's implemented is in or after ETL, by the data engineers or data scientists.

Related

Customer Deduplication in Booking Application

We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.
The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:
If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.
I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).
I would appreciate very much any help so I can drive the engineering team to the right direction.
In General
What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?
It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.
If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.
If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.
In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.
Specifics
...check if the (newly created) customer
for this reservation has already an old customer id (by comparing
email...
Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.
... and other aspects.
What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?
If it has one or more old reservations, detach that reservation from
the old customer id, and link it to a new customer id. Literally by
changing the customer ID of that old reservation to the newly created
customer.
Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.
As we have several operational applications relying on that data, this creates a massive sync issue.
...case in point.
In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.
Option - Phased Introduction Based on Testing
You could try this:
Keep the system as-is for now.
Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.
Alternative Customer/Session Approach
Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
Allow users to optionally be validated as "Customers" (big C).
Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.
I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.
I wouldn't say that's a terrible design, it's just a simple approach of solving this particular problem, with some room for improvement. It's not optimal because the runtime of that job depends on the new bookings that are received during the day, which may vary from day to day, so other workflows that depend on that will be impacted.
This approach can be improved by processing new bookings in parallel, and using an index to get a fast lookup when checking if a new e-mail already exists or not.
You can also check out Bloom Filters - an efficient data structure that is able to tell you if an element is not in a given set.
The way I would do it is to store the bookings in a No-SQL DB table keyed-off the user email. You get the user email in both situations - when it has an account or when it makes a booking without an account, so you just have to make a lookup to get the bookings by email, which makes that deduplication job redundant.

Infinite scrolling for personalized recommendations

I've been recently researching solutions that would allow me to display a personalized ranking of products in an online e-commerce store.
A natural solution for this problem would be to use a managed ML service such as
AWS Personalize.
Based on my understanding it can be implemented in terms of 2 service calls:
Recommendations - return up to ~500 products based on user's profile
Ranking - based on product ids (up to ~500) reorder the list according user's to profile
I was wondering if there exists and implementation / indexing strategy that would allow to display the entire product catalog (let's assume 10k products)?
I can imagine an implementation that would:
Return 50 products per page
Call recommendations to grab first 500 products
Alternatively, pick top 500 products platform-wise and rerank them according to the user.
For the remaining results, i.e. pages 11 to N a database query would be executed, excluding those 500 products by id. The recommended ordering wouldn't be as relevant anymore as te top recommendations have been listed and the user is less likely to encounter relevant results at the 11th page. As a downside such a query would need a relatively large array to be included as a part of the query.
Is there a better solution for this? I've seen many eccomerce platforms offering a "Recommended" order option for their product listing that allows infinite scrolling. Would that mean that such a store is typically using predefined ranks, that is an arbitrary rank assigned to each product by the content manager that is exactly the same for any user on the platform?
I don't think I've ever seen an ecommerce site that shows me 10K products without any differentiation. Most ecommerce sites use a process called "merchandising" to decide which product to show, to which customer, in which position/treatment, at which time, and in which context.
Personalized recommendations may be part of that process, but they are generally only a part. For instance, if you're browsing "Books about architecture", the fact the recommendation engine thinks you're really interested in CDs by Duran Duran is not super useful. And even within "books about architecture", there may be other attributes that are more likely to drive your buying behaviour.
Most ecommerce sites use a range of attributes to decide product ranking in a product listing page, for example "relevance", "price", "is the product on special offer?", "is the product in stock?", "do we make a big margin on this product?", "is this a best seller?", "is the supplier reliable?". Personalized recommendations are sprinkled into these factors, but the weightings are very much specific to the vendor.
Most recommendation engines will provide a "relevance score" or similar indicator. In my experience, this has a long tail distribution - a handful of products will score highly, and the rest will score very low relevancy scores. The ecommerce business I have worked with have a cut off point - a score of less than x means we'll ignore the recommendation, in favour of other criteria.
Again, in my experience, personalized recommendations are useful for squeezing the last few percentage points of conversion, but nowhere near as valuable as robust search, intuitive categorization, and rich meta data.
I'd suggest that if your customer has scrolled through 500 recommended products, it's unlikely the next 9500 need to be in "recommended" order - your recommendation mechanism is probably statistically not significant.

How do I structure multiple Identity Data in a database

Am designing a database for a credit bureau and am seeking some guidance.
The data they receive from Banks, MFIs, Saccos, Utility companies etc comes with various types of IDs. E.g. It is perfectly legal to open a bank account with a National ID and also a Passport. Scenario One that has my head banging is that Customer1 will take a credit facility (call it loan for now) in bank1 with the passport and then go to bank2 and take another loan with their NationalID and Bank3 with their MilitaryID. Eventually when this data comes from the banks to the bureau, it would be seen as 3 different people while we know that its actually 1 person. At this point, there is nothing we can do as a bureau.
However, one way out (for now) is using the Govt registry which provides a repository which holds both passports and IDS. So once we query for this information and get a response, how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Again, a person's name could be captured in various orders states. Bank1 could do FName, LName, OName while Bank3 can do LName, FName only. How do I store this names?
Even against one ID type e.g. NationalID, you will often find misspellt names or missing names. So one NationalID in our database could end up with about 6 different names because the person's name was captured different by the various banks where he has transacted.
And that is just the tip of the iceberg. We have issues with addresses, telephone numbers, etc etc.
Could you have any insight as to how I'd structure my database to ensure we capture all data from all banks and provide the most accurate information possible regarding an individual? Better yet, do you have experience with this type of setup?
Thanks.
how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Trivial.
You ahve an identity table, that has an AlternateId field if the Identity is linked to another one. Use the first IDentity you created as master. Any alternative will have AlternateId pointing to it.
You need to separate the identity from the data in it, so you can have alterante versions of it, possibly with an origin and timestampt. You need oto likely fully support versioning and tying different identities to each other as alternative, including generating a "master identity" possibly by algorithm with the "official" version of your data (i.e. consolidated).
The details are complex - mostly you ahve to make a LOT of compromises without killing performance, so at the end HIRE A SPECIALIST. There is a reason there are people out as sensior database designers or architects that have 20+ years experience finding the optimal solution given the constrints you may not even be aware of (application wise).
Better yet, do you have experience with this type of setup?
Yes. Try financial information. Stock symbols / feeds / definitions are not necessariyl compatible and vary by whom you get it. Any non-trivial setup has different data feeds that may show the same item slightly different, sometimes in error. DIfferent name, sometimes different price (example: ES, CME group, is 50 USD per point, but on TT Fix it is 5 - to make up, the price is multiplied by 10, so instad of 1000.25 you get 10002.5). THis is the same line of consolidation, and it STINKS.
Tons of code, tons of proper database design, redoing it half a dozen time to get the proper performance. THis is tricky, sadly.

is there any stock management design pattern?

We want to design an e-commerce application, and we are mental about consitent stock numbers. We don't want our customers finding out, after they have bought an item, that that item is out of stock, that's a big thing here. The average order here has about 60 different items, which will makes things even trickier.
Let's imagine these two scenarios:
1st Scenario:
1) Customer C1 opens the online store and find a product he/she wants to buy;
2) That product is shown as "in stock" (but the current stock is 1);
3) Customer C1 puts 1 item in the basket;
4) Customer C2 gets into the website and select the same item (put in the basket), which is still marked as "in stock" (stock is still 1);
5) Customer C1 goes to checkout and confirms his purchase and the application decreases the current stock for that item to 0;
6) Customer C2 keeps buying items, let's say 35 other distinct items (it took 20 minutes to customer c2 to select the items he wanted);
7) Customer C2 goes to checkout and confirms this purchase, but now, the first item he bought is no longer available (and we CAN NOT sell it);
8) The application warns customer C2 that the first item is no longer available and that he has to check his basket;
9) Customer C2 gets pissed and close the browser without buying anything.
2nd scenario (but I think it is unnecessarily complex and buggy):
1) Customer C1 opens the online store and find a product he/she wants to buy;
2) That product is shown as "in stock" (but the current stock is 1);
3) Customer C1 puts 1 item in the basket (and the application decreases the current stock for that item to 0);
4) Customer C2 gets into the website and see the item he/she wanted is out of stock;
5) Customer C2 leaves the website;
6) Customer C1 keeps buying items (the stock decreases for it of these items);
7) Customer C1 closes the browser;
8) Every now and then some batch routine kicks in to remove the items which had decreased the stock but didn't get bought/confirmed.
We have just a few distinct products, but we have been selling about 30.000.000 items by phone, some products get sold as much as 2.000.000 every day, so the concurrency in the row responsible for the stock of that product might get many updates at the same time, so it's important we get a good performance.
Those are usual scenario, but is there any design pattern which gives the user a better experience while keeping the stock numbers consistent and yet yield a great application performance?
Any help will be much appreciated.
Cheers
First off, taking a step back, do you really need to solve the inventory management problem on the front end? Since you're selling large volumes of a relatively small set of products, it should be relatively easy to manage your inventory so that you are never out of stock or, if you are, it doesn't prevent you from fulfilling orders. There is a great deal of literature and examples that deal with calculating safety stock which requires just a bit of statistics to follow. It would make far more sense to me to focus your attention on giving the company the tools (if it doesn't already have them) to manage their inventory to prevent stock-out situations rather than trying to prevent them from happening in the sales portal.
That being said, I'm not quite sure that I follow your problem with the two scenarios you outline. Even if the database performance was flawless, if you have only 1 of item A in stock and you can't sell an item if it's not in stock, then one of the two customers, by definition, one of the two potential customers is going to lose out. If in the first scenario C2 is going to go away without buying anything if any of his 35 items are not in stock (which seems unlikely if he spent 20 minutes filling his cart), there is nothing you can do in the database to prevent that. Your interface could potentially have some AJAX that alerts them while they're shopping that one of the items in their cart is out of stock much like StackExchange notifies you while you're entering an answer that someone else has entered an answer. It's not at all clear to me, that telling C2 about the problem earlier is going to be beneficial-- if he's going to leave if he can't buy all 35 items in one transaction, he's going to leave no matter when you tell him that C1 bought the item. Realistically, there is no way to design the system so as not to disappoint one of the two customers in that case.
It might help if you can explain a bit more about why your application and your customers are so sensitive to stock-out situations. Most customers and most retailers are relatively accustomed to the fact that sometimes after placing an order they get notified that the retailer isn't going to be able to fulfill the order as quickly as they had expected and are given the option to cancel that part of their order, the whole order (assuming the remaining items haven't shipped yet), or to wait for the item to come back in to stock. Assuming that you do something to notify customers while they're browsing that inventory is relatively low (i.e. Amazon will tell you "N items in stock" if you're looking at an item for which they only have a handful of stock left), most customers are reasonably understanding 20 minutes when they get to the checkout and are told that the item is now out of stock since they knew in advance that they needed to order quickly. And most retailers are comfortable that even if they run out of stock of most popular items, they can still satisfy more requests than they have inventory in hand because they undoubtedly have new inventory arriving in the next day or two or they can rush an order for new inventory.
Your 1st scenario is what most companies do, and that's why stock management systems have the concept of a back order.
Your 2nd scenario is more beneficial to the customer, but will reduce your sales somewhat, as well as be more complicated to manage.
This really isn't a database decision. This is a management decision on how you want to handle your inventory.
Most relational databases supported with sufficient hardware can handle 2 million changes a day.
You could try to find out how other online retailers do it, and emulate them. For example, when Amazon is almost out of a product, they'll often display a notice saying, "Only n left in stock!" Try to find a product like that, then add it to your cart in one browser and use a different browser to see what happens to the inventory.
I am with Justin and Gilbert. This is more about logistics than front-end. There is also the amazon solution of saying "I want all these things shipped in the same packet", (i.e. that will take longer, as all the bits have to wait for the slowest one) or "send them separately, as soon as they are available". Basically, you give yourself time to restock.
I think the most infuriating scenario is booking airline/ferry tickets, and when you get to the paying part, they either time out, do "not have that price available anymore" or some such nonsense. Particularly annoying, as i is not exactly buying cruise boat propellers..
You could do a kan-ban routine, where you basically say that when you have 10 (or whatever) left of something, it is shown as "1 item remaining" in front end. That means that customerA and customerB buying at the same time both gets their stuff. And then the warning goes to procurement within the company: "we are "out" of objectN".
I would be interested in knowing what kind of stuff your client is selling, that most customer buys 60 objects.

historical data modelling literature, methods and techniques

Last year we launched http://tweetMp.org.au - a site dedicated to Australian politics and twitter.
Late last year our politician schema needed to be adjusted because some politicians retired and new politicians came in.
Changing our db required manual (SQL) change, so I was considering implementing a CMS for our admins to make these changes in the future.
There's also many other sites that government/politics sites out there for Australia that manage their own politician data.
I'd like to come up with a centralized way of doing this.
After thinking about it for a while, maybe the best approach is to not model the current view of the politician data and how they relate to the political system, but model the transactions instead. Such that the current view is the projection of all the transactions/changes that happen in the past.
Using this approach, other sites could "subscribe" to changes (a la` pubsubhub) and submit changes and just integrate these change items into their schemas.
Without this approach, most sites would have to tear down the entire db, and repopulate it, so any associated records would need to be reassociated. Managing data this way is pretty annoying, and severely impedes mashups of this data for the public good.
I've noticed some things work this way - source version control, banking records, stackoverflow points system and many other examples.
Of course, the immediate challenges and design issues with this approach includes
is the current view cached and repersisted? how often is it updated?
what base entities must exist that never change?
probably heaps more i can't think of right now...
Is there any notable literature on this subject that anyone could recommend?
Also, any patterns or practices for data modelling like this that could be useful?
Any help is greatly appreciated.
-CV
This is a fairly common problem in data modelling. Basically it comes down to this:
Are you interesting in the view now, the view at a point in time or both?
For example, if you have a service that models subscriptions you need to know:
What services someone had at a point in time: this is needed to work out how much to charge, to see a history of the account and so forth; and
What services someone has now: what can they access on the Website?
The starting point for this kind of problem is to have a history table, such as:
Service history: id, userid, serviceid, start_date, end_date
Chain together the service histories for a user and you have their history. So how do you model what they have now? The easiest (and most denormalized view) is to say the last record or the record with a NULL end date or a present or future end date is what they have now.
As you can imagine this can lead to some gnarly SQL so this is selectively denomralized so you have a Services table and another table for history. Each time Services is changed a history record is created or updated. This kind of approach makes the history table more of an audit table (another term you'll see bandied about).
This is analagous to your problem. You need to know:
Who is the current MP for each seat in the House of Representatives;
Who is the current Senator for each seat;
Who is the current Minister for each department;
Who is the Prime Minister.
But you also need to know who was each of those things at a point in time so you need a history for all those things.
So on the 20th August 2003, Peter Costello made a press release you would need to know that at this time he was:
The Member for Higgins;
The Treasurer; and
The Deputy Prime Minister.
because conceivably someone could be interesting in finding all press releases by Peter Costello or the Treasurer, which will lead to the same press release but will be impossible to trace without the history.
Additionally you might need to know which seats are in which states, possibly the geographical boundaries and so on.
None of this should require a schema change as the schema should be able to handle it.

Resources