How can I get number of API transactions used by Watson NLU? - ibm-watson

AlchemyLanguage used to return the number of API transactions that took place during any call, this was particularly useful when making a combined call.
I do not see the equivalent way to get those results per REST call.
Is there any way to track or calculate this? I am concerned about things like some of the sub-requests like when you ask for sentiment on entities does that count as two, or one plus an additional call per recognized entity?

There's currently no way to track the transactions from the API itself. To track this (particularly for cost estimates), you'll have to go to the usage dashboard in Bluemix. To find it: sign in to Bluemix, click Manage, then select Billing and Usage, and finally select Usage. At the bottom of the page you'll se a list of all your credentialed services. Expanding any of those will show the usage plus total charges for the month.
As far as how the NLU service is billed, it's not necessarily on a per API request as you mentioned. The serivce is billed in "units" and from the pricing page (https://console.ng.bluemix.net/catalog/services/natural-language-understanding):
A NLU item is based on the number of data units enriched and the
number of enrichment features applied. A data unit is 10,000
characters or less. For example: extracting Entities and Sentiment
from 15,000 characters of text is (2 Data Units * 2 Enrichment
Features) = 4 NLU Items.
So overall, the best way to understand your transaction usage would be to run a few test requests and then check the Bluemix usage dashboard.

I was able to do a simple test, and made calls to a set of highlevel features and included sub-features. And it appeared to only register calls for the highlevel features.

Related

Infinite scrolling for personalized recommendations

I've been recently researching solutions that would allow me to display a personalized ranking of products in an online e-commerce store.
A natural solution for this problem would be to use a managed ML service such as
AWS Personalize.
Based on my understanding it can be implemented in terms of 2 service calls:
Recommendations - return up to ~500 products based on user's profile
Ranking - based on product ids (up to ~500) reorder the list according user's to profile
I was wondering if there exists and implementation / indexing strategy that would allow to display the entire product catalog (let's assume 10k products)?
I can imagine an implementation that would:
Return 50 products per page
Call recommendations to grab first 500 products
Alternatively, pick top 500 products platform-wise and rerank them according to the user.
For the remaining results, i.e. pages 11 to N a database query would be executed, excluding those 500 products by id. The recommended ordering wouldn't be as relevant anymore as te top recommendations have been listed and the user is less likely to encounter relevant results at the 11th page. As a downside such a query would need a relatively large array to be included as a part of the query.
Is there a better solution for this? I've seen many eccomerce platforms offering a "Recommended" order option for their product listing that allows infinite scrolling. Would that mean that such a store is typically using predefined ranks, that is an arbitrary rank assigned to each product by the content manager that is exactly the same for any user on the platform?
I don't think I've ever seen an ecommerce site that shows me 10K products without any differentiation. Most ecommerce sites use a process called "merchandising" to decide which product to show, to which customer, in which position/treatment, at which time, and in which context.
Personalized recommendations may be part of that process, but they are generally only a part. For instance, if you're browsing "Books about architecture", the fact the recommendation engine thinks you're really interested in CDs by Duran Duran is not super useful. And even within "books about architecture", there may be other attributes that are more likely to drive your buying behaviour.
Most ecommerce sites use a range of attributes to decide product ranking in a product listing page, for example "relevance", "price", "is the product on special offer?", "is the product in stock?", "do we make a big margin on this product?", "is this a best seller?", "is the supplier reliable?". Personalized recommendations are sprinkled into these factors, but the weightings are very much specific to the vendor.
Most recommendation engines will provide a "relevance score" or similar indicator. In my experience, this has a long tail distribution - a handful of products will score highly, and the rest will score very low relevancy scores. The ecommerce business I have worked with have a cut off point - a score of less than x means we'll ignore the recommendation, in favour of other criteria.
Again, in my experience, personalized recommendations are useful for squeezing the last few percentage points of conversion, but nowhere near as valuable as robust search, intuitive categorization, and rich meta data.
I'd suggest that if your customer has scrolled through 500 recommended products, it's unlikely the next 9500 need to be in "recommended" order - your recommendation mechanism is probably statistically not significant.

How many records / rows / nodes is a lot in Firebase?

I'm creating an app where I will store users under all postalcodes/zipcodes they want to deliver to. The structure looks like this:
postalcodes/{{postalcode}}/{{userId}}=true
The reason for the structure is to easily fetch all users who deliver to a certain postal code.
ex. postalcodes/21121/
If all user applies like 500 postalcodes and the app has about 1000 users it can become a lot of records:
500x1000 = 500000
Will Firebase easily handle that many records in data storage, or should I consider a different approach/solution? What are your thoughts?
Kind regards,
Elias
I'm quite sure Firebase can return 500k nodes without a problem.
The bigger concerns are how long that retrieval will take (especially in this mobile-first era) and what your application will show your user based on that many nodes.
A list with 500k rows is hardly useful, so most likely you'll show a subset of the data.
Say you just show the first screenful of nodes. How many nodes will that be? 20? So why would you already retrieve the other nodes already in that case? I'd simply retrieve the nodes needed to build the first screen and load the rest on demand - when/if needed.
Alternatively I could imagine you show a digest of the nodes (like a total number of nodes and maybe some averages per zip code area). You'd need all nodes to determine that digest. But I'd hardly consider it to task of a client application to determine the digest values. That's more something of a server-side task. That server could use the same technology as client-apps (i.e. the JavaScript API), but it wouldn't be bothered (as much) by bandwidth and time constraints.
Just some ideas of how I would approach this, so ymmv.

google app engine query opimization

I am trying to do my reads and writes for GAE as efficiently as possible and I was wondering which is the best of the following two options.
I have a website where users are able to post different things and right now whenever I want to show all posts by that user I do a query for all posts with that user's user ID and then I display them. Would it be better to store all of the post IDs in the user entity and do a get_by_id(post_ID_list) to return all of the posts? Or would that extra space being used up not be worth it?
Is there anywhere I can find more information like this to optimize my web app?
Thanks!
The main reason you would want to store the list of IDs would be so that you can get each entity separately for better consistency - entity gets by id are consistent with the latest version in the datastore, while queries are eventually consistent.
Check datastore costs and optimize for cost:
https://developers.google.com/appengine/docs/billing
Getting entities by key wouldn't be any cheaper than querying all the posts. The query makes use of an index.
If you use projection queries, you can reduce your costs quite a bit.
There is several cases.
First, if you keep track for all ids of user's posts. You must use entity group for consistency. Thats means speed of write to datastore would be ~1 entity per second. And cost is 1 read for object with ids and 1 read per entity.
Second, if you just use query. This is not need consistency. Cost is 1 read + 1 read per entity retrieved.
Third, if you quering only keys and after fetching. Cost is 1 read + 1 small per key retrieved. Watch this: Keys-Only Queries. This equals to projection quering for cost.
And if you have many result, and use pagination then you need use Query Cursors. That prevent useless usage of datastore.
The most economical solution is third case. Watch this: Batch Operations.
In case you have a list of id's because they are stored with your entity, a call to ndb.get_multi (in case you are using NDB, but it would be similar with any other framework using the memcache to cache single entities) would save you further datastore calls if all (or most) of the entities correpsonding to the keys are already in the datastore.
So in the best possible case (everything is in the memcache), the datastore wouldn't be touched at all, while using a query would.
See this issue for a discussion and caveats: http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118.

What Database / Technology should be used to calculate unique visitors in time scope

I've got a problem with performance of my reporting database (tables have millions of records, 50+), when I want to calculate distinct on column that indicates a visitor uniqueness, let's say some hashkey.
For example:
I have these columns:
hashkey, name, surname, visit_datetime, site, gender, etc...
I need to get distinct in time span of 1 year, less than in 5 sec:
SELECT COUNT(DISTINCT hashkey) FROM table WHERE visit_datetime BETWEEN 'YYYY-MM-DD' AND 'YYYY-MM-DD'
This query will be fast for short time ranges, but if it be bigger than one month, than it can takes more than 30s.
Is there a better technology to calculate something like this than relational databases?
I'm wondering what google analytics use to do theirs unique visitors calculating on the fly.
For reporting and analytics, the type of thing you're describing, these sorts of statistics tend to be pulled out, aggregated, and stored in a data warehouse or something. They are stored in a fashion meant for performance reasons in lieu of nice relational storage techniques optimized for OLTP (online transaction processing). This pre-aggregated technique is called OLAP (online analytical processing).
You could have another table store the count of unique visitors for each day, updated daily by a cron function or something.
Google Analytics uses a first-party cookie, which you can see if you log Request Headers using LiveHTTPHeaders, etc.
All GA analytics parameters are packed into the Request URL, e.g.,
utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B">http://www.google-analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012&utm_content=content+gatc012&utm_medium=medium+gatc012&utmac=UA-30138-1&utmcc=__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B...
Within that URL is a piece that keyed to __utmcc, these are the GA cookies. Within _utmcc, is a string keyed to _utma, which is string comprised of six fields each delimited by a '.'. The second field is the Visitor ID, a random number generated and set by the GA server after looking for GA cookies and not finding them:
__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1
In this example, 1774621898 is the Visitor ID, intended by Google Analytics as a unique identifier of each visitor
So you can see the flaws of technique to identify unique visitors--entering the Site using a different browser, or a different device, or after deleting the cookies, will cause you to appear to GA as a unique visitor (i.e., it looks for its cookies and doesn't find any, so it sets them).
There is an excellent article by EFF on this topic--i.e., how uniqueness can be established, and with what degree of certainty, and how it can be defeated.
Finally, once technique i have used to determine whether someone has visited our Site before (assuming the hard case, which is that they have deleted their cookies, etc.) is to examine the client request for our favicon. The directories that store favicons are quite often overlooked--whether during a manual sweep or programmatically using a script.

Can I do transactions and locks in CouchDB?

I need to do transactions (begin, commit or rollback), locks (select for update).
How can I do it in a document model db?
Edit:
The case is this:
I want to run an auctions site.
And I think how to direct purchase as well.
In a direct purchase I have to decrement the quantity field in the item record, but only if the quantity is greater than zero. That is why I need locks and transactions.
I don't know how to address that without locks and/or transactions.
Can I solve this with CouchDB?
No. CouchDB uses an "optimistic concurrency" model. In the simplest terms, this just means that you send a document version along with your update, and CouchDB rejects the change if the current document version doesn't match what you've sent.
It's deceptively simple, really. You can reframe many normal transaction based scenarios for CouchDB. You do need to sort of throw out your RDBMS domain knowledge when learning CouchDB, though. It's helpful to approach problems from a higher level, rather than attempting to mold Couch to a SQL based world.
Keeping track of inventory
The problem you outlined is primarily an inventory issue. If you have a document describing an item, and it includes a field for "quantity available", you can handle concurrency issues like this:
Retrieve the document, take note of the _rev property that CouchDB sends along
Decrement the quantity field, if it's greater than zero
Send the updated document back, using the _rev property
If the _rev matches the currently stored number, be done!
If there's a conflict (when _rev doesn't match), retrieve the newest document version
In this instance, there are two possible failure scenarios to think about. If the most recent document version has a quantity of 0, you handle it just like you would in a RDBMS and alert the user that they can't actually buy what they wanted to purchase. If the most recent document version has a quantity greater than 0, you simply repeat the operation with the updated data, and start back at the beginning. This forces you to do a bit more work than an RDBMS would, and could get a little annoying if there are frequent, conflicting updates.
Now, the answer I just gave presupposes that you're going to do things in CouchDB in much the same way that you would in an RDBMS. I might approach this problem a bit differently:
I'd start with a "master product" document that includes all the descriptor data (name, picture, description, price, etc). Then I'd add an "inventory ticket" document for each specific instance, with fields for product_key and claimed_by. If you're selling a model of hammer, and have 20 of them to sell, you might have documents with keys like hammer-1, hammer-2, etc, to represent each available hammer.
Then, I'd create a view that gives me a list of available hammers, with a reduce function that lets me see a "total". These are completely off the cuff, but should give you an idea of what a working view would look like.
Map
function(doc)
{
if (doc.type == 'inventory_ticket' && doc.claimed_by == null ) {
emit(doc.product_key, { 'inventory_ticket' :doc.id, '_rev' : doc._rev });
}
}
This gives me a list of available "tickets", by product key. I could grab a group of these when someone wants to buy a hammer, then iterate through sending updates (using the id and _rev) until I successfully claim one (previously claimed tickets will result in an update error).
Reduce
function (keys, values, combine) {
return values.length;
}
This reduce function simply returns the total number of unclaimed inventory_ticket items, so you can tell how many "hammers" are available for purchase.
Caveats
This solution represents roughly 3.5 minutes of total thinking for the particular problem you've presented. There may be better ways of doing this! That said, it does substantially reduce conflicting updates, and cuts down on the need to respond to a conflict with a new update. Under this model, you won't have multiple users attempting to change data in primary product entry. At the very worst, you'll have multiple users attempting to claim a single ticket, and if you've grabbed several of those from your view, you simply move on to the next ticket and try again.
Reference: https://wiki.apache.org/couchdb/Frequently_asked_questions#How_do_I_use_transactions_with_CouchDB.3F
Expanding on MrKurt's answer. For lots of scenarios you don't need to have stock tickets redeemed in order. Instead of selecting the first ticket, you can select randomly from the remaining tickets. Given a large number tickets and a large number of concurrent requests, you will get much reduced contention on those tickets, versus everyone trying to get the first ticket.
A design pattern for restfull transactions is to create a "tension" in the system. For the popular example use case of a bank account transaction you must ensure to update the total for both involved accounts:
Create a transaction document "transfer USD 10 from account 11223 to account 88733". This creates the tension in the system.
To resolve any tension scan for all transaction documents and
If the source account is not updated yet update the source account (-10 USD)
If the source account was updated but the transaction document does not show this then update the transaction document (e.g. set flag "sourcedone" in the document)
If the target account is not updated yet update the target account (+10 USD)
If the target account was updated but the transaction document does not show this then update the transaction document
If both accounts have been updated you can delete the transaction document or keep it for auditing.
The scanning for tension should be done in a backend process for all "tension documents" to keep the times of tension in the system short. In the above example there will be a short time anticipated inconsistence when the first account has been updated but the second is not updated yet. This must be taken into account the same way you'll deal with eventual consistency if your Couchdb is distributed.
Another possible implementation avoids the need for transactions completely: just store the tension documents and evaluate the state of your system by evaluating every involved tension document. In the example above this would mean that the total for a account is only determined as the sum values in the transaction documents where this account is involved. In Couchdb you can model this very nicely as a map/reduce view.
No, CouchDB is not generally suitable for transactional applications because it doesn't support atomic operations in a clustered/replicated environment.
CouchDB sacrificed transactional capability in favor of scalability. In order to have atomic operations you need a central coordination system, which limits your scalability.
If you can guarantee you only have one CouchDB instance or that everyone modifying a particular document connects to the same CouchDB instance then you could use the conflict detection system to create a sort of atomicity using methods described above but if you later scale up to a cluster or use a hosted service like Cloudant it will break down and you'll have to redo that part of the system.
So, my suggestion would be to use something other than CouchDB for your account balances, it will be much easier that way.
As a response to the OP's problem, Couch is probably not the best choice here. Using views is a great way to keep track of inventory, but clamping to 0 is more or less impossible. The problem being the race condition when you read the result of a view, decide you're ok to use a "hammer-1" item, and then write a doc to use it. The problem is that there's no atomic way to only write the doc to use the hammer if the result of the view is that there are > 0 hammer-1's. If 100 users all query the view at the same time and see 1 hammer-1, they can all write a doc to use a hammer 1, resulting in -99 hammer-1's. In practice, the race condition will be fairly small - really small if your DB is running localhost. But once you scale, and have an off site DB server or cluster, the problem will get much more noticeable. Regardless, it's unacceptable to have a race condition of that sort in a critical - money related system.
An update to MrKurt's response (it may just be dated, or he may have been unaware of some CouchDB features)
A view is a good way to handle things like balances / inventories in CouchDB.
You don't need to emit the docid and rev in a view. You get both of those for free when you retrieve view results. Emitting them - especially in a verbose format like a dictionary - will just grow your view unnecessarily large.
A simple view for tracking inventory balances should look more like this (also off the top of my head)
function( doc )
{
if( doc.InventoryChange != undefined ) {
for( product_key in doc.InventoryChange ) {
emit( product_key, 1 );
}
}
}
And the reduce function is even more simple
_sum
This uses a built in reduce function that just sums the values of all rows with matching keys.
In this view, any doc can have a member "InventoryChange" that maps product_key's to a change in the total inventory of them. ie.
{
"_id": "abc123",
"InventoryChange": {
"hammer_1234": 10,
"saw_4321": 25
}
}
Would add 10 hammer_1234's and 25 saw_4321's.
{
"_id": "def456",
"InventoryChange": {
"hammer_1234": -5
}
}
Would burn 5 hammers from the inventory.
With this model, you're never updating any data, only appending. This means there's no opportunity for update conflicts. All the transactional issues of updating data go away :)
Another nice thing about this model is that ANY document in the DB can both add and subtract items from the inventory. These documents can have all kinds of other data in them. You might have a "Shipment" document with a bunch of data about the date and time received, warehouse, receiving employee etc. and as long as that doc defines an InventoryChange, it'll update the inventory. As could a "Sale" doc, and a "DamagedItem" doc etc. Looking at each document, they read very clearly. And the view handles all the hard work.
Actually, you can in a way. Have a look at the HTTP Document API and scroll down to the heading "Modify Multiple Documents With a Single Request".
Basically you can create/update/delete a bunch of documents in a single post request to URI /{dbname}/_bulk_docs and they will either all succeed or all fail. The document does caution that this behaviour may change in the future, though.
EDIT: As predicted, from version 0.9 the bulk docs no longer works this way.
Just use SQlite kind of lightweight solution for transactions, and when the transaction is completed successfully replicate it, and mark it replicated in SQLite
SQLite table
txn_id , txn_attribute1, txn_attribute2,......,txn_status
dhwdhwu$sg1 x y added/replicated
You can also delete the transactions which are replicated successfully.

Resources