Third party API integration - How to structure the database

Third party API integration - How to structure the database - database

In my app I am integrating different third party APIs, like Google Contacts, Google Calendar, Mailchimp, etc...
Each third party API has specific settings I can ask to the user to apply during the sync with my app.
I have two options:
I make unique table like "integration" where I store all connections. Here I am afraid I will need to add many columns to track settings for different APIs (contact privacy, calendar id, Mailchimp audience id...etc...)
I make a table for each third party API integrated in order to track separately all relevant setting through dedicated columns. Here I may have many tables when I reach many different third party APIs.
I would think the second is the best, but I need your help to structure my database as per best practices to be scalable.
Thanks!

What you are trying to do is clearly against Google's Terms of Service, specifically Section 5 e. Prohibitions on Content:
Unless expressly permitted by the content owner or by applicable law, you will not, and will not permit your end users or others acting on your behalf to, do the following with content returned from the APIs:
Scrape, build databases, or otherwise create permanent copies of such content, or keep cached copies longer than permitted by the cache header;
[...]
And there are very good legal and moral reasons for that. So the answer is actually not to do it and use the APIs every time.

Related

How big tech companies share databases across multiple teams?

How multiple teams(which own different system components/micro-services) in a big tech company share their databases.
I can think of multiple use cases where this would be required. For example in an e-commerce firm, same product will be shared among multiple teams like product at first will be part of product onboarding service, then may be catalog service (which stores all products and categories), then search service, cart service, order placing service, recommendation service, cancellation & return service and so on.
If they don't share any db then
Do they all have redundant copy of the products with same product ID and
Wouldn't there be a challenge to achieve consistency among multiple team.
There are multiple related doubt I have in both the case wether they share DB or not.
I have been through multiple tech blogs and video on software design, and still didn't get satisfying answer. Do share some resources which can give a complete workflow of how things work end-to-end in a big tech firm.
Thank you

In the microservice architecture, each microservice exposes endpoints where other microservice can access shared information between the services. So one service would store as minimal information of a record that is managed by another microservice.
For example if a user service would like to fetch orders for a particular user in an e-commerce case, then the order service would expose an endpoint given a user id would return all orders related to the userid supplied and so on...so essentally the only field related to the user that the order service needs to store is the userid, the rest of the user details is irrelevant to it.
To further improve the cohesion and understanding between teams, data discovery apis/documentation are also built to share metadata of databases to other teams to further explain what each table/field means for one to efficiently plan out a microservice. You can read more about how such companies build data discovery tools
here

If I understand you correctly, you are unsure how different departments receive data in a company?
The idea is that you create reusable and effective API's to solve this problem.
Let's generically say the company we're looking at is walmart. Walmart has millions of items in a database(s). Each item has a unique ID etc etc.
If Walmart is selling items online via walmart.com, they have to have a way to get those items, so they create API's and use them to grab items based on certain query conditions.
Now, let's say walmart has decided to build an app... well they need those exact same items! Well, good thing we already created those API's, we will use the exact same ones to grab the data.
Now, how does Walmart manage which items are available at which store, and at what price? They would usually link this meta data through additional database schema tables and tying them all together with primary and foreign keys.
^^ This essentially allows walmart to grab ONLY the item out of their CORE database that only has details that are necessary to the item (e.g. name, size, color, SKU, details, etc), and link it to another database that is say, YOUR local walmart that contains information relevant to only your walmart location in regard to that item (e.g. price, stock, aisle number etc).
So using multiple databases yes, in a sense.
Perhaps this may drive you down some more roads: https://learnsql.com/blog/why-use-primary-key-foreign-key/
https://towardsdatascience.com/designing-a-relational-database-and-creating-an-entity-relationship-diagram-89c1c19320b2

There's a substantial diversity of approaches used between and even within big tech companies, driven by different company/org cultures and different requirements around consistency and availability.
Any time you have an explicit "query another service/another DB" dependency, you have a coupling which tends to turn a problem in one service into a problem in both services (and this isn't a necessarily a one-way thing: it's quite possible for the querying service to encounter a problem which cascades into a problem in the queried service (this is especially possible when a cache becomes load-bearing, which has led to major outages at at least one FANMAG in the not-that-distant past)).
This has led some companies that could be fairly called big tech to eschew that approach in their service design, typically by having services publish events describing what has changed to a durable log (append-only storage). Other services subscribe to that log and use the events to construct their own eventually consistent view of the data owned by the other service (i.e. there's some level of data duplication, with services storing exactly the data they need to function).

Is it possible to get data from other companies' databases?

I was wondering how so many job sites have so many job offers/information regarding other companies' offers. For instance, if I were to start my own job searching engine, how would I be able to get the information that sites like indeed.com have in my own databases? One site (jobmaps.us) says that it's "powered by indeed" and seems to be following the same format as indeed.com (as do all other job searching websites). Is there some universal job searching template that I can use?
Thanks in advance.

Some services offer an API which allows you to "federate" searches (relay them to multiple data sources, then gather all the results together for display in one place). Alternatively, some offer a mechanism that would allow you to download/retrieve data, so you can load it into your own search index.
The latter approach is usually faster and gives you total control but requires you to maintain a search index and track when data items are updated/added/deleted on remote systems. That's not always trivial.
In either case, some APIs will be open/free and some will require registration and/or a license. Most will have rate limits. It's all down to whoever owns the data.
It's possible to emulate a user browsing a website, sending HTTP requests and analysing the response from a web server. By knowing the structure of the HTML, it's possible to extract ("scrape") the information you need.
This approach is often against site policies and is likely to get you blocked. If you do go for this approach, ensure that you respect any robots.txt policies to avoid being blacklisted.

Multiple microservices and database associations

I have a question concerning microservices and databases. I am developing an application: a user sees a list of countries and can click through it so he can see a list of attractions of that country. I created a country-service, auth-service (contains users for oAuth2) and an attraction-service. Each service has its own database. I mapped the association between an attraction and its country by the iso code (for example: BE = belgium): /api/attraction/be.
The approach above seems to work but I am a bit stuck with the following: a user must be able to add an attraction to his/her list of favorites, but I do not see how that's possible since I have so many different databases.
Do I create a favorite-service, do I pass id's (I don't think I should do this), what kind of business key can I create, how do I associate the data in a correct way...?
Thanks in advance!

From the information you have provided, using a standalone favourite service sounds like the right option.
A secondary simpler and quicker option might be to also to handle this on your user service which looks after the persistence of your users data as favourites are exclusive to a user entity.
As for ID's, I haven't seen many reasons as to why this might be a bad idea? Your individual services are going need to store some identifying value for related data and the main issue here I feel is just keeping this ID field consistent across your different services. What you choose just needs to be reliable and predictable to keep things easy and simple as your system grows.

If you are using RESTful HTTP, you already have a persistent, bookmarkable identification of resources, URLs (URIs, IRIs if you want to be pedantic). Those are the IDs that you can use to refer to some entity in another microservice.
There is no need to introduce another layer of IDs, be it country codes, or database ids. Those things are internal to your microservice anyway and should be transparent for all clients, including other microservices.
To be clear, I'm saying, you can store the URI to the country in the attractions service. That URI should not change anyway (although you might want to prepare to change it if you receive permanent redirects), and you have to recall that URI anyway, to be able to include it in the attraction representation.
You don't really need any "business key" for favorites either, other than the URI of the attraction. You can bookmark that URI, just as you would in a browser.
I would imagine if there is an auth-service, there are URIs also for identifying individual users. So in a "favorites" service, you could simply link the User URI with Attraction URIs.

Ad Blocker w/ Segment.io

I'm considering using segment.io for several of my client-side 3rd party API needs, but I'm a little concerned about ad-blockers.
My app has no ads, but I do a lot of event-tracking for product analytics, as well as error tracking.
Segment.io offers a nice all-in-one solution, but if it's blocked, and all my eggs are in that basket, then, well, I won't have any eggs left, or however that idiom ends.
So my question is: is there a way to integrate multi-purpose event tracking (segment.io, keen.io, etc.) that isn't as susceptible to ad-blocking?
My app is mostly serverless, using a Firebase+AWS Lambda setup, so I've tried to think of some kind of back-end solution, but no big ideas so far.
ETA: I'm not looking to track adblocking users or violate anyone's trust. my question is about event-tracking unrelated to a user's identity, and whether or not that's possible in an all-in-one event tracking library that might be ad-blocked.

First, I'd typically consider such blocking to be "privacy" blocking instead of ads. So instead of Adblock it's more likely to be Ghostery or uBlock Origin.
Although most website uses of analytics are benign (improving conversion rates, catching browser exceptions, etc), the concern many have is that it allows the third party analytics sites (including segment, etc) to track users across multiple websites. Now most of these analytics sites are also not interested in that, but better safe than sorry?
The ethics of wanting to have analytics about all your webapp use are far more nuanced than "privacy good, tracking bad" and I don't think this is the forum for it, so I'll provide you a technical answer. Just note that your disclaimer about not wanting to "track adblocking users" is not really valid. If your aim is to gather analytics about them, that's still essentially tracking. Otherwise just use a hosted solution and realise that maybe 10-20% of users don't provide you with analytics.
The bad news: basically every "hosted" analytics solution is or will be in the block lists. Not only are their API hosts directly blocked, but there are also blocks in placed based on the name of JS files you may try to include.
The good news: you can work around it if you relay events through your own API, and AWS API Gateway which you may already be using is perfect for this.
There are multiple steps to this.
Step 1: The analytics provider need to provide the option of a fully bundled/built JS file. If they require you to pull the script dynamically from their own servers then it will be blocked there before it even downloads.
Step 2: Rename the bundled script so that it doesn't trigger any filename-based blocks, e.g. rename from mixpanel.umd.js to mp.js, and add it to your server.
Step 3: Create an API gateway to relay events back to the "correct" API (e.g. to api.analyticshost.com). You can actually do this with AWS API gateway only (no lambda required) if you pass through the right headers and URL params.
Step 4: Initialise the library to use your API host rather than the default one.
The result of this is (a) the browser no longer needs to dynamically pull the analytics from the analytics provider's CDN, and instead gets it from your server, and (b) the browser sends it to your API and then relayed through to the analytics provider's.

When gathering analytics segment also provides server side tracking libraries. This can be quite useful when you want to gather metrics for certain types of events that might be blocked by users on the client. At it's simplest, Segment has an HTTP Source but there are a number of popular languages supported as well.
https://segment.com/docs/connections/sources/catalog/#server
The classic example is the order complete event, this would typically happen in your server once that transaction has been committed to a database. Regardless of browser configuration, you could send this event from the server and track.
Be sure you respect the users consent management settings here though.

A lot of valid points are already mentioned in the accepted answer, I would mention a few technical considerations to minimize ad blockers impact on tracking tools (Segment, Google Tag Manager, etc):
Develop for server-side tracking. Whatever is on server cannot be blocked by ad blockers. However, this is usually tricky and very custom, Segment gives some examples on it as well.
Use managed client-side proxy solutions like DataUnlocker. This is great and does not require many code changes.
Use self-hosted open-source solutions for proxying Google Analytics and Google Tag Manager like this or this. I believe these solutions can be extended to support Segment as well.

How can I get product information intoa database without having to populate it manually?

I am looking for a method of dynamically linking product information based on the name of the product.
For example: User types in "Playstation 3", the site would then go out and grab any information it can, such as picture, retail price, etc. Ideally, it would let you choose the correct item (returns both ps3 controller and ps3 console, user can choose which). It would then use this information in a product listing.
The easiest way I can think to implement this is to use the existing API of a major retailer such as Amazon. I have a couple completely different ideas for sites, one of which would involve selling from amazon (which I would assume they would be ok with) and another which would only be data mining the information. I am concerned they would not take it very kindly if I was just stealing their images and descriptions.
Is there another way, maybe less "sneaky" way to accomplish this that wouldn't be in legally frowned upon ?

Many web-commerce companies use a data stream known as an API - EBay, Etsy, and Amazon all have API feeds for their products. If you can convince the company to allow you access to their API (usually they will give you a key/password), then you can directly access their back-end database, typically at the read-only level. Depending on the company, you can just write them directly for access.
You are correct when you say that most companies wouldn't take kindly to someone web-scraping their product directory and re-using it. That is unethical, and could lead to big trouble with larger companies with a significant legal presence.
On the other hand, there is nothing to prevent you from cobbling together several API feeds into a Mash-Up - try Yahoo Pipes! to learn the basics of API/Mash-Up integration:
Yahoo Pipes:
http://pipes.yahoo.com/pipes/
Here is the link to Amazon's Product Advertising API program:
https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
Good luck, and happy development!

Many online retailers provide a product feed - either well-publicized (William M-B has listed some examples), or sorta-kinda hidden, for the purposes of affiliate marketing. They usually have terms of use around those product feeds, describing in detail what you're allowed to do with them, and exactly how many of your limbs are at risk if you don't play by their rules.
However, the mechanism you're describing sounds remarkably similar to a search engine; there's a well-established precedent for search engines indexing sites, and using their content to reason about the underlying site. Get a lawyer to validate this, but there's a good chance that your intended purpose falls under "fair use".

I'm representative of http://aerse.com.
We are building service, that do the following:
search product by name. For example: galaxy s3, galaxy s 3 or galaxy sIII
return technical specifications (CPU, RAM etc) and product images (thumbnails and high-res images)
provide API http://aerse.com/p
deal with legal issues, provide licenses & etc.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight