Is it possible to get data from other companies' databases? - database

I was wondering how so many job sites have so many job offers/information regarding other companies' offers. For instance, if I were to start my own job searching engine, how would I be able to get the information that sites like indeed.com have in my own databases? One site (jobmaps.us) says that it's "powered by indeed" and seems to be following the same format as indeed.com (as do all other job searching websites). Is there some universal job searching template that I can use?
Thanks in advance.

Some services offer an API which allows you to "federate" searches (relay them to multiple data sources, then gather all the results together for display in one place). Alternatively, some offer a mechanism that would allow you to download/retrieve data, so you can load it into your own search index.
The latter approach is usually faster and gives you total control but requires you to maintain a search index and track when data items are updated/added/deleted on remote systems. That's not always trivial.
In either case, some APIs will be open/free and some will require registration and/or a license. Most will have rate limits. It's all down to whoever owns the data.
It's possible to emulate a user browsing a website, sending HTTP requests and analysing the response from a web server. By knowing the structure of the HTML, it's possible to extract ("scrape") the information you need.
This approach is often against site policies and is likely to get you blocked. If you do go for this approach, ensure that you respect any robots.txt policies to avoid being blacklisted.

Related

Redis database snapshot diffs or other suggested DB for network/resource monitoring

I have a monitoring service that polls a REST API for information about the latest resources (list of hosts/list of licenses). The monitoring service cache's all this data in a Redis database. Everything works great for discovering new resources.
However the problem I am facing is when a host drops off the network. The challenge I am facing is that I haves no way of knowing that the host has disappeared from the list of hosts. The REST API only gives me a way of querying a list of hosts.
One way that I can come up (theoretically) is by taking a diff of the rdb at different time intervals. However this does not seem efficient to me and honestly I am not sure how I would do this with redis.
The suggestions I am looking for are, maybe some frameworks which are best suited for this kind of an operation or if need be a different database that might be as efficient as redis yet gives me the functionality I need to take diffs. Time series databases spring to mind but I have no experience in them and not sure how they can be used to solve this problem precisely.
There's no need to resort to anywhere besides Redis itself - it is robust enough to continue serving your requirements as long as you tell it what to do (like any other software ;)).
The following is an example but as you didn't specify how you're caching your data, I'll assume for simplicity's sake that you have a key per every host/license in your list where you store some string/binary value, like:
SET acme.org "some cached value"
You have a lot of such keys because the monitoring REST API returns a list, so a common way to keep everything order is use another key to store that list for each request returned by the API. You can achieve that with a Set:
SADD request:<timestamp> acme.org foo.bar ...
Sets are particularly useful here because you can perform Set operations, SDIFF and SINTER and store-variants in your case, to keep track of the current online and dropped hosts. For example:
MULTI
SINTERSTORE online:<timestamp> request:<timestamp> request:<previous-timestamp>
SDIFFSTORE dropped:<timestamp> request:<timestamp> request:<previous-timestamp>
EXEC
Note: as you're caching things it is good practice to expiry values (TTL) to all relevant keys and use an appropriate eviction policy.

Ad Blocker w/ Segment.io

I'm considering using segment.io for several of my client-side 3rd party API needs, but I'm a little concerned about ad-blockers.
My app has no ads, but I do a lot of event-tracking for product analytics, as well as error tracking.
Segment.io offers a nice all-in-one solution, but if it's blocked, and all my eggs are in that basket, then, well, I won't have any eggs left, or however that idiom ends.
So my question is: is there a way to integrate multi-purpose event tracking (segment.io, keen.io, etc.) that isn't as susceptible to ad-blocking?
My app is mostly serverless, using a Firebase+AWS Lambda setup, so I've tried to think of some kind of back-end solution, but no big ideas so far.
ETA: I'm not looking to track adblocking users or violate anyone's trust. my question is about event-tracking unrelated to a user's identity, and whether or not that's possible in an all-in-one event tracking library that might be ad-blocked.
First, I'd typically consider such blocking to be "privacy" blocking instead of ads. So instead of Adblock it's more likely to be Ghostery or uBlock Origin.
Although most website uses of analytics are benign (improving conversion rates, catching browser exceptions, etc), the concern many have is that it allows the third party analytics sites (including segment, etc) to track users across multiple websites. Now most of these analytics sites are also not interested in that, but better safe than sorry?
The ethics of wanting to have analytics about all your webapp use are far more nuanced than "privacy good, tracking bad" and I don't think this is the forum for it, so I'll provide you a technical answer. Just note that your disclaimer about not wanting to "track adblocking users" is not really valid. If your aim is to gather analytics about them, that's still essentially tracking. Otherwise just use a hosted solution and realise that maybe 10-20% of users don't provide you with analytics.
The bad news: basically every "hosted" analytics solution is or will be in the block lists. Not only are their API hosts directly blocked, but there are also blocks in placed based on the name of JS files you may try to include.
The good news: you can work around it if you relay events through your own API, and AWS API Gateway which you may already be using is perfect for this.
There are multiple steps to this.
Step 1: The analytics provider need to provide the option of a fully bundled/built JS file. If they require you to pull the script dynamically from their own servers then it will be blocked there before it even downloads.
Step 2: Rename the bundled script so that it doesn't trigger any filename-based blocks, e.g. rename from mixpanel.umd.js to mp.js, and add it to your server.
Step 3: Create an API gateway to relay events back to the "correct" API (e.g. to api.analyticshost.com). You can actually do this with AWS API gateway only (no lambda required) if you pass through the right headers and URL params.
Step 4: Initialise the library to use your API host rather than the default one.
The result of this is (a) the browser no longer needs to dynamically pull the analytics from the analytics provider's CDN, and instead gets it from your server, and (b) the browser sends it to your API and then relayed through to the analytics provider's.
When gathering analytics segment also provides server side tracking libraries. This can be quite useful when you want to gather metrics for certain types of events that might be blocked by users on the client. At it's simplest, Segment has an HTTP Source but there are a number of popular languages supported as well.
https://segment.com/docs/connections/sources/catalog/#server
The classic example is the order complete event, this would typically happen in your server once that transaction has been committed to a database. Regardless of browser configuration, you could send this event from the server and track.
Be sure you respect the users consent management settings here though.
A lot of valid points are already mentioned in the accepted answer, I would mention a few technical considerations to minimize ad blockers impact on tracking tools (Segment, Google Tag Manager, etc):
Develop for server-side tracking. Whatever is on server cannot be blocked by ad blockers. However, this is usually tricky and very custom, Segment gives some examples on it as well.
Use managed client-side proxy solutions like DataUnlocker. This is great and does not require many code changes.
Use self-hosted open-source solutions for proxying Google Analytics and Google Tag Manager like this or this. I believe these solutions can be extended to support Segment as well.

Is OData suitable for multi-tenant LOB application?

I'm working on a cloud-based line of business application. Users can upload documents and other types of object to the application. Users upload quite a number of documents and together there are several million docs stored. I use SQL Server.
Today I have a somewhat-restful-API which allow users to pass in a DocumentSearchQuery entity where they supply keyword together with request sort order and paging info. They get a DocumentSearchResult back which is essentially a sorted collection of references to the actual documents.
I now want to extend the search API to other entity types than documents, and I'm looking into using OData for this. But I get the impression that if I use OData, I will face several problems:
There's no built-in limit on what fields users can query which means that either the perf will depend on if they query a indexed field or not, or I will have to implement my own parsing of incoming OData requests to ensure they only query indexed fields. (Since it's a multi-tenant application and they share physical hardware, slow queries are not really acceptable since those affect other customers)
Whatever I use to access data in the backend needs to support IQueryable. I'm currently using Entity Framework which does this, but i will probably use something else in the future. Which means it's likely that I need to do my own parsing of incoming queries again.
There's no built-in support for limiting what data users can access. I need to validate incoming Odata queries to make sure they access data they actually have permission to access.
I don't think I want to go down the road of manually parsing incoming expression trees to make sure they only try to access data which they have access to. This seems cumbersome.
My question is: Considering the above, is using OData a suitable protocol in a multi-tenant environment where customers write their own clients accessing the entities?
I think it is suitable here. Let me give you some opinions about the problems you think you will face:
There's no built-in limit on what fields users can query which means
that either the perf will depend on if they query a indexed field or
not, or I will have to implement my own parsing of incoming OData
requests to ensure they only query indexed fields. (Since it's a
multi-tenant application and they share physical hardware, slow
queries are not really acceptable since those affect other customers)
True. However you can check for allowed fields in the filter to allow the operation or deny it.
Whatever I use to access data in the backend needs to support
IQueryable. I'm currently using Entity Framework which does this, but
i will probably use something else in the future. Which means it's
likely that I need to do my own parsing of incoming queries again.
Yes, there is a provider for EF. That means if you use something else in the future you will need to write your own provider. If you change EF probably you took a decision to early. I don´t recommend WCF DS in that case.
There's no built-in support for limiting what data users can access. I
need to validate incoming Odata queries to make sure they access data
they actually have permission to access.
There isn´t any out-of-the-box support to do that with WCF Data Services, right. However that is part of the authorization mechanism that you will need to implement anyway. But I have good news for you: do it is pretty easy with QueryInterceptors. simply intercepting the query and, based on the user privileges. This is something you will have to implement it independently the technology you use.
My answer: Considering the above, WCF Data Services is a suitable protocol in a multi-tenant environment where customers write their own clients accessing the entities at least you change EF. And you should have in mind the huge effort it saves to you.

Which of these two APIs is the best to use REST or SOAP (for this specific architecture)?

Architecture :
database on a central server which contains a complex hierarchical database structure.
The clients should be able to insert data into tables through the API, The data would be inserted into multiple tables in the database at the same time, and not only into one table.
The clients should be able to retrieve data by using a complex search query.
The clients can upload/download files to the server which could have a size of multiple GBs
would SOAP be better for this job than REST ? can you please explain why ?
Almost all the things you mention are equally achievable using either SOAP or REST, though perhaps a little easier with SOAP. Certainly it's easier to create client APIs for SOAP interfaces; client tooling support is significantly more advanced on the majority of languages.
However, you say that you're wanting to deal with multi-gigabyte upload and download. That's a crucial point as REST is able to handle that sort of thing far more easily. SOAP is almost always tooled in terms of DOM processing, and that means building full messages in memory; you don't ever want to do that with a multi-GB payload.
So go with REST. That's definitely your best option for achieving all your listed objectives.

Updating data from several different sources

I'm in the process of setting up a database with customer information. The database will handle customer data (customer id, address, phonenr etc.) as well as some basic information about which kind of advertisement a specific customer has been subjected to, and how they reacted to it.
The data will be maintained both from a central data-warehouse, but additional information about customers and the advertisement will also be updated from other sources. For example, if an external advertisement agency runs a campaign, I want them to be able to feed back data about OptOuts, e-mail bounces etc. I guess what I need is an API which can be easily handed out to any number of agencies.
My first thought was to set up a web service API for all external sources, but since we'll probably be talking large amounts of data (millions of records per batch) I'm not sure a web service is the best option.
So my question is, what's the best practice here? I need a solution simple enough for advertisement agencies (likely with moderately skilled IT-people) to make use of. Simplicity is of the essence – by which I mean “simplicity over performance” in this case. If the set up gets too complex, it won't work.
The system will very likely be based on Microsoft technology.
Any suggestions?
The process you're describing is commonly referred to as Data Integration using ETL processes. ETL stands for Extract-Transform-Load. The idea is to build up your central data warehouse by extracting information from a lot of different data-sources, transform it and then load it into your data warehouse.
A variety of (also graphical) tools exist to implement such a process. Since you said you'll probably running a Microsoft stack, I suggest having a look at Sql Server Integration Services (SSIS).
Regarding your suggestion to implement integration using a web-service, I don't think that's a good idea too. Similarily, I don't think shifting the burden of data integration to your customers is a good idea either. You should agree with your customers on some form of a data exchange format, it could be as simple as a CSV file, or XML, Excel sheets, Access databases, use whatever suits your needs.
Any modern ETL tool like SSIS is capable of working with those different data sources.

Resources