Predictive Models for Detecting Outlier - data-modeling

I have a IP log data as sample and new to data modellling.
Could anyone suggest any packages which does predictive modelling to detect Outlier in the data.?
Any R packages or anything that I can use? Any inputs will help.

Related

what does "Online" in Online analytical processing means?

Is there any offline analytical processing? If there is, what is the difference with online analytical processing?
In What are OLTP and OLAP. What is the difference between them?, OLAP is deals with Historical Data or Archival Data. OLAP is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations.
I don't understand what does the word online in online analytical processing mean?
Is it related to real time processing(About real-time understanding: In a short time after the data is generated, this data can be analyzed. Am I wrong with this understanding)?
When does the analysis happen?
I imagine a design like this:
log generated in many apps -> kafka -> (relational DB) -> flink as ETL -> HBase, and the analysis will happen after data is inserted into HBase. Is this correct?
If yes, why is it called online?
If no, when does analysis happen? Please correct me if this design is usually not in the industry.
P.S. Assuming that the log generated by the apps in a day has a PB level
TLDR: as far as I can tell "Online" appears to stem from characteristics of a scenario where handling transactions with satellite devices (ATM's) was a new thing.
Long version
To understand what "online" in OLTP means, you have to go back to when ATM's first came out in the 1990s.
If you're a bank in the 1990s, you've got two types of system: your central banking system (i.e. mainframe), and these new fangled ATM's connected to it... online.
So if you're a bank, and someone wants to get money out, you have to do a balance check; and if cash is withdrawn you need to do a debit. That last action - or transaction - is key, because you don't want to miss that - you want to update your central record back in the bank's central systems. So that's the transactional processing (TP) part.
The OL part just refers to the remote / satellite / connected devices that participate in the transaction processing.
OLTP is all about making sure that happens reliably.

Enable users to "hotfix" source data while waiting for upstream source data to change

For a few SaaS tools our company uses, a 3rd party administrates the tools and provides us with daily feeds, which we load into our data warehouse.
Occasionally, a record in one of the feeds will have an error that needs to be fixed ASAP for downstream reporting. However, the SLA for the 3rd party to correct the record(s) in the source SaaS system can take up to two weeks. The 'error' doesn't break anything it is just that a record is closed when it should have stayed open, or a field has the wrong value.
The process is as follows:
BI team A, downstream of us in the data warehouse team, notices the discrepancy.
BI team A corrects the record in their database, which other teams consume from
BI team B, which receives data from the data warehouse and BI team A, raises an alarm because they see a discrepancy between our output and that which they receive from team A.
We (data warehouse team) have to correct the source data
The upstream 3rd party eventually corrects the records
Does anyone have a best practice for this scenario? What is an approach that would:
A. enable the BI team A to correct records ASAP without impacting the data warehouse team, and
B. be rollback-able once the upstream 3rd party corrects the source data?
One idea I had was to use a source-controlled csv file (like a dbt seed table) were it not that records usually contain PII and therefore can't version controlled.
how I would approach this:
Ensure that you have controls on your DW to catch any errors. Having a consumer of your data (BI Team A) telling you that your data is wrong is not a good place to be in!
Have 1 team responsible for fixing the data and in 1 place - this ensures you have control, consistency and auditing. As the data starts in the DW and then moves downstream to other systems, the DW is the place to fix it.
Build a standard process for fixing data that involves as little manual intervention as possible and which has been developed and tested in advance. When you encounter an error, and are under pressure from your customers to fix it, the last thing you want is to be trying to work out how to resolve the error and then developing/running untested code
At a high-level, your standard process should be a copy of the Production process e.g. a copy of the staging table (where you can insert the corrected versions of the incorrect records) and a copy of the loading process but pointed at this copied staging table . Depending on your Production logic you may need to amend the copy to delete/insert or update the incorrect records in your DW. Depending on your toolset, you might be able to achieve this with a separate config file rather than copying tables/logic.
Auditing. You should always be able to trace the fact that records have been amended, which records have been affected and what the changes were
Obviously you need to ensure that the changes you make to the DW cascade down to any consuming systems - either in the regular update process (if your consumers can wait until then) or as a one-off process. Similarly, you need to ensure that when the amended record is finally received from the 3rd Party that it updates your DW correctly and that you've audited the fact that an error has been corrected - presumably you'd want to be able to report on any errors not fixed by the 3rd party within their SLA?

Event data storage with InfluxDB

I’m working on a new project that will involve storing events from various systems at random intervals. Events such as deployment completions, production bugs, continuous integration events etc. This is somewhat time series data, although the volume should be relatively low, a few thousand a day etc.
I had been thinking maybe InfluxDB was a good option here as the app will revolve mostly around plotting time lines and durations etc, although there will need to be a small amount of data stored with these datapoints. Information like error messages , descriptions , url and maybe twitter sized strings. I would say that there is a good chance most events will not actually have a numerical value but more just act as a point in time reference for an event.
As an example, I would expect a lot of events to look like (in Influx line protocol format)
events,stream=engineering,application=circleci,category=error message="Web deployment failure - Project X failed at step 5", url=“https://somelink.com”,value=0
My question here is, am I approaching this wrong? Is InfluxDB the wrong choice for this type of data? I have read a few horror stories about data corruption and i’m a bit nervous there but i’m not entirely sure of any better (but also affordable) options.
How would you go about storing this type of data in a way that can can accessed at a high frequency for a purpose such as a realtime dashboard?
Im resisting the urge to just rollout a Postgres database.

Distributed transactions - why do we save tranlogs to file system?

All transaction managers (Atomikos, Bitronix, IBM WebSphere TM etc) save some "transaction logs" into 'tranlogs' folder to file system.
When something terrible happens and server gets down sometimes tranlogs become broken.
They require some manual recovery procedure.
I've been told that by simply clearing broken tranlogs folder I risk to have an inconsistent state of resources that participated in transactions.
As a "dumb" developer I feel more comfortable with simple concepts. I want to think that distributed transaction management should be alike the regular transaction management:
If something went wrong at any party (network, app error, timeout) - I expect the whole multi-resource transaction not to be committed in any part of it. All leftovers should be cleaned up sooner or later automatically.
If transaction managers fails (file system fault, power supply fault) - I expect all the transactions under this TM to be rollbacked (apparently, at DB timeout level).
File storage for tranlogs is optional if I don't want to have any automatic TX recovery (whatever it would mean).
Questions
Why can't I think like this? What's so complicated about 2PC?
What are the exact risks when I clear broken tranlogs?
If I am wrong and I really need all the mess with 2PC file system state. Don't you feel sick about the fact that TX manager can actually break storage state in an easy and ugly manner?
When I was first confronted with 2 phase commit in real life in 1994 (initially on a larger Oracle7 environment), I had a similar initial reaction. What a bloody shame that it is not generally possible to make it simple. But looking back at algorithm books of university, it become clear that there is no general solution for 2PC.
See for instance how to come to consensus in a distributed environment
Of course, there are many specific cases where a 2PC commit of a transaction can be resolved more easy to either complete or roll back completely and with less impact. But the general problem stays and can not be solved.
In this case, a transaction manager has to decide at some time what to do; a transaction can not remain open forever. Therefor, as an ultimate solution they will always need to have go back to their own transaction logs, since one or more of the other parties may not be able to reliably communicate status now and in the near future. Some transaction managers might be more advanced and know how to resolve some cases more easily, but the need for an ultimate fallback stays.
I am sorry for you. Fixing it generally seems to be identical to "Falsity implies anything" in binary logic.
Summarizing
On Why can't I think like this? and What's so complicated about 2PC: See above. This algorithmetic problem can't be solved universally.
On What are the exact risks when I clear broken tranlogs?: the transaction manager has some database backing it. Deleting translogs is the same problem in general relational database software; you loose information on the transactions in process. Some db platforms can still have somewhat or largely integer files. For background and some database theory, see Wikipedia.
On Don't you feel sick about the fact that TX manager can actually break storage state in an easy and ugly manner?: yes, sometimes when I have to get a lot of work done by the team, I really hate it. But well, it keeps me having a job :-)
Addition: to 2PC or not
From your addition I understand that you are thinking whether or not to include 2PC in your projects.
In my opinion, your mileage may vary. Our company has as policy for 2PC: avoid it whenever possible. However, in some environments and especially with legacy systems and complex environments such a found in banking you can not work around it. The customer requires it and they may be not willing to allow you to perform a major change in other infrastructural components.
When you must do 2PC: do it well. I like a clean architecture of the software and infrastructure, and something that is so simple that even 5 years from now it is clear how it works.
For all other cases, we stay away from two phase commit. We have our own framework (Invantive Producer) from client, to application server to database backend. In this framework we have chosen to sacrifice elements of ACID when normally working in a distributed environment. The application developer must take care himself of for instance atomicity. Often that is possible with little effort or even doesn't require thinking about. For instance, all software must be safe for restart. Even with atomicity of transactions this requires some thinking to do it well in a massive multi user environment (for instance locking issues).
In general this stupid approach is very easy to understand and maintain. In cases where we have been required to do two phase commit, we have been able to just replace some plug-ins on the framework and make some changes to client-side code.
So my advice would be:
Try to avoid 2PC.
But encapsulate your transaction logic nicely.
Allowing to do 2PC without a complete rebuild, but only changing things where needed.
I hope this helps you. If you can tell me more about your typical environments (size in #tables, size in GB persistent data, size in #concurrent users, typical transaction mgmt software and platform) may be i can make some additions or improvements.
Addition: Email and avoiding message loss in 2PC
Regarding whether suggesting DB combining with JMS: No, combining DB with JMS is normally of little use; it will itself already have some db, therefor the original question on transaction logs.
Regarding your business case: I understand that per event an email is sent from a template and that the outgoing mail is registered as an event in the database.
This is a hard nut to crack; I've been enjoying doing security audits and one of the easiest security issues to score was checking use of email.
Email - besides not being confidential and tampersafe in most situations like a postcard - has no guarantees for delivery and/or reading without additional measures. For instance, even when email is delivered directly between your mail transfer agent and the recipient, data loss can occur without the transaction monitor being informed. That even gets worse when multiple hops are involved. For instance, each MTA has it's own queueing mechanism on which a "bomb can be dropped" leading to data loss. But you can also think of spam measures, bad configuration, mail loops, pressing delete file by accident, etc. Even when you can register the sending of the email without any loss of transaction information using 2PC, this gives absolutely no clue on whether the email will arrive at all or even make it across the first hop.
The company I work for sells a large software package for project-driven businesses. This package has an integrated queueing mechanism, which also handles email events. Typically combined in most implementation with Exchange nowadays. A few months we've had a nice problem: transaction started, opened mail channel, mail delivered to Exchange as MTA, register that mail was handled... transaction aborted, since Oracle tablespace full. On the next run, the mail was delivered again to Exchange, again abort, etc. The algorithm has been enhanced now, but from this simple example you can see that you need all endpoints to cooperate in your 2PC, even when some of the endpoints are far away in an organisation receiving and displaying your email.
If you need measures to ensure that an email is delivered or read, you will need to supplement it by additional measures. Please pick one of application controls, user controls and process controls from literature.

Versioning a dataset in an RDBMS using initials and deltas

I'm working on a system that mirrors remote datasets using initials and deltas. When an initial comes in, it mass deletes anything preexisting and mass inserts the fresh data. When a delta comes in, the system does a bunch of work to translate it into updates, inserts, and deletes. Initials and deltas are processed inside long transactions to maintain data integrity.
Unfortunately the current solution isn't scaling very well. The transactions are so large and long running that our RDBMS bogs down with various contention problems. Also, there isn't a good audit trail for how the deltas are applied, making it difficult to troubleshoot issues causing the local and remote versions of the dataset to get out of sync.
One idea is to not run the initials and deltas in transactions at all, and instead to attach a version number to each record indicating which delta or initial it came from. Once an initial or delta is successfully loaded, the application can be alerted that a new version of the dataset is available.
This just leaves the issue of how exactly to compose a view of a dataset up to a given version from the initial and deltas. (Apple's TimeMachine does something similar, using hard links on the file system to create "view" of a certain point in time.)
Does anyone have experience solving this kind of problem or implementing this particular solution?
Thanks!
have one writer and several reader databases. You send the write to the one database, and have it propagate the exact same changes to all the other databases. The reader databases will be eventually consistent and the time to update is very fast. I have seen this done in environments that get upwards of 1M page views per day. It is very scalable. You can even put a hardware router in front of all the read databases to load balance them.
Thanks to those who tried.
For anyone else who ends up here, I'm benchmarking a solution that adds a "dataset_version_id" and "dataset_version_verb" column to each table in question. A correlated subquery inside a stored procedure is then used to retrieve the current dataset_version_id when retrieving specific records. If the latest version of the record has dataset_version_verb of "delete", it's filtered out of the results by a WHERE clause.
This approach has an average ~ 80% performance hit so far, which may be acceptable for our purposes.

Resources