CRISP-DM - Timing for each of the tasks? - analytics

I have what may be a simple question.
So, using CRISP-DM we have 6 tasks which have to be followed.
How to identify the amount of time needed for each of the tasks?
P.S. As assumption, for Data Collection we need 3 days.
This is the question, how it's looks like.

There is no general rule.
Every project is very different.
For example, one project may already have all its data, and thus need 0 days to get the data.
Usually, there will be some manager preventing access to the data you need, and then it will take at least 6 months and C-level activity to get the data to you. And absolutely no progress will be possible before seeing the data.
So just plan 0-12 months on every step.
Also, don't forget that it is an iterative process, so you will need to restart again, anyway. In my opinion, CRISP-DM is dead. Business people love it because it gives them the impression of "managing" things, but it doesn't work that way in reality, it is just theater you do for the managers.

Related

Is there a way to split Flink tasks in cluster gui besides using .startNewChain()?

The Flink Cluster's GUI is really beneficial to me, particularly the plan of the job so that you can see the number of records sent from what part of the job to the other. But an issue that I have come across is that if you don't use .startNewChain() in between functions, the data that is said to be sent from one function to another is misleading.
To give an example:
Here, the code is using .startNewChain() on the finalErrorOutputStream.
When this is ran in the cluster, the gui displays this:
The outputDataStream has output 10,476 records and the finalErrorOutputStream is shown as a separate task(not sure if "task" is technically the right term, but it is what I am calling it) that shows it has received 8,860 records.
Now if we remove the .startNewChain() from the finalErrorOutputStream, we get this in the gui:
The outputDataStream has output 10,507 records and we don't know how many of those have gone to the finalErrorOutputStream (yes, we could set up graphs in the task metrics tab, but the goal is to just be able to tell from this standard overview) and because there is a sink for finalErrorOutputStream, it shows like finalErrorOutputStream hasn't output any records. If you were to show this to someone unfamiliar with Flink and the reasoning for this it would be confusing.
So using .startNewChain() is a better solution to show the breakdown of how many records went to where, BUT the issue is that .startNewChain() does cause a performance impact.
And there are some jobs that I have seen made where if you don't use .startNewChain() it is just a single square in the job's plan even though a lot is going on inside of it.
So my question is if .startNewChain() is the only way to get this behavior or if there is some other option that will provide that insight into the job's "plan"?
Unfortunately no, there is no other way. The GUI, and the backing data-structures it reads the data from, are only aware of tasks (you used the right term), not operators (the individual parts of a task).

One or two field to represent "current step" and "is or not finished" state?

Sorry if this question is too silly or neurotic... But I can't figure it out by myself. So I want to see how others deal with it.
My question is:
I want to write a program show progress of do some thing. So I need to record which state is it currently in so that someone can check it by anytime. there are two method:
Use two field to represent the progress state: step and is_finished.
Just one filed: step. For example, if this thing need 5 step, then 6 means finished. ( 0 means not started? )
Compare above two methods.
two field:
Seems more clear. And the most important is that logically speaking step and finished or not are two concepts? I'm not sure about this.
If thing are finished. We change is_finished field to true ( or 1 as you like ). But what to do with step field now? Plus one, or just not touch it because it has no meaning any more now?
one field:
Simple, space saving. But not very intuitive. For example, we don't know what 6 really means by just looking at this field because it may represent finish or middle step. It need other information e.g. total step to determine. And potentially this meaning is not very stable if the total steps will change ( is_finished field in two field method would not affected by this).
So How do you will deal with it? Thanks!
UPDATE:
I forgot some point maybe useful in the previous post:
The story is: We provide a web-based service for customers. (This service has time limitation e.g. 1 year term) After customer purchase it our deployment programe prepare hardware(virtual machine) and deploy some software which need some time to finish. And we want to provide progress info for customer. When deployment is finished, the customer should be informed.
Database design:
It need a usage state field to represent running normal, running but owe (expired), stop. What confusing me is should it include not deployed yet and deploying information or not?
The progress info should include some other info e.g. the start time so we can tell how much time elapsed since start. But this info is no need to be persistent because we won't care about these info as long as it's finished. So I decide to store these progress info in a separate (temporary) table. Then I think it need another field in another more persistent table to tell if things are done . So can we combine it into the usage state field mentioned above?
I like the one-field approach better, for the following reasons:
(Assuming you want to search on steps) you can "cover" all steps using only one simple index.
Should you ever want to attach some additional information to each of the steps, the one-field approach can easily accommodate a FOREIGN KEY towards a new table containing that information.
Requires slightly less storage space. Storage is cheap these days, but that's not the point - caching and network performance is.
Two-field approach:
(Assuming you want to search on steps) might require a "fatter" composite index or even two indexes (which takes space, lowers the cache effectiveness and incurs maintenance cost for INSERT/UPDATE/DELETE operations).
Requires a CHECK to defend the database from "impossible" combinations. Funny enough, some DBMSes don't enforce CHECKs (I'm looking at you, MySQL).
Requires slightly more storage space (and therefore slightly less of it fits into cache, takes up slightly more network bandwidth etc.).
NOTE: Should you choose to use NULLs, that could have "interesting" consequences under certain DBMSes (for example, Oracle doesn't index NULLs).
For example, we don't know what 6 really means
That doesn't really matter, as long as the client application knows what it means.
Design the database for applications, not humans.
And potentially this meaning is not very stable if the total steps will change
True, but you have the same problem with two-field approach as well, if new step is added in the "middle" of existing steps.
Either UPDATE the table accordingly,
or never change the step values. For example, if the step 5 is the last one, then newly added step 6 is considered earlier despite having greater value - your application (or the additional table I mentioned) will know the order of steps, even if their values are not ordered. If you really want "order by value" without resorting to UPDATE, make the steps: 10, 20, 30 etc, so you can insert new steps in the gaps (the old BASIC line number trick).
It remains a matter of taste but I would suggest the second option of a single int field step. On inserting a new record, initialize the value of step to 0 which would indicate "not started yet". Any positive integer value would obviously denote the current step. As soon as the trajectory is completed I would set step to NULL. As you correctly stated this method does require solid documentation but I think that it is not too confusing

What is the life span of data?

Recently I’ve found myself in a database tangle where management wants the ability to remove data from the database, but still wants that data to appear in other places. Example: They want to remove all instances of the product whizbang, but they still want whizbang to appear in sales reports. (if they ran one for a previous date).
Now I can add a field, say is_deleted, that will track whether that product has been deleted and thus still keep all my references, but over a period of time, I have the potential of housing a lot of dead data. (data that is never accessed again). How to handle this is not my question.
I’m curious to find out, in your experience what is the average life span of data? That is, on average how long is data alive or good for before it gets either replaced or deleted? I understand that this is relative to the type of data you are housing, but certainly all data has some sort of life span?
Data lives forever...or often it should. One common practice is to have end and/or start dates for a record. So for your whizbang, you have a start date (so that it won't appear on sales reports before it's official launch), and an end date (so that it drops off of reports after it's been end-of-lifed). Using the proper dates as criteria for your reporting as well as your applications, you won't see the whizbang except for when you should, and the data still exists (which it should, theoretically infinitely).
As Koistya Navin mentions, moving data to a data warehouse at a certain point is also an option, but this depends in large part on how large your 'old' data is, and how long you need to keep it readily available for access.
Many of our customers keep data online for 2 years. After that it's moved to backup disks, but it can be put online if needed.
Consider adding a column "expiration" or "effective date". This will allow you mark a product as obsolete, but reports will return that product if the time range is satisfied.
Usually it's better to move such data into seporate database (database warehouse) and keep working database clean. At data warehouse your data can be kept for many years without impacting your application.
Reference: Data Warehouse at Wikipedia
I've always gone by what is the ruling body looking for. Example the IRS wants you to keep 7 years of history or for security reasons we keep 3 years of log information, etc. So I guess you could do 2 things, determine what the life span of your data is I would say 3 years would be enough and then you could add the is_deleted flag along with a date that way you would be able to flag some data to delete sooner than later.
Yes, all data has a lifespan. And yes, it is relative to the type of data you have.
Some data has a lifespan measured in seconds (authentication tokens, for instance), some other data virtual eternity (more than the medium and formats it is stored into, like for instance ownership records).
You will have to either be more specific as to the type of data you are envisioning, or do a census in your own organization as to the usual lifespan of stuff.
Our particular flavor varies. We have some data (a vast majority) which goes stale after 3 months (hard product limit) but can be revived at any later date.
We have other data that is effectively immortal.
In practice, most of the data we serve up is fresh and frequently requested for a few weeks, at most a month, before falling to sporadic use.
How much is "a lot of dead data"?
With processing power and data storage so cheap, I wouldn't purge old data unless there's a really good reason to. You also need to consider the legal implications. Large (and even small) companies may have incredibly long retention policies for old data, to save themselves millions down the road when they are subpoenaed for it by a judge.
I would check with whatever legal department you have and find out how long the data needs to be stored. That's the safest bet.
Also, ask yourself what the benefit of removing the old data is. Is the only benefit a tidier database? If so, I wouldn't do it. Are you going to see a 10X performance increase? If so, I'd do it. This really is a complex question though, and it's tough for us to have all the information required to give you good advice.
I have a few projects where the customer wants all the historical data (going back over 19 years). Quite a bit of the really old data is malformed and is going to be a nightmare to import into the new system. We convinced them that they won't need records going back any further than 10 years, but like you said it's all relative to the type of data you're housing.
On a side note, data storage is extremely cheap right now, and if it isn't affecting the performance of your application, I would just leave it where it is.
[...] but certainly all data has some sort of life span?
Not any kind of life span we can talk about meaningfully. A lot of data is useless as soon as it's created or recorded. Such data could be discarded immediately with no effect. On the other hand, some data has enough value that it will outlive the current system that hosts it. If Amazon were to completely replace their current infrastructure, the customer histories they have stored would still be immensely valuable.
As you said, it's relative. Each type of data has its own life span that has no relation to another type of data's life span. There's no meaningful "average life span of data".
I have the potential of housing a lot of dead data. (data that is never accessed again).
But they will when they perform those reports then they are accessing that data.
Until then you'll need to keep the data in some form. Move to another table or have a switch like you mentioned.
uh...at the risk of oversimplifying...it sounds like using DateDeleted instead of a bit would solve your how-long-to-keep issue.

Inventory database design [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This is a question not really about "programming" (is not specific to any language or database), but more of design and architecture. It's also a question of the type "What the best way to do X". I hope does no cause to much "religious" controversy.
In the past I have developed systems that in one way or another, keep some form of inventory of items (not relevant what items). Some using languages/DB's that do not support transactions. In those cases I opted not to save item quantity on hand in a field in the item record. Instead the quantity on hand is calculated totaling inventory received - total of inventory sold. This has resulted in almost no discrepancies in inventory because of software. The tables are properly indexed and the performance is good. There is a archiving process in case the amount of record start to affect performance.
Now, few years ago I started working in this company, and I inherited a system that tracks inventory. But the quantity is saved in a field. When an entry is registered, the quantity received is added to the quantity field for the item. When an item is sold, the quantity is subtracted. This has resulted in discrepancies. In my opinion this is not the right approach, but the previous programmers here swear by it.
I would like to know if there is a consensus on what's the right way is to design such system. Also what resources are available, printed or online, to seek guidance on this.
Thanks
I have seen both approaches at my current company and would definitely lean towards the first (calculating totals based on stock transactions).
If you are only storing a total quantity in a field somewhere, you have no idea how you arrived at that number. There is no transactional history and you can end up with problems.
The last system I wrote tracks stock by storing each transaction as a record with a positive or negative quantity. I have found it works very well.
The Data Model Resource Book, Vol. 1: A Library of Universal Data Models for All Enterprises
The Data Model Resource Book, Vol. 2: A Library of Data Models for Specific Industries
The Data Model Resource Book: Universal Patterns for Data Modeling
I have Vol 1 and Vol 2 and these have been pretty helpful in the past.
It depends, inventory systems are about far more than just counting items. For example, for accounting purposes, you might need to know accounting value of inventory based on FIFO (First-in-First-out) model. That can't be calculated by simple "totaling inventory received - total of inventory sold" formula. But their model might calculate this easily, because they modify accounting value as they go. I don't want to go into details because this is not programming issue but if they swear by it, maybe you didn't understand fully all their requirements they have to accommodate.
both are valid, depending on the circumstances. The former is best when the following conditions hold:
the number of items to sum is relatively small
there are few or no exceptional cases to consider (returns, adjustments, et al)
the inventory item quantity is not needed very often
on the other hand, if you have a large number of items, several exceptional cases, and frequent access, it will be more efficient to maintain the item quantity
also note that if your system has discrepancies then it has bugs which should be tracked down and eliminated
i have done systems both ways, and both ways can work just fine - as long as you don't ignore the bugs!
It's important to consider the existing system and the cost and risk of changing it. I work with a database that stores inventory kind of like yours does, but it includes audit cycles and stores adjustments just like receipts. It seems to work well, but everyone involved is well trained, and the warehouse staff aren't exactly quick to learn new procedures.
In your case, if you're looking for a little more tracking without changing the whole db structure then I'd suggest adding a tracking table (kind of like from your 'transaction' solution) and then log changes to the inventory level. It shouldn't be too hard to update most changes to the inventory level so that they also leave a transaction record. You could also add a periodic task to backup the inventory level to the transaction table every couple hours or so so that even if you miss a transaction you can discover when the change happened or roll back to a previous state.
If you want to see how a large application does it take a look at SugarCRM, they have and inventory management module though I'm not sure how it stores the data.
I think this is actually a general best-practices question about doing a (relatively) expensive count every time you need a total vs. doing that count every time something changes, then storing the count in a field and reading that field whenever you need a total.
If I couldn't use transactions, I would go with the live count every time I needed a total. If transactions are available, it would be safe to perform the inventory update operations and the saving of the re-counted total within the same transaction, which would ensure the accuracy of the count (although I'm not sure this would work with multiple users hitting the database).
But if performance is not really a huge problem (and modern databases are good enough at counting rows that I would rarely even worry about this) I'd just stick with the live count each time.
I would opt for the first way, where
the quantity on hand is calculated
totaling inventory received - total of
inventory sold
The Right Way, IMO.
EDIT: I would also want to factor in any stock losses/damages into the system, but I'm sure you have that covered.
I've worked on systems that solve this problem before. I think the ideal solution is a precomputed column, which gets you the best of both worlds. Your total would be a field somewhere, thus no expensive lookups, but it can't get out of sync with the rest of your data (the database maintains the integrity). I don't remember which RDMSs support precomputed columns, but if you don't have transactions, that might not be available either.
You could potentially fake precomputed columns (very effectively... I see no downside) using triggers. You'd probably need transactions though. IMHO, keeping data integrity when you're doing this sort of controlled denormalization is the only legitimate use for a trigger.
Django-inventory geared more to fixed assets, but might give you some ideas.
IE: ItemTemplate (class) -> ItemsOnHand (instance)
ItemsOnHand can be linked to more ItemTemplates; Example Printer & the ink cartridges is requires. This also allows to set Reorder points for each ItemOnHand.
Each ItemsOnHand is linked to InventoryTransactions, this allows for easy auditing.
To avoid calculating actual on hand items from thousand of invetory transactions, checkpoints are used which are just a balance + a date. To calculate items on hand query to find the most recent checkpoint and start adding or substracting items to find the current balance of items. Define new checkpoints periodically.
I can see some benefit to having the two columns, but I'm not following the part about discrepancies - you seem to be implying that having the two columns (in and out) is less prone to discrepancy than a single column (current). Why is that?
Is not having one or two columns, what I meant with "totaling inventory received - total of inventory sold" is something like this:
Select sum(quantity) as inventory_received from Inventory_entry
Select sum(quantity) as inventory_sold from Sales_items
then
Qunatity_on_hand = inventory_received - inventory_sold
Please keep in mind that I oversimplified this and my initial explanation. I know there is much more to inventory that just keeping track of quantities, but in this case that's were the problem lies and what we want to fix. At this point the reason to change it is preciselly the cost of supporting the problems caused by the current design.
Also I wanted to mention that although this is not a "coding" question is related to algoritms and design which IMHO are very important topics.
Thanks everybody for your answers so far.
Nelson Marmol
We solve different problems, but our approach to some of them might be interesting to you.
We allow the system to make a "best guess", and give the users regular feedback about any of those guesses that look wrong.
To apply this to inventory, you could have 3 fields:
inventory_received
inventory_sold
estimated_on_hand
Then, you could run a process (daily?) along the lines of:
SELECT *
FROM Inventory
WHERE estimated_on_hand != inventory_received - inventory_sold
Of course, this relies on users looking at this alert, and doing something about it.
Also, you could have a function to reset inventory some how, either by updating inventory_sold/received, or perhaps adding another field "inventory_adjustment", which could be positive or negative.
... just some thoughts. Hope it's helpful.

Design question: How would you design a recurring event system? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
If you were tasked to build an event scheduling system that supported recurring events, how would you do it? How do you handle when an recurring event is removed? How could you see when the future events will happen?
i.e. When creating an event, you could pick "repeating daily" (or weekly, yearly, etc).
One design per response please. I'm used to Ruby/Rails, but use whatever you want to express the design.
I was asked this at an interview, and couldn't come up with a really good response that I liked.
Note: was already asked/answered here. But I was hoping to get some more practical details, as detailed below:
If it was necessary to be able to comment or otherwise add data to just one instance of the recurring event, how would that work?
How would event changes and deletions work?
How do you calculate when future events happen?
I started by implementing some temporal expression as outlined by Martin Fowler. This takes care of figuring out when a scheduled item should actually occur. It is a very elegant way of doing it. What I ended up with was just a build up on what is in the article.
The next problem was figuring out how in the world to store the expressions. The other issue is when you read out the expression, how do those fit into a not so dynamic user interface? There was talk of just serializing the expressions into a BLOB, but it would be difficult to walk the expression tree to know what was meant by it.
The solution (in my case) is to store parameters that fit the limited number of cases the User Interface will support, and from there, use that information to generate the Temporal Expressions on the fly (could serialize when created for optimization). So, the Schedule class ends up having several parameters like offset, start date, end date, day of week, and so on... and from that you can generate the Temporal Expressions to do the hard work.
As for having instances of the tasks, there is a 'service' that generates tasks for N days. Since this is an integration to an existing system and all instances are needed, this makes sense. However, an API like this can easily be used to project the recurrences without storing all instances.
I've had to do this before when I was managing the database end of the project. I requested that each event be stored as separate events. This allows you to remove just one occurrence or you could move a span. It's a lot easier to remove multiples than to try and modify a single occurrence and turn it into two. We were then able to make another table which simply had a recurrenceID which contained the information of the recurrence.
#Joe Van Dyk asked: "Could you look in the future and see when the upcoming events would be?"
If you wanted to see/display the next n occurences of an event they would have to either a) be calculated in advance and stored somewhere or b) be calculated on the fly and displayed. This would be the same for any evening framework.
The disadvantage with a) is that you have to put a limit on it somewhere and after that you have to use b). Easier just to use b) to begin with.
The scheduling system does not need this information, it just needs to know when the next event is.
When saving the event I would save the schedule to a store (let's call it "Schedules" and I'd calculate when the event was to fire the next time and save that as well, for instance in "Events". Then I'd look in "Events" and figure out when the next event was to take place and go to sleep until then.
When the app "wakes up" it would calculate when the event should take place again, store this in "Events" again and then perform the event.
Repeat.
If an event is created while sleeping the sleep is interrupted and recalculated.
If the app is starting or recovering from a sleep event or similar, check "Events" for passed events and act accordingly (depending on what you want to do with missed events).
Something like this would be flexible and would not take unnecessary CPU cycles.
Off the top of my head (after revising a couple things while typing/thinking):
Determine the minimum recurrence-resolution needed; that's how often the app runs. Maybe it's daily, maybe every five minutes.
For each recurring event, store the most recent run time, the run-interval and other goodies like expiration time if that's desirable.
Every time the app runs, it checks all events, comparing (today/now + recurrenceResolution) to (recentRunTime + runInterval) and if they coincide, fire the event.
When I wrote a calendar app for myself mumble years ago, I basically just stole the scheduling mechanism from cron and used that for recurring events. e.g., Something taking place on the second Saturday of every month except January would include the instruction "repeat=* 2-12 8-14 6" (every year, months 2-12, the 2nd week runs from the 8th to the 14th, and 6 for Saturday because I used 0-based numbering for the days of the week).
While this makes it quite easy to determine whether the event occurs on any given date, it is not capable of handling "every N days" recurrence and is also rather less than intuitive for users who aren't unix-savvy.
To deal with unique data for individual event instances and removal/rescheduling, I just kept track of how far out events had been calculated for and stored the resulting events in the database, where they could then be modified, moved, or deleted without affecting the original recurrent event information. When a new recurring event was added, all instances were immediately calculated out until the existing "last calculated" date.
I make no claim that this is the best way to do it, but it is a way, and one which works quite well within the limitations I mentioned earlier.
If you have a simple reoccuring event, such as daily, weekly or a couple days a week, whats wrong with using buildt in scheduler/cron/at functionallity? Creating an executable/console app and set up when to run it? No complicated calendar, event or time management.
:)
//W

Resources