I am currently working on a problem where we are ranking a customer demographic by absolute value of sales or by the YoY (year-over-year) change in sales.
As is obvious, looking at both these metrics at the same time lets us know whether the growth we are witnessing is actually substantial or not.
For example, let's say a state had a YoY growth of 200% by the end of 2021, but if the absolute value of sales only changed from $100 to $300, it is not actual growth with regards to the hundred thousand dollar values we usually see for other states.
I am looking for ways to combine the effect of both these metrics (absolute value and YoY change) and create a composite metric which can be used to rank my customer demographic.
A common way to combine two metrics is to normalise them to the same range, e.g. 0..1 and then take a weighted average. However there's always a question as to what the weights should be, and, more importantly, what such a metric means. From my experience, I'd advise against trying to combine metrics this way.
Related
I'm trying to use Apache Superset to create a dashboard that will display the average rate of X/Y at different entities such that the time grain can be changed on the fly. However, all I have available as raw data is daily totals of X and Y for the entities in question.
It would be simple to do if I could just get a line chart that displayed sum(X)/sum(Y) as its own metric, where the sum range would change with the time grain, but that doesn't seem to be supported.
Creating a function in SQLAlchemy that calculates the daily rates and then uses that as the raw data is also an insufficient solution, since taking the average of that over different time ranges would not be properly weighted.
Is there a workaround I'm not seeing?
Is there a way to use Druid or some other tool to make displaying a quotient over a variable range possible?
My current best solution is to just set up different charts for each time grain size (day, month, quarter, year), but that's extremely inelegant and I'm hoping to do better.
There are multiple ways to do this, one is using the Metric editor as shown bellow, in this case the metric definition is stored as part of the chart.
Another way is to define a metric in the "datasource editor", where the metric will be stored with the datasource definition, and become reusable for any chart using this datasource, as shown here
Side note: depending on the database you use, you may have to CAST from say an integer to a numeric type as I did in the example, or multiply by 100 in order to get a proper result that's useful.
I want to do an experiment involving the use of additive noise for protecting a database from inference attacks.
My database should begin by generating a specific list of values that have a mean of 25, Then I will anonymize these values by adding a random noise value, which is designed to have an expected value of 0.
For example:
I can use uniformly distributed noise in the range [-1,1] or use a Normal (Gaussian) noise with mean 0.
I will test this anonymization method for a database of 100, 1000, 10000 values with different noise.
I am confused to use which platform and how, So I started with 10 values in Excel, For uniformly distributed noise value I use RAND() and add to the actual value, for normal noise, I use Norm.Inv with mean 0, then I add to the actual value.
But I don't know how to interpret data from the hacker's side, When I add noise to the dataset, how can I interpret its effect on privacy when the dataset becomes larger?
Also, should I use a database tool to handle this problem?
From what I understand, you're trying to protect your "experimental" database from inference attacks.
Attackers try to steal information from a database, using queries that are already allowed for public use. First, try to decide on your identifiers, quasi-identifiers and sensitive values.
Consider a student management system that has the GPA's of each student. We know that GPA is a sensitive information. Identifier is "student_id", and quasi-identifiers are "standing" and, let's say, "gender". In most cases, administrator of the RDBM system allows aggregate queries such as "Get average GPA of all students" or "Get average GPA of senior students" etc. Attackers try to infer from these aggregate queries. If, somehow, there are only one student who is senior, then the query "Get average GPA of senior students" would return the GPA of one specific person.
There are two main ways to protect the database from this kind of attacks. De-identification and Anonymization. De-identification means removing any identifier and quasi-identifier from the database. But this does not work in some cases. Consider one student who takes a make-up exam after grades are announced. If you get the average GPA of all students before and after he takes the exam, and compare the results of queries, you'd see a small amount of change (let's say, from 2.891 to 2.893). The attacker can infer the make-up exam score of that one particular student from this 0.002 difference of aggregate GPA.
The other approach is anonymization. With k-anonymity, you divide the database into groups that has at least k entities. For example, 2-anonymity ensures that there are no groups with single entity in it, so the aggregate queries on single-entity groups no longer leak private information.
Unless, you are one of the two entities in a group.
If there are 2 senior students in a class, and you want to know the average grade of seniors, the 2-anonymity allows you to have that information. But if you are a senior, and you already know your grade, you can infer the other student's grade.
Adding noise to sensitive values is a way to deal with those attacks, but if the noise is too low, then it has almost no effect on the information leaked (e.g. for grades, knowing someone has 57 out of 100 instead of 58 makes almost no difference). If it's too high, it results with a loss of functionality.
You've also asked how you can interpret the effect on privacy, as the dataset becomes larger. If you take average of an extremely large dataset, you'd see that the result you find is actually the sensitive value of everyone (this could be a little complex, but think the dataset as infinite, and the values that the sensitive information can take is finite, then calculate the probabilities). Adding noise with zero mean works, but range of the noise should get wider as the dataset get larger.
Lastly, if you are using excel, which is not a RDBMS but a spreadsheet, I suggest you to come up with a way to use equivalents of SQL queries, set identifiers and quasi-identifiers of the dataset, and public queries that can be performed by anyone.
Also, in addition to anonymity, take a look at "diversity" and "closeness" of a dataset and their use in database anonymization.
I hope that answers your question.
If you have further questions, please ask.
In a scenerio when I need to use the the entire date (i.e. day, month, year) as a whole, and never need to extract either the day, or month, or the year part of the date, in my database application program, what is the best practice:
Making Date an atomic attribute
Making Date a composite attribute (composed of day, month, and year)
Edit:- The question can be generalized as:
Is it a good practice to make composite attributes where possible, even when we need to deal with the attribute as a whole only?
Actually, the specific question and the general question are significantly different, because the specific question refers to dates.
For dates, the component elements aren't really part of the thing you're modelling - a day in history - they're part of the representation of the thing you're modelling - a day in the calendar that you (and most of the people in your country) use.
For dates I'd say it's best to store it in a single date type field.
For the generalized question I would generally store them separately. If you're absolutely sure that you'll only ever need to deal with it as a whole, then you could use a single field. If you think there's a possibility that you'll want to pull out a component for separate use (even just for validation), then store them separately.
With dates specifically, the vast majority of modern databases store and manipulate dates efficiently as a single Date value. Even in situations when you do want to access the individual components of the date I'd recommend you use a single Date field.
You'll almost inevitably need to do some sort of date arithmetic eventually, and most database systems and programming languages give some sort of functionality for manipulating dates. These will be easier to use with a single date variable.
With dates, the entire composite date identifies the primary real world thing you're identifying.
The day / month / year are attributes of that single thing, but only for a particular way of describing it - the western calendar.
However, the same day can be represented in many different ways - the unix epoch, a gregorian calendar, a lunar calendar, in some calendars we're in a completely different year. All of these representations can be different, yet refer to the same individual real world day.
So, from a modelling point of view, and from a database / programmatic efficiency point of view, for dates, store them in a single field as far as possible.
For the generalisation, it's a different question.
Based on experience, I'd store them as separate components. If you were really really sure you'd never ever want to access component information, then yes, one field would be fine. For as long as you're right. But if there's even an ability to break the information up, I peronally would separate them from the start.
It's much easier to join fields together, than to separate fields from a component string. That's both from a programm / algorithmic viewpoint and from compute resource point of view.
Some of the most painful problems I've had in programming have been trying to decompose a single field into component elements. They'd initially been stored as one element, and by the time the business changed enough to realise they needed the components... it had become a decent sized challenge.
Most composite data items aren't like dates. Where a date is a single item, that is sometimes (ok, generally in the western world) represented by a Day-Month-Year composite, most composite data elements actually represent several concrete items, and only the combination of those items truly uniquely represent a particular thing.
For example a bank account number (in New Zealand, anyway) is a bit like this:
A bank number - 2 or 3 digits
A branch number - 4 to 6 digits
An account / customer number - 8 digits
An account type number - 2 or 3 digits.
Each of those elements represents a single real world thing, but together they identify my account.
You could store these as a single field, and it'd largely work. You might decide to use a hyphen to separate the elements, in case you ever needed to.
If you really never need to access a particular piece of that information then you'd be good with storing it as a composite.
But if 3 years down the track one bank decides to charge a higher rate, or need different processing; or if you want to do a regional promotion and could key that on the branch number, now you have a different challenge, and you'll need to pull out that information. We chose hyphens as separators, so you'll have to parse out each row into the component elements to find them. (These days disk is pretty cheap, so if you do this, you'll store them. In the old days it was expensive so you had to decide whether to pay to store it, or pay to re-calculate it each time).
Personally, in the bank account case (and probably the majority of other examples that I can think of) I'd store them separately, and probably set up reference tables to allow validation to happen (e.g. you can't enter a bank that we don't know about).
My project looks like this: my data set is a bunch of profiles of people, with various attributes, e.g. boolean hasJob and int healthScore, and their income. Using this data, I'm trying to predict their income for the future. Each profile also has a history: e.g., what their attributes and income were in the past.
So in essence I'm trying to map multiple sets of (x booleans, y numbers) to a number (salary in the coming year).
I've considered neural networks, Bayes nets, and genetic algorithms for function-fitting. Any suggestions or input?
Thanks in advance!
--Emily
What you want to do is called "time series modeling". However you probably have only very little data per series (per person). I think it is difficult to find one model that fits every person as you make some general assumptions that e.g. everyone is equally career oriented. Also this is such a noisy target, it could be that e.g. you have to take into account if someone is a sweettalker or not. How do you measure such a thing? I'm pretty sure your current attributes have enough noise that will make it difficult to predict anything. When you say health status, do you mean physical health only or mental health. In different businesses different things are important. What about the business or industry they are working in? Its health and growth potential? I would assume this highly influences their income. I also think that you have dependent variables as well as attributes could (and likely are) influenced by your target variable. E. g. people with higher income have better health. It sounds like a very very complex and difficult thing and definitely nothing where "I naively grouped my data and tried a bunch of methods" is going to give meaningful results. I would suggest to learn more about time series modeling and especially also about the data that you have. Maybe try starting out with clustering persons by their initial attributes and see how they develop. Are there any variables that correlate with this development?
What is your research question?
I created an application few days ago that deals with invoicing. I would like to know how to best integrate a discount to my invoices. Should I put it as a negative item (in the invoice_items table) or should I create a "discount" column in the invoice table ?
I would have it as a negative-valued item. The reasons are:
With invoicing, it's very important that the calculated value remains contant forever; even if your calculation formula later changes, you can correctly reproduce any given invoice. This is even true if the value was incorrectly calculated at the time - it was what it was.
Having a value amount means that manual adjustments for exceptional circumstances is easily handled - eg, your marketing manager/accountant may decide to give a one-off discount of $100 because of a late delivery. This is trivial with negative values - just add another row, but difficult/hassle with discount rates
You can have multiple discount amounts per invoice
It's totally flexible - it has its own space to exist and be whatever it needs to be. In fact, I would make the discount another "product" (maybe even multiple products - one for each distinct discount reason, eg xmas, coupon, referral, etc.
With its own item, you can add a reason description just like any other "product" - eg "10% discount for paying cash" or whatever
You don't need any special code or database columns! Just total items up as before and print them on the invoice. "There is no spoon (discount)": It's just another line item - what could be more simple than no code/db changes required?
Not all items should be discounted - eg refunds, returns, subscriptions (if applicable). It becomes too complicated and it's unnecessary to represent the business logic of discounts in the database. Leave the calculation etc in the app code, store the result in the db
Having its own item means the calculation can be arbitrarily complex. This means no db maintenance as the complexity grows. It's a whole lot easier to maintain/alter code than it is to maintain/alter a database
Finally, I successfully built an invoicing system, and I took the "item" approach and it worked really well
What consequences would either of those choices have for you down the road? For example, would you like to have multiple discounts, or very specified discounts later on? If there will only be one discount per invoice, then I wouldn't make it any more complicated than need be. In my opinion it's easier and clearer to have it in the invoice table - having it as a negative item will make the processing of items more difficult, I think.
I fully agree with making it as simple as possible, but one thing to consider is if any item should be exempted from the discount? In that case you need to add a bool field in the details to remember which line should have discount.