I have a column that has a bunch of random comment data in it. I need to parse out some specific things that are in this column. Here is a sample of one of the rows.
Effective: 07/01/2020 Seller Type: Dealer 9 Month Premium | Inventory = 220 Promo Intro Rate = $5.00 Budget Cap = $15.00 (Fixed $5.00 + Variable $10.00) Months 1-3 = $1,100.00 (inventory X intro) Months 4-9 (varies) = $1,100.00 fixed - $3,300.00 budget cap Additional Fulfillment Details TBD: See BB/KB for questions!
One other thing to note is that each comment row is different and the information isn't all uniform and in the same order or contains the same data even.
I need to parse out the "Budget cap" rate and the base rate which is $1,100 in this case into 2 extra columns. Any ideas on how to do this would be very appreciated.
Related
I am currently designing a star schema for a reporting database where an online product's performance is measured. The challenge is, that I receive information which is in principle measuring the same facts (visits, purchases) and has the same dimensions (user gender, user age, day) but with varying granularity depending on the source, for example, given a total of 10 visits:
Source A returns a single line per day for the performance in the format:
Visits, Purchases, Gender, Age Range, Day (total visits = 15)
Source B returns two lines for a single day as it does not allow the combination of gender and age:
Visits, Purchases, Gender, Day (total visits = 10)
Visits, Purchases, Age Range, Day (total visits = 10)
The issues is, if I store them in the same fact table, I will have incorrect values when applying aggregate functions:
Day
Visits
Age
Gender
Source
19/04/2022
5
18-24
Male
A
19/04/2022
10
18-24
Female
A
19/04/2022
2
NULL
Male
B
19/04/2022
8
NULL
Female
B
19/04/2022
10
18-24
NULL
B
(The sum of the visits column would count 20 for source B even though we only have 10 visits for this source, they just appear double due to the different data structure)
Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?
Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?
You typically can only present the combined data at a grain that's compatible with all the sources, so (Day), (Age,Day), or (Gender,Day).
Alternatively you could "allocate" the Source B data, say applying the gender split for the day to each age group. The totals would work, but the drilldown wouldn't be meaningful.
I need to implement a measure that indicates sales volume per product per day. For the example table below (each line is a record of a sale):
id,create_date,report_date,quantity
329,2019-01-02 08:19:17,2019-01-02 14:34:12,6
243,2019-01-02 09:11:42,2019-01-03 15:30:14,6
238,2019-02-02 08:19:17,2019-03-02 14:36:17,2
170,2019-04-02 02:15:17,2019-04-02 14:37:12,2
238,2019-04-02 08:43:11,2019-04-02 14:41:01,8
238,2019-04-02 08:52:52,2019-04-02 14:39:12,1
238,2019-08-02 08:10:09,2019-08-02 15:02:12,1
238,2019-10-02 08:10:17,2019-10-02 18:34:11,1
170,2020-01-02 08:24:14,2020-01-02 19:31:31,2
170,2020-01-02 08:32:16,2020-01-02 21:52:32,3
The operations to reach the result:
1. Identify total sales and total products for each day.
For 2019-01-02, two sales were carried out, totaling 12 products (6 products for each sale on the day)
2. Divide total products by total sales, resulting in the product/sale ratio for the day (if the result is 2, it indicates that each sale on average corresponds to two products).
In the example table there are 6 different dates (YYYMMDD), for each corresponding date: total products/amount of sales on the day (12/2, 2/1, 11/3, 1/1, 1/1, 5/1) .
3. Average every day's story, resulting in a single value.
(3 + 2 + 3.6 + 1 + 1 + 3)/6 = 2.26 , indicating that on average two products are sold per sale per day.
As it involves many operations, I couldn't get a solution for this problem. If anyone can help me.
note: I accept alternative suggestions to offer the measure to indicate the volume of sales per product per day.
Please check the numbers given in your steps 2 and 3:
12/2=6 not 3
5/1 must be 5/2
I still think that you want to calculate a 'day story' in step 2, see formular below.
Here are the steps for generating such a value:
create a table
add your time as dimension and make it to date not date&time
order by date ascending (optional)
create a field day story with the formula sum(quantity)/count(id)
add this field three times to your table
click on the AUT left to the fieldname and select Running calculation to 'running average`
You have to convince your users to only look at the last line of the table.
all:
I am trying to design a shared worksheet that measures salespeople performance over a period of time. In addition to calculating # of units, sales price, and profit, I am trying to calculate how many new account were sold in the month (ideally, I'd like to be able to change the timeframe so I can calculate larger time periods like quarter, year etc').
In essence, I want to find out if a customer was sold to in the 12 months before the present month, and if not, that I will see the customer number and the salesperson who sold them.
So far, I was able to calculate that by adding three columns that each calculate a part of the process (see screenshot below):
Column H (SoldLastYear) - Shows customers that were sold in the year before this current month: =IF(AND(B2>=(TODAY()-365),B2<(TODAY()-DAY(TODAY())+1)),D2,"")
Column I (SoldNow) - Shows the customers that were sold this month, and if they are NOT found in column H, show "New Cust": =IFNA(IF(B2>TODAY()-DAY(TODAY()),VLOOKUP(D2,H:H,1,FALSE),""),"New Cust")
Column J (NewCust) - If Column I shows "New Cust", show me the customer number: =IF(I2="New Cust",D2,"")
Column K (SalesName) - if Column I shows "New Cust", show me the salesperson name: =IF(I2="New Cust",C2,"")
Does anyone have an idea how I can make this more efficient? Could an array formula work here or will it be stuck in a loop since its referring to other lines in the same column?
Any help would be appreciated!!
EDIT: Here is what Im trying to achieve:
Instead of:
having Column H showing me what was sold in the 12 months before the 1st day of the current month (for today's date: 8/1/19-7/31/20);
Having Column I showing me what was sold in August 2020; and
Column I searching column H to see if that customer was sold in the timeframe specified in Column H
I want to have one column that does all three: One column that flags all sales made for the last 12 months from the beginning of the current month (so, 8/1/19 to 8/27/20), then compares sales made in current month (august) with the sales made before it, and lets me know the first time a customer shows up in current month IF it doesn't appear in the 12 months prior --> finds the new customers after a dormant period of 12 months.
Im really just trying to find a way to make the formula better and less-resource consuming. With a large dataset, the three columns (copied a few times for different timeframes) really slow down Excel...
Here is an example of the end result:
Example of final product
I want to store trades as well as best ask/bid data, where the latter updates much more rapidly than the former, in InfluxDB.
I want to, if possible, use a schema that allows me to query: "for each trade on market X, find the best ask/bid on market Y whose timestamp is <= the timestamp of the trade".
(I'll use any version of Influx.)
For example, trades might look like this:
Time Price Volume Direction Market
00:01.000 100 5 1 foo-bar
00:03.000 99 50 0 bar-baz
00:03.050 99 25 0 foo-bar
00:04.000 101 15 1 bar-baz
And tick data might look more like this:
Time Ask Bid Market
00:00.763 100 99 bar-baz
00:01.010 101 99 foo-bar
00:01.012 101 98 bar-baz
00:01.012 101 99 foo-bar
00:01:238 100 99 bar-baz
...
00:03:021 101 98 bar-baz
I would want to be able to somehow join each trade for some market, e.g. foo-bar, with only the most recent ask/bid data point on some other market, e.g. bar-baz, and get a result like:
Time Trade Price Ask Bid
00:01.000 100 100 99
00:03.050 99 101 98
Such that I could compute the difference between the trade price on market foo-bar and the most recently quoted ask or bid on market bar-baz.
Right now, I store trades in one time series and ask/bid data points in another and merge them on the client side, with logic along the lines of:
function merge(trades, quotes, data_points)
next_trade, more_trades = first(trades), rest(trades)
quotes = drop-while (quote.timestamp < next_trade.timestamp) quotes
data_point = join(next_trade, first(quotes))
if more_trades
return merge(more_trades, quotes, data_points + data_point)
return data_points + data_point
The problem is that the client has to discard tons of ask/bid data points because they update so frequently, and only the most recent update before the trade is relevant.
There are tens of markets whose most recent ask/bid I might want to compare a trade with, otherwise I'd simply store the most recent ask/bid in the same series as the trades.
Is it possible to do what I want to do with Influx, or with another time series database? An alternative solution that produces lower quality results is to group the ask/bid data by some time interval, say 250ms, and take the last from each interval, to at least impose an upper bound on the amount of quotes the client has to drop before finding the one that's closest to the next trade.
NB. Just a clarification on InfluxDB terminology. You're probably storing trade and tick data in different measurements(analogous to a table). Series is a subdivision withing a measurement based on tag values. e.g
Time Ask Bid Market
00:00.763 100 99 bar-baz
is one series
Time Ask Bid Market
00:01.010 101 99 foo-bar
is another series(assuming you are storing Market name/id as a tag and not a field)
Answer
InfluxQL https://docs.influxdata.com/influxdb/v1.7/query_language/spec/ - I can't think of a way to achieve what you need with InfluxQL (Influx Query Language) as it does not support joins.
Perhaps what you could do on the client side is instead of requesting all tick data for a period and discarding most of it, make a request per trade and market to get exactly the (the most recent with respect to the trade) ask/bid datapoint that you need. Something like:
function merge(trades, market)
points = <empty list>
for next_trade in trades
quote = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp and Market=market and time>next_trade.timestamp - 1m")
// or to get a list per market with one query
// quote_per_market = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp group by Market")
points = points + join(next_trade, quote)
return points
Of course you'd have the overhead of querying the database more frequently but depending on the number of trades and your resource constraints it may be more efficient. NB. A potential pitfall here is that ask and bid retrieved this way are not retrieved as a pair but independently and while they are returned as a pair it could happen that they have different timestamps. If for some timestamp for some reason you only have an ask or a bid price you might run into this problem. However, as long as you write them in pairs and have no missing data it should be ok.
Flux https://www.influxdata.com/products/flux/ - Flux is a more sophisticated query language that is part of Influxdb 1.7 and 2 that allows you to do joins and operations across different measurements. I can't give you any examples yet but it's worth having a look at.
Other (relational) Times Series DBs that you could have a look at that would also allow you to do joins are CrateDB https://crate.io/ or Postgres + TimescaleDB https://www.timescale.com/products
I am working on a Cassandra data model for storing time series (I'm a Cassandra newbie).
I have two applications: intraday stock data and sensor data.
The stock data will be saved with a time resolution of one minute.
Seven datafields build one timeframe:
Symbol, Datetime, Open, High, Low, Close, Volume
I will query the data mostly by Symbol and Date. e.g. give me all data for AAPL between 2013-01-01 and 2013-01-31 ordered by Datetime.
The recommendation for cassandra queries is to query whole columns. So you could create five rows with the keys Open, High, Low, Close, Volume. And for each Symbol and Minute an own column. E.g. "AAPL:2013-01-04T130400Z".
This would result in a table of five rows and n*NT columns where n = number of symbols, nT = number of minutes.
Most of the time I will query date ranges. I.e. all minutes of a day. So I could rearrange the data to have columns named "AAPL:2013-01-04" and rows: OpenT130400Z, HighT130400Z, LowT130400Z, CloseT130400Z, VolumeT130400Z.
This would result in a table with n*nD columns (n: number of Symbols, nD: number of Days) and 5*nM rows (nM: number of minutes/entries per day).
To sum up: I have columns, which hold the information for a whole day for one symbol.
I have found a description how to deal with time series data in cassandra here http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
But I don't really get, if they use the hour (1332960000) as a column name or as a row key!?
I understood they use the hour as row key and have the small timesteps as columns. So they would have a fixed column number. But that would have disadvantages in reading because I would have to do a range query on keys! Am I right?
Second question:
If I have sensor data, which is much more fine grained than 1 minute stock data (let's say I have to save timesteps with a resolution of microseconds) how would I deal with this?
If I use columns for saving a composite of sensor channel and hours, and rows for microseconds since the last hour this would result in 3,600,000,000 rows and n*nH columns (n: number of sensors, nH: number of Hours).
I could not use the microseconds since last hour for columns because I have 3,6 billion points which is higher than the allowed number of 2 billion columns.
Did I get it?
What do you think about this problem? How to solve it?
Thank you!
Best,
Malte
So I have a suggestion for your first question about the stock data. A naive implementation might look like this:
RowKey:
Column Format:
Name: The current datetime granular to a minute
Value: a composite column of Open,High,Low,Close,Volume
So you would have something like
AAPL = [2013-05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]
That would give you roughly half a million columns in one year so it might be ok for maybe 4 years. I wouldn't go and attempt to hit the 2 billion limit. What you could do is define a splitting factor on the row key. It all depends on your usage pattern, but a simple one might be on the year so the column family entry might look like this with a composite row key and that would guarantee that you always have less than a million columns per row.
AAPL:2013 = [05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]