Date State | City | Zip | Water | Weight
-------------------------------------------------------------------
01/01/2016 Arizona Chandler 1011 10 ltr 40 kg
01/04/2016 Arizona Mesa 1012 20 ltr 50 kg
06/05/2015 Washington Spokane 1013 30 ltrs 44 kg
06/08/2015 Washington Spokane 1013 30 ltrs 44 kg
What I want are complex queries, like I want to know average water, weight by passing a city or state or ip for a date range or month, or any field or all fields.
I am not sure how to go about this. Read about map reduce, but cant guess how will I get above output
If you have link for examples which covers above scenarios that will also help.
Thanks in advance
So first we need to model your structured data in JSON. Something like this would work:
{
"date": "2016-01-01",
"location": "Arizona Chandler",
"pressure": 1101,
"water": 10,
"weight": 40
}
Here's your data in a Cloudant database:
https://reader.cloudant.com/so37613808/_all_docs?include_docs=true
Next we'd need to create a MapReduce view to aggregate the a specific field by date. A map function to create an index whose key is the date and whose value is the water would look like this:
function(doc) {
emit(doc.date, doc.water);
}
Every key/value pair emitted from the map function is added to an index which can be queried later in its entirety or by a range of keys (keys which in this case represent a date).
And if an average is required we would use the built-in _stats reducer. The Map and Reduce portions are expressed in a Design Document like this one: https://reader.cloudant.com/so37613808/_design/aggregate
The subsequent index allows us to get an aggregate across the whole data set with:
https://reader.cloudant.com/so37613808/_design/aggregate/_view/waterbydate
Dividing the sum by the count gives us an average.
We can use the same index to provide data grouped by keys too:
https://reader.cloudant.com/so37613808/_design/aggregate/_view/waterbydate?group=true
Or we can select a portion of the data by supplying startkey and endkey parameters:
https://reader.cloudant.com/so37613808/_design/aggregate/_view/waterbydate?startkey=%222016-01-01%22&endkey=%222016-06-03%22
See https://docs.cloudant.com/creating_views.html for more details.
Related
I have data in 2 dimensions (let's say, time and region) counting the number of visitors on a website for a given day and region, as per the following:
time
region
visitors
2021-01-01
Europe
653
2021-01-01
America
849
2021-01-01
Asia
736
2021-01-02
Europe
645
2021-01-02
America
592
2021-01-02
Asia
376
...
...
...
2021-02-01
Asia
645
...
...
...
I would like to create a table showing the average daily worldwide visitors for each month, that is:
time
visitors
2021-01
25238
2021-02
16413
This means, I need to aggregate the data this way:
first, sum over regions for distinct dates
then, calculate average on dates
Is was thinking of doing a global average of all lines of data for each month, and then multiply the value by the number of days in the month but since that number is variable I can't do it.
Is there any way to do this ?
Create 2 calculated fields:
Month(time)
SUM(visitors)/COUNT(DISTINCT(time))
In case it might help someone... so far (January 2021) it seems there is no way to do that in DataStudio. Calculated fields or data blending do not have a GROUP BY-like function.
So, I found 2 alternative solutions:
create an additional table in my data with the first aggregation (sum over regions). This gives a table with the number of visitors for each date.
Then I import it in DataStudio and do the second aggregation in the table.
since my data is stored in BigQuery, a custom SQL query can be used to create another data source from the same dataset. This way, a GROUP BY statement can be used to sum over regions before the average is calculated.
These solutions have a big drawback that is, I cannot add controls to filter by region (since data from all regions is aggregated before entering datastudio).
I got into databases and normalization. I am still trying to understand normalization and I am confused about its usage. I'll try to explain it with this example.
Every day I collect data which would look like this in a single table:
TABLE: CAR_ALL
ID
DATE
CAR
LOCATION
FUEL
FUEL_USAGE
MILES
BATTERY
123
01.01.2021
Toyota
New York
40.3
3.6
79321
78
520
01.01.2021
BMW
Frankfurt
34.2
4.3
123232
30
934
01.01.2021
Mercedes
London
12.7
4.7
4321
89
123
05.01.2021
Toyota
New York
34.5
3.3
79515
77
520
05.01.2021
BMW
Frankfurt
20.1
4.6
123489
29
934
05.01.2021
Mercedes
London
43.7
5.0
4400
89
In this example I get data for thousands of cars every day. ID, CAR and LOCATION never changes. All the other data can have other values daily. If I understood correctly, normalizing would make it look like this:
TABLE: CAR_CONSTANT
ID
CAR
LOCATION
123
Toyota
New York
520
BMW
Frankfurt
934
Mercedes
London
TABLE: CAR_MEASUREMENT
GUID
ID
DATE
FUEL
FUEL_USAGE
MILES
BATTERY
1
123
01.01.2021
40.3
3.6
79321
78
2
520
01.01.2021
34.2
4.3
123232
30
3
934
01.01.2021
12.7
4.7
4321
89
4
123
05.01.2021
34.5
3.3
79515
77
5
520
05.01.2021
20.1
4.6
123489
29
6
934
05.01.2021
43.7
5.0
4400
89
I have two questions:
Does it make sense to create an extra table for DATE?
It is possible that new cars will be included through the collected data.
For every row I insert into CAR_MEASUREMENT, I would have to check whether the ID is already in CAR_CONSTANT. If it doesn't exist, I'd have to insert it.
But that means that I would have to check through CAR_CONSTANT thousands of times every day. Wouldn't it be more efficient if I just insert the whole data as 1 row into CAR_ALL? I wouldn't have to check through CAR_CONSTANT every time.
The benefits of normalization are dependent on your specific use case. I can see both pros and cons to normalizing your schema, but its impossible to say which is better without more knowledge of your use case.
Pros:
With your schema, normalization could reduce the amount of data consumed by your DB since CAR_MEASUREMENT will probably be much larger than CAR_CONSTANT. This scales up if you are able to factor out additional data into CAR_CONSTANT.
Normalization could also improve data consistency if you ever begin tracking additional fixed data about a car, such as license plate number. You could simply update one row in CAR_CONSTANT instead of potentially thousands of rows in CAR_ALL.
A normalized data structure can make it easier to query data for a specific car. using a LEFT JOIN, the DBMS can search through the CAR_MEASUREMENT table based on the integer ID column instead of having to compare two string columns.
Cons:
As you noted, the normalized form requires an additional lookup and possible insert to CAR_CONSTANT for every addition to CAR_MEASUREMENT. Depending on how fast you are collecting this data, those extra queries could be too much overhead.
To answer your questions directly:
I would not create an extra table for just the date. The date is a part of the CAR_MEASUREMENT data and should not be separated. The only exception that I can think of to this is if you will eventually collect measurements that do not contain any car data. In that case, then it would make sense to split CAR_MEASUREMENT into separate MEASUREMENT and CAR_DATA tables with MEASUREMENT containing the date, and CAR_DATA containing just the car-specific data.
See above. If you have a use case to query data for a specific car, then the normalized form can be more efficient. If not, then the additional INSERT overhead may not be worth it.
Currently I have a excel table that looks like this
A B C D E F G
ID NAME DATE ITEM 2020 3
1234 Alex 09-20-2020 Carrot 2019 2
1234 Alex 09-20-2020 Onion
1234 Alex 09-20-2019 Carrot
1234 Alex 09-20-2019 Mushroom
1234 Alex 09-20-2020 Pasta
1345 Morgan 09-20-2020 Pasta
1345 Morgan 09-20-2020 Tomato Sauce
1145 Jayson 09-20-2020 Tomato Sauce
1145 Jayson 09-20-2020 Cream Sauce
1345 Morgan 09-20-2019 Pasta
1345 Morgan 09-20-2019 Tomato Sauce
I want to be able to count the unique customers for each year using excel functions. This is so that the functions can be transferred to a different computers without setting up the custom functions.
The proccess currently can be done in excel without function by: adding filter to each column, filtering to only show the intended year, using remove duplicate to remove duplicates in NAME, and finally counting the rows (giving reslts seen in G2 & G3). However, I want to be able to do that through excel functions. So far what I have is that I am able to count unique values through
{=SUM(IF(FREQUENCY(IF(LEN(B2:B12)>0,MATCH(B2:B12,B2:B12,0),""),IF(LEN(B2:B12)>0,MATCH(B2:B12,B2:B12,0),''))>0,1))}
Additionally I am also able to SUMPRODUCT() for counting a array with multiple condition so for now I have combined the above forumla with
SUMPRODUCT((YEAR(C2:C12)=G1)+0)
My initial idea was to add the first function into the SUMPRODUCTI() since the first function could also produce a array that it could count. However that quickly did not work as it did not count the unique values corresponding to the year.
My question here is if there is any way to what would be a grouping function so that I can take unique values that are within a year, without transforming the data (through filters of deletion of duplicates). My current understanding with SUMPRODUCT() is that it will only look for unique values in the entire column but not within the range given for the first array.
You have got numeric ID's which you should make use of. If you have got Excel O365, in G1 use:
=COUNT(UNQIUE(FILTER(A$2:A$12,YEAR(C$2:C$12)=F1)))
With older versions, use this CSE-entered formula:
=SUM(--(FREQUENCY(IF(YEAR(C$2:C$12)=F1,A$2:A$12),A$2:A$12)>0))
And drag down.
I have data generated on daily basis.
let me explain through a example:
On World Market, the price of Gold change on seconds interval basis. and i want to store that price in Redis DBMS.
Gold 22 JAN 11.02PM X-price
22 JAN 11.03PM Y-Price
...
24 DEC 11.04PM X1-Price
Silver 22 JAN 11.02PM M-Price
22 JAN 11.03PM N-Price
I want to store this data on daily basis. want to apply ML (Machine Leaning) on last 52 Week data. Is this possible?
Because As my knowledge goes. redis work on Key Value.
if this is possible. Can i get data from a specific date(04 July) and DateRange(01 Feb to 31 Mar)
In redis, a Sorted Set is appropriate for time series data. If you score each entry with the timestamp of the price quote you can quickly access a whole day or group of days using the ZRANGEBYSCORE command (or ZSCAN if you need paging).
The quote data can be stored right in the sorted set. If you do this make sure each entry is unique. Adding a record to a sorted set that is identical to an existing one just updates the existing record's score (timestamp). This moves the old record to the present and erases it from the past, which is not what you want.
I would recommend only storing a unique key/ID for each quote in the sorted set, and store the data in it's own key or hash field. This will allow you to create additional indexes for the data as needed and access specific records more easily if necessary.
I downloaded these databases for US and CA from GeoNames. The date looks like this:
5881639 100 Mile House 100 Mile House 51.64982 -121.28594 P PPL CA 02 0 917 America/Vancouver 2006-01-18
5881640 101 Mile Lake 101 Mile Lake 51.66652 -121.30264 H LK CA 02 0 917 America/Vancouver 2006-01-18
5881641 101 Ponds 101 Ponds 47.811 -53.97733 H PNDS CA 05 0 18 America/St_Johns 2006-01-18
I want to use this data for a city-picker, but I want to display to province or state beside it. Doesn't look like this data contains that information. Is there some way to retrieve that? Or is there a better DB that includes that?
You use the data in the columns for the admin codes these are actually ids that link to the admin codes table (there are separate data sets available for the admin codes) it is very straightforward.
Check the Geonames forums for more info.
http://forum.geonames.org/
Use the datasets here: geocoder.ca which include city name and state / province name in the same file.
If you want to stick with your data, you can use Google's Geocoding API, as in the first answer here:
Google Maps: how to get country, state/province/region, city given a lat/long value?
to get information based on latitude and longitude. This will be a lot of work, though, especially for a city-picker.