how to visualize geodata (Points) in databricks builtin map? - maps

I simply have a dataframe of "lat" and "lon" coordinates that I would like to visualize on a map.
I am trying to view my data data on the map. But nothing is being displayed.
In my case, the mmsi is the id.
Please note that I have around 2M points to display. Is that possible in databricks?
If there is no way around (plotting 2,000,000 points) in databricks, then what tool can handle large amount of data?
Any help is much appreciated!!

The built in map does not support latitude / longitude data points.
Delivering 2 million data points directly to a browser is problematic. Databricks has a protective limit of 20 MB for HTML display. Your viewers won't be able to take in and visually process that level of detail at any point in time.
I recommend implementing a filter/summarization strategy to enable the display within Databricks notebooks. Some ideas can be taken from this Databricks post: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Also this Anaconda post gives an excellent survey of data visualization packages, some (like Datashader, Vaex) implement a visual summarization strategies before rendering images: https://www.anaconda.com/blog/python-data-visualization-2018-why-so-many-libraries

The error message clearly says that " Unrecognizable values in the first column. The values should be either country codes in ISO 3166-1 alpha-3 format (e.g. "GBR") or US state abbreviations (e.g. "TX")."
Note: To plot a graph of the world, use country codes in ISO 3166-1 alpha-3 format as the key.
A Map Graph is a way to visualize your data on a map.
Plot Options... was used to configure the graph below.
Keys should contain the field with the location.
Series groupings is always ignored for World Map graphs.
Values should contain exactly one field with a numerical value.
Since there can multiple rows with the same location key, choose "Sum", "Avg", "Min", "Max", "COUNT" as the way to combine the values for a single key.
Different values are denoted by color on the map, and ranges are always spaced evenly.
Reference: Databricks - Charts and Graph
Hope this helps.

Related

How do I achieve this output in a Google Data Studio report, given the raw data?

I have a small set of test records in Google Data Studio and am attempting to create a table that gives me breakdowns of particular values relative to the total number of values, by dimension. I think the image here below describes the need clearly. I have tried using an approach I saw online entailing creating a calculated field like this:
case when Action = 'Clicked' then 1 else 0 end
and then creating a metric based upon that field, which does the 'Percentage of Total - Relative to Corresponding Data' but this produces incorrect numbers and seems really cumbersome (needing one calculated field per distinct value). My client wants this exact tabular presentation, not a chart(s).
How do I achieve the desired report?
Thanks!
Solution entails creating fields like 'Opened' which outputs 1 when Action = 'Opened', 0 otherwise. Then create fields like 'Opened Rate' with e.g. sum(Opened) / Record Count. Then set those as percentages.

Google Sheets Stacked Chart with data structured as a DB table

On the paper it was simple to do, but at the end I'm stucked.
Herebelow the chart I would like to produce (I took an easy example to understand the objective to reach):
On this chart we see that the series for each product are well colored and well structured for each cart.
Now, let's jump on my problem: the data structure.
In my world, data is coming from a database, so the data are structured like that:
I managed to aggregate the values but I loose colors of my series and I need absolutely to identify visually the different product volumes.
Is there any possiblity to reach the first stacked chart layout but with the second dataset format ?
Yes, transfer your data and chart from there (I don't think there's a native solution for this, but happy to learn otherwise).
For example, use =QUERY([data],"select [cart], sum([qty]) group by [cart] pivot [product] label sum([qty]) ''",1) replacing the terms in the brackets with the respective column letters (e.g. B if cart is in column B).

Converting sensor tag data in DSX

I'm working on converting the existing recipe for Data Science Experience (DSX) to use data from a connected Sensor Tag device. However the mobile applications for that device send the data as strings rather than numerics - this is causing the DSX recipe that calculates a Z score to choke. The data is coming from a cloudant db used as a histtorian for Watson IoT Platform so I cant simply reformat it there. Is there a simple way to convert the data inside a DSX notebook ?
Just access the row object and convert it:
cloudantdata.rdd.map(lambda row : float(row.temperature)).take(10)
EDIT 30.1.17:
To directly address your question:
df = cloudantdata.selectExpr("timestamp as timestamp", "data.d.objectTemp as temperature").map(lambda row : (row.timestamp,float(row.temperature)))
That way you get a tuple RDD which IMHO anyway is more usable as a RowRDD
I'm not familiar with DSX but you can use node red to parse the information from devices then store it in cloudant db in numeric format

SSIS Expression to Isolate a String

I am building an SSIS package that will populate data from an Excel Spreadsheet into our Database for Reporting.
The customer did not provide an individual column for the City and Unfortunately, the customer cannot update their export file to add the city, so I am trying to build a city column using the Branch Names.
I need an SSIS Expression (or several) to use in a Derived Column Transformation to pull the Name of the Cities out of the Branch Name. The issue I have is that the Spacing and placement of the names varies. I have tried to use Token, Sub string and Right and Left combined with other expressions and I always seem to cut something off.
Has anyone else run into this and how can I fix it. (I am not familiar with C# to use a Script Component).
Here is a Sample of the Data that I have.
Branch Name
JS OMAHA - 09
JS SIOUX FALLS - 48
JS DOWNINGTOWN - 53
JS ST PAUL - 70
JS BLOOMINGTON - 103
JS PITTSBURGH NORTH -149-
JS TINTON FALLS - 186
JS BLAINE - 337
JS ROCHESTER MN - 423
Do you have a list of valid cities sitting in a table? If so you can use a lookup transformation.
Lets say your list if cities is in a table called city
On the General tab pick No Cache
On the Connection tab tab pick the city table
On the Columns tab tab match the Branch Name column to the city column in your city table
In the Advanced tab, tick Modify the SQL statement and change the end to where [Branch Name] Like '%' + ? '%'
Now your lookup will find the closest match and pass it through as an extra column.
The other way is to load it all into a staging table and do an UPDATE, also using LIKE
Whatever you do, it will help to have a list of valid cities in a table
The other way is to make an assumption about the tokens in the data and use string functions in a derived column transformation to extract it out, but you can get some unexpected results.
I can expand further on these if you wish but I won't waste time if you're never going to return to the question.
Whilst you stated that you are not familiar with script components - they are the correct tool for the job. You will get much greater flexibility by using C# (or VB.Net) code to manipulate your strings. There are a number of good tutorials online to show you how to use a script task, and lots of information about string manipulation in C#.

Reading edge list data set in apache giraph?

I'm using SNAP dataset for social network analysis. SNAP uses simple edge list as a data format. How to read SNAP dataset in Apache Giraph?
As per I know SNAP has various data formats depending upon which dataset you are looking at. If the dataset that you are looking at has the format : sourceid destinationid on each line then you might want to use IntNullTextEdgeInputFormat (it's in giraph-core/src/main/java/org/apache/giraph/io/formats ).
Also take a look at various predefined formats available in the same folder. If none of those fit for your dataset format then you can write your own input format class (it will be really simple if you start from the predefined formats and edit it as you need).
use -eif org.apache.giraph.io.formats.IntNullTextEdgeInputFormat
Yes, SNAP uses Simple Edge List format for representing graph databases. You can use this code for converting it to a JSON format which is accepted by Apache Giraph.

Resources