I'm doing a project that analyzes covid data and I'm trying to clean the data of null values and the like, but first need to make it usable. It currently has an individual column for every date and the amount of new cases that day. The Combined_Key column is unique so that was what I was going to try to map the dates and cases to.
Also every column is of type String so I imagine I'll need to insert the data into a dataframe that's setup with the correct types but I also don't know how to do that without making 450 date columns all typed separately, even more exciting is that there isn't an inherent date type in spark/scala so not sure how to handle that.
UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,2/19/20,2/20/20,2/21/20,2/22/20,2/23/20,2/24/20,2/25/20
84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.53952745,-86.64408227,"Autauga, Alabama, US",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,5,6,6,6,6,8,8,10,12,12,12,12,12,12,12,17,18,19,19,19,23,24,24,24,25,26,28,30,32,33,36,36,36,37,39,41,42,43,47,51,54,54,56,58,62,63,72,81
There's part of top 2 rows of the data, a whole lot of date columns have been left out. I'm working in the spark shell, I've tried something like this after turning the data into a table but that gets either a "error: 5 more arguments than can be applied to method ->: (y: B)(String, B)" or "error: type mismatch;" respectively.
var covidMap = scala.collection.mutable.Map[String, ArrayBuffer[Int]]()
table.foreach{x => covidMap += (x(10)).toString -> (x(11),x(20),x(30),x(40),x(50),x(60))}
table.foreach{x => covidMap += (x(10)).toString -> (x(11))}
Honestly I don't know if these are even close to what I need to be doing, I've been coding for 5 weeks in a training program and it's incredibly difficult for me thus far, so, I'm here. Any help is appreciated!
Starting with an example DataFrame (taking your first two example date columns and adding today's date to show it'll work in the future):
val df = List(
(84001001,"US","USA",840,1001.0,"Autauga","Alabama","US",32.53952745,-86.64408227,"Autauga, Alabama, US",0,0,50)
)
.toDF("UID","iso2","iso3","code3","FIPS","Admin2","Province_State","Country_Region","Lat","Long_","Combined_Key","1/22/20","1/23/20","4/2/22")
.show()
gives:
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+-------+-------+------+
| UID|iso2|iso3|code3| FIPS| Admin2|Province_State|Country_Region| Lat| Long_| Combined_Key|1/22/20|1/23/20|4/2/22|
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+-------+-------+------+
|84001001| US| USA| 840|1001.0|Autauga| Alabama| US|32.53952745|-86.64408227|Autauga, Alabama, US| 0| 0| 50|
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+-------+-------+------+
We can then create a new column, which I've called dates but you can easily rename. Here the array function is used to combine all of the values of the date columns into column, which is an array:
import org.apache.spark.sql.functions.{array, col}
val dateRegex = "\\d+/\\d+/\\d+" // matches all columns in x/y/z format
val dateColumns = df.columns.filter(_.matches(dateRegex))
df
// select all date columns and combine into a new column: `dates`
.withColumn("dates", array(dateColumns.map(df(_)): _*))
// drop the original date columns, keeping `dates`
.drop(dateColumns: _*)
.show(false)
gives:
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+----------+
|UID |iso2|iso3|code3|FIPS |Admin2 |Province_State|Country_Region|Lat |Long_ |Combined_Key |dates |
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+----------+
|84001001|US |USA |840 |1001.0|Autauga|Alabama |US |32.53952745|-86.64408227|Autauga, Alabama, US|[0, 0, 50]|
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+----------+
A downside to this is that the output DataFrame doesn't retain the original date values.
I have a data set which consists of an ID and a matrix (n x n) of data related to that ID.
Both the column names (A,B,C,D) and the Row names (1,2,3) are also important and need to be held for each individual ID, as well as the data (a1,b1,c1,d1,...)
for example:
ID | A | B | C | D |
1 | a1 | b1 | c1 | d1 |
2 | ... | ... | ... | ... |
3 | ... | ... | ... | ... |
I am trying to determine the best way of modelling this data set in a database, however, it seems like something that is difficult given the flat nature of RDBMS.
Am I better off holding the ID and an XML blob representing the data matrix, or am i overlooking a simpler solution here.
Thanks.
RDBMSes aren't flat. The R part sees to that. What you need is:
Table Entity
------------
ID
Table EntityData
----------------
EntityID
MatrixRow (1, 2, 3...)
MatrixColumn (A, B, C, D...)
Value
Entity:EntityData is a one-to-many relationship; each cell in the matrix has an EntityData row.
Now you have a schema that can be analyzed at the SQL level, instead of just being a data dump where you have to pull and extract everything at the application level in order to find out anything about it.
This is one of the reasons why PostgreSQL supports arrays as a data type. See
http://www.postgresql.org/docs/8.4/static/functions-array.html
and
http://www.postgresql.org/docs/8.4/static/arrays.html
Where it shows you can use syntax like ARRAY[[1,2,3],[4,5,6],[7,8,9]] to define the values of a 3x3 matrix or val integer[3][3] to declare a column type to be a 3x3 matrix.
Of course this is not at all standard SQL and is PostgreSQL specific. Other databases may have similar-but-slightly-different implementations.
If you want a truly relational solution:
Matrix
------
id
Matrix_Cell
-----------
matrix_id
row
col
value
But constraints to make sure you had valid data would be hideous.
I would consider a matrix as a single value as far as the DB is concerned and store it as
csv:
Matrix
------
id
cols
data
Which is somewhat lighter than XML.
I'd probably implement it like this:
Table MatrixData
----------------
id
rowName
columnName
datapoint
If all you're looking for is storing the data, this structure will hold any size matrix and allow you to reconstitute any matrix from the ID. You will need some post-processing to present it in "matrix format", but that's what the front-end code is for.
can the data be thought of as "row data"? if so then maybe you could store each row as a Object (or XML Blob) with data A,B,C,D and then, in your "representation", you use something like a LinkedHashMap (assuming Java) to get the objects with an ID key.
Also, it seems that by its very basic nature, a typical database table already does what you need doesn't it?
Or even better what you can do is, create a logical array like structure.
Say u want to store an m X n array..
Create m attributes in the table.
In each attribute store n elements separated by delimiters ...
while retrieving the data, simply do reverse parsing to easily get back the data..