I have a pyspark query which returns a WrappedArray:
det_port_arr =
vessel_arrival_depart_df.select(vessel_arrival_depart_df['DetectedPortArrival'])
det_port_arr.show(2, truncate=False)
det_port_arr.dtypes
The output is a DataFrame with a single column, but that column is a struct which contains an array of structs:
|DetectedPortArrival |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray([portPoi,5555,BEILUN [CNBEI],marinePort], [portPoi,5729,NINGBO [CNNBO],marinePort], [portPoi,5730,NINGBO PT [CNNBG],marinePort]),device,Moored]|
|null |
[('DetectedPortArrival',
'struct<poiMeta:array<struct<poiCategory:string,poiId:bigint,poiName:string,poiType:string>>,sourceType:string,statusType:string>')]
If I try to select the poiMeta member of the struct:
temp = vessel_arrival_depart_df.select(vessel_arrival_depart_df['DetectedPortArrival']['poiMeta'])
temp.show(truncate=False)
print type(temp)
I obtain
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|DetectedPortArrival.poiMeta |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[portPoi,5555,BEILUN [CNBEI],marinePort], [portPoi,5729,NINGBO [CNNBO],marinePort], [portPoi,5730,NINGBO PT [CNNBG],marinePort]] |
|null |
Here are the data types:
temp.dtypes
('DetectedPortArrival.poiMeta',
'array<struct<poiCategory:string,poiId:bigint,poiName:string,poiType:string>>')]
But here's the problem: I don't seem to be able to query that column DetectedPortArrival.poiMeta:
df2 = temp.selectExpr("DetectedPortArrival.poiMeta")
df2.show(2)
AnalysisExceptionTraceback (most recent call last)
<ipython-input-46-c7f0041cffe9> in <module>()
----> 1 df2 = temp.selectExpr("DetectedPortArrival.poiMeta")
2 df2.show(3)
/opt/spark/spark-2.1.0-bin-hadoop2.4/python/pyspark/sql/dataframe.py in selectExpr(self, *expr)
996 if len(expr) == 1 and isinstance(expr[0], list):
997 expr = expr[0]
--> 998 jdf = self._jdf.selectExpr(self._jseq(expr))
999 return DataFrame(jdf, self.sql_ctx)
1000
/opt/spark/spark-2.1.0-bin-hadoop2.4/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/opt/spark/spark-2.1.0-bin-hadoop2.4/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u"cannot resolve '`DetectedPortArrival.poiMeta`' given input columns: [DetectedPortArrival.poiMeta]; line 1 pos 0;\n'Project ['DetectedPortArrival.poiMeta]\n+- Project [DetectedPortArrival#268.poiMeta AS DetectedPortArrival.poiMeta#503]\n +- Project [asOf#263, vesselId#264, DetectedPortArrival#268, DetectedPortDeparture#269]\n +- Sort [asOf#263 ASC NULLS FIRST], true\n +- Project [smfPayloadData#1.paired.shipmentId AS shipmentId#262, smfPayloadData#1.timestamp.asOf AS asOf#263, smfPayloadData#1.paired.vesselId AS vesselId#264, smfPayloadData#1.paired.vesselName AS vesselName#265, smfPayloadData#1.geolocation.speed AS speed#266, smfPayloadData#1.geolocation.detectedPois AS detectedPois#267, smfPayloadData#1.events.DetectedPortArrival AS DetectedPortArrival#268, smfPayloadData#1.events.DetectedPortDeparture AS DetectedPortDeparture#269]\n +- Filter ((((cast(smfPayloadData#1.paired.vesselId as double) = cast(9776183 as double)) && isnotnull(smfPayloadData#1.paired.shipmentId)) && (length(smfPayloadData#1.paired.shipmentId) > 0)) && (isnotnull(smfPayloadData#1.paired.vesselId) && (isnotnull(smfPayloadData#1.events.DetectedPortArrival) || isnotnull(smfPayloadData#1.events.DetectedPortDeparture))))\n +- SubqueryAlias smurf_processed\n +- Relation[smfMetaData#0,smfPayloadData#1,smfTransientData#2] parquet\n"
Any suggestions as to how to query that column?
cant you just select the column based on their index? Something like
temp.select(temp.columns[0]).show()
Best Regards
Related
I want to compare two arrays and filter the data frame
condition_1 = AAA
condition_2 = ["AAA","BBB","CCC"]
My spark data frame has a column with array of strings
df = df.withColumn("array_column", F.lit(["XXX","YYY","AAA"]))
# to filter a string condition_1 with the array column
df = df.filter(
F.col('array_column').isin(condition_1) &
# second filter here
)
But how can I filter condition_2 in in a similar way? since they are both arrays?
Code I tried:
df = df.filter(
F.col('array_column').isin(condition_1) &
any(x in condition_2 for x in F.col('array_column'))
)
But I get an error - Column is not iterable.
I also tried - bool(set(F.col('array_column')).intersection(condition_2))
But still have the same error. Can anyone help me with this?
Hope I got your question right. It wasnt as clear. Use pyspark's array functions
Data
condition_1 = 'AAA'
condition_2 = ["AAA","BBB","CCC"]
df=spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+------------------+
| ID|Company_Id| value| array_column|
+---+----------+-------+------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA]|
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG]|
| 3C| 9800bvd|value-3| [AAA]|
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC]|
+---+----------+-------+------------------+
Code
df.where((array_contains(col('array_column'), lit(condition_1)))&(size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0)).show(truncate=False)
Outcome
+---+----------+-------+----------------+
|ID |Company_Id|value |array_column |
+---+----------+-------+----------------+
|1A |3412asd |value-1|[XXX, YYY, AAA] |
|3C |9800bvd |value-3|[AAA] |
|3C |9800bvd |value-1|[AAA, YYY, CCCC]|
+---+----------+-------+----------------+
How it works
condition_1 ; get a boolean selection of where column contains string
array_contains(col('array_column'), lit(condition_1))
condition_2 ; This happens in stages
Intersect column with the list
array_intersect(col('array_column'),array([lit(x) for x in condition_2]))
get the size of the outcome of 1 above
size(array_intersect(col('array_column'),array([lit(x) for x in` condition_2])))
Check that the intersection contains at least one item
size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0
Finally, chain condition_1 and condition_2 using operant & and pass into the df.where() or df.filter() methods
I was Trying to to Calculate initial embedding of all data frame which is first step to implement my GNN which is Heterogeneous in nature.I have used twitter msg data set in the task I loaded it in following way
load_path = '/Users/hemangjiwnani/Desktop/Projects/Paper1/KPGNN/datasets/Twitter/'
save_path = '/Users/hemangjiwnani/Desktop/Projects/Paper1/KPGNN/datasets/Twitter/'
# load dataset`enter code here`
p_part1 = load_path + '68841_tweets_multiclasses_filtered_0722_part1.npy'
p_part2 = load_path + '68841_tweets_multiclasses_filtered_0722_part2.npy'
#"./datasets/Twitter/68841_tweets_multiclasses_filtered_0722_part1.npy"
df_np_part1 = np.load(p_part1, allow_pickle=True)
df_np_part2 = np.load(p_part2, allow_pickle=True)
Then I have created a data frame of the same with the help of following code
df_np = np.concatenate((df_np_part1, df_np_part2), axis = 0) #Axis = 0 means horizontal
print("Loaded data.")
df = pd.DataFrame(data=df_np, columns=["event_id", "tweet_id", "text", "user_id", "created_at", "user_loc",\
"place_type", "place_full_name", "place_country_code", "hashtags", "user_mentions", "image_urls", "entities",
"words", "filtered_words", "sampled_words"])
print("Data converted to dataframe.")
print(df.shape)
print(df.head(5))
Which was having following output
Loaded data.
Data converted to dataframe.
(68841, 16)
event_id ... sampled_words
0 0 ... []
1 0 ... []
2 0 ... []
3 0 ... []
4 0 ... []
[5 rows x 16 columns]
This Function Below is raising a error while returning
def documents_to_features(df):
nlp = spacy.load("en_core_web_sm"
#nlp = en_core_web_lg.load()
features = df.filtered_words.apply(lambda x: nlp(' '.join(x)).vector).values
-->return np.stack(features, axis=0)
ERROR
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-2e8bc5e83009> in <module>()
----> 1 d_features = documents_to_features(df)
2 print("Document features generated.")
3 t_features = df_to_t_features(df)
4 print("Time features generated.")
5 combined_features = np.concatenate((d_features, t_features), axis=1)
1 frames
<ipython-input-9-b772a7744232> in documents_to_features(df)
3 #nlp = en_core_web_lg.load()
4 features = df.filtered_words.apply(lambda x: nlp(' '.join(x)).vector).values
----> 5 return np.stack(features, axis=0)
<__array_function__ internals> in stack(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/numpy/core/shape_base.py in stack(arrays, axis, out)
425 shapes = {arr.shape for arr in arrays}
426 if len(shapes) != 1:
--> 427 raise ValueError('all input arrays must have the same shape')
428
429 result_ndim = arrays[0].ndim + 1
ValueError: all input arrays must have the same shape
I have the following data
100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello
there!
1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!
In a file, I have multiple lines of of the above formatted data. Each line is delimited by ///n and ///t. For each line, there are four records that are delimited by ///n. Inside each record, there are four columns that are delimited by ///t. Now, I need to parse this into a Dataframe. So basically for the above two lines; since each line has 2 records with 6 columns; there should be 12 records in the Dataframe. Each record follows the same format.
I tried parsing this using a combination of split and amp but did not get the correct output
You can process it using string transformations, like:
// Sample of input data
val str1 = "100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello there!"
val str2 = "1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!"
val df = Seq(str1, str2).toDF
// Process:
val output = df.as[String].flatMap(row=>{
val fields = row.split("///n").map(record=>{
val fields = record.split("///t").toList
(fields(0), fields(1), fields(2), fields(3), fields(4), fields(5))
}).toList
fields
}).toDF("column_1", "column_2", "column_3", "column_4", "column_5", "column_6")
Result:
+--------+--------+--------+--------+----------+------------+
|column_1|column_2|column_3|column_4| column_5| column_6|
+--------+--------+--------+--------+----------+------------+
| 100| 1001| 2| 0.119|2342342342| Hi |
| |there! |
| 103| 1002| 2| 0.119|2342342342|Hello there!|
| 1010| 10077| 2| 0.119|2342342342| Hi |
| | there!|
| 1044| 1003| 2| 0.119|2342342342|Hello there!|
+--------+--------+--------+--------+----------+------------+
I manage to extract a time-series of prices from a web-portal. The data arrives in a json format, and I convert them into a pandas dataFrame.
Unfortunately, the data for the different bands come in a text string, and I can't seem to extract them out properly.
The below is the json data I extract
I convert them into a pandas dataframe using this code
data = pd.DataFrame(r.json()['prices'])
and get them like this
I need to extract (for example) the data in the column ClosePrice out, so that I can do data analysis and cleansing on them.
I tried using
data['closePrice'].str.split(',', expand=True).rename(columns = lambda x: "string"+str(x+1))
but it doesn't really work.
Is there any way to either
a) when I convert the json to dataFrame, such that the prices within the closePrice, bidPrice etc are extracted in individual columns OR
b) if they were saved in the dataFrame, extract the text strings within them, such that I can extract the prices (e.g. the bid, ask and lastTraded) within the text string?
A relatively brute force way, using links from other stackOverflow.
# load and extract the json data
s = requests.Session()
r = s.post(url + '/session', json=data)
loc = <url>
dat1 = s.get(loc)
dat1 = pd.DataFrame(dat1.json()['prices'])
# convert the object list into individual columns
dat2 = pd.DataFrame()
dat2[['bidC','askC', 'lastP']] = pd.DataFrame(dat1.closePrice.values.tolist(), index= dat1.index)
dat2[['bidH','askH', 'lastH']] = pd.DataFrame(dat1.highPrice.values.tolist(), index= dat1.index)
dat2[['bidL','askL', 'lastL']] = pd.DataFrame(dat1.lowPrice.values.tolist(), index= dat1.index)
dat2[['bidO','askO', 'lastO']] = pd.DataFrame(dat1.openPrice.values.tolist(), index= dat1.index)
dat2['tStamp'] = pd.to_datetime(dat1.snapshotTime)
dat2['volume'] = dat1.lastTradedVolume
get the equivalent below
Use pandas.json_normalize to extract the data from the dict
import pandas as pd
data = r.json()
# print(data)
{'prices': [{'closePrice': {'ask': 1.16042, 'bid': 1.16027, 'lastTraded': None},
'highPrice': {'ask': 1.16052, 'bid': 1.16041, 'lastTraded': None},
'lastTradedVolume': 74,
'lowPrice': {'ask': 1.16038, 'bid': 1.16026, 'lastTraded': None},
'openPrice': {'ask': 1.16044, 'bid': 1.16038, 'lastTraded': None},
'snapshotTime': '2018/09/28 21:49:00',
'snapshotTimeUTC': '2018-09-28T20:49:00'}]}
df = pd.json_normalize(data['prices'])
Output:
| | lastTradedVolume | snapshotTime | snapshotTimeUTC | closePrice.ask | closePrice.bid | closePrice.lastTraded | highPrice.ask | highPrice.bid | highPrice.lastTraded | lowPrice.ask | lowPrice.bid | lowPrice.lastTraded | openPrice.ask | openPrice.bid | openPrice.lastTraded |
|---:|-------------------:|:--------------------|:--------------------|-----------------:|-----------------:|:------------------------|----------------:|----------------:|:-----------------------|---------------:|---------------:|:----------------------|----------------:|----------------:|:-----------------------|
| 0 | 74 | 2018/09/28 21:49:00 | 2018-09-28T20:49:00 | 1.16042 | 1.16027 | | 1.16052 | 1.16041 | | 1.16038 | 1.16026 | | 1.16044 | 1.16038 | |
I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:
Year, Name, County, Number
2000, JOHN, KINGS, 50
2000, BOB, KINGS, 40
2000, MARY, NASSAU, 60
2001, JOHN, KINGS, 14
2001, JANE, KINGS, 30
2001, BOB, NASSAU, 45
And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?
I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)
val names = sc.textFile("names.csv").map(l => l.split(","))
val uniqueCounty = names.map(x => x(2)).distinct.collect
for (i <- 0 to uniqueCounty.length-1) {
val county = uniqueCounty(i).toString;
val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
println("County:" + county + eachCounty.first)
}
Here is the solution using RDD. I am assuming you need top occurring name per county.
val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect
Output:
res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))
You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.
Use spark-csv to load your csv file into a Dataframe.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
df.show
Gives output:
+----+----+------+------+
|Year|Name|County|Number|
+----+----+------+------+
|2000|JOHN| KINGS| 50|
|2000| BOB| KINGS| 40|
|2000|MARY|NASSAU| 60|
|2001|JOHN| KINGS| 14|
|2001|JANE| KINGS| 30|
|2001| BOB|NASSAU| 45|
+----+----+------+------+
DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.
import org.apache.spark.sql.functions._
df.select("County","Number").groupBy("County").agg(max("Number")).show
Gives output:
+------+-----------+
|County|max(Number)|
+------+-----------+
|NASSAU| 60|
| KINGS| 50|
+------+-----------+
Is this what you are trying to achieve?
Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.
More information about Dataframes API
EDIT
For correct output:
df.registerTempTable("names")
//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
"count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show
Gives output:
+------+----+---------------+
|County|Name|max(Occurrence)|
+------+----+---------------+
| KINGS|JOHN| 2|
|NASSAU|MARY| 1|
+------+----+---------------+