Adding file names from array in dataframe column in spark scala

Adding file names from array in dataframe column in spark scala - arrays

val hadoopConf = new Configuration()
val fs = FileSystem.get(hadoopConf)
val status = fs.listStatus(new Path("/home/Test/")).map(_.getPath().toString)
val df = spark.read.format("json").load(status : _*)
How to add the file name in a new column in df?
I tried:
val dfWithCol = df.withColumn("filename",input_file_name())
But it adds the same file name in all the columns?
Can anyone suggest a better approach?

It is expected behaviour because your json file having more than one record in it.
Spark adds the filenames for each record and you want to check all the unique filenames then do distinct on filename column
//to get unique filenames
df.select("filename").distinct().show()
Example:
#source data
hadoop fs -cat /user/shu/json/*.json
{"id":1,"name":"a"}
{"id":1,"name":"a"}
val hadoopConf = new Configuration()
val fs = FileSystem.get(hadoopConf)
val status = fs.listStatus(new Path("/user/shu/json")).map(_.getPath().toString)
val df = spark.read.format("json").load(status : _*)
df.withColumn("filename",input_file_name()).show(false)
//unique filenames for each record
+---+----+----------------------------------------------------------------------------+
|id |name|input |
+---+----+----------------------------------------------------------------------------+
|1 |a |hdfs://nn:8020/user/shu/json/i.json |
|1 |a |hdfs://nn:8020/user/shu/json/i1.json |
+---+----+----------------------------------------------------------------------------+
in the above example you can see unique filenames for each record (as i have 1 record in each json file).

Related

Create a new column based on Condition in Spark Dataframe

How to create a new column in Dataframe DF based on give condition. I have array of String and want to compare that with existing dataframe
dataframe DF
+-------------------+-----------+
| DiffColumnName| Datatype|
+-------------------+-----------+
| DEST_COUNTRY_NAME| StringType|
|ORIGIN_COUNTRY_NAME| StringType|
| COUNT|IntegerType|
+-------------------+-----------+
and Array of String having column names( this is not constant and can be changed)
val diffcolarray = Array("ORIGIN_COUNTRY_NAME", "COUNT")
I want to create a new column in DF based on a condition that if columns present in diffcolarray is also present in Dataframe's column DiffColumnName then yes else no.
I have tried below options however getting error
val newdf = df.filter(when(col("DiffColumnName") === df.columns.filter(diffcolarray.contains(_)), "yes").otherwise("no")).as("issue")
val newdf = valdfe.filter(when(col("DiffColumnName") === df.columns.map(diffcolarray.contains(_)), "yes").otherwise("no")).as("issue")
Looks like when comparing there is datatype mismatch.Output should be something like this. Any suggestion would be helpful. Thank you
+-------------------+-----------+----------+
| DiffColumnName| Datatype| Issue |
+-------------------+-----------+----------+
| DEST_COUNTRY_NAME| StringType| NO |
|ORIGIN_COUNTRY_NAME| StringType| NO |
| COUNT|IntegerType| YES |
+-------------------+-----------+----------+

This can give you the desired output.
df.withColumn("Issue",when(col("DiffColumnName").isin(diffcolarray: _*),"YES").otherwise("NO")).show(false)

How can I create an External Catalog Table in Apache Flink

I tried to create and ExternalCatalog to use in Apache Flink Table. I created and added to the Flink table environment (here the official documentation). For some reason, the only external table present in the 'catalog', it is not found during the scan. What I missed in the code above?
val catalogName = s"externalCatalog$fileNumber"
val ec: ExternalCatalog = getExternalCatalog(catalogName, 1, tableEnv)
tableEnv.registerExternalCatalog(catalogName, ec)
val s1: Table = tableEnv.scan("S_EXT")
def getExternalCatalog(catalogName: String, fileNumber: Int, tableEnv: BatchTableEnvironment): ExternalCatalog = {
val cat = new InMemoryExternalCatalog(catalogName)
// external Catalog table
val externalCatalogTableS = getExternalCatalogTable("S")
// add external Catalog table
cat.createTable("S_EXT", externalCatalogTableS, ignoreIfExists = false)
cat
}
private def getExternalCatalogTable(fileName: String): ExternalCatalogTable = {
// connector descriptor
val connectorDescriptor = new FileSystem()
connectorDescriptor.path(getFilePath(fileNumber, fileName))
// format
val fd = new Csv()
fd.field("X", Types.STRING)
fd.field("Y", Types.STRING)
fd.fieldDelimiter(",")
// statistic
val statistics = new Statistics()
statistics.rowCount(0)
// metadata
val md = new Metadata()
ExternalCatalogTable.builder(connectorDescriptor)
.withFormat(fd)
.withStatistics(statistics)
.withMetadata(md)
.asTableSource()
}
The example above is part of this test file in git.

This is probably a namespace issue. Tables in external catalogs are identified by a list of names of the catalog, (potentially schemas,) and finally the table name.
In your example, the following should work:
val s1: Table = tableEnv.scan("externalCatalog1", "S_EXT")
You can have a look at the ExternalCatalogTest to see how external catalogs can be used.

Check a value in an array inside a object json in PostgreSQL 9.5

I have an json object containing an array and others properties.
I need to check the first value of the array for each line of my table.
Here is an example of the json
{"objectID2":342,"objectID1":46,"objectType":["Demand","Entity"]}
So I need for example to get all lines with ObjectType[0] = 'Demand' and objectId1 = 46.
This the the table colums
id | relationName | content
Content column contains the json.

just query them? like:
t=# with table_name(id, rn, content) as (values(1,null,'{"objectID2":342,"objectID1":46,"objectType":["Demand","Entity"]}'::json))
select * From table_name
where content->'objectType'->>0 = 'Demand' and content->>'objectID1' = '46';
id | rn | content
----+----+-------------------------------------------------------------------
1 | | {"objectID2":342,"objectID1":46,"objectType":["Demand","Entity"]}
(1 row)

Spark: how to change datarframe Array[String] to RDD[Array[String]]

I have transactions as a DataFrame array<string>:
transactions: org.apache.spark.sql.DataFrame = [collect_set(b): array<string>]
I want to change it to RDD[Array[string]], but when I am changing it to RDD, it's getting changed to org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]:
val sam: RDD[Array[String]] = transactions.rdd
<console>:42: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[Array[String]]
val sam: RDD[Array[String]] = transactions.rdd

transactions.rdd will return RDD[Row], as it is in message.
You can manually convert Row to Array:
val sam = transactions.rdd.map(x => x.getList(0).toArray.map(_.toString))
More Spark 2.0 style it would be:
val sam = transactions.select("columnName").as[Array[String]].rdd
Replace columnName with proper column name from DataFrame - probably you should rename collect_set(b) to more user-friendly name

Dataframe is actually a array[Row] so whenever you run collect on a dataframe it will create a array[Row] and when you are converting it RDD it becomes a RDD[Row].
So if you want RDD[Array[String]] you can do it this way:
val sam = transactions.rdd.map(x => x.toString().stripPrefix("[").stripSuffix("]").split(fieldSeperator))

Rename columns with multiple value dictionary

I have a dataframe and would like to rename the columns based on a dictionary with multiple values per key. The dictionary key has the desired column name, and the values hold possible old column names. The column names have no pattern.
import pandas as pd
column_dict = {'a':['col_a','col_1'], 'b':['col_b','col_2'], 'c':'col_c','col_3']}
df = pd.DataFrame([(1,2.,'Hello'), (2,3.,"World")], columns=['col_1', 'col_2', 'col_3'])
Function to replace text with key
def replace_names(text, dict):
for key in dict:
text = text.replace(dict[key],key)
return text
replace_names(df.columns.values,column_dict)
Gives an error when called on column names
AttributeError: 'numpy.ndarray' object has no attribute 'replace'
Is there another way to do this?

You can use df.rename(columns=...) if you supply a dict which maps old column names to new column names:
import pandas as pd
column_dict = {'a':['col_a','col_1'], 'b':['col_b','col_2'], 'c':['col_c','col_3']}
df = pd.DataFrame([(1,2.,'Hello'), (2,3.,"World")], columns=['col_1', 'col_2', 'col_3'])
col_map = {col:key for key, cols in column_dict.items() for col in cols}
df = df.rename(columns=col_map)
yields
a b c
0 1 2.0 Hello
1 2 3.0 World

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Adding file names from array in dataframe column in spark scala - arrays

Related

Create a new column based on Condition in Spark Dataframe

How can I create an External Catalog Table in Apache Flink

Check a value in an array inside a object json in PostgreSQL 9.5

Spark: how to change datarframe Array[String] to RDD[Array[String]]

Rename columns with multiple value dictionary

Categories

Resources