I have the following code to read data from a parquet to Dataframe
DataFrame addressDF = sqlContext.read().parquet(addressParquetPath);
How do i read data from parquet to DATA SET?
Dataset dataset = sqlContext.createDataset(sqlContext.read().parquet(propertyParquetPath).toJavaRDD(), Encoder.);
What should the Encoder parameter contain? Also, Do i have to create a property class and then pass that or how is it?
The Encoder for a type T is the class that tells Spark how instances of T can be decoded and~ encoded from the internal Spark representation. It contains the schema of the class and the scala ClassTag which is used to create your class via reflection.
In your code, you don't specialize Dataset over any type T, so I cannot create an Encoder for you but I can give you as example the one from Databricks Spark documentation, which I suggest to read because it is great.
First of all, let's create the class University that we want to load into a DateSet:
public class University implements Serializable {
private String name;
private long numStudents;
private long yearFounded;
public void setName(String name) {...}
public String getName() {...}
public void setNumStudents(long numStudents) {...}
public long getNumStudents() {...}
public void setYearFounded(long yearFounded) {...}
public long getYearFounded() {...}
}
Now, University is a Java Bean and the Spark Encoders library provides a way to create encoders for Java Beans with the function bean:
Encoder<University> universityEncoder = Encoders.bean(University.class)
which can then be used to read a Dataset of University from parquet without first loading them into a DataFrame (which is redundant):
Dataset<University> schools = context.read().json("/schools.json").as(universityEncoder);
and now schools is a Dataset<University> read from a parquet file.
Related
Team
I'm building a spring boot application that can support multiple DBs either Cassandra, CouchDB or DynamoDB based on the configuration in application.yml.
My entity class has annotations that are specific to Cassandra and the annotations for DynamoDB are different. For eg. DynamoDB has #DynamoDBTable for Table and Cassandra has #org.springframework.data.cassandra.core.mapping.Table annotations.
The problem is that I would like to use a single entity object irrespective of the DB type because the entity is referred from multiple places in the application. What is the best design pattern to implement this?
In case of Cassandra -
package com.abc;
#Table("Cart")
public class Cart {
#PrimaryKeyColumn(ordinal = 0, type = PrimaryKeyType.PARTITIONED)
#GeneratedValue(strategy = GenerationType.AUTO)
protected String id;
#PrimaryKeyColumn(ordinal = 1, type = PrimaryKeyType.PARTITIONED)
private String userId;
#PrimaryKeyColumn(ordinal = 2, type = PrimaryKeyType.CLUSTERED, ordering = Ordering.DESCENDING)
private String skuId;
In case of DynamoDB -
#DynamoDBTable(tableName = "Cart")
public class Cart {
#DynamoDBHashKey
#DynamoDBAutoGeneratedKey
protected String id;
private String userId;
private String skuId;
Thanks
AA
I would suggest you to create an intermediary object which can act as a bridge between your application logic and database ORM.
You can create a helper function which populate those fields.
class CartDAO {
private String id;
private String userId;
private String skuId;
// Getters & Setters
}
class CartService{
CartDAO fetchFromDynamoDB(String Id)
{
// Fetch from DB
// Create CartDAO from that object
// Return CartDAO
}
CartDAO fetchFromCassandra(String Id)
{
// Fetch from DB
// Create CartDAO from that object
// Return CartDAO
}
}
Now you can use CartDAO seamlessly in your application logic.
Yes it is possible.
Option 1:
Simply put the required annotation of both MongoDB and Cassandra.
Each annotation will have there own package and definition. So provide the required definition.
Option 2:
As defined by snk01, you can use that approach as well.
Here i am assuming that you are writing the persistence layer for each database seperately.
I am trying to create an Eventlog (ORMSLOG in example), that saves events in human readable form in Datastore.
Doing this should write readable event:
List<Device> devices = ofy().transactionless().load().type(Device.class).list();
ORMSLOG.log(ORMSLOG.GET_ALL_DEVICES, "Devices found: " + String.valueOf(devices));
The ORMSLOG is a simple class.
public class ORMSLOG {
public final static String CREATE_DEVICE = "Create Device";
public final static String GET_ALL_DEVICES = "Get all Devices";
public static void log(final String event, final String data) {
ofy().save().entity(new Event(event, data)).now();
}
}
But the data saved in Datastore is not readable and looks like this:
ORMSLOG data
I need to transform the reference to the object into human readable text.
You are just logging the String representation of the objects, which is done by calling the toString method. Since you did not override the toString method in the Device class, you are getting the pointer to the objects. If you override the toString method in your Device class to return whatever state you want to return, you would see a much better result. Most IDEs (e.g. Eclipse) have an option to generate toString method for you.
Mongodb is a no-schema document database, but in spring data, it's necessary to define entity class and repository class, like following:
Entity class:
#Document(collection = "users")
public class User implements UserDetails {
#Id private String userId;
#NotNull #Indexed(unique = true) private String username;
#NotNull private String password;
#NotNull private String name;
#NotNull private String email;
}
Repository class:
public interface UserRepository extends MongoRepository<User, String> {
User findByUsername(String username);
}
Is there anyway to use map not class in spring data mongodb so that the server can accept any dynamic JSON data then store it in BSON without any pre-class define?
First, a few insightful links about schemaless data:
what does “schemaless” even mean anyway?
“schemaless” doesn't mean “schemafree”
Second... one may wonder if Spring, or Java, is the right solution for your problem - why not a more dynamic tool, such a Ruby, Python or the Mongoshell?
That being said, let's focus on the technical issue.
If your goal is only to store random data, you could basically just define your own controller and use the MongoDB Java Driver directly.
If you really insist on having no predefined schema for your domain object class, use this:
#Document(collection = "users")
public class User implements UserDetails {
#Id
private String id;
private Map<String, Object> schemalessData;
// getters/setters omitted
}
Basically it gives you a container in which you can put whatever you want, but watch out for serialization/deserialization issues (this may become tricky if you had ObjectIds and DBRefs in your nested document). Also, updating data may become nasty if your data hierarchy becomes too complex.
Still, at some point, you'll realize your data indeed has a schema that can be pinpointed and put into well-defined POJOs.
Update
A late update since people still happen to read this post in 2020: the Jackson annotations JsonAnyGetter and JsonAnySetter let you hide the root of the schemaless-data container so your unknown fields can be sent as top-level fields in your payload. They will still be stored nested in your MongoDB document, but will appear as top-level fields when the ressource is requested through Spring.
#Document(collection = "users")
public class User implements UserDetails {
#Id
private String id;
// add all other expected fields (getters/setters omitted)
private String foo;
private String bar;
// a container for all unexpected fields
private Map<String, Object> schemalessData;
#JsonAnySetter
public void add(String key, Object value) {
if (null == schemalessData) {
schemalessData = new HashMap<>();
}
schemalessData.put(key, value);
}
#JsonAnyGetter
public Map<String, Object> get() {
return schemalessData;
}
// getters/setters omitted
}
I'm trying to unmarshal a csv that have composed fields. For instance, in the following example
"order1","foo#email.com","(test1;45),(test2;89)"
The third attribute would represent a list of two items (but the size of the list is variablet), each item having a name and a price. The #Link annotation only works in one-to-one annotation, so it is not an option. The #OneToMany annotation in csv only works for writting so is neither an option.
The csv is written by non technical stuff, so a complex format is not an option either.
Is it possible to manage this requirement?
The java class to instantiate would be, in this case, something like this:
public class Order {
private String name;
private String email;
private List<Item> items;
}
public class Item {
private String name;
private int price;
}
Many thanks in advance
first post here, hoping someone could perhaps shed some light on an issue I've been trying to juggle...
As a part of a school project we're attempting to build a interface to display points on a map and paths on a map.
For our first sprint I managed to work out storing/retrieving items using Objectify - it went great!
Now we're trying to extend the functionality for our next spring. Having problems now trying to store an object of type MapPath (note MapPath and MapData, our two data types, both extend class Data). Brief code snippets as follows :
#Entity
public class Data extends JavaScriptObject
{
#Id
Long id;
private String name;
private String dataSet;
...getters and setters
}
#Subclass
public class MapData extends Data implements Serializable{
{
private String name;
private String address;
private String dataSet;
#Embedded
private Coordinate location;
....constructors, getters/setters
}
#Subclass
public class PathData extends Data implements Serializable{
private String name;
private String address;
private String dataSet;
#Embedded
private Coordinate[] path;
...etc
}
Now hopefully I haven't lost you yet. I have a DataService class that basically handles all transactions. I have the following unit test :
#Test
public void storeOnePath(){
PathData pd = new PathData();
pd.setName("hi");
DataService.storeSingleton(pd);
Data d = DataService.getSingleton("hi");
assertEquals(pd,d);
}
The implementation of getSingleton is as follows :
public static void storeSingleton(Data d){
Objectify obj = ObjectifyService.begin();
obj.put(d);
}
JUnit complains:
java.lang.ExceptionInInitializerError
at com.teamrawket.tests.DataTest.storeOnePath(DataTest.java:59)
...<taken out>
Caused by: java.lang.IllegalStateException: Attempting to create multiple associations on class com.teamrawket.server.MapData for name
at com.googlecode.objectify.impl.Transmog$Visitor.addRootSetter(Transmog.java:298)
at com.googlecode.objectify.impl.Transmog$Visitor.visitField(Transmog.java:231)
at com.googlecode.objectify.impl.Transmog$Visitor.visitClass(Transmog.java:134)
at com.googlecode.objectify.impl.Transmog.<init>(Transmog.java:319)
at com.googlecode.objectify.impl.ConcreteEntityMetadata.<init>(ConcreteEntityMetadata.java:75)
at com.googlecode.objectify.impl.Registrar.registerPolymorphicHierarchy(Registrar.java:128)
at com.googlecode.objectify.impl.Registrar.register(Registrar.java:62)
at com.googlecode.objectify.ObjectifyFactory.register(ObjectifyFactory.java:209)
at com.googlecode.objectify.ObjectifyService.register(ObjectifyService.java:38)
at com.teamrawket.server.DataService.<clinit>(DataService.java:20)
... 27 more
What exactly does "attempting to create multiple associations on class ... for name" imply?
Sorry for the long post and any formatting issues that may arise.
You have repeated field names in your subclasses. You should not declare 'name' and 'dataSet' in both superclasses and subclasses; remove these fields from MapData and PathData and you should be fine.
com.teamrawket.server.MapData refers to the fullPath name for your MapData file. The name at the end refers to the field String name in your MapData class. This whole exception tries to tell you that it already contains a reference for that specific fullPath.
I would say there is another object with the same fullPath already registered. It would be helpful to know where line 59 is exactly as that is where the error occured.