Writing parquet output with selected attributes from Bean - dataset

I have a bean class
#Getter
#Setter
public class Employee {
String id;
String name;
String depart;
String address;
final String pipe= "|";
#Override
public String toString() {
return id +pipe+ name +pipe+depart;
}
}
And I have a JavaRDD<Employee> emprdd;
and when I do the emprdd.saveAsText(path);. I get the output as based on the toString method.
Now I wanted to write into the parquet format after converting it to the dataframe but I need only (id,name,depart). I tried sqlContext.createDataframe(rdd,Employee.class); (syntax ignored), but I dont need all the properties.
Can anyone guide me through this. (This is a sample , I have bean class with 350+ attributes)

Related

Flink map tuple to string

Hi I have some tuple Tuple2<String, Integer> that i want to convert to string an then send it to KAFKA.
Im trying to figure out a way to to iterate the tuple and create one string from it so if i have N elements in my tuple i want to create a string that contain them.
I tried flat map it but im geting new string for each element in the tuple.
SingleOutputStreamOperator<String> s = t.flatMap(new FlatMapFunction<Tuple2<String, Integer>, String>() {
#Override
public void flatMap(Tuple2<String, Integer> stringIntegerTuple2, Collector<String> collector) throws Exception {
collector.collect(stringIntegerTuple2.f0 + stringIntegerTuple2.f1);
}
});
What is the correct way to create on string out of tuple .
You can just override the .toString() method of a tuple with a custom class and use that. Like this:
import org.apache.flink.api.java.tuple.Tuple3;
public class CustomTuple3 extends Tuple3 {
#Override
public String toString(){
return "measurment color=" + this.f0.toString() + " color=" + this.f1.toString() + " color=" + this.f2.toString();
}
}
So now just use a CustomTuple3 object instead of a Tuple3 and when you populate it and call .toString() on it, it will output that formatted string.

solrj index child documents with multiple levels

I'm moving from multiple cores to a single core with nested documents.(reason being matching/scoring from multiple cores is limited)
To achieve this I'm trying to index a nested structure using solrJ.
I've tested the following code but I get an error "BookDetail cannot have more than one Field with child=true"
How can I avoid this? Is this a solrj limitation?
Indexing is done as followed:
- Solr and solrj version 5.3.1
HttpSolrClient mytestcore=new HttpSolrClient("...");
mytestcore.add(dob.toSolrInputDocument(new Book());//should have some initialization
The structure is below(but most unused fields are removed).
public class Book implements Serializable {
#Field
private String id;
#Field
private String type;
#Field(child = true)
private List<BookDetail> details;
...
}
public class BookDetail implements Serializable {
#Field
private String id;
#Field
private String type;
#Field(child = true)
private List<BookMetaData> bookMetaData;
#Field(child = true)
private List<BookContent> pages;
...
}
public class BookMetaData implements Serializable {
#Field
private String id;
#Field
private String type;
...
}
public class BookContent implements Serializable {
#Field
private String id;
#Field
private String type;
#Field
private String content;
...
}
Edit:
Currently solved it by making seperate SolrInputDocuments for each document type and add them with addChildDocument.(as seen on other answers on stackoverflow) But this solution doesn't use the annotation "child = true" anymore...

How to store classes containing Lists of classes that also contain Lists using GAE and Objectify?

I have a Java model similar to:
public class Country {
#Id private String id;
private CurrencyId currencyId;
private List<Province> provinceList;
...
}
public class Province {
#Id private String id;
private Gobernor gobernorId;
private List<City> cityList;
...
}
public class City {
#Id private String id;
private String name;
...
}
I want to store that data using objectify. However, as Country data might change, I also want to store the date the Country data has been stored, so I think I should store an entity such as:
public class CountryListEntity {
#Id private String id;
private List<Country> countryList;
private Date storeDate;
}
Note I will only have one entity of kind CountryListEntity with the Id "root", if I can store it like that. I know very little about both how google apps stores data and how objectify works. I've tried many combinations of #Embedded, but I got many errors, i.e.
Cannot place array or collection properties inside #Embedded arrays or collections
Can anyone tell me how to define these classes? A snippet of the code needed to store and retrieve this "root" entity, would be highly appreciated!
#Embedded collections are transformed into a series of collection fields in the
low-level Entity. That's why one level embedding is all you can do.
If you are going to store/load all data at once and if your entities are as simple as the ones in your example you can put #Serialized annotation for your lists inside #Embedded lists.
You can find out more from this discussion.
The problem with this approach is that your low-level embeddings won't be able to be indexed.
public class CountryListEntity {
#Id private String id;
#Embedded
private List<Country> countryList;
private Date storeDate;
}
public class Country implements Serializable {
private String id;
private CurrencyId currencyId;
#Serialized
private List<Province> provinceList;
// ...
}
public class Province implements Serializable {
private String id;
private Gobernor gobernorId;
#Serialized
private List<City> cityList;
// ...
}
public class City implements Serializable {
private String id;
private String name;
// ...
}

How do I query a single field in AppEngine using JDO

I've got a Product POJO that looks like.
#PersistenceCapable(identityType = IdentityType.APPLICATION)
public class Product extends AbstractModel {
#Persistent
private String name;
#Persistent
private Key homePage;
#Persistent
private Boolean featured;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Key getHomePage() {
return homePage;
}
public void setHomePage(Key homePage) {
this.homePage = homePage;
}
public boolean isFeatured() {
return featured;
}
public void setFeatured(Boolean featured) {
this.featured = featured;
}
}
My DataStore is currently completely empty.
I'd like to retrieve all homePage keys where featured is true for the Product.
I'm trying
PersistenceManager persistenceManager = getPersistenceManager();
Query query = persistenceManager.newQuery("SELECT homePage FROM " + getModelClass());
query.setFilter("featured == true");
List<Key> productPageKeys = (List<Key>) query.execute();
However this is giving me a null pointer error. How should I be constructing this query?
Cheers,
Peter
To do a projection, you would do something like
Query q = pm.newQuery("SELECT myField FROM mydomain.MyClass WHERE featured == true");
List<String> results = (List<String>)q.execute();
where String is the type of my field. Any basic JDO documentation would define that.
Internally GAE/J will retrieve the Entity, and then in the post-processing before returning it to the user it is manipulated into the projection you require.
As Nick pointed out in the other reply, this gives no performance gain over doing it yourself ... but then the whole point of a standard persistence API is to shield you from such datastore-specifics of having to do such extraction; it's all provided out of the box.
Entities are stored as serialized blobs of data in the datastore, so it's not possible to retrieve and return a single field from an entity. You need to fetch the whole entity, and extract the field you care about yourself.

Grails constraints GORM-JPA always sorting alphabeticaly

In my grails app, in which I use GORM-JPA, I cannot define the order of the elements of the class using the constraints. If I autogenerate the views, they are all sorted alphabetically, instead of the defined order. Here's my source class:
package kbdw
import javax.persistence.*;
// import com.google.appengine.api.datastore.Key;
#Entity
class Organisatie implements Serializable {
#Id
#GeneratedValue(strategy = GenerationType.IDENTITY)
Long id
#Basic
String naam
#Basic
String telefoonnummer
#Basic
String email
#Basic
OrganisatieType type
#Basic
String adresLijnEen
#Basic
String adresLijnTwee
#Basic
String gemeente
#Basic
String postcode
#Basic
String faxnummer
static constraints = {
id visible:false
naam size: 3..75
telefoonnummer size: 4..18
email email:true
type blank:false
adresLijnEen size:5..250
adresLijnTwee blank:true
gemeente size: 2..100
postcode size: 4..10
faxnummer size: 4..18
}
}
enum OrganisatieType {
School,
NonProfit,
Bedrijf
}
The variable names are in Dutch, but it should be clear (Organisatie = organisation, naam = name, adres = address, ...).
How do I force the app to use that order of properties? Do I need to use # annotations?
Thank you!
Yvan
(ps: it's for deploying on the Google App Engine ;-) )
Try installing and hacking scaffolding, and use DomainClassPropertyComparator in your gsp-s. Scaffold templates do a Collections.sort() on default comparator, but you can use explicit one.
The absence of Hibernate might be the cause: without it, DomainClassPropertyComparator won't work, and Grails uses SimpleDomainClassPropertyComparator - I'm looking at DefaultGrailsTemplateGenerator.groovy
You can, for sure, provide another Comparator that will compare the order of declared fields.
EDIT:
For example, after installing scaffolding I have a file <project root>\src\templates\scaffolding\edit.gsp. Inside, there are such lines:
props = domainClass.properties.findAll{ ... }
Collections.sort(props, comparator. ... )
where comparator is variable provided by Grails scaffolding. You can do:
props = ...
Collections.sort(props, new PropComparator(domainClass.clazz}))
where PropComparator is something like
class PropComparator implements Comparator {
private Class clazz
PropComparator(Class clazz) { this.clazz = clazz }
int compare(Object o1, Object o2) {
clazz.declaredFields.findIndexOf{it.name == o1}
- clazz.declaredFields.findIndexOf{it.name == o2}
}
}

Resources