Is there a way to query solr "leader" directly using solrj? - solr

I'm having a single shard and 1 leader & 1 replica architecture. When using "CloudSolrClient", queries are being distributed to both leader and replica. But is there a way to point it only to leader(using zookeeper) other than finding the leader manually and building the query?

It's possible to get the Shards leader in SolrJ and there are several scenarios where this is useful, like for instance when you need to perform a backup programmatically (see example in Solr in Action book).
Here is the relevant code I use:
private final String COLLECTION_NAME = "myCollection";
private final String ZOOKEPER_CLIENT_TIMEOUT_MS = "1000000"
private Map<String, String> getShardLeaders(CloudSolrServer cloudSolrServer) throws InterruptedException, KeeperException {
Map<String, String> shardleaders = new TreeMap<String, String>();
ZkStateReader zkStateReader = cloudSolrServer.getZkStateReader();
for (Slice slice : zkStateReader.getClusterState().getSlices(COLLECTION_NAME)) {
shardleaders.put(slice.getName(), zkStateReader.getLeaderUrl(COLLECTION_NAME, slice.getName(), ZOOKEPER_CLIENT_TIMEOUT_MS));
}
return shardleaders;
}

Related

How do I access the explain() method and executionStats when using Spring Data MongoDb v2.x?

It's time to ask the community. I cannot find the answer anywhere.
I want to create a generic method that can trace all my repository queries and warn me if a query is not optimized (aka missing an index).
With Spring Data MongoDb v2.x and higher and with the introduction of the Document API, I cannot figure out how to access DBCursor and the explain() method.
The old way was to do it like this:
https://enesaltinkaya.com/java/how-to-explain-a-mongodb-query-in-spring/
Any advise on this is appreciated.
I know this is an old question but wanted to give input from a similar requirement I had in capacity planning for a cosmos Db project using Java Mongo API driver v2.X.
Summarizing Enes Altınkaya's blog post. With an #autowired MongoTemplate we use runCommand to execute server-side db queries by passing a Document object. Getting to an explain output we parse a Query or Aggregate object into a new Document object and add the entry {"executionStats": true}(or {"executionStatistics": true} for cosmos Db). Then wrap it in an another Document using "explain" as the propery.
For Example:
Query:
public static Document documentRequestStatsQuery(MongoTemplate mongoTemplate,
Query query, String collectionName) {
Document queryDocument = new Document();
queryDocument.put("find", collectionName);
queryDocument.put("filter", query.getQueryObject());
queryDocument.put("sort", query.getSortObject());
queryDocument.put("skip", query.getSkip());
queryDocument.put("limit", query.getLimit());
queryDocument.put("executionStatistics", true);
Document command = new Document();
command.put("explain", queryDocument);
Document explainResult = mongoTemplate.getDb().runCommand(command);
return explainResult;
}
Aggregate:
public static Document documentRequestStatsAggregate(MongoTemplate mongoTemplate,
Aggregation aggregate, String collection) {
Document explainAggDocument = Document.parse(aggregate.toString());
explainAggDocument.put("aggregate", collection);
explainAggDocument.put("executionStatistics", true);
Document command = new Document();
command.put("explain", explainAggDocument);
Document explainResult = mongoTemplate.getDb().runCommand(command);
return explainResult;
}
For the actual monitoring, since Service & Repository classes are MongoTemplate abstractions we can use Aspects to capture the query/aggregate execution details as the applications is running.
For Example:
#Aspect
#Component
#Slf4j
public class RequestStats {
#Autowired
MongoTemplate mongoTemplate;
#After("execution(* org.springframework.data.mongodb.core.MongoTemplate.aggregate(..))")
public void logTemplateAggregate(JoinPoint joinPoint) {
Object[] signatureArgs = joinPoint.getArgs();
Aggregation aggregate = (Aggregation) signatureArgs[0];
String collectionName = (String) signatureArgs[1];
Document explainAggDocument = Document.parse(aggregate.toString());
explainAggDocument.put("aggregate", collectionName);
explainAggDocument.put("executionStatistics", true);
Document dbCommand = new Document();
dbCommand.put("explain", explainAggDocument);
Document explainResult = mongoTemplate.getDb().runCommand(dbCommand);
log.info(explainResult.toJson());
}
}
Outputs something like below after each execution:
{
"queryMetrics": {
"retrievedDocumentCount": 101,
"retrievedDocumentSizeBytes": 202214,
"outputDocumentCount": 101,
"outputDocumentSizeBytes": 27800,
"indexHitRatio": 1.0,
"totalQueryExecutionTimeMS": 15.85,
"queryPreparationTimes": {
"queryCompilationTimeMS": 0.21,
"logicalPlanBuildTimeMS": 0.5,
"physicalPlanBuildTimeMS": 0.58,
"queryOptimizationTimeMS": 0.1
},
"indexLookupTimeMS": 10.43,
"documentLoadTimeMS": 0.93,
"vmExecutionTimeMS": 13.6,
"runtimeExecutionTimes": {
"queryEngineExecutionTimeMS": 1.56,
"systemFunctionExecutionTimeMS": 1.36,
"userDefinedFunctionExecutionTimeMS": 0
},
"documentWriteTimeMS": 0.68
}
// ...
I usually log this out into another collection or write to file.

Conflict between spring-data-cassandra and spring-data-solr

I'm currently working in project that needs Cassandra database to have search ability. We've got DataStax cluster and we want to use Spring Data to simplify database operations. However, when we made entity that got both - #Table (for cassandra) and #SolrDocument (for Solr) it happened to be broken. The only error we got is the one below. Anyone have encountered such a problem?
Caused by: org.springframework.data.mapping.PropertyReferenceException: No property findAll found for type ENTITYNAME!
I know that this is probably Spring issue, but hope to find someone who have fought this type of problem.
Greetings!
Some sample entity causing problems:
#SolrDocument(solrCoreName = "sample_entity")
#Table("sample_entity")
#Getter
#Setter
#AllArgsConstructor
#NoArgsConstructor
public final class SampleEntity {
#PrimaryKey
#Indexed(name = "id")
private UUID id;
private LocalDateTime created;
private UUID foreignId;
#Indexed(name = "name")
private String name;
private boolean someflag = true;
}
You're mixing up things - if you're using DSE Search, then it's better to perform search via CQL, by querying in the solr_query column. In your example, the #SolrDocument will force using of the Solr's HTTP API, while #Table will force use of CQL.
You can use Object Mapper from DataStax to map classes to tables, like this:
// STest.java
#Table(keyspace = "test",name = "stest")
public class STest {
#PartitionKey
private int id;
private String t;
}
// STestAccessor.java
#Accessor
public interface STestAccessor {
#Query("SELECT * FROM test.stest WHERE solr_query = :solr")
Result<STest> getViaSolr(#Param("solr") String solr);
}
// STestMain.java
MappingManager manager = new MappingManager(session);
STestAccessor sa = manager.createAccessor(STestAccessor.class);
Result<STest> rs = sa.getViaSolr("*:*");
for (STest sTest : rs) {
System.out.println("id=" + sTest.getId() + ", text=" + sTest.getT());
}
Here is the full code.

Is there an equivalent to Kafka's KTable in Apache Flink?

Apache Kafka has a concept of a KTable, where
where each data record represents an update
Essentially, I can consume a kafka topic, and only keep the latest message for per key.
Is there a similar concept available in Apache Flink? I have read about Flink's Table API, but does not seem to be solving the same problem.
Some help comparing and contrasting the 2 frameworks would be helpful. I am not looking for which is better or worse. But rather just how they differ. The answer for which is right would then depend on my requirements.
You are right. Flink's Table API and its Table class do not correspond to Kafka's KTable. The Table API is a relational language-embedded API (think of SQL integrated in Java and Scala).
Flink's DataStream API does not have a built-in concept that corresponds to a KTable. Instead, Flink offers sophisticated state management and a KTable would be a regular operator with keyed state.
For example, a stateful operator with two inputs that stores the latest value observed from the first input and joins it with values from the second input, can be implemented with a CoFlatMapFunction as follows:
DataStream<Tuple2<Long, String>> first = ...
DataStream<Tuple2<Long, String>> second = ...
DataStream<Tuple2<String, String>> result = first
// connect first and second stream
.connect(second)
// key both streams on the first (Long) attribute
.keyBy(0, 0)
// join them
.flatMap(new TableLookup());
// ------
public static class TableLookup
extends RichCoFlatMapFunction<Tuple2<Long,String>, Tuple2<Long,String>, Tuple2<String,String>> {
// keyed state
private ValueState<String> lastVal;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<String> valueDesc =
new ValueStateDescriptor<String>("table", Types.STRING);
lastVal = getRuntimeContext().getState(valueDesc);
}
#Override
public void flatMap1(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// update the value for the current Long key with the String value.
lastVal.update(value.f1);
}
#Override
public void flatMap2(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// look up latest String for current Long key.
String lookup = lastVal.value();
// emit current String and looked-up String
out.collect(Tuple2.of(value.f1, lookup));
}
}
In general, state can be used very flexibly with Flink and let's you implement a wide range of use cases. There are also more state types, such as ListState and MapState and with a ProcessFunction you have fine-grained control over time, for example to remove the state of a key if it has not been updated for a certain amount of time (KTables have a configuration for that as far as I know).

How do I setup a streamed set of SQL Inserts in Apache Camel

I have a file with over 3 million pipe-delimited rows that I want to insert into a database. Its a simple table (no normalisation required)
Setting up the route to watch for the file, read it in using streaming mode and split the lines is easy. Inserting rows into the table will also be a simple wiring job.
Question is: how can I do this using batched inserts? Lets say that 1000 rows is optimal.. given that the file is streamed how would the SQL component know that the stream had finished. Lets say the file had 3,000,001 records. How can I set Camel up to insert the last stray record?
Inserting the lines one at a time can be done - but this will be horribly slow.
I would recommend something like this:
from("file:....")
.split("\n").streaming()
.to("any work for individual level")
.aggregate(body(), new MyAggregationStrategy().completionSize(1000).completionTimeout(50)
.to(sql:......);
I didn't validate all the syntax, but the plan would be to grab the file split it with streams, then aggregate groups of 1000 and have a timeout to catch that last smaller group. Those aggregated groups could simply make the body a list of strings or whatever format you will need for your batch sql insert.
Here is more accurate example:
#Component
#Slf4j
public class SQLRoute extends RouteBuilder {
#Autowired
ListAggregationStrategy aggregationStrategy;
#Override
public void configure() throws Exception {
from("timer://runOnce?repeatCount=1&delay=0")
.to("sql:classpath:sql/orders.sql?outputType=StreamList")
.split(body()).streaming()
.aggregate(constant(1), aggregationStrategy).completionSize(1000).completionTimeout(500)
.to("log:batch")
.to("google-bigquery:google_project:import:orders")
.end()
.end();
}
#Component
class ListAggregationStrategy implements AggregationStrategy {
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
List rows = null;
if (oldExchange == null) {
// First row ->
rows = new LinkedList();
rows.add(newExchange.getMessage().getBody());
newExchange.getMessage().setBody(rows);
return newExchange;
}
rows = oldExchange.getIn().getBody(List.class);
Map newRow = newExchange.getIn().getBody(Map.class);
log.debug("Current rows count: {} ", rows.size());
log.debug("Adding new row: {}", newRow);
rows.add(newRow);
oldExchange.getIn().setBody(rows);
return oldExchange;
}
}
}
This can be done using the Camel-Spring-batch component. http://camel.apache.org/springbatch.html , the volume of commit per step can be defined by the commitInterval and the orchestration of the job is defined in a spring config. It works quite for well for usecases similar to your requirement.
Here's a nice example from github : https://github.com/hekonsek/fuse-pocs/tree/master/fuse-pocs-springdm-springbatch/fuse-pocs-springdm-springbatch-bundle/src/main

Objectify doesn't always return results

I am using Objectify to store data on Google App Engine's datastore. I have been trying to implement a one-to-many relationship between two classes, but by storing a list of parameterised keys. The method below works perfectly some of the time, but returns an empty array other times - does anyone know why this may be?
It will either return the correct list of CourseYears, or
{
"items": [
]
}
Here is the method:
#ApiMethod(name = "getCourseYears") #ApiResourceProperty(ignored = AnnotationBoolean.TRUE)
public ArrayList<CourseYear> getCourseYears(#Named("name") String name){
Course course = ofy().load().type(Course.class).filter("name", name).first().now();
System.out.println(course.getName());
ArrayList<CourseYear> courseYears = new ArrayList<CourseYear>();
for(Key<CourseYear> courseYearKey: course.getCourseYears()){
courseYears.add(ofy().load().type(CourseYear.class).id(courseYearKey.getId()).now());
}
return courseYears;
}
The Course class which stores many CourseYear keys
#Entity
public class Course {
#Id
#Index
private Long courseId;
private String code;
#Index
private String name;
#ApiResourceProperty(ignored = AnnotationBoolean.TRUE)
public List<Key<CourseYear>> getCourseYears() {
return courseYears;
}
#ApiResourceProperty(ignored = AnnotationBoolean.TRUE)
public void setCourseYears(List<Key<CourseYear>> courseYears) {
this.courseYears = courseYears;
}
#ApiResourceProperty(ignored = AnnotationBoolean.TRUE)
public void addCourseYear(Key<CourseYear> courseYearRef){
courseYears.add(courseYearRef);
}
#Load
#ApiResourceProperty(ignored = AnnotationBoolean.TRUE)
List<Key<CourseYear>> courseYears = new ArrayList<Key<CourseYear>>();
...
}
I am debugging this on the debug server using the API explorer. I have found that it will generally work at the start for a few times but if I leave and return to the API and try and run it again, it will not start working again after that.
Does anyone have any idea what might be going wrong?
Many thanks.
You might want to reduce the amount of queries you send to the datastore. Try something like this:
Course course = ofy().load().type(Course.class).filter("name", name).first().now();
ArrayList<CourseYear> courseYears = new ArrayList<CourseYear>();
List<Long> courseIds = new List<>();
for(Key<CourseYear> courseYearKey: course.getCourseYears()){
courseIds.add(courseYearKey.getId());
}
Map<Long, Course> courses = ofy().load().type(CourseYear.class).ids(courseIds).list();
// add all courses from map to you courseYears list
I also strongly recommend a change in your data structure / entities:
In your CourseYears add a property Ref<Course> courseRef with the parent Course and make it indexed (#Index). Then query by
ofy().load().type(CourseYear.class).filter("courseRef", yourCourseRef).list();
This way you'll only require a single query.
The two most likely candidates are:
Eventual consistency behavior of the high replication datastore. Queries (ie your filter() operation) always run a little behind because indexes propagate through GAE asynchronously. See the GAE docs.
You haven't installed the ObjectifyFilter. Read the setup guide. Recent versions of Objectify throws an error if you haven't installed it, so if you're on the latest version, this isn't it.

Resources