If I want to know a number of all files in alfresco to show on alfresco page how to do that first?
Now I am not find api access to database and if I find api, what should I do next?
You can make a query to SearchService, like this:
SearchParameters params = new SearchParameters();
params.getStores().add(StoreRef.STORE_REF_WORKSPACE_SPACESSTORE);
params.setLanguage(SearchService.LANGUAGE_FTS_ALFRESCO);
params.setQuery("TYPE:cm\\:content AND PATH:\"/app\\:company_home/st\\:sites/cm\\:test/cm\\:documentLibrary//*\"");
ResultSet result = searchService.query(params);
System.out.println(result.length());
But I'm not sure how optimised it is for performance.
easy way, you can query all type in database by api : localhost:8080/alfresco/service/cmis/query?q={q}
q is CMIS Query Language for alfresco. example SELECT * FROM cmis:document it selects all properties for all documents
see more CMIS Query Language
According to my experience sometimes it is better to get such information from database.
Just for info: in my current project we have more than 50000 documents in repo and I need to get exact number for monitoring.
Here is few points to use DB queries in some cases:
CMIS is (much) slower (in my case it took ~1-2 seconds per query). As #lightoze suggested you can use SearchService, but then you'll get the documents in ResultSet, so after you'll need also to call length method to get the number of them, which I think is more time consuming rather than sql call. And in my case I do such calls every 5 minutes.
There is a bug in 5.0.c which limits the result of some queries by 1000 docs.
Here you can find how to connect to database and here some interesting queries including the total number of documents in repo.
Related
My data source cannot be a single table as I need data that spans across 6 tables. For that I have created a view that does joins on these tables.
When I use this view as data source, indexing takes a lot of times and times out. I tried increasing timeout to 40 minutes and one more suggested change:
"disableOrderByHighWaterMarkColumn" : true
It timed-out. I also set Batch Size:1000. This time it populated the index but failed after few hours saying "connection lost". and "thanks" to disableOrderByHighWaterMarkColumn, If I rerun indexer again it will process all the rows again.
My question is what is the best way to approach a solution to this problem.
Follow-up Question: Since I am relying on a view, I cannot have auto change tracking. I am using a high watermark column (LastUpdatedTime) to track changes in my View. I only want to keep 6 months of data in my index so I am not sure how I can do that when I am using View. I have "where CreateDateTime > dateadd(month, -6, getdate())" clause already in my View but this will not enable Indexer to delete "out-of-time-window" rows(documents) from index. How can I achieve my goals here?
Should I write a processor task to periodically query all documents using C# SDK and delete documents based on date?
Sorry to hear the Azure SQL Database indexer is giving you trouble. I noticed a couple of things in your question that might be worth thinking about in terms of SQL performance:
My data source cannot be a single table as I need data that spans across 6 tables. For that I have created a view that does joins on these tables When I use this view as data source, indexing takes a lot of times and times out.
It's worth taking a look at the query performance troubleshooting guide and figure out what exactly is happening in your Azure SQL database that is causing problems. Assuming you want to use change tracking support, the default query the indexer uses against the SQL database looks like this:
SELECT * FROM c WHERE hwm_column > #hwmvalue ORDER BY hwm_column
We frequently see issues with performance here when there isn't an index on the hwm_column or if hwm_column is computed. You can read more about issues with the high water mark column here.
I tried increasing timeout to 40 minutes and one more suggested change: "disableOrderByHighWaterMarkColumn" : true It timed-out. I also set Batch Size:1000. This time it populated the index but failed after few hours saying "connection lost". and "thanks" to disableOrderByHighWaterMarkColumn, If I rerun indexer again it will process all the rows again.
disableOrderByHighWaterMarkColumn doesn't seem like it will work for your scenario, so I agree that you shouldn't set it. Decreasing the batch size seems to have had a positive effect, I would consider measuring the performance gain here using the troubleshooting guide referenced above
Follow-up Question: Since I am relying on a view, I cannot have auto change tracking. I am using a high watermark column (LastUpdatedTime) to track changes in my View. I only want to keep 6 months of data in my index so I am not sure how I can do that when I am using View. I have "where CreateDateTime > dateadd(month, -6, getdate())" clause already in my View but this will not enable Indexer to delete "out-of-time-window" rows(documents) from index. How can I achieve my goals here? Should I write a processor task to periodically query all documents using C# SDK and delete documents based on date?
Instead of filtering out data that is more than 6 months old, I would consider adding soft delete policy. The challenge here is that the indexer needs to pick up rows that should be deleted. The easiest way to accomplish this might updating your application logic to add a new column to your view indicating the row should be deleted. Once the value of this column changes, the LastUpdatedTime should also be updated so it shows up in the next indexer query.
You can write your own processor task, but querying all documents in Azure Cognitive Search and paging through them may have negative performance implications on your search performance. I would recommend trying to get it working with your indexer first.
I noticed even the simplest 'SELECT MAX(TIMESTAMP) FROM MYVIEW' is somewhat slow (taking minutes) in my environment, and found it's doing a TableScan of 100+GB across 80K micropartitions.
My expectation was this to finish in milliseconds using MIN/MAX/COUNT metadata in each micropartitions. In fact, I do see Snowflake finishing the job in milliseconds using metadata for almost same MIN/MAX value lookup in following article:
http://cloudsqale.com/2019/05/03/performance-of-min-max-functions-metadata-operations-and-partition-pruning-in-snowflake/
Is there any limitation in how Snowflake decides to use metadata? Could it be because I'm querying through a view, instead of querying a table directly?
=== Added for clarity ===
Thanks for answering! Regarding how the view is defined, it seems to adds a WHERE clause for additional filtering with a cluster key. So I believe it should still be possible to fully use metadata of miropartitions. But as posted, TableScan is being done in profilter output.
I'm bit concerned on your comment on SecureView. The view I'm querying is indeed a SECURE VIEW - does it affect how optimizer handles my query? Could that be a reason why TableScan is done?
It looks like you're running the query on a view. The metadata you're referring to will be used when you're running a simple MIN MAX etc on the table, however if you have some logic inside your view which requires filtering / joining of data then Snowflake cannot return results just based off the metadata.
So yes, you are correct when you say the following because your view is probably doing something other than a simple MAX on the table:
...Could it be because I'm querying through a view, instead of querying a table directly?
I'll be describing the business case first. If you just want the question, please skip a couple of paragraphs ahead...
I'm synchronizing data on a .NET mobile client from an ASP.NET Web API server over the Web. Due to "mobile nature" of the client, I'd like the process to be as efficient as possible, so I'd like to implement an incremental synchronization, meaning that the client asks for new entries from a specified date, which will usually be the last sync date.
I'm dealing with entry deletions separately, so for the sake of simplicity, let's focus on new and modified entries.
The table being synchronized is too large to fit in a single response, so paging is implemented.
Each entry in the table has a unique ID column and a LastUpdated column. On the server, I'm using the following code to respond with the requested page:
var set = Model.Set<T>().Where(t => String.Compare(t.LastUpdated, fromDate, StringComparison.Ordinal) >= 0).OrderBy(t => t.Id);
var queryResultPage = set.Skip(pageSize * pageNumber).Take(pageSize);
return queryResultPage.ToList();
Model.Set is the DbSet from which data is retrieved. Please ignore the fact that I must use strings to represent dates...
My question is, what SQL Server table index(es) would produce optimal performance for this case?
Pleun is exactly right. I did a demo of this for a client recently on the CRM 2011 platform. I showed them a case where a page view was taking ~30 seconds to load a page after sorting through 2.2M records plus an additional 4.5M records.
Using SQL Profiler, you can find the query being run.
Put it into SQL Management Studio (clean it up as necessary to make it standard SQL)
Then run execution plan and look for indexes it suggests (especially ones it says are missing)
Anyway, in my demo to my client, after we finished with this, the query dropped down to less than a second; and the page loaded in about 4 seconds (which is still pitiful).
I have made a search but couldn't find a solution which works for me.
I just wonder how Facebook or Linkedin manages to handle same type activity with one sentence?
I mean, if you store every activity with different IDs in an Activity Table, how can you list them as "Member_a and 15 more people changed their photos"
I'm trying to make a social activity wall for my web-site, it's not that big but I just wanted to know the logic on this situation.
For example, when first page loads, I make an Ajax call and listing 0-10 records and if user scrolls down, page makes another ajax call which lists 11-20 records.
Now; if I try to combine same type of activity after sql select query with using if else, if this 10 records are the same, the user will only see 1 item. I hope I could explain what I want to say :)
So, I need a solution which makes this query in SQL Statement.
I'm not asking from you to write a query for me, I just want to know the logic.
Here is a screenshot what I want to achieve:
You see, they are actually different stored data but they combined it and made it as a 1 item network update.
By the way, I'm using C# and SQL Server 2008.
for example:
SELECT Min(b.MemberName), COUNT(*) as Total FROM Network_Feed a
JOIN Member b on a.MemberID = b.MemberID
WHERE a.FeedType = 1
did I understand your question right?
It's not easy to manage petabytes of data as a one table. So, big projects running on SQL Server are used some advanced scaling(distributing data and load) tricks like Service Brokers and Replication.
You can check
http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4000004532 as an SQL Server example.
I am using DataImportHandler for indexing data in SOLR. I used full-import to index all the data in the my database which is around 10000 products.Now I am confused with the delta-import usage? Does it index the new data added into the database on interval basis i mean it is going to index the new data added to my table around 10 rows or it just updates the changes in the already indexed data.
Can anyone please explain it to me with simple example as soon as you can.
The DataImportHandler can be a little daunting. Your initial query has loaded 10.000 unique products. This is loaded if you specify /dataimport?command=full-import.
When this import is done, the DIH stores a variable ({dataimporter.last_index_time}) which is the last date/time you did this import.
In order to do an update, you specify a deltaQuery. The deltaQuery is meant to identify the records that have changed in your database since the last update. So, you specify a query like this: SELECT product_id
FROM sometable
WHERE [date_update] >= '${dataimporter.last_index_time}'
This will retrieve all the product_ids from your database that are updated since you last full update. The next query (deltaImportQuery) you need to specify is the query that will retrieve the full record for each product_id that you have from the previous step.
Assuming product_id is you unique key, solr will figure out that it needs to update an existing record, or add one if the product_id doens't work.
In order to execute the deltaQuery and the deltaImportQuery you use /dataimport?command=delta-import
This is a great simplification of all the possibilities, check the Solr wiki on DataImportHandler, it is a VERY powerful tool!
On another note:
When you use a delta import within a small time window (like a couple of times in a few seconds) and the database server is on an other machine than the solr index service, make sure that the systemtime of both machines matches, since the timestamp of [date_update] is generated on the database server and dataimporter.last_index_time is generated on the other.
Otherwise you won't be updating the index (or too much) depending on the time differences.
I agree that the Data Import Handler can handle this situation. One important limitation to the DIH is that it does not queue requests. The result of this is that if the DIH is "busy" indexing it will ignore all future DIH requests until it is "idle" again. The skipped DIH requests are lost and not executed.