WPF find all regex matches in a xps document - wpf

I need to search an expression inside a xps document then list all matches (with the page number of each match).
I searched in google, but no reference or sample found which addresses this issue .
SO: How can I search a xps document and get this information?

The first thing to note is that an XPS file is an Open Packaging package. It can be opened and the contents accessed via the System.IO.Packaging.Package class. This makes any operations on the contents much easier.
Here's an example of how to search the page content with a given regex, while also tracking which page the match occurs on.
var regex = new Regex(#"th\w+", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);
using(var xps = System.IO.Packaging.Package.Open(#"C:\path\to\regex.oxps"))
{
var pages = xps.GetParts()
.Where (p => p.ContentType == "application/vnd.ms-package.xps-fixedpage+xml")
.ToList();
for (var i = 0; i < pages.Count; i++)
{
var page = pages[i];
using(var reader = new StreamReader(page.GetStream()))
{
var s = reader.ReadToEnd();
var matches = regex.Matches(s);
if (matches.Count > 0)
{
var matchText = matches
.Cast<Match>()
.Aggregate (new StringBuilder(), (agg, m) => agg.AppendFormat("{0} ", m.Value));
Console.WriteLine("Found matches on page {0}: {1}", i + 1, matchText);
}
}
}
}

It is not going to be as simple as you might have thought. XPS files are compressed (zipped) files containing a somewhat complex folder structure containing all the text, fonts, graphics and other items. You can use compression tools such as 7-Zip or WinZip etc. to extract the entire folder structure from an XPS file.
Having said that, you can use the following sequence of steps to do what you want:
Extract the contents of your XPS file programmatically in a temp folder. You can use the new ZipFile class for this purpose if you're using .NET 4.5 or better.
The extracted folder will have the following folder structure:
_rels
Documents
1
_rels
MetaData
Pages
_rels
Resources
Fonts
MetaData
Go to Documents\1\Pages\ subfolder. Here you'll find one or more .fpage files, one for each page of your document. These files are in XML format and contain all text contained in the page in a structured manner.
Use simple loop to iterate through all .fpage files, opening each of them using an XML reader such as XDocument or XmlDocument and search for required text in node values using RegEx.IsMatch(). If found, note down the page number in a List and move ahead.

Related

How to move files in Google Drive automatically with Google Script, based on partial file name?

I'm rather new to Google Scripts. We are wanting to make all files uploaded to a single folder in Google Drive be automatically moved to other folders, based on part of their file name.
3 example files: APX PMT 05.02.2019, ALT PMT 05.03.2019, BEA PMT 05.04.2019
We want these files to be moved to their destination folders based on the first 3 letters of their file name. APX PMT 05.02.2019 gets moved to the APX folder, ALT PMT 05.03.2019 gets moved to the ALT folder, ect.
Do not have code samples as I'm extremely new to this. Move files automatically from one folder to another in Google Drive is a good start on me learning this, but still unsure how to make it move file based on only part of the file name.
Results: Wanting people to be able to upload files to a single destination, and the code auto moves them to their proper folders.
Test Code version 2.0 . Works as below if I remove the spaces between the 2 character sets (change BEA RFT to BEARFT or BEA_RFT) , as our workplace would like them sorted by the first 7 characters in the file name now. How can i make it work when there is a space in the characters?:
function moveFiles() {
var dfldrs=['BEA RFT', 'BEA ADJ', 'BEA PMT', 'BEA CHG'];//Seven letter prefixes
var ofObj={BEA RFT:'ID',BEA ADJ:'ID',BEA PMT:'ID',BEA CHG:'ID'};//distribution folder ids
var upldFldr=DriveApp.getFolderById('ID');
var files=upldFldr.getFiles();
while(files.hasNext()) {
var file=files.next();
var key=file.getName().slice(0,7);
var index=dfldrs.indexOf(key);
if(index>-1) {
Drive.Files.update({"parents": [{'id': ofObj[key]}]}, file.getId());
}
}
}
Moving Files
Please, Read these instructions before running script
You need to provide the three letter prefixes
You need to provide the distribution folder ids associated with each prefix
You need to provide the upload folder id
You need to run this program from your upload file script or provide an alternate trigger function as you desire.
You need to enable Advance Drive API version 2
The Code
function moveFiles() {
var dfldrs=['APX','ALT','BEA'];//Three letter prefixes
var ofObj={APX:'APX id',ALT:'ALT id',BEA:'BEA id'};//distribution folder ids
var upldFldr=DriveApp.getFolderById('folderid');
var files=upldFldr.getFiles();
while(files.hasNext()) {
var file=files.next();
var key=file.getName().slice(0,3);
var index=dfldrs.indexOf(key);
if(index>-1) {
Drive.Files.update({"parents": [{"id": ofObj[key]}]}, file.getId());
}
}
}
Drive API Reference

merging two pdfs in to single and attaching in the email in apex

I have a requirement where I want to merge two pdfs in to a single pdf and attach in the attachements to the custom object in salesforce then this merged pdf is sent via email.
Here is my code snippet. Where contentPdf is one pdf and b is another pdf content which needs to be merged.
PageReference pdf = PageReference(/apex/FirstPDF?id='+ccId);
Blob contentPdf = pdf.getContent();
PageReference cadre = new PageReference('/apex/SecondPDF?id=' + ccId);
Blob b = cadre.getContentPdf();
String combinedPdf = EncodingUtil.convertToHex(contentPdf)+EncodingUtil.convertToHex(b);
Blob horodatagePdf = EncodingUtil.convertFromHex(combinedPdf);
Attachment attachment = new Attachment();
attachment.Body = horodatagePdf;
attachment.Name = String.valueOf('New pdf.pdf');
attachment.ParentId = ccId;
insert attachment;
But the problem is that it does not show the right documents merged instead it shows only one page in the final pdf saved in my machine. I have tried to use contentAsPdf() to retrieve content from pageReference but it does not work. Moreover the page is not well generated the one I get in the attachment. Or if there is any other way to do it quuickely.
I don't think you can merge PDF documents like that. It looks crazy. You can simply join text files together but anything more complex (JPEGs, PDFs...) has special structure... It's quite possible that your code works, in the sense that it generates a file which size is a sum of single files' sizes but it's not a valid document so only 1st part renders OK.
Try making another page which would just reuse the other 2 pages by calling them (use <apex:include>). Check if it renders close to what you're after (there might be style clashes for example) and if it's any good - call getContentAsPdf() on that?

find file content size docx,pptx etc

I want to find out the size of the content inside a docx,pptx etc. Is there any package which can be used for this? I googled and found that POI is used widely to read/write to MS file types. But not able to find the correct api to find the size of the file content. I want to know the actual content size not the compressed file size which can be seen from properties.
Finally i found the way, but it is throwing OOM exception if the file is too large.
OPCPackage opcPackage = OPCPackage.open(file.getAbsolutePath());
XWPFDocument doc = new XWPFDocument(opcPackage);
XWPFWordExtractor we = new XWPFWordExtractor(doc);
String paragraphs = we.getText();
System.out.println("Total Paragraphs: "+paragraphs.length() / 1024);
Please help me if there are any other better way to do this.
Ok this has been asked long time ago and there is also no response to this question. I have not used OPCPackage and hence my answer is not based on that.
DOCX (and for that matter PPTX as well as XSLX) files are all zip files having a particular structure.
We could hence use the java.util.zip package and enumerate the entries of the zip file and get the size of the zip entry xl for xlsx file and word for docx files. Probably a more generic method would be to ignore the following top-level zip entries i.e. zip entries starting with :
docProps
_rels
[Content_Types].xml
The size of the remaining zip entry (do not ignore any folder within this zip entry) would tell you the correct size of the content.
This method is also very efficient - you only read the entries of the zip file and not the zip file itself hence obtaining the size information would run with negligible time and memory resources. For a quick start I was able to get the size of a 4MB docx file in fraction of a second.
A "good-enough" but not adequately working piece of code using this approach is pasted below. Please feel free to use this as a starting point and fix bugs if found. It would be great if you can post back the modifications or corrections so that others can benefit
private static final void printUnzippedContentLength() throws IOException
{
ZipFile zf = new ZipFile(new File("/home/chaitra/verybigfile.docx"));
Enumeration<? extends ZipEntry> entries = zf.entries();
long sumBytes = 0L;
while(entries.hasMoreElements())
{
ZipEntry ze = entries.nextElement();
if(ze.getName().startsWith("docProps") || ze.getName().startsWith("_rels") || ze.getName().startsWith("[Content_Types].xml"))
{
continue;
}
sumBytes += ze.getSize();
}
System.out.println("Uncompressed content has size " + (sumBytes/1024) + " KB" );
}

WPF List files from a folder

I want to get a number of songs from a folder and list their names in a WPF Listview.
I also want each item in the list view to be a draggable file and can be copied from the list to the desktop. I've achieved this on one button, using the code:
Point mpos = e.GetPosition(null);
Vector diff = this.start - mpos;
string[] files = new String[1];
files[0] = #"C:\Song1.mp3";
DragDrop.DoDragDrop(this, new DataObject(DataFormats.FileDrop, files),
DragDropEffects.Copy);
For that each item in the list needs to have a filepath string associated with it.
How do I:
1. Get the files from a folder and list them.
2. Associate with each one a filepath string for the dragging.
Thanks!
You can use Directory.GetFiles() to get all the file paths in a folder and then use Path.GetFileName() (or Path.GetFileNameWithoutExtension()) on each path returned to get just the file names.

How to export Rich Text fields as HTML from Notes with LotusScript?

I'm working on a data migration task, where I have to export a somewhat large Lotus Notes application into a blogging platform. My first task was to export the articles from Lotus Notes into CSV files.
I created a Agent in LotusScript to export the data into CSV files. I use a modified version of this IBM DeveloperWorks forum post. And it basically does the job. But the contents of the Rich Text field is stripped of any formatting. And this is not what I want, I want the Rich Text field rendered as HTML.
The documentation for the GetItemValue method explicitly states that the text is rendered into plain text. So I began to research for something that would retrieve the HTML. I found the NotesMIMEEntity class and some sample code in the IBM article How To Access HTML in a Rich Text Field Using LotusScript.
But for the technique described in the above article to work, the Rich Text field need to have the property "Store Contents as HTML and MIME". And this is not the case with my Lotus Notes database. I tried to set the property on the fields in question, but it didn't do the trick.
Is it possible to use the NotesMIMEEntity and set the "Store Contents as HTML and MIME" property after the content has been added, to export the field rendered as HTML?
Or what are my options for exporting the Notes database Rich Text fields as HTML?
Bonus information: I'm using IBM Lotus Domino Designer version 8.5
There is this fairly unknown command that does exactly what you want: retrieve the URL using the command OpenField.
Example that converts only the Body-field:
http://SERVER/your%5Fdatabase%5Fpath.nsf/NEW%5FVIEW/docid/Body?OpenField
Here is how I did it, using the OpenField command, see D.Bugger's post above
Function GetHtmlFromField(doc As NotesDocument, fieldname As String) As String
Dim obj
Set obj = CreateObject("Microsoft.XMLHTTP")
obj.open "GET", "http://www.mydomain.dk/database.nsf/0/" + doc.Universalid + "/" + fieldname + "?openfield&charset=utf-8", False, "", ""
obj.send("")
Dim html As String
html = Trim$(obj.responseText)
GetHtmlFromField = html
End Function
I'd suggest looking at Midas' Rich Text LSX (http://www.geniisoft.com/showcase.nsf/MidasLSX)
I haven't used the personally, but I remember them from years ago being the best option for working with Rich Text. I'd bet it saves you a lot of headaches.
As for the NotesMIMEEntity class, I don't believe there is a way to convert RichText to MIME, only MIME to RichText (or retain the MIME within the document for emailing purposes).
If you upgrade to Notes Domino 8.5.1 then you can use the new ConvertToMIME method of the NotesDocument class. See the docs. This should do what you want.
Alternativly the easiest way to get the Domino server to render the RichText will be to actually retrieve it via a url call. Set up a simple form that just has the RichText field and then use your favourite HTTP api to pull in the page. It should then be pretty straight forward to pull out the body.
Keep it simple.
Change the BODY field to Store contents as HTML and MIME
Open the doc in editmode.
Save.
Close.
You can now use the NotesMIMEEntity to get what you need from script.
You can use the NotesDXLExporter class to export the Rich Text and use an XSLT to transform the output to what you need.
I know you mentioned using LotusScript, but if you don't mind writing a small Java agent (in the Notes client), this can be done fairly easily - and there is no need to modify the existing form design.
The basic idea is to have your Java code open a particular document through a localhost http request (which is simple in Java) and to have your code capture that html output and save it back to that document. You basically allow the Domino rendering engine to do the heavy lifting.
You would want do this:
Create a form which contains only the rich-text field you want to convert, and with Content Type of HTML
Create a view with a selection formula for all of the documents you want to convert, and with a form formula which computes to the new form
Create the Java agent which just walks your view, and for each document gets its docid, opens a URL in the form http://SERVER/your_database_path.nsf/NEW_VIEW/docid?openDocument, grabs the http response and saves it.
I put up some sample code in a similar SO post here:
How to convert text and rich text fields in a document to html using lotusscript?
Works in Domino 10 (have not tested with 9)
HTMLStrings$ = NotesRichTextItem .Converttohtml([options] ) As String
See documentation :
https://help.hcltechsw.com/dom_designer/10.0.1/basic/H_CONVERTOHTML_METHOD_NOTESRICHTEXTITEM.html
UPDATE (2022)
HCL no longer support this method since version 11. The documentation does not include any info about the method.
I have made some tests and it still works in v12 but HCL recommended to not use it.
Casper's recommendation above works well, but make sure the ACL is such to allow Anonymous Access otherwise your HTML will be the HTML from your login form
If you do not need to get the Richtext from the items specifically, you can use ?OpenDocument, which is documented (at least) here: https://www.ibm.com/developerworks/lotus/library/ls-Domino_URL_cheat_sheet/
https://www.ibm.com/support/knowledgecenter/SSVRGU_9.0.1/com.ibm.designer.domino.main.doc/H_ABOUT_URL_COMMANDS_FOR_OPENING_DOCUMENTS_BY_KEY.html
OpenDocument also allows you to expand sections (I am unsure if OpenField does)
Syntax is:
http://Host/Database/View/DocumentUniversalID?OpenDocument
But be sure to include the charset parameter as well - Japanese documents were unreadable without specifying utf-8 as the charset.
Here is the method I use that takes a NotesDocument and returns the HTML for the doc as a string.
private string ConvertDocumentToHml(Domino.NotesDocument doc, string sectionList = null)
{
var server = doc.ParentDatabase.Server.Split('/')[0];
var dbPath = doc.ParentDatabase.FilePath;
string viewName = "0";
string documentId = doc.UniversalID.ToUpper();
var ub = new UriBuilder();
ub.Host = server;
ub.Path = dbPath.Replace("\\", "/") + "/" + viewName + "/" + documentId;
if (string.IsNullOrEmpty(sectionList))
{
ub.Query = "OpenDocument&charset=utf-8";
}
else
{
ub.Query = "OpenDocument&charset=utf-8&ExpandSection=" + sectionList;
}
var url = ub.ToString();
var req = HttpWebRequest.CreateHttp(url);
try
{
var resp = req.GetResponse();
string respText = null;
using (var sr = new StreamReader(resp.GetResponseStream()))
{
respText = sr.ReadToEnd();
}
return respText;
}
catch (WebException ex)
{
return "";
}
}

Resources