How to remove unnecessary parsing related info from Tika parsing output - text-parsing

I am parsing docx file with Apache Tika. Parsing is working file expect that it also prints some unnecessary texts in the beginning like below:
[Content_Types] .xml _rels / .rels word / _rels / document.xml.rels
word / document.xml
and at the end like below:
word / theme / theme1.xml word / settings.xml word / fontTable.xml
word / webSettings.xml docProps / app.xml Normal 13 3 460 2627
Microsoft Office Word 0 21 6 false XXXX XXXX false 3081 false false
12.0000 docProps / core. xml XXX XXXX 1 2016- 12-16T14: 57: 00Z 2016-12-16T15: 10: 00Z word / styles.xml
Code is :
public static String extractString(File file)
{
BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = new FileInputStream(file))
{
parser.parse(stream, handler, metadata);
return handler.toString();
}
catch (IOException | SAXException | TikaException e)
{
e.printStackTrace();
return null;
}
}
How to remove the unnecessary crap from the beginning and end?

Related

My H2/C3PO/Hibernate setup does not seem to preserving prepared statements?

I am finding my database is the bottleneck in my application, as part of this it looks like Prepared statements are not being reused.
For example here method I use
public static CoverImage findCoverImageBySource(Session session, String src)
{
try
{
Query q = session.createQuery("from CoverImage t1 where t1.source=:source");
q.setParameter("source", src, StandardBasicTypes.STRING);
CoverImage result = (CoverImage)q.setMaxResults(1).uniqueResult();
return result;
}
catch (Exception ex)
{
MainWindow.logger.log(Level.SEVERE, ex.getMessage(), ex);
}
return null;
}
But using Yourkit profiler it says
com.mchange.v2.c3po.impl.NewProxyPreparedStatemtn.executeQuery() Count 511
com.mchnage.v2.c3po.impl.NewProxyConnection.prepareStatement() Count 511
and I assume that the count for prepareStatement() call should be lower, ais it is looks like we create a new prepared statment every time instead of reusing.
https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html
I am using C3po connecting poolng wehich complicates things a little, but as I understand it I have it configured correctly
public static Configuration getInitializedConfiguration()
{
//See https://www.mchange.com/projects/c3p0/#hibernate-specific
Configuration config = new Configuration();
config.setProperty(Environment.DRIVER,"org.h2.Driver");
config.setProperty(Environment.URL,"jdbc:h2:"+Db.DBFOLDER+"/"+Db.DBNAME+";FILE_LOCK=SOCKET;MVCC=TRUE;DB_CLOSE_ON_EXIT=FALSE;CACHE_SIZE=50000");
config.setProperty(Environment.DIALECT,"org.hibernate.dialect.H2Dialect");
System.setProperty("h2.bindAddress", InetAddress.getLoopbackAddress().getHostAddress());
config.setProperty("hibernate.connection.username","jaikoz");
config.setProperty("hibernate.connection.password","jaikoz");
config.setProperty("hibernate.c3p0.numHelperThreads","10");
config.setProperty("hibernate.c3p0.min_size","1");
//Consider that if we have lots of busy threads waiting on next stages could we possibly have alot of active
//connections.
config.setProperty("hibernate.c3p0.max_size","200");
config.setProperty("hibernate.c3p0.max_statements","5000");
config.setProperty("hibernate.c3p0.timeout","2000");
config.setProperty("hibernate.c3p0.maxStatementsPerConnection","50");
config.setProperty("hibernate.c3p0.idle_test_period","3000");
config.setProperty("hibernate.c3p0.acquireRetryAttempts","10");
//Cancel any connection that is more than 30 minutes old.
//config.setProperty("hibernate.c3p0.unreturnedConnectionTimeout","3000");
//config.setProperty("hibernate.show_sql","true");
//config.setProperty("org.hibernate.envers.audit_strategy", "org.hibernate.envers.strategy.ValidityAuditStrategy");
//config.setProperty("hibernate.format_sql","true");
config.setProperty("hibernate.generate_statistics","true");
//config.setProperty("hibernate.cache.region.factory_class", "org.hibernate.cache.ehcache.SingletonEhCacheRegionFactory");
//config.setProperty("hibernate.cache.use_second_level_cache", "true");
//config.setProperty("hibernate.cache.use_query_cache", "true");
addEntitiesToConfig(config);
return config;
}
Using H2 1.3.172, Hibernate 4.3.11 and the corresponding c3po for that hibernate version
With reproducible test case we have
HibernateStats
HibernateStatistics.getQueryExecutionCount() 28
HibernateStatistics.getEntityInsertCount() 119
HibernateStatistics.getEntityUpdateCount() 39
HibernateStatistics.getPrepareStatementCount() 189
Profiler, method counts
GooGooStaementCache.aquireStatement() 35
GooGooStaementCache.checkInStatement() 189
GooGooStaementCache.checkOutStatement() 189
NewProxyPreparedStatement.init() 189
I don't know what I shoud be counting as creation of prepared statement rather than reusing an existing prepared statement ?
I also tried enabling c3p0 logging by adding a c3p0 logger ands making it use same log file in my LogProperties but had no effect.
String logFileName = Platform.getPlatformLogFolderInLogfileFormat() + "songkong_debug%u-%g.log";
FileHandler fe = new FileHandler(logFileName, LOG_SIZE_IN_BYTES, 10, true);
fe.setEncoding(StandardCharsets.UTF_8.name());
fe.setFormatter(new com.jthink.songkong.logging.LogFormatter());
fe.setLevel(Level.FINEST);
MainWindow.logger.addHandler(fe);
Logger c3p0Logger = Logger.getLogger("com.mchange.v2.c3p0");
c3p0Logger.setLevel(Level.FINEST);
c3p0Logger.addHandler(fe);
Now that I have eventually got c3p0Based logging working and I can confirm the suggestion of #Stevewaldman is correct.
If you enable
public static Logger c3p0ConnectionLogger = Logger.getLogger("com.mchange.v2.c3p0.stmt");
c3p0ConnectionLogger.setLevel(Level.FINEST);
c3p0ConnectionLogger.setUseParentHandlers(false);
Then you get log output of the form
24/08/2019 10.20.12:BST:FINEST: com.mchange.v2.c3p0.stmt.DoubleMaxStatementCache ----> CACHE HIT
24/08/2019 10.20.12:BST:FINEST: checkoutStatement: com.mchange.v2.c3p0.stmt.DoubleMaxStatementCache stats -- total size: 347; checked out: 1; num connections: 13; num keys: 347
24/08/2019 10.20.12:BST:FINEST: checkinStatement(): com.mchange.v2.c3p0.stmt.DoubleMaxStatementCache stats -- total size: 347; checked out: 0; num connections: 13; num keys: 347
making it clear when you get a cache hit. When there is no cache hit yo dont get the first line, but get the other two lines.
This is using C3p0 9.2.1

Python3 - IndexError when trying to save a text file

i'm trying to follow this tutorial with my own local data files:
CNTK tutorial
i have the following function to save my data array into a txt file feedable to CNTK:
# Save the data files into a format compatible with CNTK text reader
def savetxt(filename, ndarray):
dir = os.path.dirname(filename)
if not os.path.exists(dir):
os.makedirs(dir)
if not os.path.isfile(filename):
print("Saving", filename )
with open(filename, 'w') as f:
labels = list(map(' '.join, np.eye(11, dtype=np.uint).astype(str)))
for row in ndarray:
row_str = row.astype(str)
label_str = labels[row[-1]]
feature_str = ' '.join(row_str[:-1])
f.write('|labels {} |features {}\n'.format(label_str, feature_str))
else:
print("File already exists", filename)
i have 2 ndarrays of the following shape that i want to feed the model:
train.shape
(1976L, 15104L)
test.shape
(1976L, 15104L)
Then i try to implement the fucntion like this:
# Save the train and test files (prefer our default path for the data)
data_dir = os.path.join("C:/Users", 'myself', "OneDrive", "IA Project", 'data', 'train')
if not os.path.exists(data_dir):
data_dir = os.path.join("data", "IA Project")
print ('Writing train text file...')
savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
print ('Writing test text file...')
savetxt(os.path.join(data_dir, "Test-128x118_cntk_text.txt"), test)
print('Done')
and then i get the following error:
Writing train text file...
Saving C:/Users\A702628\OneDrive - Atos\Microsoft Capstone IA\Capstone data\train\Train-128x118_cntk_text.txt
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-24-b53d3c69b8d2> in <module>()
6
7 print ('Writing train text file...')
----> 8 savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
9
10 print ('Writing test text file...')
<ipython-input-23-610c077db694> in savetxt(filename, ndarray)
12 for row in ndarray:
13 row_str = row.astype(str)
---> 14 label_str = labels[row[-1]]
15 feature_str = ' '.join(row_str[:-1])
16 f.write('|labels {} |features {}\n'.format(label_str, feature_str))
IndexError: list index out of range
Can somebody please tell me what's going wrong with this part of the code? And how could i fix it? Thank you very much in advance.
Since you're using your own input data -- are they labelled in the range 0 to 9? The labels array only has 10 entries in it, so that could cause an out-of-range problem.

Handling Hebrew files and folders with Python 3.4

I used Python 3.4 to create a programm that goes through E-mails and saves specific attachments to a file server.
Each file is saved to a specific destination depending on the sender's E-mail's address.
My problem is that the destination folders and the attachments are both in Hebrew and for a few attachments I get an error that the path does not exsist.
Now that's not possible because It can fail for one attachment but not for the others on the same Mail (the destination folder is decided by the sender's address).
I want to debug the issue but I cannot get python to display the file path it is trying to save correctly. (it's mixed hebrew and english and it always displays the path in a big mess, although it works correctly 95% of the time when the file is being saved to the file server)
So my questions are:
what should I add to this code so that it will proccess Hewbrew correctly?
Should I encode or decode somthing?
Are there characters I should avoid when proccessing the files?
here's the main piece of code that fails:
try:
found_attachments = False
for att in msg.Attachments:
_, extension = split_filename(str(att))
# check if attachment is not inline
if str(att) not in msg.HTMLBody:
if extension in database[sender][TYPES]:
file = create_file(str(att), database[sender][PATH], database[sender][FORMAT], time_stamp)
# This is where the program fails:
att.SaveAsFile(file)
print("Created:", file)
found_attachments = True
if found_attachments:
items_processed.append(msg)
else:
items_no_att.append(msg)
except:
print("Error with attachment: " + str(att) + " , in: " + str(msg))
and the create file function:
def create_file(att, location, format, timestamp):
"""
process an attachment to make it a file
:param att: the name of the attachment
:param location: the path to the file
:param format: the format of the file
:param timestamp: the time and date the attachment was created
:return: return the file created
"""
# create the file by the given format
if format == "":
output_file = location + "\\" + att
else:
# split file to name and type
filename, extension = split_filename(att)
# extract and format the time sent on
time = str(timestamp.time()).replace(":", ".")[:-3]
# extract and format the date sent on
day = str(timestamp.date())
day = day[-2:] + day[4:-2] + day[:4]
# initiate the output file
output_file = format
# add the original file name where needed
output_file = output_file.replace(FILENAME, filename)
# add the sent date where needed
output_file = output_file.replace(DATE, day)
# add the time sent where needed
output_file = output_file.replace(TIME, time)
# add the path and type
output_file = location + "\\" + output_file + "." + extension
print(output_file)
# add an index to the file if necessary and return it
index = get_file_index(output_file)
if index:
filename, extension = split_filename(output_file)
return filename + "(" + str(index) + ")." + extension
else:
return output_file
Thanks in advance, I would be happy to explain more or supply more code if needed.
I found out that the promlem was not using Hebrew. I found that there's a limit on the number of chars that the (path + filename) can hold (255 chars).
The files that failed excided that limit and that caused the problem

Parse log file while with between lines relation

I have long log file that contents looks like
2015-06-13 20:58:32,278 60157353 [Thread-1] DEBUG ccc - start PROC, will wait 30
2015-06-13 20:58:32,302 60157377 [Thread-1] DEBUG ccc - stoping PROC 0
2015-06-13 20:58:42,339 60167414 [Thread-1] DEBUG ccc - start PROC, will wait 30
2015-06-13 20:58:42,363 60167438 [Thread-1] DEBUG ccc - stoping PROC 0
2015-06-13 20:58:52,378 60177453 [Thread-1] DEBUG ccc - start PROC, will wait 30
2015-06-13 20:58:52,404 60177479 [Thread-1] DEBUG ccc - stoping PROC 0
2015-06-13 20:58:52,430 60177506 [Thread-1] DEBUG ccc - start PROC, will wait 30
I need to check time between start PROC and stoping PROC is not longer than 30 seconds.
Is it somehow possible do this with any log parser software?
Using a LogMX Parser, you can mark each start/stop couple as "Too long" (if there is more than 30s between start PROC and stoping PROC).
In the following Parser example, when the elapsed time is greater than 30s:
The user-defined log entry field named "TooLong" is set to "x" (else, it is empty) => can easily filter/sort/search using this field
The stoping PROC entry is marked as ERROR to appear in red => can quickly see it
Of course, you can adjust this code according to your needs.
To use this parser:
Copy the following code in a new file <LogMX_dir>/parsers/src/sample/parser/VicoParser.java
Compile it using Eclipse, IntelliJ IDEA, Maven, Gradle, or Ant using files in <LogMX_dir>/parsers (see LogMX documentation)
Add this Parser in LogMX using menu "Tools" > "Options" > "Parsers" > green "+" button > "Java class Parser" tab > choose <LogMX_dir>/parsers/classes/sample.parser/VicoParser
VicoParser.java:
package sample.parser;
import java.text.SimpleDateFormat;
import java.util.Arrays;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.lightysoft.logmx.business.ParsedEntry;
import com.lightysoft.logmx.mgr.LogFileParser;
/**
* Sample LogMX Parser able to parse a log file with multi-line support, Absolute/Relative Date support,
* and detection of too-long elapsed time between too specific entries.<BR/>
*
* Log4j Pattern for this log format is:
* %d %-4r [%t] %-5p %c %x - %m%n
*
* Here is an example of log file suitable for this parser:<BR/>
* 2015-06-13 20:58:32,278 60157353 [Thread-1] DEBUG ccc - start PROC, will wait 30
* 2015-06-13 20:58:32,302 60157377 [Thread-1] DEBUG ccc - stoping PROC 0
* 2015-06-13 20:58:42,339 60167414 [Thread-1] DEBUG ccc - start PROC, will wait 30
* 2015-06-13 20:58:42,363 60167438 [Thread-1] DEBUG ccc - stoping PROC 0
* 2015-06-13 20:58:52,378 60177453 [Thread-1] DEBUG ccc - start PROC, will wait 30
* 2015-06-13 20:58:52,404 60177479 [Thread-1] DEBUG ccc - stoping PROC 0
* 2015-06-13 20:58:52,430 60177506 [Thread-1] DEBUG ccc - start PROC, will wait 30
*/
public class VicoParser extends LogFileParser {
/** Current parsed log entry */
private ParsedEntry entry = null;
/** Entry date format (this is Log4j default ISO-8601) */
private static SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss,SSS");
/** Mutex to avoid that multiple threads use the same Date formatter at the same time */
private final Object DATE_FORMATTER_MUTEX = new Object();
/** Pattern for entry begin */
private final static Pattern ENTRY_BEGIN_PATTERN = Pattern.compile(
// %d
"^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d+?)\\s+?"
// %-4r [%t] %-5p
+ "(\\d+?)\\s+?\\[(.*?)\\]\\s+?(.*?)\\s+?"
// %c %x - %m
+ "(.*?) (.*?) - (.*)$");
/** Buffer for Entry message (improves performance for multi-lines entries) */
private StringBuilder entryMsgBuffer = null;
///////////// Elapsed-Time computation ////////////
/** Log entry message used for T0 (elapsed time calculation) */
private static final String LOG_MESSAGE_T0 = "start PROC";
/** Log entry message used for T1 (elapsed time calculation) */
private static final String LOG_MESSAGE_T1 = "stoping PROC";
/** Last encountered T0 entry */
private ParsedEntry prevT0Entry = null;
/** Max allowed time between entries, before raising "TooLong" flag */
private static final long MAXIMUM_DELTA_T = 30000L; // 30s (30,000 ms)
/////////////////////////////////////////////////////
/** Key of user-defined field "Timestamp" (internal, not displayed) */
private static final String EXTRA_FIELD_KEY__TIMESTAMP = "Timestamp";
/** Key of user-defined field "NDC" */
private static final String EXTRA_FIELD_KEY__NDC = "NDC";
/** Key of user-defined field "TooLong" */
private static final String EXTRA_FIELD_KEY__TOOLONG = "TooLong";
/** User-defined fields names */
private static final List<String> EXTRA_FIELDS_KEYS = Arrays.asList(EXTRA_FIELD_KEY__NDC,
EXTRA_FIELD_KEY__TOOLONG);
/**
* Returns the name of this parser
* #see com.lightysoft.logmx.mgr.LogFileParser#getParserName()
*/
#Override
public String getParserName() {
return "Vico Parser";
}
/**
* Returns the supported file type for this parser
* #see com.lightysoft.logmx.mgr.LogFileParser#getSupportedFileType()
*/
#Override
public String getSupportedFileType() {
return "Vico log files";
}
/**
* Process the new line of text read from file
* #see com.lightysoft.logmx.mgr.LogFileParser#parseLine(java.lang.String)
*/
#Override
protected void parseLine(String line) throws Exception {
// If end of file, records last entry if necessary, and exits
if (line == null) {
recordPreviousEntryIfExists();
return;
}
Matcher matcher = ENTRY_BEGIN_PATTERN.matcher(line);
if (matcher.matches()) {
// Record previous found entry if exists, then create a new one
prepareNewEntry();
entry.setDate(matcher.group(1));
entry.setThread(matcher.group(3));
entry.setLevel(matcher.group(4));
entry.setEmitter(matcher.group(5));
String logMsg = matcher.group(7);
// Save relative timestamp (in ms), for "getRelativeEntryDate()", but also to compute elapsed
// time between two specific log entries (faster than parsing complete absolute date)
long timestamp = Integer.parseInt(matcher.group(2), 10);
entryMsgBuffer.append(logMsg);
entry.getUserDefinedFields().put(EXTRA_FIELD_KEY__NDC, matcher.group(6)); // save NDC
entry.getUserDefinedFields().put(EXTRA_FIELD_KEY__TIMESTAMP, timestamp); // save Timestamp
if (logMsg.startsWith(LOG_MESSAGE_T0)) {
if (prevT0Entry != null) {
System.err.println("Warning: found [" + LOG_MESSAGE_T0 + "] not followed by ["
+ LOG_MESSAGE_T1 + "]");
}
prevT0Entry = entry;
} else if (logMsg.startsWith(LOG_MESSAGE_T1)) {
if (prevT0Entry == null) {
System.err.println("Warning: found [" + LOG_MESSAGE_T1 + "] not preceded by ["
+ LOG_MESSAGE_T0 + "]");
} else {
long prevT0 = (Long) prevT0Entry.getUserDefinedFields().get(
EXTRA_FIELD_KEY__TIMESTAMP);
if (timestamp - prevT0 > MAXIMUM_DELTA_T) {
entry.getUserDefinedFields().put(EXTRA_FIELD_KEY__TOOLONG, "x"); // Flag this entry as "TooLong"
prevT0Entry.getUserDefinedFields().put(EXTRA_FIELD_KEY__TOOLONG, "x"); // Flag this entry as "TooLong"
// Change log entry Level (note: cannot change Level of T0 entry because it has been already processed by LogMX)
entry.setLevel("ERROR");
}
prevT0Entry = null;
}
}
} else if (entry != null) {
entryMsgBuffer.append('\n').append(line); // appends this line to previous entry's text
}
}
/**
* Returns the ordered list of user-defined fields to display (given by their key), for each entry.
* #see com.lightysoft.logmx.mgr.LogFileParser#getUserDefinedFields()
*/
#Override
public List<String> getUserDefinedFields() {
return EXTRA_FIELDS_KEYS;
}
/**
* Returns a relative Date for the given entry
* #see com.lightysoft.logmx.mgr.LogFileParser#getRelativeEntryDate(com.lightysoft.logmx.business.ParsedEntry)
*/
#Override
public Date getRelativeEntryDate(ParsedEntry pEntry) throws Exception {
Long timestamp = (Long) pEntry.getUserDefinedFields().get(EXTRA_FIELD_KEY__TIMESTAMP);
return new Date(timestamp);
}
/**
* Returns the absolute Date for the given entry
* #see com.lightysoft.logmx.mgr.LogFileParser#getAbsoluteEntryDate(com.lightysoft.logmx.business.ParsedEntry)
*/
#Override
public Date getAbsoluteEntryDate(ParsedEntry pEntry) throws Exception {
synchronized (DATE_FORMATTER_MUTEX) { // Java date formatter is not thread-safe
return dateFormat.parse(pEntry.getDate());
}
}
/**
* Send to LogMX the current parsed log entry
* #throws Exception
*/
private void recordPreviousEntryIfExists() throws Exception {
if (entry != null) {
entry.setMessage(entryMsgBuffer.toString());
addEntry(entry);
}
}
/**
* Send to LogMX the current parsed log entry, then create a new one
* #throws Exception
*/
private void prepareNewEntry() throws Exception {
recordPreviousEntryIfExists();
entry = createNewEntry();
entryMsgBuffer = new StringBuilder(80);
entry.setUserDefinedFields(new HashMap<String, Object>(4));
}
}
And here is what I get:
Note: you can sort/filter log entries using the field named "TooLong" by clicking on its column (mouse left/middle button, or menu "Filter" > "Show filtering bar")

Groovy - create file issue: The filename, directory name or volume label syntax is incorrect

I'm running a script made in Groovy from Soap UI and the script needs to generate lots of files.
Those files have also in the name two numbers from a list (all the combinations in that list are different), and there are 1303 combinations
available and the script generates just 1235 files.
A part of the code is:
filename = groovyUtils.projectPath + "\\" + "$file"+"_OK.txt";
targetFile = new File(filename);
targetFile.createNewFile();
where $file is actually that part of the file name which include those 2 combinations from that list:
file = "abc" + "-$firstNumer"+"_$secondNumber"
For those file which are not created is a message returned:"The filename, directory name or volume label syntax is incorrect".
I've tried puting another path:
filename = "D:\\rez\\" + "\\" + "$file"+"_OK.txt";
targetFile = new File(filename);
targetFile.createNewFile();
and also:
File parentFolder = new File("D:\\rez\\");
File targetFile = new File(parentFolder, "$file"+"_OK.txt");
targetFile.createNewFile();
(which I've found here: What are possible reasons for java.io.IOException: "The filename, directory name, or volume label syntax is incorrect")
but nothing worked.
I have no ideea where the problem is. Is strange that 1235 files are created ok, and the rest of them, 68 aren't created at all.
Thanks,
My guess is that some of the files have illegal characters in their paths. Exactly which characters are illegal is platform specific, e.g. on Windows they are
\ / : * ? " < > |
Why don't you log the full path of the file before targetFile.createNewFile(); is called and also log whether this method succeeded or not, e.g.
filename = groovyUtils.projectPath + "\\" + "$file"+"_OK.txt";
targetFile = new File(filename);
println "attempting to create file: $targetFile"
if (targetFile.createNewFile()) {
println "Successfully created file $targetFile"
} else {
println "Failed to create file $targetFile"
}
When the process is finished, check the logs and I suspect you'll see a common pattern in the ""Failed to create file...." messages
File.createNewFile() returns false when a file or directory with that name already exists. In all other failure cases (security, I/O) it throws an exception.
Evaluate createNewFile()'s return value or, additionally, use the File.exists() method:
File file = new File("foo")
// works the first time
createNewFile(file)
// prints an error message
createNewFile(file)
void createNewFile(File file) {
if (!file.createNewFile()) {
assert file.exists()
println file.getPath() + " already exists."
}
}

Resources