I am looking into the EXIF format to write a parser. The tags are present here. http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html#LightSource
1)Some of the fields have 'undef' as the datatype. How would you use such a field? What type do we assume it to be?
2) Some datatypes are like this int16u[2]!. The exclamation means unsafe(when the cursor moves over it). But, what does that mean?
3) Another datatype is N. I don't understand what that means.
I see that the mentioned document has been revised at 13-Oct-2012, which is after you have asked your question. So I suggest to check again - maybe it has less "holes" now.
Related
I imported my data into Stata, and the program is reading some of the variables as strings, but not all of them. And I cannot understand what I did wrong, as some variables are being read as numbers. Is there a way in Stata to turn the string into numeric?
destring is intended for this situation, but the real question is why Stata read your variables as string when you think they should be numeric.
Some of the reasons commonly met are
There is metadata in your data, especially if the data were read in from a file that has spent time in a spreadsheet. Rows of header information or endnotes can cause this problem.
A missing data code has been used that Stata doesn't recognise, say NA for missing.
Decimal points are indicated by say commas, not stops or periods.
The options of destring are often critical, as you may need to spell out what should be done. So, study the help for destring.
If a variable to you should be numeric, but it's not clear why not, something like
tab myvar if missing(real(myvar))
shows the kinds of values of myvar that can't be converted easily. Very often it becomes clear quickly that there is one repeated problem for which there is one overall fix.
I use TIKA to index documents. then I want to get the whole paragraph from paragraph start to the paragraph end which contains the key words. I tried to use HighlightFragsize but it does not work. For example: there is a document like below:
When I was very small, my parents took me to many places, because they wanted me to learn more about the world. Thanks to them, I
witnessed the variety of the world and a lot of beautiful scenery.
But no matter where I go, in my heart, the place with the most
beautiful scenery is my hometown.
there are two paragraphs above. If I search 'my parents', I hope I can get the whole paragraph "When I was very small, my parents....... a lot of beautiful scenery". not only part of this paragraph. I used HighlightFragsize to limit the sentence, but the result is not what I want. Please help. thanks in advance
You haven't provided a lot of information to go off of but I'm assuming that you're using a highlighter so here are a couple of things you should check for:
The field that holds your parsed data - is it stored? Can you see the entire contents?
If (1), is the text longer than 51200 chars? The default highlighter configuration has a setting maxAnalyzedChars that is set to 51200. This means that the highlighter will not process more than 51200 characters from a highlighted field in a matched document to look for highlights. If this is the case, increase this value until you get the desired results.
Highlighting on extremely large fields may incur a significant performance penalty which you should be mindful of before choosing a configuration.
See this for more details.
UPDATE
I don't think there's any parameter called HighlightFragsize but there's one called hl.fragsize which can do what you want when set to zero.
Try the following query and see if it works for you:
q=my+parents&hl=true&hl.fl=my_field&hl.fragsize=0
Additionally, you should, in any case, be mindful of the first 2 points I posted above.
UPDATE 2
I don’t think there’s a direct way to do what you’re looking for. You could possibly split up your field into a multi valued field with each paragraph being stored as a separate value.
You can then possibly use hl.preserveMulti, hl.maxMultiValuedToExamine and hl.maxMultiValuedToMatch to achieve what you need.
How do I read from an XML file with an unkown format, for example a file can be of the format:
<data1>int1</data1>
<data2>int2</data2>
<data3>int3</data3>
<data4>int4</data4>
OR
<data1>int1</data1>
<data4>int4</data4>
OR
<data1>int1</data1>
<data2>int2</data2>
<data3>int3</data3>
<data4>int4</data4>
<data5>int5</data5>
In the second case I am to assume int2 and int3 are to be assigned default values. I thought of one way to solve this problem but it came out messy and spaghetti like.
Any help would be appreciated!
(1) None of those is a complete XML document. There needs to be a single root element.
(2) If you don't insist on restricting the number and order of their of instances, it's easy to declare in a DTD that element content can be a mix of other elements, using the '|' operator. (See http://www.w3.org/TR/xml11/#sec-element-content)
(3) If you want to constrain those more tightly, then yes, DTDs can require spelling out all the combinations. Switching to validating against XML Schemas is one obvious solution; DTDs are pretty much considered obsolete anyway since they aren't compatible with XML Namespaces (which have become a basic part of XML processing).
(4) If you insist on sticking with DTDs, and can't accept unconstrained order/count, and don't want to spell out all the possible sequences... consider doing some of that checking and/or applying of default values in your application code rather than in the DTD.
I may be erring towards pedantry here, but say I have a field in a database that currently has two values (but may contain more in future). I know I could name this as a flag (e.g. MY_FLAG) containing values 0 and 1, but should more values be required (e.g. 0,1,2,3,4), is it still correct to call the field a flag?
I seem to recall reading something previously, that a flag should always be binary, and anything else should be labelled more appropriately, but I may be mistaken. Does anyone know if my thinking is correct? If so, can you point me to any information on this please? My googling has turned nothing up!!
Thanks very much :o)
Flags are usually binary because when we say flag it means either it is up(1) or down(0).
Just like it is used in military to flag up and down in order to show the war-signs. The concept of flagging is taken from there.
Regarding what you are saying
"your words : values be required (e.g. 0,1,2,3,4)"
In such a situation use Enum. Enumerations are build for such cases or sometimes what we do is , we justify the meaning of these numeric values in comments or in separate file so that more memory could be saved(we use tinyInt or bit field). But never name such a situation Flag.
Flags have standard meaning that is either Up or Down. It doesn't mean that you will get error or something but it is not a good practice. Hope you get it.
It's all a matter of conventions and the ability to maintain your database/code effectively. Technically, you can have a column called my_flag defined as a varchar and hold values like "batman" and "barak obama".
By convention, flags are boolean. If you intend to have other values there, it's probably a better idea to call the column something else, like some_enum, or my_code.
Very occasionally, people talk about (for example) tri-state flags, but Wikipedia and most of the dictionary definitions that I read reserve "flag" for binary / two state uses1.
Of course, neither Wikipedia or any dictionary has the authority to say some English usage is "incorrect". "Correct" usage is really "conventional" usage; i.e. what other people say / write.
I would argue that saying or writing "tri-state flag" is unconventional, but it is unambiguous and serves its purpose of communicating a concept adequately. (And the usage can be justified ...)
1 - Most, but not all; see http://www.oxforddictionaries.com/definition/english/flag.
Don't call anything "flag". Or "count" or "mark" or "int" or "code". Name it like everything else in code: after what it means.
workday {mon..fri}
tall {yes,no}
zip_code {00000..99999}
state {AL..WY}
Notice that (something like) yes/no plays the 'flag' role of indicating a permanent dichotomy. (In lieu of boolean, which does that in the rest of the universe outside SQL). For when the specification/contract really is whether something is so. If a design might add more values you should use a different type.
Of course if you want to add more info to a name you can. Add distinctions that are meaningful if you can.
workday {monday..friday}
workday_abbrev {mon..fri}
is_tall {yes,no}
zip_plus_5 {00000-99..99999-99}
state_name {Alabama..Wyoming}
state_2 {AL..WY}
I've been looking around for the best way to store a large string value (like a blog post, or a text description, etc.) with Fluent nHibernate and the answer I keep seeing is to use an nvarchar(MAX). Which, if my reading is correct (which it very often isn't) is 4000+. So I have a field like so...
Map(x => x.Description)
.Column("[description]")
.Length(4001)
.Access.Property()
.Not.Nullable();
In theory, this should do it, right? I'm a little confused about this though. In school, we were taught pretty clearly that you want to make each column size as small as possible.
If I make that column max size, doesn't that go against that very principle and make the table very large, and wasteful? Can anyone shed some very clear, stupid, blonde-proof logic on this for me? I've been left with a lot of confusion over the whole ordeal.
Have a look at this, maybe be it will help: http://www.crankingoutcode.com/post/Storing-large-strings-with-Fluent-NHibernate-(automapping).aspx
And Gotcha's: http://ayende.com/blog/1969/nhibernate-and-large-text-fields-gotchas
Note that max means that you can store characters upto 2^31-1 bytes of data. However it will consume space based on the actual length of the data