Practical differences between [] and null - arrays

For a numeric field, null can mean something very different than 0. For example, at a restaurant, someone's "Tip" could be $0, or it could be null, meaning that the bill was sent to the table but the patron has not signed the bill yet.
What would be some practical differences between [] and null? The only differences I can think of are at an integrity or storage level, but I'm having trouble thinking up real-world examples where data might have a different purpose using one vs. the other. What are some real-world examples for this difference?

The use of null for arrays is useful to denote that the information is unknown, as opposed to there being no members of the array.
For example:
In an application collecting information on real estate, one of the fields on a property might be an array of buildings. Null would mean that the person entering the information didn't specify the presence of lack of buildings (aka they don't know the information or don't want to say), while an empty array would mean that the property actually has no buildings (aka an empty lot).
This is particularly useful when collecting information from incomplete sources

Think of a paper towel roll. A full paper towel roll is equivalent to some array with data inside. An empty paper towel roll with just the card board is []. No paper towel roll at all is null.

If you consider an example of file & folder. An empty folder is [] and if it has some files then it like data inside an array. If there is no folder then it is NULL.
Also, have seen sometimes that user gets confused between ' ' (space) and null. Both are different. ' ' has (atleast) space in it and null means nothing.

Related

Is there a way to turn my data from string variables to numeric?

I imported my data into Stata, and the program is reading some of the variables as strings, but not all of them. And I cannot understand what I did wrong, as some variables are being read as numbers. Is there a way in Stata to turn the string into numeric?
destring is intended for this situation, but the real question is why Stata read your variables as string when you think they should be numeric.
Some of the reasons commonly met are
There is metadata in your data, especially if the data were read in from a file that has spent time in a spreadsheet. Rows of header information or endnotes can cause this problem.
A missing data code has been used that Stata doesn't recognise, say NA for missing.
Decimal points are indicated by say commas, not stops or periods.
The options of destring are often critical, as you may need to spell out what should be done. So, study the help for destring.
If a variable to you should be numeric, but it's not clear why not, something like
tab myvar if missing(real(myvar))
shows the kinds of values of myvar that can't be converted easily. Very often it becomes clear quickly that there is one repeated problem for which there is one overall fix.

Reordering dataset in stata

We have a dataset of 1222x20 in Stata.
There are 611 individuals, such that each individual is in 2 rows of the dataset. There is only one variable of interest in each second row of each individual that we would like to use.
This means that we want a dataset of 611x21 that we need for our analysis.
It might also help if we could discard each odd/even row, and merge it later.
However, my Stata skills let me down at this point and I hope someone can help us.
Maybe someone knows any command or menu option that we might give a try.
If someone knows such a code, the individuals are characterized by the variable rescode, and the variable of interest on the second row is called enterprise.
Below, the head of our dataset is given. There is a binary time variable followup, where we want to regress the enterprise(yes/no) as dependent variable at time followup = followup onto enterprise as independent variable at time followup = baseline
We have tried something like this:
reg enterprise(if followup="Folowup") i.aimag group loan_baseline eduvoc edusec age16 under16 marr_cohab age age_sq buddhist hahl sep_f nov_f enterprise(if followup ="Baseline"), vce(cluster soum)
followup is a numeric variable with value labels, as its colouring in the Data Editor makes clear, so you can't test its values directly for equality or inequality with literal strings. (And if you could, the match needs to be exact, as Folowup would not be read as implying Followup.)
There is a syntax for identifying observations by value labels: see [U] 13.11 in the pdf documentation or https://www.stata-journal.com/article.html?article=dm0009.
However, it is usually easiest just to use the numeric value underneath the value label. So if the variable followup had numeric values 0 and 1, you would test for equality with 0 or 1.
You must use == not = for testing for equality here:
... if followup == 1
For any future Stata questions, please see the Stata tag wiki for detailed advice on how to present data. Screenshots are usually difficult to read and impossible to copy, and leave many details about the data obscure.

I ask for help in areas I don't know when I study programming linguistics

I am a college student studying programming linguistics. I posted a similar question a while ago, but I mispublished it, and I have a similar question, so I ask for your help.
Questions are as follows.
For an elementary data type in a language with which you are familiar, do the following:
Explain the difference between data objects of that type and the values that those data objects may contain.
In this textbook, data object means the memory location that contains the data value. And Value is one of the attributes of the data object.
What I don't understand is how to compare two concepts, not the same class, with value, one of the attributes of data objects, and data objects, which are its top concepts.
I tried to understand it in various ways, but I couldn't understand it, so I asked for your help. Thank you.
I was solving the problem by setting the language that I am familiar with as C language.
My take on this question is the following:
The data object is a specific physical instance of storage for a value of the type. This physical instance exists in a definite time and the value it contains may change over time. There was a time before it existed and there will be a time when it is gone - maybe temporarily but eventually forever. Two data objects that hold the same value are nevertheless distinct in that they have separate existence.
The value is a nonphysical general principle which is a member of some theoretical set of possible values. The general principal does not exist in time or space but can be thought of as existing in the "Platonic universe of ideal forms"; it is an idea. In a sense, the idea of the value has always existed - before mankind discovered it - and will continue on after mankind is gone. There is no such thing as two distinct values that are the same; if you see two values that are the same, it's actually the same value; there's just one number 2, no matter in what context or how many times you see it used.
In C, the data objects of primitive type int are at-least-16-bit storage buckets for values between (at a minimum range) -2^16 and 2^16-1. Consider this code snippet:
int n1 = 2;
int n2 = 2;
In this code snippet, we have two data objects - n1 and n2 - but just one value - 2.

Unique kind of questionnaire - Database design

For a research experiment I need to design a web application that implements a particular kind of questionnaire, the results of which will serve to derive some statistics and draw some conclusions.
In short, the questionnaire has to work as follows
1-> All answers are in a scale from absolutely false to absolutely correct or an open answer.
2-> Each set of questions corresponds to a given word (for example BIRD or FISH) and a description (a small sentence).
3-> Within one set, all questions can take the form of either a small sentence (text) or an image (to be categorized) and each set may contain an arbitrary number of questions. I need to be able to store the questions that correspond to one word as both text and images and be able to choose between them.
4-> To each set of question correspond 5 different kinds of help. In the questionnaire type A the user may choose one at will. In questionnaire type B all kinds are shown.
5-> The user has to respond to all the questions once (and they have to appear in a random order). Then, if type A, he has to choose a kind of help or refuse to get help and possibly modify his answer. Or, if type B, see all kinds of help one by one (in a random order) and possibly modify his answer.
6-> For each question, for each type of questionnaire I have to know if an answer was modified, which kind of help caused the user to modify and (if type B) whether this kind of help appeared 1st, 2nd, 3rd etc.
I realize those may not be the most complicated demands, but I am new to this and rather confused. My relations up to now look something like the following
QUESTIONNAIRE(id, help_choice, type, phase)
QUESTION_CATEG(id,type, name, description)
IMAGE(#qcat_id, filepath)
TEXT(#qcat_id, content)
INCLUDES(#questionnaire_id, #qcat_id)
HELP(#id, #qcat_id, content)
ANSWER(#questionnaire_id, #qcat_id, response, was_modified, help_taken, help_order).
With help_taken being able to take special values to denote no-help and help_choice being able to take special values to denote that all help was shown.
What is bothering me is the different types of questions. I don't really like (and I don't it will work) the way I have made the distinction between a text type and an image type question for a given question category. Knowing that for a given category (say BIRD) i may have both types (image and text), I have included a 'type' attribute in QUESTION_CATEG. But I feel like I am repeating information.
Do you have any hints as to how this might be fixed. Or even ideas for a completely different approach. Any help is welcome.
This seems to work.
Q_CATEG(id, name, order, description, included)
QUESTION(id, q_categ_id, type, content, order)
AVAIL_ANSWER(id, question_id, content, order)
HELP_CATEG(id, order, name, description)
HELP(help_categ_id, q_categ_id, order, content)
QUESTIONNAIRE(id, type, phase, start, end)
GIVEN_ANSWER(questionnaire_id, question_id, answer_id, modified_answer_id, reason_answer_id, help_categ_id, help_order)

CSV String vs Arrays: Is this too stringly typed?

I came across some existing code in our production environment given to us by our vendor. They use a string to store comma seperated values to store filtered results from a DB. Keep in mind that this is for a proprietary scripting language called PowerOn that interfaces with a database residing on an AIX system, but it's a language that supports strings, integers, and arrays.
For example, we have;
Account
----------------
123
234
3456
28390
The psuedo code might look like;
Define accounts As String
For Each Account
accounts=accounts + CharCast(Account) + ","
End
as opposed to something I would expect to see like
Define accounts As Integer Array(99)
Define index as Integer=0
For Each Account
accounts(index)=Account
index=index+1
End
By the time the loop is done, accounts will look like; 123,234,3456,28390,. The string is later used to test if a specific instance exists like so
If CharSearch("28390", accounts) > 0 Then Call DoSomething
In the example, the statement evaluates to true and DoSomething gets called. Given the option of arrays, why would want to store integer values whithin a string of comma seperated values? Every language I've come across, it's almost always more expensive to perform string based operations than integer based operations.
Considering I haven't seen this technique before and my experience is somewhat limitted, is there a name for this? Is this common practice or is this just another example of being too stringly typed? To extend the existing code, should I continue using string method? Did we get cruddy code from our vendor?
What I put in the comment still holds but my real answer is: It's probably a design decision with respect to compatibility/portability. In your integer-array case (and a low enough level of the API) you'd typically find yourself asking questions like, what's a safe guess of the size of an integer on "today"'s machines. What about endianness.
The most portable and most flexible of all data formats always has been and always will be printed representation. It may not be as fast to process that but that's where adapters/converters or so kick in. I wouldn't be surprised to find (human-readable) printed representation of something especially in database APIs like you describe.
If you want something fast, just take whatever is given to you, convert it to a more efficient internal format, do you processing and convert it back.
There's nothing inherently wrong with using comma-separated strings instead of arrays. Sure you can't readily access a random n's element of such a collection, but if such random access is not needed then there's no penalty for it, right?
As far as I know Oracle DB stores NUMBER values as strings (and if my memory is correct - for DATEs as well) for very practical reasons.
In your specific example looks like using strings is an overkill when dealing with passing data around without crossing the process boundaries. But could it be that the choice of string data type makes more sense when sending data over wire or storing on disk?

Resources