Pyspark Dataframe Iterate Array Columns

Pyspark Dataframe Iterate Array Columns - arrays

In PySpark, I have a dataframe I'm trying to parse multiple columns with arrays. The last two rows in the dataframe contains multiple values I would like to parse into separate rows.
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| WB-API-CNTY | WB-API-UNIQUE | WB-OIL-CODE | WB-OIL-LSE-NBR | WB-OIL-DIST | WB-GAS-CODE | WB-GAS-RRC-ID | WB-GAS-DIS |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| 449 | 80212 | [] | [] | [] | [] | [] | [] |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| 449 | 80214 | ["O"] | ["05361"] | ["06"] | ["O"] | ["060536"] | ["00"] |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| 449 | 80222 | ["O", "O"] | ["01718", "05492"] | ["06", "06"] | ["O", "O"] | ["060171", "060549"] | ["00", "00"] |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| 451 | 00005 | ["G", "O"] | ["5568", "04351"] | ["10", "09"] | ["G", "O"] | ["105568", "090435"] | ["09", "00"] |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
Results:
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| WB-API-CNTY | WB-API-UNIQUE | WB-OIL-CODE | WB-OIL-LSE-NBR | WB-OIL-DIST | WB-GAS-CODE | WB-GAS-RRC-ID | WB-GAS-DIS |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 449 | 80212 | | | | | | |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 449 | 80214 | O | 05361 | 06 | O | 060536 | 00 |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 449 | 80222 | O | 01718 | 06 | O | 060171 | 00 |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 449 | 80222 | O | 05492 | 06 | O | 060549 | 00 |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 451 | 00005 | G | 5568 | 10 | G | 105568 | 09 |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 451 | 00005 | O | 04351 | 09 | O | 090435 | 00 |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+

array_cols = ['WB-OIL-CODE', 'WB-OIL-LSE-NBR', 'WB-OIL-DIST', 'WB-GAS-CODE', 'WB-GAS-RRC-ID', 'WB-GAS-DIS']
other_cols = [c for c in df.columns if c not in array_cols]
df = df.select(
*other_cols,
F.expr(f"inline(arrays_zip({'`' + '`,`'.join(array_cols) + '`'}))")
)

Related

Understanding Linker output (Elf2Bin)

I'm doing some development using Mbed Studio and after linking I see this sort of output:
Elf2Bin: test-sp2
| Module | .text | .data | .bss |
|----------------------|---------------|---------|------------|
| [lib]\c_w.l | 5900(+196) | 16(+0) | 348(+0) |
| [lib]\fz_wv.l | 26(+0) | 0(+0) | 0(+0) |
| [lib]\libcpp_w.l | 1(+0) | 0(+0) | 0(+0) |
| [lib]\libcppabi_w.l | 44(+0) | 0(+0) | 0(+0) |
| anon$$obj.o | 94(+94) | 0(+0) | 2048(+0) |
| main.o | 1798(+0) | 0(+0) | 460(+0) |
| mbed-os\cmsis | 13531(+0) | 168(+0) | 6609(+0) |
| mbed-os\connectivity | 70226(+140) | 295(+0) | 45532(+40) |
| mbed-os\drivers | 3852(-16) | 0(+0) | 0(+0) |
| mbed-os\events | 2016(+0) | 0(+0) | 3104(+0) |
| mbed-os\hal | 2090(+0) | 8(+0) | 115(+0) |
| mbed-os\platform | 9442(+0) | 64(+0) | 1104(+0) |
| mbed-os\rtos | 1792(+0) | 0(+0) | 8(+0) |
| mbed-os\targets | 34347(+1459) | 296(+0) | 394(+0) |
| Subtotals | 145159(+1873) | 847(+0) | 59722(+40) |
Total Static RAM memory (data + bss): 60569(+40) bytes
Total Flash memory (text + data): 146006(+1873) bytes
I understand what this output means, mostly.
But what do the (+xxx) next to the byte counts mean?
For example, in
| mbed-os\connectivity | 70226(+140) | 295(+0) | 45532(+40) |
What does the (+140) in the .text section mean? Could it be the change in size from the last link?

Could it be the change in size from the last link?
Yes.

Why using char type as index for looping gives unexpected results?

Bear in mind this is an old version of the C compiler: CP/M for Z80.
#include<stdio.h>
main()
{
char i = 0;
do
{
printf("0x%04x | ", i);
} while (++ i);
}
Expected:
0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | 0x0008 | 0x0009 | 0x000A | 0x000B | 0x000C | 0x000D | 0x000E | 0x000F | 0x0010 | 0x0011 | 0x0012 | 0x0013 | 0x0014 | 0x0015 | 0x0016 | 0x0017 | 0x0018 | 0x0019 | 0x001A | 0x001B | 0x001C | 0x001D | 0x001E | 0x001F | 0x0020 | 0x0021 | 0x0022 | 0x0023 | 0x0024 | 0x0025 | 0x0026 | 0x0027 | 0x0028 | 0x0029 | 0x002A | 0x002B | 0x002C | 0x002D | 0x002E | 0x002F | 0x0030 | 0x0031 | 0x0032 | 0x0033 | 0x0034 | 0x0035 | 0x0036 | 0x0037 | 0x0038 | 0x0039 | 0x003A | 0x003B | 0x003C | 0x003D | 0x003E | 0x003F | 0x0040 | 0x0041 | 0x0042 | 0x0043 | 0x0044 | 0x0045 | 0x0046 | 0x0047 | 0x0048 | 0x0049 | 0x004A | 0x004B | 0x004C | 0x004D | 0x004E | 0x004F | 0x0050 | 0x0051 | 0x0052 | 0x0053 | 0x0054 | 0x0055 | 0x0056 | 0x0057 | 0x0058 | 0x0059 | 0x005A | 0x005B | 0x005C | 0x005D | 0x005E | 0x005F | 0x0060 | 0x0061 | 0x0062 | 0x0063 | 0x0064 | 0x0065 | 0x0066 | 0x0067 | 0x0068 | 0x0069 | 0x006A | 0x006B | 0x006C | 0x006D | 0x006E | 0x006F | 0x0070 | 0x0071 | 0x0072 | 0x0073 | 0x0074 | 0x0075 | 0x0076 | 0x0077 | 0x0078 | 0x0079 | 0x007A | 0x007B | 0x007C | 0x007D | 0x007E | 0x007F | 0x0080 | 0x0081 | 0x0082 | 0x0083 | 0x0084 | 0x0085 | 0x0086 | 0x0087 | 0x0088 | 0x0089 | 0x008A | 0x008B | 0x008C | 0x008D | 0x008E | 0x008F | 0x0090 | 0x0091 | 0x0092 | 0x0093 | 0x0094 | 0x0095 | 0x0096 | 0x0097 | 0x0098 | 0x0099 | 0x009A | 0x009B | 0x009C | 0x009D | 0x009E | 0x009F | 0x00A0 | 0x00A1 | 0x00A2 | 0x00A3 | 0x00A4 | 0x00A5 | 0x00A6 | 0x00A7 | 0x00A8 | 0x00A9 | 0x00AA | 0x00AB | 0x00AC | 0x00AD | 0x00AE | 0x00AF | 0x00B0 | 0x00B1 | 0x00B2 | 0x00B3 | 0x00B4 | 0x00B5 | 0x00B6 | 0x00B7 | 0x00B8 | 0x00B9 | 0x00BA | 0x00BB | 0x00BC | 0x00BD | 0x00BE | 0x00BF | 0x00C0 | 0x00C1 | 0x00C2 | 0x00C3 | 0x00C4 | 0x00C5 | 0x00C6 | 0x00C7 | 0x00C8 | 0x00C9 | 0x00CA | 0x00CB | 0x00CC | 0x00CD | 0x00CE | 0x00CF | 0x00D0 | 0x00D1 | 0x00D2 | 0x00D3 | 0x00D4 | 0x00D5 | 0x00D6 | 0x00D7 | 0x00D8 | 0x00D9 | 0x00DA | 0x00DB | 0x00DC | 0x00DD | 0x00DE | 0x00DF | 0x00E0 | 0x00E1 | 0x00E2 | 0x00E3 | 0x00E4 | 0x00E5 | 0x00E6 | 0x00E7 | 0x00E8 | 0x00E9 | 0x00EA | 0x00EB | 0x00EC | 0x00ED | 0x00EE | 0x00EF | 0x00F0 | 0x00F1 | 0x00F2 | 0x00F3 | 0x00F4 | 0x00F5 | 0x00F6 | 0x00F7 | 0x00F8 | 0x00F9 | 0x00FA | 0x00FB | 0x00FC | 0x00FD | 0x00FE | 0x00FF |
Actual:
0x0A00 | 0x0A01 | 0x0A02 | 0x0A03 | 0x0A04 | 0x0A05 | 0x0A06 | 0x0A07 | 0x0A08 | 0x0A09 | 0x0A0A | 0x0A0B | 0x0A0C | 0x0A0D | 0x0A0E | 0x0A0F | 0x0A10 | 0x0A11 | 0x0A12 | 0x0A13 | 0x0A14 | 0x0A15 | 0x0A16 | 0x0A17 | 0x0A18 | 0x0A19 | 0x0A1A | 0x0A1B | 0x0A1C | 0x0A1D | 0x0A1E | 0x0A1F | 0x0A20 | 0x0A21 | 0x0A22 | 0x0A23 | 0x0A24 | 0x0A25 | 0x0A26 | 0x0A27 | 0x0A28 | 0x0A29 | 0x0A2A | 0x0A2B | 0x0A2C | 0x0A2D | 0x0A2E | 0x0A2F | 0x0A30 | 0x0A31 | 0x0A32 | 0x0A33 | 0x0A34 | 0x0A35 | 0x0A36 | 0x0A37 | 0x0A38 | 0x0A39 | 0x0A3A | 0x0A3B | 0x0A3C | 0x0A3D | 0x0A3E | 0x0A3F | 0x0A40 | 0x0A41 | 0x0A42 | 0x0A43 | 0x0A44 | 0x0A45 | 0x0A46 | 0x0A47 | 0x0A48 | 0x0A49 | 0x0A4A | 0x0A4B | 0x0A4C | 0x0A4D | 0x0A4E | 0x0A4F | 0x0A50 | 0x0A51 | 0x0A52 | 0x0A53 | 0x0A54 | 0x0A55 | 0x0A56 | 0x0A57 | 0x0A58 | 0x0A59 | 0x0A5A | 0x0A5B | 0x0A5C | 0x0A5D | 0x0A5E | 0x0A5F | 0x0A60 | 0x0A61 | 0x0A62 | 0x0A63 | 0x0A64 | 0x0A65 | 0x0A66 | 0x0A67 | 0x0A68 | 0x0A69 | 0x0A6A | 0x0A6B | 0x0A6C | 0x0A6D | 0x0A6E | 0x0A6F | 0x0A70 | 0x0A71 | 0x0A72 | 0x0A73 | 0x0A74 | 0x0A75 | 0x0A76 | 0x0A77 | 0x0A78 | 0x0A79 | 0x0A7A | 0x0A7B | 0x0A7C | 0x0A7D | 0x0A7E | 0x0A7F | 0x0A80 | 0x0A81 | 0x0A82 | 0x0A83 | 0x0A84 | 0x0A85 | 0x0A86 | 0x0A87 | 0x0A88 | 0x0A89 | 0x0A8A | 0x0A8B | 0x0A8C | 0x0A8D | 0x0A8E | 0x0A8F | 0x0A90 | 0x0A91 | 0x0A92 | 0x0A93 | 0x0A94 | 0x0A95 | 0x0A96 | 0x0A97 | 0x0A98 | 0x0A99 | 0x0A9A | 0x0A9B | 0x0A9C | 0x0A9D | 0x0A9E | 0x0A9F | 0x0AA0 | 0x0AA1 | 0x0AA2 | 0x0AA3 | 0x0AA4 | 0x0AA5 | 0x0AA6 | 0x0AA7 | 0x0AA8 | 0x0AA9 | 0x0AAA | 0x0AAB | 0x0AAC | 0x0AAD | 0x0AAE | 0x0AAF | 0x0AB0 | 0x0AB1 | 0x0AB2 | 0x0AB3 | 0x0AB4 | 0x0AB5 | 0x0AB6 | 0x0AB7 | 0x0AB8 | 0x0AB9 | 0x0ABA | 0x0ABB | 0x0ABC | 0x0ABD | 0x0ABE | 0x0ABF | 0x0AC0 | 0x0AC1 | 0x0AC2 | 0x0AC3 | 0x0AC4 | 0x0AC5 | 0x0AC6 | 0x0AC7 | 0x0AC8 | 0x0AC9 | 0x0ACA | 0x0ACB | 0x0ACC | 0x0ACD | 0x0ACE | 0x0ACF | 0x0AD0 | 0x0AD1 | 0x0AD2 | 0x0AD3 | 0x0AD4 | 0x0AD5 | 0x0AD6 | 0x0AD7 | 0x0AD8 | 0x0AD9 | 0x0ADA | 0x0ADB | 0x0ADC | 0x0ADD | 0x0ADE | 0x0ADF | 0x0AE0 | 0x0AE1 | 0x0AE2 | 0x0AE3 | 0x0AE4 | 0x0AE5 | 0x0AE6 | 0x0AE7 | 0x0AE8 | 0x0AE9 | 0x0AEA | 0x0AEB | 0x0AEC | 0x0AED | 0x0AEE | 0x0AEF | 0x0AF0 | 0x0AF1 | 0x0AF2 | 0x0AF3 | 0x0AF4 | 0x0AF5 | 0x0AF6 | 0x0AF7 | 0x0AF8 | 0x0AF9 | 0x0AFA | 0x0AFB | 0x0AFC | 0x0AFD | 0x0AFE | 0x0AFF |
What am I doing wrong?
Assembly:
cseg
?59999:
defb 48,120,37,48,52,120,32,124,32,0
main#:
ld c,0
#0:
push bc
push bc
ld bc,?59999
push bc
ld hl,2
call printf
pop bc
pop bc
pop bc
inc c
jp nz,#0
ret
public main#
extrn printf
end

Golly. LONG time since I used a z80 C compiler, and most were buggy as [unprintable] back then.
I would suggest that you dump the assembler if the compiler allows. My GUESS is that internally the char is being promoted to a 16 bit INT with indeterminate upper bits set.
The problem is that %04X expects an integer - not a char.
You might try forcing the compiler to play nice by explicitly casting the char to an int - i.e.
printf("0x%04x | ", (int) i);

Most probable thing is that, as being an old 8 bit compiler, it is not converting the char typed i variable into an int and it is just pushing the bc register (assuming your function will not use the high part, which is simply not true, as your function (printf()) expects a whole int as parameter) which you don't know what it has in the b register. The compiler is using the full bc register to print, as you use %x format, which is for an int parameter, and this explains the presence of the high byte as 0x0a in the output (and which doesn't appear anywhere in your assembler listing). Later versions of the standard begun to convert every short and char arguments to int in order probably to avoid this kind of issue.
Try this code, and see if that solves the problem.
#include<stdio.h>
main()
{
char i = 0;
do
{
printf("0x%04x | ", (int) i);
} while (++ i);
}
(I cannot check here, as I have z80 computer, but not a C compiler for it)
Edit
After checking the assembler code, the compiler output just pushes the complete bc register into the stack, in which the lower part (thec register) comes from the character you want to print, but the b register was previously loaded with the high byte of the 59999 pointer to the array of characters of the format string, which happens to be 0xea. So, I got stranged at the output, that should be probably 0xea00, 0xea01, 0xea02, ... and not the output you have. Have you recompiled the source to get the assembler output and the output refers to a different compilation?
To dig a little more I'd need the code of the printf() function, which I assume you don't have. But that seems that converting the parameter to (int) before passing it to the printf() function should solve the problem.

Transforming a dataset in a more compact format (Stata)

initially I was dealing with a dataset that looked something like this:
+------+--------+-----------+-------+
| date | geo | variables | value |
+------+--------+-----------+-------+
| 1981 | Canada | var1 | # |
| 1982 | Canada | var1 | # |
| 1983 | Canada | var1 | # |
| ... | ... | ... | ... |
| 2015 | Canada | var2 | # |
| 1981 | Canada | var2 | # |
| 1982 | Canada | var2 | # |
| ... | ... | ... | ... |
| 2015 | Canada | var2 | # |
| 1981 | Quebec | var1 | # |
| 1982 | Quebec | var1 | # |
| 1983 | Quebec | var1 | # |
| ... | ... | ... | ... |
| 2015 | Quebec | var2 | # |
| 1981 | Quebec | var2 | # |
| 1982 | Quebec | var2 | # |
| ... | ... | ... | ... |
| 2015 | Quebec | var2 | # |
+------+--------+-----------+-------+
So I have 35 time periods, two countries and two variables. I would like to transform the table in Stata for it to look like this:
+------+--------+------+------+
| date | geo | var1 | var2 |
+------+--------+------+------+
| 1981 | Canada | # | # |
| 1982 | Canada | # | # |
| ... | ... | ... | ... |
| 2015 | Canada | # | # |
| 1981 | Quebec | # | # |
| 1982 | Quebec | # | # |
| ... | ... | ... | ... |
| 2015 | Quebec | # | # |
+------+--------+------+------+
However, I'm not having much success with this. I tried to separate the different observations into variables with the command:
separate value, by(variables) generate(var)
Which creates something like this:
+------+--------+------+------+
| date | geo | var1 | var2 |
+------+--------+------+------+
| 1981 | Canada | # | . |
| 1982 | Canada | # | . |
| ... | ... | ... | ... |
| 2015 | Canada | # | . |
| 1981 | Canada | . | # |
| 1982 | Canada | . | # |
| ... | ... | ... | ... |
| 2015 | Canada | . | # |
| 1981 | Quebec | # | . |
| 1982 | Quebec | # | . |
| ... | ... | ... | ... |
| 2015 | Quebec | # | . |
| 1981 | Quebec | . | # |
| 1982 | Quebec | . | # |
| ... | ... | ... | ... |
| 2015 | Quebec | . | # |
+------+--------+------+------+
Which contains a lot of useless missing values.
So, more specifically, I would like something to bring me directly to Table A to B (i.e. without using separate), or a solution to fix Table C into B.
Thanks a lot.

Without sample data, my answer will have to be untested. I think something like the following will get you started in the right direction.
reshape wide value, i(date geo) j(variables) string
Note that this assumes the contents of your variables variable are suitable for use as variable names. For example, a value of 1potato for variables would be a problem.
In any event,
help reshape
should be your first stop.
Added in response to comment: Here is some data I made up and a demonstration that reshape works for this data. Perhaps you can explain how this data differs from the real data. Your error message suggests that for some combination of date and geo, a particular value of variables occurs more than once.
. list, sepby(geo)
+----------------------------------+
| date geo variab~s value |
|----------------------------------|
1. | 1981 Canada var1 111 |
2. | 1982 Canada var1 211 |
3. | 1983 Canada var1 311 |
4. | 1981 Canada var2 112 |
5. | 1982 Canada var2 212 |
6. | 1983 Canada var2 312 |
|----------------------------------|
7. | 1981 Quebec var1 121 |
8. | 1982 Quebec var1 221 |
9. | 1983 Quebec var1 321 |
10. | 1981 Quebec var2 122 |
11. | 1982 Quebec var2 222 |
12. | 1983 Quebec var2 322 |
+----------------------------------+
. reshape wide value, i(geo date) j(variables) string
(note: j = var1 var2)
Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 12 -> 6
Number of variables 4 -> 4
j variable (2 values) variables -> (dropped)
xij variables:
value -> valuevar1 valuevar2
-----------------------------------------------------------------------------
. rename (value*) (*)
. list, sepby(geo)
+-----------------------------+
| date geo var1 var2 |
|-----------------------------|
1. | 1981 Canada 111 112 |
2. | 1982 Canada 211 212 |
3. | 1983 Canada 311 312 |
|-----------------------------|
4. | 1981 Quebec 121 122 |
5. | 1982 Quebec 221 222 |
6. | 1983 Quebec 321 322 |
+-----------------------------+
.

Multiple outcomes/scenarios

I got a problem that I have already created a solution for, but I'm wondering if there's a better way of solving the problem. Basically I have to create a flag for certain scenarios under a partition of ID and date. My solution involved mapping for all the possible scenarios, then creating "case when" statements for all these scenarios with the specific outcome. Basically, I was the one that created the outcomes. I am wondering if there's another way, something around letting SQL create the outcomes instead of myself.
Thanks a lot!
Background:
+----+-----------+--------+-------+------+-----------------+-----------------------------------------------------------------------------------+
| ID | Month | Status | Value | Flag | Scenario Number | Scenario Description |
+----+-----------+--------+-------+------+-----------------+-----------------------------------------------------------------------------------+
| 1 | 1/01/2016 | First | 123 | No | 1 | First, second and blank exists. Do not flag |
| 1 | 1/01/2016 | Second | 456 | No | 1 | First, second and blank exists. Do not flag |
| 1 | 1/01/2016 | | 789 | No | 1 | First, second and blank exists. Do not flag |
| 1 | 1/02/2016 | Second | 123 | Yes | 2 | First does not exist, two second but have different values. Flag these as Yes |
| 1 | 1/02/2016 | Second | 456 | Yes | 2 | First does not exist, two second but have different values. Flag these as Yes |
| 1 | 1/02/2016 | Second | 123 | No | 3 | First does not exist, two second have same values. Do not flag |
| 1 | 1/02/2016 | Second | 123 | No | 3 | First does not exist, two second have same values. Do not flag |
| 1 | 1/03/2016 | Second | 123 | No | 4 | Only one entry of Second exist. Do no flag |
| 1 | 1/04/2016 | | 123 | Yes | 5 | Two blanks for the partition. Flag these as Yes |
| 1 | 1/04/2016 | | 123 | Yes | 5 | Two blanks for the partition. Flag these as Yes |
| 1 | 1/05/2016 | | | No | 6 | Only one entry of blank exist. Do not flag these |
| 1 | 1/06/2016 | First | 123 | Yes | 7 | First exist for the partition. Do not flag |
| 1 | 1/06/2016 | | 456 | Yes | 7 | First exist for the partition. Do not flag |
| 1 | 1/07/2016 | Second | 123 | Yes | 8 | First does not exist and second and blank do not have the same value. Flag these. |
| 1 | 1/07/2016 | | 456 | Yes | 8 | First does not exist and second and blank do not have the same value. Flag these. |
| 1 | 1/07/2016 | Second | 123 | Yes | 8 | First does not exist and second and blank have the same value. Flag these. |
| 1 | 1/07/2016 | | 123 | Yes | 8 | First does not exist and second and blank have the same value. Flag these. |
+----+-----------+--------+-------+------+-----------------+-----------------------------------------------------------------------------------+
Data:
+----+-----------+-------+----------+---------------+
| ID | Month | Value | Priority | Expected_Flag |
+----+-----------+-------+----------+---------------+
| 1 | 1/01/2016 | 96.01 | | Yes |
| 1 | 1/01/2016 | 96.01 | | Yes |
| 1 | 1/02/2016 | 65.2 | First | No |
| 1 | 1/02/2016 | 3.47 | Second | No |
| 1 | 1/02/2016 | 45.99 | | No |
| 11 | 1/01/2016 | 25 | | No |
| 11 | 1/02/2016 | 74.25 | Second | No |
| 11 | 1/02/2016 | 74.25 | Second | No |
| 11 | 1/02/2016 | 23.25 | | No |
| 24 | 1/01/2016 | 1.25 | First | No |
| 24 | 1/01/2016 | 1.365 | | No |
| 24 | 1/04/2016 | 1.365 | First | No |
| 24 | 1/04/2016 | 1.365 | | No |
| 24 | 1/05/2016 | 1.365 | First | No |
| 24 | 1/05/2016 | 1.365 | First | No |
| 24 | 1/06/2016 | 1.365 | Second | No |
| 24 | 1/06/2016 | 1.365 | Second | No |
| 24 | 1/07/2016 | 1.365 | Second | Yes |
| 24 | 1/07/2016 | 1.365 | | Yes |
| 24 | 1/08/2016 | 1.365 | First | No |
| 24 | 1/08/2016 | 1.365 | | No |
| 24 | 1/09/2016 | 1.365 | Second | No |
| 24 | 1/09/2016 | 1.365 | | No |
| 27 | 1/01/2016 | 0 | Second | Yes |
| 27 | 1/01/2016 | 0 | Second | Yes |
| 27 | 1/02/2016 | 45.25 | Second | No |
| 3 | 1/01/2016 | 96.01 | First | No |
| 3 | 1/01/2016 | 96.01 | First | No |
| 3 | 1/03/2016 | 96.01 | First | No |
| 3 | 1/03/2016 | 96.01 | First | No |
| 35 | 1/01/2016 | | | Yes |
| 35 | 1/01/2016 | | | Yes |
| 35 | 1/02/2016 | | First | No |
| 35 | 1/02/2016 | | Second | No |
| 35 | 1/02/2016 | | | No |
| 35 | 1/02/2016 | | | No |
| 35 | 1/03/2016 | | Second | Yes |
| 35 | 1/03/2016 | | Second | Yes |
| 35 | 1/04/2016 | | Second | No |
| 35 | 1/04/2016 | | Second | No |
+----+-----------+-------+----------+---------------+

Working Through Normalization (3NF)

I've been planning this website for a while now and began putting it together a few weeks ago and I'd say that things were going extremely well until the other day I was introduced to normalization. It's taken me quite some time to get the hang of what it is and why it's needed but I think that I'm close to acheiving my overall goal now.
So the scenario is that I have a website where members can join and set up their own newspaper/blog and assign authors to it. The site primarily consists of a members_table, authors_table, newspapers_table and posts_table, which I've gone through and normalized to get:
Members
+----+-----------+----------+-----------------+----------+----------+------------+------------+
| ID | FIRSTNAME | SURNAME | EMAIL | USERNAME | PASSWORD | AVATAR | JOINED |
+----+-----------+----------+-----------------+----------+----------+------------+------------+
| 01 | Brian | Griffin | brian#gmail.com | briang | *** | hdhs.jpg | 2014-07-31 |
| 02 | Meg | Griffin | meg#gmail.com | megg | *** | | 2014-07-31 |
| 03 | Peter | Griffin | peter#gmail.com | peterg | *** | jaunsq.jpg | 2014-07-31 |
| 04 | Glen | Quagmire | glen#gmail.com | glen | *** | | 2014-07-31 |
+----+-----------+----------+-----------------+----------+----------+------------+------------+
Authors
+----+-----------+----------+-----------------+-------------------+---------+----------+---------+
| ID | FIRSTNAME | SURNAME | EMAIL | BIO ( REQUIRED ) | TWITTER | FACEBOOK | WEBSITE |
+----+-----------+----------+-----------------+-------------------+---------+----------+---------+
| 01 | Brian | Griffin | brian#gmail.com | About me... | URL | | URL |
| 02 | Meg | Griffin | meg#gmail.com | About me... | URL | | |
| 03 | Peter | Griffin | peter#gmail.com | About me... | | URL | URL |
| 04 | Glen | Quagmire | glen#gmail.com | About me... | URL | URL | |
+----+-----------+----------+-----------------+-------------------+---------+----------+---------+
** Should Socials Be Broken Down Here? If So, How?
Newspaper_Categories
+----+-------------------+
| ID | CATEGORY |
+----+-------------------+
| 01 | Lifestyle |
| 02 | Auto Mobiles |
| 03 | Entertainment |
| 04 | Food & Drink |
| 05 | Internet |
+----+-------------------+
Newspapers
+----+-----------------------+----------+---------+------------------+------------------+
| ID | NAME | CATEGORY | AVATAR | BIO ( REQUIRED ) | OWNER ( MEMBER ) |
+----+-----------------------+----------+---------+------------------+------------------+
| 01 | Spooner Street Weekly | 01 | 311.jpg | About Us... | 01 |
| 02 | A Dogs Life | 01 | | About Us... | 01 |
| 03 | In The Kitchen | 04 | js.jpg | About Us... | 02 |
+----+-----------------------+----------+---------+------------------+------------------+
Should Owner go Here or in a table named Newspaper_Owners
Socials
+----+-------------------+
| ID | TYPE |
+----+-------------------+
| 01 | Facebook |
| 02 | Twitter |
| 03 | Google |
| 04 | Flickr |
| 05 | Youtube |
+----+-------------------+
Newspaper_Socials
+----------+--------+------+
| NEWSAPER | SOCIAL | LINK |
+----------+--------+------+
| 01 | 01 | URL |
| 01 | 02 | URL |
| 01 | 05 | URL |
| 01 | 01 | URL |
| 02 | 02 | URL |
| 02 | 04 | URL |
| 03 | 01 | URL |
+----------+--------+------+
Post_Categories
+-------------------+--------------------+
| ID | NEWSPAPER_ID | CATEGORY |
+----+--------------+--------------------+
| 01 | 01 | Glens Girls |
| 02 | 01 | In The Clam |
| 03 | 01 | Peters Shenanigans |
| 04 | 02 | Martini Recipes |
| 05 | 03 | Housewife Tips |
+----+--------------+--------------------+
Posts
+----+----------+-------+---------+----------+------------+-------+
| ID | CATEGORY | TITLE | ARTICLE | FEATURED | ADDED | VIEWS |
+----+----------+-------+---------+----------+------------+-------+
| 01 | 01 | Title | Article | 0 | 2014-07-31 | 200 |
| 02 | 01 | Title | Article | 0 | 2014-07-31 | 220 |
| 03 | 03 | Title | Article | 1 | 2014-07-31 | 232 |
| 04 | 05 | Title | Article | 0 | 2014-07-31 | 143 |
| 05 | 05 | Title | Article | 1 | 2014-07-31 | 311 |
+----+----------+-------+---------+----------+------------+-------+
Post_Photos
+---------+-----+--------------+------+
| POST_ID | ALT | PHOTOGRAPHER | LINK |
+---------+-----+--------------+------+
| 01 | Alt | John Smith | URL |
| 02 | Alt | | |
| 03 | Alt | Mike Jones | |
| 05 | Alt | Adam West | URL |
+---------+-----+--------------+------+
Post_Keywords
+---------+---------+
| POST_ID | KEYWORD |
+---------+---------+
| 01 | Keyword |
| 01 | Keyword |
| 01 | Keyword |
| 01 | Keyword |
| 02 | Keyword |
| 02 | Keyword |
| 03 | Keyword |
| 03 | Keyword |
| 03 | Keyword |
+---------+---------+
If each post is only allowed 3 keywords, can these be added to the end of Posts table?
Author_Posts
+-----------+---------+
| AUTHOR_ID | POST_ID |
+-----------+---------+
| 01 | 01 |
| 01 | 02 |
| 01 | 03 |
| 02 | 04 |
| 03 | 05 |
+-----------+---------+
Could anyone please let me know if I'm on the right lines with this structure and give me some pointers on where Primary Keys and also Foreign Keys ??? should go?

"** Should Socials Be Broken Down Here? If So, How?"
Yes, you already have the socials table.
Add a new many to many table of Users to Socials

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Pyspark Dataframe Iterate Array Columns - arrays

array_cols = ['WB-OIL-CODE', 'WB-OIL-LSE-NBR', 'WB-OIL-DIST', 'WB-GAS-CODE', 'WB-GAS-RRC-ID', 'WB-GAS-DIS'] other_cols = [c for c in df.columns if c not in array_cols] df = df.select( *other_cols, F.expr(f"inline(arrays_zip({'`' + '`,`'.join(array_cols) + '`'}))") )

Related

Understanding Linker output (Elf2Bin)

Why using char type as index for looping gives unexpected results?

Transforming a dataset in a more compact format (Stata)

Multiple outcomes/scenarios

Working Through Normalization (3NF)

Categories

Resources