I have a movies.csv file, which has a feature vector per line (E.g - id|Name|0|1|1|0|0|0|1 has 2 features for name and id, 7 features for genre classification)
I want a node m from class Movies to have a relationship [:HAS_GENRE] with nodes g from class Genres. For that, I need to loop over all the '|' separated features and only make a relationship if the value is 1.
IN essence, I want to have -
x = a //where a is the index of the first genre feature
while (x < lim) //lim is the last index of the feature vector
{
if line[x] is 1:
(m{id:toInt(line[0]})-[:HAS_GENRE]->(g{id=line[x]})
}
How do I do that?
try this
WITH ["Genre1","Genre2",...] as genres
LOAD CSV FROM "file:movies.pdv" using fieldterminator "|" AS row
MERGE (m:Movie {id:row[0]}) ON CREATE SET m.title = row[1]
FOREACH (idx in filter(range(0,size(genres)-1) WHERE row[2+idx]="1") ) |
MERGE (g:Genre {name:genres[idx]})
CREATE (m)-[:HAS_GENRE]->(g)
)
it loads each row of the file of as a collection
the first two elements are used to create a movie
then filter the potential indexes range(0,size(genres)-1) by the existence of a "1" in the input row,
the resulting list of indexes is then used to lookup the genre-name or id
and connect the movie with the genre
Related
In Lua, I have a set of tables:
Column01 = {}
Column02 = {}
Column03 = {}
ColumnN = {}
I am trying to access these tables dynamically depending on a value. So, later on in the programme, I am creating a variable like so:
local currentColumn = "Column" .. variable
Where variable is a number 01 to N.
I then try to do something to all elements in my array like so:
for i = 1, #currentColumn do
currentColumn[i] = *do something*
end
But this doesn't work as currentColumn is a string and not the name of the table. How can I convert the string into the name of the table?
If I understand correctly, you're saying that you'd like to access a variable based on its name as a string? I think what you're looking for is the global variable, _G.
Recall that in a table, you can make strings as keys. Think of _G as one giant table where each table or variable you make is just a key for a value.
Column1 = {"A", "B"}
string1 = "Column".."1" --concatenate column and 1. You might switch out the 1 for a variable. If you use a variable, make sure to use tostring, like so:
var = 1
string2 = "Column"..tostring(var) --becomes "Column1"
print(_G[string2]) --prints the location of the table. You can index it like any other table, like so:
print(_G[string2][1]) --prints the 1st item of the table. (A)
So if you wanted to loop through 5 tables called Column1,Column2 etc, you could use a for loop to create the string then access that string.
C1 = {"A"} --I shorted the names to just C for ease of typing this example.
C2 = {"B"}
C3 = {"C"}
C4 = {"D"}
C5 = {"E"}
for i=1, 5 do
local v = "C"..tostring(i)
print(_G[v][1])
end
Output
A
B
C
D
E
Edit: I'm a doofus and I overcomplicated everything. There's a much simpler solution. If you only want to access the columns within a loop instead of accessing individual columns at certain points, the easier solution here for you might just be to put all your columns into a bigger table then index over that.
columns = {{"A", "1"},{"B", "R"}} --each anonymous table is a column. If it has a key attached to it like "column1 = {"A"}" it can't be numerically iterated over.
--You could also insert on the fly.
column3 = {"C"}
table.insert(columns, column3)
for i,v in ipairs(columns) do
print(i, v[1]) --I is the index and v is the table. This will print which column you're on, and get the 1st item in the table.
end
Output:
1 A
2 B
3 C
To future readers: If you want a general solution to getting tables by their name as a string, the first solution with _G is what you want. If you have a situation like the asker, the second solution should be fine.
I have a .csv file where multiple postcodes (characters and numbers) correspond to a unique ID number (also characters and numbers).
e.g
BS2 9TL, E00073143
BS2 9TB, E00073143
BS2 9XJ, E00073143
BS2 8AT, E00073144
BS2 8TY, E00073144
BS2 8UA, E00073144
BS2 8UG, E00073144
I need to create a new array for each unique ID number that stores the respective postcodes. The amount of postcodes for each ID number is not the same every time.
The file contains 9010 postcodes and 1258 ID numbers.
Can anyone show me how to go about doing this?
PCs=importdata('PostalCodes.csv'); %// import data
PostalCodes = cell(numel(PCs,2)); %// create storage
IDs = cell(numel(PCs,2));
for ii = 1:numel(PCs)
tmp = strsplit(PCs{ii,1}, ','); %// split on comma
PostalCodes{ii,1} = tmp{1};
IDs{ii,1} = tmp{2};
end
[IDs,idx] = sort(IDs); %// sort on ID
PostalCodes = PostalCodes(idx); %// sort PCs the same way
PostalCodes = cell2mat(PostalCodes); %// go to matrix
[IdNums,~,tmp2] = unique(IDs); %// get unique IDs
tmp3 = [1; find(diff(tmp2)); numel(IDs)]; %// create index array
for ii = 1:numel(tmp3)-1;
PostalCode(ii).IDs = PostalCodes(tmp3(ii):tmp3(ii+1),:); %// store in struct
end
You don't actually want separate arrays, because that's very bad practise, so I've put everything in a structure for you. You can now access the structure by simply typing:
PostalCode(1).IDs(2,:)
ans =
BS2 9TL
where the (1) beteen PostalCode and IDs corresponds to the ID (which is found in IdNums), and the (2,:) plucks out the second postal code corresponding to ID IdNums(1).
You could use an array of structs
[x,y]=textread('/tmp/file.csv' , '%s %s','delimiter',',')
csv=[x,y]
values=struct('key',{},'value',{})
keys= unique(csv(:,2));
for i = 1:length(keys)
values(i).key=keys{i}
values(i).value=csv( strcmp( csv(:,2) , keys{i}),1)
end
Tested this using octave. On matlab you could use a map container instead of key/value structs for direct access via id's
My action passes a list of values from a column x in table y to the view. How do I write the following SQL: SELECT x FROM y, using DAL "language", when x and y are variables given by the view. Here it is, using exequtesql().
def myAction():
x = request.args(0, cast=str)
y = request.args(1, cast=str)
myrows = db.executesql('SELECT '+ x + ' FROM '+ y)
#Let's convert it to the list:
mylist = []
for row in myrows:
value = row #this line doesn't work
mylist.append(value)
return (mylist=mylist)
Also, is there a more convenient way to convert that data to a list?
First, note that you must create table definitions for any tables you want to access (i.e., db.define_table('mytable', ...)). Assuming you have done that and that y is the name of a single table and x is the name of a single field in that table, you would do:
myrows = db().select(db[y][x])
mylist = [r[x] for r in myrows]
Note, if any records are returned, .select() always produces a Row object, which comprises a set of Row objects (even if only a single field was selected). So, to extract the individual values into a list, you have to iterate over the Rows object and extract the relevant field from each Row object. The above code does so via a list comprehension.
Also, you might want to add some code to check whether db[y] and db[y][x] exist.
I am using PostgreSQL and let's say I have a tasks table to keep track of task items. Tasks table is as follows;
Id Name Index
7 name A 1
5 name B 2
6 name C 3
3 name D 4
Index column in tasks table stores the sort order of the tasks. Therefore I will output the tasks with respect to index in increasing order.
So When I change Task D(id = 3)' s index into 2 the new indexes should be as below;
Id Name Index
7 name A 1
5 name B 3
6 name C 4
3 name D 2
or when I change Task A(id = 7)' s index into 4 the new indexes should be as below;
Id Name Index
7 name A 4
5 name B 2
6 name C 3
3 name D 1
What I think is updating all row's index values one by one is pretty inefficient.
So what is the most efficient way to update all index values when I change one of the indexes in my Tasks table?
Edit :
First of all sorry for the confusion. What I am asking is not a simple exchanging two row indexes. If you look at the examples when I change Task D's index in to 2 more than one rows change. So when Task D is 2, Task B becomes 3 and Task C becomes 4.
For instance;
It is like when you drag Task D and drop below Task A so that it's index becomes 2 and B and C's index increases by 1.
SQL Fiddle
What you are doing is exchanging two row's indexes. So it is necessary to store the index value of the first updated one in a temp variable and setting it temporarily to a special value to avoid a unique index collision, that is, if the index is unique. If the index is not unique that step is unnecessary.
begin;
create temp table t as
select
(
select index
from tasks
where id = 3
) as index,
(
select id
from tasks
where index = 2
) as id
;
update tasks
set index = -1
where id = (select id from t)
;
update tasks
set index = 2
where id = 3
;
update tasks
set index = (select index from t)
where id = (select id from t)
;
drop table t;
commit;
The following assumes the index column (as well as id) is unique:
with swapped as (
select n1.id as id1,
n1.name as name1,
n1.index as index1,
n2.id as id2,
n2.name as name2,
n2.index as index2
from names n1
join names n2 on n2.index = 2 -- this is the value of the "new index"
where n1.id = 3 -- this is the id of the row where the index should be changed to the new value
)
update names as n
set index = case
when n.id = s.id1 then s.index2
when n.id = s.id2 then s.index1
end
from swapped s
where n.id in (s.id1, s.id2);
The CTE first creates a single row with the ids of the two rows to be swapped and then the update just compares the ids of the target table with those from the CTE, swapping the values.
SQLFiddle example: http://sqlfiddle.com/#!15/71dc2/1
Below is what I think of the hadoop framework processing text files. Please correct me if I am going wrong somewhere.
Each mapper acts on an input split which contains some records.
For each input split a record reader is getting created which starts reading records from the input split.
If there are n records in an input split the map method in the mapper is called n times which in turn reads a key-value pair using the record reader.
Now coming to the databases perspective
I have a database on a single remote node. I want to fetch some data from a table in this database. I would configure the parameters using DBConfigure and mention the input table using DBInputFormat. Now say if my table has 100 records in all, and I execute an SQL query which generates 70 records in the output.
I would like to know :
How are the InputSplits getting created in the above case (database) ?
What does the input split creation depend on, the number of records which my sql query generates or the total number of records in the table (database) ?
How many DBRecordReaders are getting created in the above case (database) ?
How are the InputSplits getting created in the above case (database)?
// Split the rows into n-number of chunks and adjust the last chunk
// accordingly
for (int i = 0; i < chunks; i++) {
DBInputSplit split;
if ((i + 1) == chunks)
split = new DBInputSplit(i * chunkSize, count);
else
split = new DBInputSplit(i * chunkSize, (i * chunkSize)
+ chunkSize);
splits.add(split);
}
There is the how, but to understand what it depends on let's take a look at chunkSize:
statement = connection.createStatement();
results = statement.executeQuery(getCountQuery());
results.next();
long count = results.getLong(1);
int chunks = job.getConfiguration().getInt("mapred.map.tasks", 1);
long chunkSize = (count / chunks);
So chunkSize takes the count = SELECT COUNT(*) FROM tableName and divides this by chunks = mapred.map.tasks or 1 if it is not defined in the configuration.
Then finally, each input split will have a RecordReader created to handle the type of database you are reading from for instance: MySQLDBRecordReader for MySQL database.
For more info check out the source
It appears #Engineiro explained it well by taking the actual hadoop source. Just to answer, number of DBRecordReader is equal to number of map tasks.
To explain further, the Hadoop Map side framework creates an instance of DBRecordReader for each Map task, in case where the child JVM is not reused for further Map tasks. In other words, the number of input splits is equals to the value of map.reduce.tasks in case of DBInputFormat. So, each map Task's record Reader has the meta information to construct the query to get subset of data from the table. Each Record Reader executes a pagination type of SQL which is similar to the below.
SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( select * from emp ) a WHERE rownum <= 6 + 7 ) WHERE dbif_rno >= 6
The above SQL is for the second Map tasks to return the rows between 6 and 13
To generalize for any type of Input formats, the number of Record Readers is equals to the number of Map Tasks.
This post talks about all that you want : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/