de-duplicate entities in apoc.export.json.data/query
jexp opened this issue · comments
I'm not sure if we're de-duplicating entities in apoc.export.json.data/query
e.g. if you have a query like
MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
RETURN p,r,m
where people and movies can appear multiple times.
or
MATCH (p:Person)-[r:KNOWS]-(p2:Person)
RETURN p1,r,p2
where even relationships can be duplicated.
Are we keeping track in a set of ids or so. Please check.
Yes, entities are duplicated during export.
In fact, executing:
CREATE (p:Person {id: 1})-[r:ACTED_IN]->(m:Movie {foo: 1}) with p
CREATE (p)-[:ACTED_IN]->(:Movie {foo: 2})
and then:
MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
with collect(p) + collect(m) as nodes, collect(r) as rels
call apoc.export.json.data(nodes, rels, "testData.json", {})
yield file return file
the resulting file has a duplicate Person
node:
{"type":"node","id":"3","labels":["Person"],"properties":{"id":1}}
{"type":"node","id":"3","labels":["Person"],"properties":{"id":1}}
{"type":"node","id":"4","labels":["Movie"],"properties":{"foo":1}}
{"type":"node","id":"5","labels":["Movie"],"properties":{"foo":2}}
{"type":"relationship","id":"2","label":"ACTED_IN","start":{"id":"3","labels":["Person"],"properties":{"id":1}},"end":{"id":"4","labels":["Movie"],"properties":{"foo":1}}}
{"type":"relationship","id":"3","label":"ACTED_IN","start":{"id":"3","labels":["Person"],"properties":{"id":1}},"end":{"id":"5","labels":["Movie"],"properties":{"foo":2}}}
The issue also occurs with other procedures, such as csv, Cypher.
Moreover, it happens also with the apoc.export.<type>.graph
procedures:
MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
with collect(p) + collect(m) as nodes, collect(r) as rels
call apoc.export.json.graph({nodes: nodes, relationships: rels}, "testGraph.json", {})
yield file return file
With the query, such as the following, the result is duplicated, but I think in this case it is right,
since each Cypher row result corresponds to an entry in the json/csv/... file:
call apoc.export.json.query("MATCH path=(p:Person)-[r:ACTED_IN]->(m:Movie) RETURN path", "testQuery.json", {})
yield file return file
So we indeed should keep track of the IDs during the export.
Since the procedures are all in APOC Core, I think you need to create a Trello card, or am I wrong?
Created Trello card, with id VchWnQfd