neo4j-contrib / neo4j-apoc-procedures

Awesome Procedures On Cypher for Neo4j - codenamed "apoc"                     If you like it, please ★ above ⇧            

Home Page:https://neo4j.com/labs/apoc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

de-duplicate entities in apoc.export.json.data/query

jexp opened this issue · comments

I'm not sure if we're de-duplicating entities in apoc.export.json.data/query

e.g. if you have a query like

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
RETURN p,r,m

where people and movies can appear multiple times.

or

MATCH (p:Person)-[r:KNOWS]-(p2:Person)
RETURN p1,r,p2

where even relationships can be duplicated.

Are we keeping track in a set of ids or so. Please check.

@jexp

Yes, entities are duplicated during export.
In fact, executing:

CREATE (p:Person {id: 1})-[r:ACTED_IN]->(m:Movie {foo: 1}) with p 
CREATE (p)-[:ACTED_IN]->(:Movie {foo: 2})

and then:

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
with collect(p) + collect(m) as nodes, collect(r) as rels
call apoc.export.json.data(nodes, rels, "testData.json", {})
yield file return file

the resulting file has a duplicate Person node:

{"type":"node","id":"3","labels":["Person"],"properties":{"id":1}}
{"type":"node","id":"3","labels":["Person"],"properties":{"id":1}}
{"type":"node","id":"4","labels":["Movie"],"properties":{"foo":1}}
{"type":"node","id":"5","labels":["Movie"],"properties":{"foo":2}}
{"type":"relationship","id":"2","label":"ACTED_IN","start":{"id":"3","labels":["Person"],"properties":{"id":1}},"end":{"id":"4","labels":["Movie"],"properties":{"foo":1}}}
{"type":"relationship","id":"3","label":"ACTED_IN","start":{"id":"3","labels":["Person"],"properties":{"id":1}},"end":{"id":"5","labels":["Movie"],"properties":{"foo":2}}}

The issue also occurs with other procedures, such as csv, Cypher.

Moreover, it happens also with the apoc.export.<type>.graph procedures:

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
with collect(p) + collect(m) as nodes, collect(r) as rels
call apoc.export.json.graph({nodes: nodes, relationships: rels}, "testGraph.json", {})
yield file return file

With the query, such as the following, the result is duplicated, but I think in this case it is right,
since each Cypher row result corresponds to an entry in the json/csv/... file:

call apoc.export.json.query("MATCH path=(p:Person)-[r:ACTED_IN]->(m:Movie) RETURN path", "testQuery.json", {})
yield file return file

So we indeed should keep track of the IDs during the export.

Since the procedures are all in APOC Core, I think you need to create a Trello card, or am I wrong?

Created Trello card, with id VchWnQfd