pietermartin / sqlg

TinkerPop graph over sql

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Provision for unique ids while using bulkAddEdges

SnehaMathankar opened this issue · comments

I have created Edge Class with unique identifier column 'id'. I am trying to create Edges using bulkAddEdges function. I am unable to figure out how to achieve this. It would be of great help if I could get any suggestion on this.

bulkAddEdges is for joining up out and in vertex pairs in one call to the db.
It is not possible to also set a unique identifier using it.

If you explain your use case a bit more, I might be able to help further.

Sir, we have requirement to crawl database. Following are the details of database components:

schema -> 134
table -> 10366
column -> 253325
PK -> 6703
FK -> 32077

We have total 134 Schemas & 10366 Tables in those schemas.
We have created 5 Vertex classes for Schemas, Tables, Columns, PrimaryKey & ForeignKey.
We created all the vertexes.It took 2 min to complete this task.

Now, we have to link Schemas with Tables, Tables with Columns, Tables with PrimaryKey & Tables with ForeignKey.
For this we have created Edges class 'Contains' with unique identifier column for the same.
When we are creating Edges using 'addVertex' to create each link we are able to do so with unique identifier column.
However, it takes almost 3 hour to complete this task.

To optimize this, we tried 'bulkAddEdges'. It improved the time.
But we are unable to figure out how to add unique identifier column in this case as 'bulkAddEdges' have no provision for the same.

I am attaching code snippet. It would be of great help if you could suggest any way to achieve this. Thanks in advance.
UsingAddEdge.txt
UsingBulkAddEdges.txt

I had a look.

In the first one UsingAddEdge.txt I recommend always specifying the label in your queries. With no label specified, all VertexLabels with the particular property name will have to be queries.
i.e.
Vertex outVertex = sqlgGraph.traversal().V().has("id", "Schema"+j).next();
should be
Vertex outVertex = sqlgGraph.traversal().V().hasLabel("Schema").has("id", "Schema"+j).next();

Your usage of bulkAddEdges is correct. Unfortunately there is no way at the moment to specify individual data for each edge using this method. I'd have to investigate if it possible.

Here is some code that might help.

@Test
public void issue490NormalBatchMode() {
    StopWatch stopWatch = StopWatch.createStarted();
    VertexLabel schemaVertexLabel = this.sqlgGraph.getTopology().ensureVertexLabelExist(
            this.sqlgGraph.getSqlDialect().getPublicSchema(), "Schema",
            new LinkedHashMap<>() {{
                put("id", PropertyDefinition.of(PropertyType.STRING, Multiplicity.of(1, 1)));
            }},
            ListOrderedSet.listOrderedSet(Collections.singletonList("id"))
    );
    VertexLabel tableVertex = this.sqlgGraph.getTopology().ensureVertexLabelExist(
            this.sqlgGraph.getSqlDialect().getPublicSchema(), "Table",
            new LinkedHashMap<>() {{
                put("id", PropertyDefinition.of(PropertyType.STRING, Multiplicity.of(1, 1)));
                put("schema", PropertyDefinition.of(PropertyType.STRING, Multiplicity.of(1, 1)));
            }},
            ListOrderedSet.listOrderedSet(Collections.singletonList("id"))
    );
    schemaVertexLabel.ensureEdgeLabelExist("contains", tableVertex,
            new LinkedHashMap<>() {{
                put("id", PropertyDefinition.of(PropertyType.STRING));
            }}
    );
    this.sqlgGraph.tx().commit();
    stopWatch.stop();
    LOGGER.info("time to create topology: {}", stopWatch);
    stopWatch.reset();
    stopWatch.start();

    this.sqlgGraph.tx().normalBatchModeOn();
    for (int i = 0; i < 134; i++) {
        this.sqlgGraph.addVertex(T.label, "Schema", "id", "Schema_" + i);
    }
    for (int i = 0; i < 10366; i++) {
        this.sqlgGraph.addVertex(T.label, "Table", "id", "Table_" + i, "schema", "Schema_" + (i % 134));
    }
    List<Vertex> schemas = this.sqlgGraph.traversal().V().hasLabel("Schema").toList();
    for (Vertex schema : schemas) {
        String schemaId = schema.value("id");
        List<Vertex> tables = this.sqlgGraph.traversal().V()
                .hasLabel("Table")
                .has("schema", schemaId)
                .toList();
        for (Vertex table : tables) {
            String tableId = table.value("id");
            schema.addEdge("contains", table, "id", schemaId + "-" + tableId);
        }
    }
    this.sqlgGraph.tx().commit();
    stopWatch.stop();
    LOGGER.info("time to insert: {}", stopWatch);
    stopWatch.reset();
    stopWatch.start();

    List<Map<String, Vertex>> schemaTables = this.sqlgGraph.traversal().V().hasLabel("Schema").as("schema")
            .out("contains").as("table")
            .<Vertex>select("schema", "table")
            .toList();
    for (Map<String, Vertex> schemaTable : schemaTables) {
        Vertex schema = schemaTable.get("schema");
        Vertex table = schemaTable.get("table");
        LOGGER.info("schema {} -> table {}", schema.value("id"), table.value("id"));
    }
    stopWatch.stop();
    LOGGER.info("time to query : {}", stopWatch);

}

On my laptop the time taken is,
time to create topology: 00:00:00.063
time to insert: 00:00:00.592
bunch of logs
time to query : 00:00:00.124

Let me know if this helps.

Thankyou sir, I really appreciate your quick response. I will try the solution as suggested by you & revert back.

Closing this for now, feel free to reopen it.