PDAL / PDAL

PDAL is Point Data Abstraction Library. GDAL for point cloud data.

Home Page:https://pdal.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TileDB Config File param "sm.dedup_coords true" not used for array creation

weidinger-c opened this issue · comments

Based on the tutorial https://docs.tiledb.com/main/integrations-and-extensions/geospatial/pdal it states that per default TileDB allows duplicates.
To able to update point cloud coordinates inside the database, duplicates should be set to false. The guide states it as follows:

Ingesting LAS Files
First, create a TileDB config file tiledb.config where you can set any TileDB configuration parameter (e.g., AWS keys if you would like to write to a TileDB array on S3). Make sure you also add the following, as currently TileDB does not handle duplicate points (this will change in a future version).

I tried that step by creating a file named tiledb.config with the content:
sm.dedup_coords true
and used that file in the tiledb writer module with param config_file="tiledb.config"

But still, after using the PDAL TileDB writer, the schema of the created database reports "allows_duplicates = true".
I guess It comes from the code part

schema.set_allows_dups(true);

where the schema is always set to allow duplicates, no matter what config file was provided.

@weidinger-c that is an issue, the main docs on our website lag behind the code, we are addressing that. What is your use case, do you need to set dups to false. I will update the code.

I would suggest that you want duplicates and that you couple that with deletion - https://docs.tiledb.com/main/how-to/arrays/writing-arrays/deleting Then you can add a point in the same place and delete the previous point.

@normanb yes i need dups = false
My usecase is that I load a lot of las files, compute some algorithms on parts of the data and in the end want to update some attributes e.g. Classification for the points. Currently, the points are added a second time instead of replacing the attribute values, which IMO is due to dups being true at the moment. Thanks for the help!

@weidinger-c to achieve this workflow you will need to replace the point in two stages, delete the point and then write. dups is misleading in this case, you can't update the existing point by setting dups = false

@normanb This is quite confusing for me. TileDB is a database where I cannot change the stored data unless I delete a whole entry first to then insert it again?

@weidinger-c I am going to create a couple of python examples which I will share here so we can explore what you need.

@weidinger-c I have been able to recreate your use case in Python

import numpy as np

import tiledb

# Name of the array to create.
array_name = "writing_sparse_multiple"


def create_array(allow_dups=False):
    # The array will be 4x4 with dimensions X/Y/Z, with domain [1,4].
    dom = tiledb.Domain(
        tiledb.Dim(name="X", domain=(1, 4), tile=4, dtype=np.float64),
        tiledb.Dim(name="Y", domain=(1, 4), tile=4, dtype=np.float64),
        tiledb.Dim(name="Z", domain=(1, 4), tile=4, dtype=np.float64),
    )

    # The array will be sparse with a single attribute "classification" so each (i,j) cell can store an integer.
    schema = tiledb.ArraySchema(
        allows_duplicates=allow_dups,
        domain=dom,
        sparse=True,
        attrs=[tiledb.Attr(name="classification", dtype=np.int32)]
    )

    # Overwrite/Create the (empty) array on disk.
    tiledb.SparseArray.create(array_name, schema, overwrite=True)

def allow_dups_write():
    create_array(allow_dups=True)
    # write classification point
    with tiledb.open(array_name, "w", timestamp=1) as A:
        A[2.1, 2.1, 2.1] = {"classification": [0]}

    # update
    with tiledb.open(array_name, "w", timestamp=2) as A:
        A[2.1, 2.1, 2.1] = {"classification": [1]}

    with tiledb.open(array_name) as A:
        print("All data")
        print(A.df[:])

    with tiledb.open(array_name, timestamp=1) as A:
        print("At timestamp 1")
        print(A.df[:])

def no_dups_write():
    create_array(allow_dups=False)
    # write classification point
    with tiledb.open(array_name, "w") as A:
        A[2.1, 2.1, 2.1] = {"classification": [0]}

    # update
    with tiledb.open(array_name, "w") as A:
        A[2.1, 2.1, 2.1] = {"classification": [1]}

    with tiledb.open(array_name) as A:
        print("All data")
        print(A.df[:])

if __name__ == "__main__":
    print("*** Write Dups ***")
    allow_dups_write()
    print("*** No Dups ***")
    no_dups_write()

Output

*** Write Dups ***
All data
     X    Y    Z  classification
0  2.1  2.1  2.1               0
1  2.1  2.1  2.1               1
At timestamp 1
     X    Y    Z  classification
0  2.1  2.1  2.1               0
*** No Dups ***
All data
     X    Y    Z  classification
0  2.1  2.1  2.1               1

I will fix the driver so that you can set the dup variable and do the update. The default will be as it is now but I will add an additional PDAL creation option for the schema rather than reading from a config file.

@normanb Thanks for the code example. My code looks quite similar, so I will have to check for differences why I get duplicates.

@normanb I have now used the new PDAL version 2.7.1 with your fix. I have set allow_dups to false, but I am now getting the follwing error:
tiledb.cc.TileDBError: [TileDB::Writer] Error: Duplicate coordinates (4.51457e+06, 5.42895e+06, 332.71) are not allowed

I read current entries in the db with a query, update some attribute values and write the data back in the db like in your example.
Maybe I have the wrong understanding, what "allow_duplicates" means. IMO when I set it to true, and I write a value to the database, it checks, if for this dimension values, there is already an db entry, therefore overwriting it and not creating a duplicate entry.

@weidinger-c you will get this error if you have duplicates in the fragment that you write. Do you have a small example dataset you can share and I will look into this.

Hi @normanb, here is a short example code. As I cannot share my data, I used this pointcloud here for the following code: https://samples.geoslam.com/Potree/Powerline_clash_detection/files/Powerline_analysis_laz.zip

I now have the error when inserting the las file into tiledb:
tiledb.cc.TileDBError: [TileDB::Writer] Error: Duplicate coordinates (280568, 5.86573e+06, 558.204) are not allowed

I tried fixing it with the context setting "sm.dedup_coords" = "true". It is totally fine, that for duplicate coordinates, only 1 entry exists, as points, that are spatially at the exact same coordinates shall be removed.
But I am not sure, if I am adding the context correctly.
(Documentation: https://docs.tiledb.com/main/how-to/configuration)
image

import numpy as np
import laspy
import tiledb
import time
import pdal

if __name__ == "__main__":

    # Name of the array to create.
    tiledb_array_name = "writing_sparse_multiple"

    # lasfile path
    pointcloud_filepath = (
        "/YOUR_PATH/Powerline_analysis_laz/Powerline_analysis_cl.laz"
    )
    pointcloud_with_ids_filepath = "/YOUR_PATH/Powerline_analysis_laz/Powerline_analysis_cl_unique_ids.las"

    # Add unique point id attribute and label attribute to the las file
    las_file = laspy.read(pointcloud_filepath)
    las_file.add_extra_dim(laspy.ExtraBytesParams(name="PointId", type=np.uint64))
    las_file.add_extra_dim(laspy.ExtraBytesParams(name="Label", type=np.uint16))
    pointid = np.arange(0, las_file.header.point_records_count)
    label = np.zeros(las_file.header.point_records_count, dtype=np.uint16)
    las_file.PointId = pointid
    las_file.Label = label
    las_file.write(pointcloud_with_ids_filepath)

    # Create dimensions
    dom = tiledb.Domain(
        tiledb.Dim(
            name="X",
            domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
            dtype=np.float64,
        ),
        tiledb.Dim(
            name="Y",
            domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
            dtype=np.float64,
        ),
        tiledb.Dim(
            name="Z",
            domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
            dtype=np.float64,
        ),
    )
    # Create schema
    schema = tiledb.ArraySchema(
        domain=dom,
        attrs=[
            tiledb.Attr(name="Intensity", dtype=np.uint16),
            tiledb.Attr(name="Label", dtype=np.uint16),
            tiledb.Attr(name="PointId", dtype=np.uint64),
        ],
        cell_order="hilbert",
        tile_order=None,
        capacity=100000,
        sparse=True,
        allows_duplicates=False,
    )

    # Create a configuration object
    config = tiledb.Config()
    config["sm.dedup_coords"] = "true"

    # Create a TileDB context
    ctx = tiledb.Ctx(config)

    print(schema)
    # Create the TileDB array
    tiledb.SparseArray.create(tiledb_array_name, schema, overwrite=True, ctx=ctx)

    # Ingest data
    print("Insert data into tiledb")
    with tiledb.open(tiledb_array_name, "w") as A:
        # Read las file
        las_file = laspy.read(pointcloud_with_ids_filepath)

        x = las_file["x"]
        y = las_file["y"]
        z = las_file["z"]
        intensity = las_file["intensity"]
        label = las_file["Label"]
        pointid = las_file["PointId"]

        # Write to TileDB
        A[x, y, z] = {
            "Intensity": intensity,
            "Label": label,
            "PointId": pointid,
        }

    # Read data
    with tiledb.open(tiledb_array_name) as A:
        data = A.df[:]
        print(data)
        data = A.query(cond="PointId > 0 and PointId < 1000000").df[:]
        print(data)
        x = data["X"]
        y = data["Y"]
        z = data["Z"]
        label = data["Label"]
        intensity = data["Intensity"]
        pointid = data["PointId"]
        # Add 1 to the classification attribute
        label = label + 1

    # Write updated data
    with tiledb.open(tiledb_array_name, "w") as A:
        A[x, y, z] = {
            "Label": label,
            "Intensity": intensity,
            "PointId": pointid,
        }

    # Check if the data is updated correctly
    with tiledb.open(tiledb_array_name, "r") as A:
        data = A.query(cond="PointId > 0 and PointId < 1000000").df[:]
        data = A.df[:]
        print(data)