TileDB Config File param "sm.dedup_coords true" not used for array creation
weidinger-c opened this issue · comments
Based on the tutorial https://docs.tiledb.com/main/integrations-and-extensions/geospatial/pdal it states that per default TileDB allows duplicates.
To able to update point cloud coordinates inside the database, duplicates should be set to false. The guide states it as follows:
Ingesting LAS Files
First, create a TileDB config file tiledb.config where you can set any TileDB configuration parameter (e.g., AWS keys if you would like to write to a TileDB array on S3). Make sure you also add the following, as currently TileDB does not handle duplicate points (this will change in a future version).
I tried that step by creating a file named tiledb.config with the content:
sm.dedup_coords true
and used that file in the tiledb writer module with param config_file="tiledb.config"
But still, after using the PDAL TileDB writer, the schema of the created database reports "allows_duplicates = true".
I guess It comes from the code part
PDAL/plugins/tiledb/io/TileDBWriter.cpp
Line 292 in a2c196d
where the schema is always set to allow duplicates, no matter what config file was provided.
@weidinger-c that is an issue, the main docs on our website lag behind the code, we are addressing that. What is your use case, do you need to set dups
to false. I will update the code.
I would suggest that you want duplicates
and that you couple that with deletion - https://docs.tiledb.com/main/how-to/arrays/writing-arrays/deleting Then you can add a point in the same place and delete the previous point.
@normanb yes i need dups = false
My usecase is that I load a lot of las files, compute some algorithms on parts of the data and in the end want to update some attributes e.g. Classification for the points. Currently, the points are added a second time instead of replacing the attribute values, which IMO is due to dups
being true at the moment. Thanks for the help!
@weidinger-c to achieve this workflow you will need to replace the point in two stages, delete the point and then write. dups
is misleading in this case, you can't update the existing point by setting dups = false
@normanb This is quite confusing for me. TileDB is a database where I cannot change the stored data unless I delete a whole entry first to then insert it again?
@weidinger-c I am going to create a couple of python examples which I will share here so we can explore what you need.
@weidinger-c I have been able to recreate your use case in Python
import numpy as np
import tiledb
# Name of the array to create.
array_name = "writing_sparse_multiple"
def create_array(allow_dups=False):
# The array will be 4x4 with dimensions X/Y/Z, with domain [1,4].
dom = tiledb.Domain(
tiledb.Dim(name="X", domain=(1, 4), tile=4, dtype=np.float64),
tiledb.Dim(name="Y", domain=(1, 4), tile=4, dtype=np.float64),
tiledb.Dim(name="Z", domain=(1, 4), tile=4, dtype=np.float64),
)
# The array will be sparse with a single attribute "classification" so each (i,j) cell can store an integer.
schema = tiledb.ArraySchema(
allows_duplicates=allow_dups,
domain=dom,
sparse=True,
attrs=[tiledb.Attr(name="classification", dtype=np.int32)]
)
# Overwrite/Create the (empty) array on disk.
tiledb.SparseArray.create(array_name, schema, overwrite=True)
def allow_dups_write():
create_array(allow_dups=True)
# write classification point
with tiledb.open(array_name, "w", timestamp=1) as A:
A[2.1, 2.1, 2.1] = {"classification": [0]}
# update
with tiledb.open(array_name, "w", timestamp=2) as A:
A[2.1, 2.1, 2.1] = {"classification": [1]}
with tiledb.open(array_name) as A:
print("All data")
print(A.df[:])
with tiledb.open(array_name, timestamp=1) as A:
print("At timestamp 1")
print(A.df[:])
def no_dups_write():
create_array(allow_dups=False)
# write classification point
with tiledb.open(array_name, "w") as A:
A[2.1, 2.1, 2.1] = {"classification": [0]}
# update
with tiledb.open(array_name, "w") as A:
A[2.1, 2.1, 2.1] = {"classification": [1]}
with tiledb.open(array_name) as A:
print("All data")
print(A.df[:])
if __name__ == "__main__":
print("*** Write Dups ***")
allow_dups_write()
print("*** No Dups ***")
no_dups_write()
Output
*** Write Dups ***
All data
X Y Z classification
0 2.1 2.1 2.1 0
1 2.1 2.1 2.1 1
At timestamp 1
X Y Z classification
0 2.1 2.1 2.1 0
*** No Dups ***
All data
X Y Z classification
0 2.1 2.1 2.1 1
I will fix the driver so that you can set the dup
variable and do the update. The default will be as it is now but I will add an additional PDAL creation option for the schema rather than reading from a config file.
@normanb Thanks for the code example. My code looks quite similar, so I will have to check for differences why I get duplicates.
@normanb I have now used the new PDAL version 2.7.1 with your fix. I have set allow_dups to false, but I am now getting the follwing error:
tiledb.cc.TileDBError: [TileDB::Writer] Error: Duplicate coordinates (4.51457e+06, 5.42895e+06, 332.71) are not allowed
I read current entries in the db with a query, update some attribute values and write the data back in the db like in your example.
Maybe I have the wrong understanding, what "allow_duplicates" means. IMO when I set it to true, and I write a value to the database, it checks, if for this dimension values, there is already an db entry, therefore overwriting it and not creating a duplicate entry.
@weidinger-c you will get this error if you have duplicates in the fragment that you write. Do you have a small example dataset you can share and I will look into this.
Hi @normanb, here is a short example code. As I cannot share my data, I used this pointcloud here for the following code: https://samples.geoslam.com/Potree/Powerline_clash_detection/files/Powerline_analysis_laz.zip
I now have the error when inserting the las file into tiledb:
tiledb.cc.TileDBError: [TileDB::Writer] Error: Duplicate coordinates (280568, 5.86573e+06, 558.204) are not allowed
I tried fixing it with the context setting "sm.dedup_coords" = "true". It is totally fine, that for duplicate coordinates, only 1 entry exists, as points, that are spatially at the exact same coordinates shall be removed.
But I am not sure, if I am adding the context correctly.
(Documentation: https://docs.tiledb.com/main/how-to/configuration)
import numpy as np
import laspy
import tiledb
import time
import pdal
if __name__ == "__main__":
# Name of the array to create.
tiledb_array_name = "writing_sparse_multiple"
# lasfile path
pointcloud_filepath = (
"/YOUR_PATH/Powerline_analysis_laz/Powerline_analysis_cl.laz"
)
pointcloud_with_ids_filepath = "/YOUR_PATH/Powerline_analysis_laz/Powerline_analysis_cl_unique_ids.las"
# Add unique point id attribute and label attribute to the las file
las_file = laspy.read(pointcloud_filepath)
las_file.add_extra_dim(laspy.ExtraBytesParams(name="PointId", type=np.uint64))
las_file.add_extra_dim(laspy.ExtraBytesParams(name="Label", type=np.uint16))
pointid = np.arange(0, las_file.header.point_records_count)
label = np.zeros(las_file.header.point_records_count, dtype=np.uint16)
las_file.PointId = pointid
las_file.Label = label
las_file.write(pointcloud_with_ids_filepath)
# Create dimensions
dom = tiledb.Domain(
tiledb.Dim(
name="X",
domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
dtype=np.float64,
),
tiledb.Dim(
name="Y",
domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
dtype=np.float64,
),
tiledb.Dim(
name="Z",
domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
dtype=np.float64,
),
)
# Create schema
schema = tiledb.ArraySchema(
domain=dom,
attrs=[
tiledb.Attr(name="Intensity", dtype=np.uint16),
tiledb.Attr(name="Label", dtype=np.uint16),
tiledb.Attr(name="PointId", dtype=np.uint64),
],
cell_order="hilbert",
tile_order=None,
capacity=100000,
sparse=True,
allows_duplicates=False,
)
# Create a configuration object
config = tiledb.Config()
config["sm.dedup_coords"] = "true"
# Create a TileDB context
ctx = tiledb.Ctx(config)
print(schema)
# Create the TileDB array
tiledb.SparseArray.create(tiledb_array_name, schema, overwrite=True, ctx=ctx)
# Ingest data
print("Insert data into tiledb")
with tiledb.open(tiledb_array_name, "w") as A:
# Read las file
las_file = laspy.read(pointcloud_with_ids_filepath)
x = las_file["x"]
y = las_file["y"]
z = las_file["z"]
intensity = las_file["intensity"]
label = las_file["Label"]
pointid = las_file["PointId"]
# Write to TileDB
A[x, y, z] = {
"Intensity": intensity,
"Label": label,
"PointId": pointid,
}
# Read data
with tiledb.open(tiledb_array_name) as A:
data = A.df[:]
print(data)
data = A.query(cond="PointId > 0 and PointId < 1000000").df[:]
print(data)
x = data["X"]
y = data["Y"]
z = data["Z"]
label = data["Label"]
intensity = data["Intensity"]
pointid = data["PointId"]
# Add 1 to the classification attribute
label = label + 1
# Write updated data
with tiledb.open(tiledb_array_name, "w") as A:
A[x, y, z] = {
"Label": label,
"Intensity": intensity,
"PointId": pointid,
}
# Check if the data is updated correctly
with tiledb.open(tiledb_array_name, "r") as A:
data = A.query(cond="PointId > 0 and PointId < 1000000").df[:]
data = A.df[:]
print(data)