ros-perception / vision_msgs

Algorithm-agnostic computer vision message types for ROS.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

vision_msgs standardization

peci1 opened this issue · comments

Hi, I come from the Discourse post, and I expected there will be some prepared space for discussion. I haven't found it, so I'm creating it here :) If I misunderstood how ideas/comments should be added, please, tell me where to post. Generally, I'm not a fan of all-in-one issues, but as you asked more for opinions than for resolute statements/bug reports, it would be complicated to post them one-at-a-time. So I'll try to separate ideas by paragraphs, but the thoughts are intertwined sometimes.

2D?: One general idea - do you think the 2D cases are so important to deserve their own messages? There are suggestions that full 3D information should be used whenever possible. And even now you're using e.g. the 3D ObjectHypothesisWithCovariance in Detection2D. I bet that bit-size is not a limiting factor in most vision detection applications. I know there will be the hassle theta vs quaternion, but if you think it through, it's not that bad. An angle theta representing yaw around z boils down to quaternion (0, 0, sin(theta), cos(theta)). And in a lot of further processing, you would anyways turn theta into the sin/cos values. See also my further points about helper functions and decoupling from source data.

Helper functions: If you fear users would need to write boilerplate code (as was expressed in #25), this can be fixed by suitable helper functions inside vision_msgs. You can provide these functions in libraries (as you do here for boundingbox construction, or as sensor_msgs does). gencpp can go one step further and provide the functionalities right in the generated header files (via the plugin infrastructure: it can add constructors, custom class method and operators and other free functions. Some off-the-shelf examples: BoundingBox(Point, Vector3, theta) (would do the trig to correctly fill the quaternion in 2D case), double BoundingBox::theta() (would do the inverse), BoundingBox(Point, Vector3) (construct 2D/3D axis-aligned box from center and extents), BoundingBox(Point, Point) (construct 2D/3D axis-aligned box from min point and max point), std::string ObjectHypothesis::stringID() (convert UUID to string; see my point on UUID), static UniqueID ObjectHypothesis::generateID(). AFAIK, genpy doesn't have such plugin mechanism, but providing a helper module still seems like a sufficient way.

Decouple from data: Some messages here include a field carrying some source data. I understand it is very practical on one hand (we do it too in our custom messages), but on the other hand it generates issues like #20 (and possibly future issues like sensor_msgs/PointCloud support etc.). And these have no good solution with the data inside the message except for composition on a higher level. I propose to remove the source data fields from vision_msgs, and possibly leave a note in the message definition that the user should either use a message synchronizer to get the pairs of image-detection, or create a custom type that contains both the image (cloud...) and the detection message. It's easy to use topic_tools/relay_field to extract the desired parts afterwards. There are situations where you do not want to relate the detection to the whole image, but you want to e.g. have the cutout from the image where the detection occured. This would work the same in case 1 detection per image is published (you'd just publish the cutout on a new topic). But it would become a bit more complicated in case the detector can return multiple detections per image and you would use DetectionArray. Here, matching by header would no longer be enough. Or - if taken rigorously - it could be, because making a cutout from an image formally invalidates its frame_id. So each cutout should have its own frame_id, and then you could make the detection-image mapping. Some more rationale: someone might want to run detections on an organized 3D pointcloud from a depth camera and return 2D detections (because he's e.g. only interested in bounding boxes in the RGB in the end). And a practical issue we faced: on a low-bandwidth link, we wanted to send detections, but as I mentioned earlier, the image is already a part of our messages (similarly to how it is done here now). But we figured out we do not want to transmit the image over the slow link. So we ended up not filling any values in the image fields, and now we have detections which lie to be detected on 0x0 px images. That doesn't really seem like a conceptual approach. Still, I'm not really 100% sure on this point, but it seems to me decoupling is the "more correct" way, but has its pros and cons. Bad thing is that using composition to create a DetectionsWithImagesArray message would not subsequently allow to use topic_tools/relay_field, because you would need to relay primitive-array fields, which I guess is not supported.

UUID: Maybe it's time to revisit the proposal to use UUIDs (#25). Recently, ros-perception/radar_msgs#3 agreed on using them for a very similar thing (RadarTrack, which is the same as a tracked detection in terms of vision_msgs). (actually, there's still one review pending for it to be merged)

geometry_msgs: I'd really like to see both BoundingBox classes in geometry_msgs. That makes much more sense (but is a much longer run :) ). Bounding boxes are used outside of vision, too, e.g. in laser filtering. Eigen also has its AlignedBox class in the Geometry module. And, taking into account my point about 2D and helper functions, it might be sufficient to transfer only the 3D box and get rid of the 2D box. I could help with this one because I already did some PRs regarding bounding boxes to MoveIt and Eigen...

Duplicate headers: The *Array messages contain a header, but then they contain the array of Detections, which also each carries a header. That seems either unncessary, or not well documented (to help the user distinguish what values to fill in each header). You could either remove headers from the non-array versions and tell that even single-object detectors have to publish an array, or you could make stamped and unstamped versions of the messages.

ObjectHypothesisWithPose should use composition: The ObjectHypothesisWithPose message should rather use composition to include an ObjectHypothesis object. I know this adds a bit more typing when creating and filling the message, but allows to use tools specialized at processing ObjectHypothesis messages, whereas the current implementation would require the tool to explicitly support both the variant with pose and without it, even though it might not be interested in the pose at all. I know there are some counter-examples in common_msgs (mainly the stamped/unstamped message versions), but I somehow always disliked the fact they were not interchangable when I was not interested in the timestamps or frames.

VisionInfo latching: I suggest to explicitly mention that VisionInfo should be published as a latched topic.

That's it for now. Do not treat these suggestions as some must-haves, it's just feedback which should help this package converge closer to something we could really use instead of our custom messages.

Hi @peci1, thanks for your insightful comments. I agree with most of your points. As a general comment, we'll have to find a tradeoff between cleaner message design (splitting out the source_img and source_cloud fields, ObjectHypothesisWithPose) and the additional complexity from using composition, which will result in an even deeper nested hierarchy. We already have monstrosities like det3darray.detections[0].results[0].pose.pose.position.x...

Here are some specific comments:

Decouple from data

I don't have any strong opinions on this one. Personally, I've never filled the source_cloud or source_image fields; according to the comments in the message definitions, they are optional anyway.

So we ended up not filling any values in the image fields, and now we have detections which lie to be detected on 0x0 px images.

I wouldn't say it's a "lie". The fields are optional, so when a node receives a message where the source_img or source_cloud is 0x0, it should treat that field as missing.

I'm not convinced by the frame_id trickery. If we need to support cutouts, I'd either leave the existing optional fields in the messages, or follow one of your other suggestions (a custom type with a Detection3DArray and a PointCloud2 (resp. Image) array, or message synchronizer with those two message types).

UUIDs

I'll continue to fight for keeping it a string field. The pros and cons have already been discussed exhaustively in #25. My main argument is that in some use cases, the tracking_id is somehow coupled to other data sources, like an external database or a MoveIt planning scene, and it's very useful to use the same ID in all databases. You can still use a UUID inside the string field, but I don't think forcing the user to use a UUID is a good idea.

geometry_msgs: I'd really like to see both BoundingBox classes in geometry_msgs.

Great suggestion!

Do not treat these suggestions as some must-haves, it's just feedback which should help this package converge closer to something we could really use instead of our custom messages.

That's interesting! I believe it's always much better to extract standard interfaces (like vision_msgs) from actual use cases than to develop them from scratch. This is also what happened here; the current vision_msgs are partly based on custom messages that people have used in the past, and are already actively used by some people. In my use cases, none of your suggestions would be a dealbreaker (except the UUIDs).

Since you said "converge closer": What is still preventing you from switching to vision_msgs? Are there any of your suggestions that will have to be merged before you would consider adopting vision_msgs?

Lets try to keep these messages smaller and more pointed...

2D?: One general idea - do you think the 2D cases are so important to deserve their own messages?

Yes. Standard detections are in a 2D image frame. Most AI methods right now cannot do 3D (you have a bounding box on an image, it doesn't make sense to have a 3D representation for 2D pixel coordinates). We're working in pixel frame in AI, not world coordinates.

UUIDs could be useful, I agree. The usefulness of UUIDs @mintar is when using multiple cameras doing tasks. If you just increment up then there could be multiple id = 5's in a system that would really mess things up and not be unique. UUIDs are randomly generated so collisions are far more rare. For things that have single instances (like moveit, navigation, etc) then a string field would be OK because there's little opportunity for collision. But things like this could easily and commonly be running multiple cameras streams worth having some universally unique identifiers so that the node listening to detections can sort them.

I'd really like to see both BoundingBox classes in geometry_msgs

I don't think that's in the cards, again, these are pixel coordinates which are most meaningful in a vision_msgs package.

Duplicate headers

This is common practice across geometry_msgs as well, I think we can safely do this based on the consistency argument.

ObjectHypothesisWithPose should use composition

Agreed

VisionInfo latching

That seems oddly specific and also not really that easy for ROS2. We have similar concepts of latching, but there's no boolean switch anymore, its a range of settings. I don't think ultimately this makes sense to try to enforce, it might not make sense for all applications.

Thanks for all the discussion, everyone. Good to see this repo getting some love :)

  • I agree with Steve, 2D and 3D really should be separate.
  • I maintain my stance that strings are superior to UUIDs; I'm open to hearing new arguments. However, the argument used in radar_msgs was that it was better than using some number field that could overflow. This is not a concern when using a string. As a compromise, we could add a comment recommending that the field be set to the string representation of a UUID.
  • on the topic of duplicate headers, the original post suggested a couple of options. @SteveMacenski can you clarify which approach you are referring to as common practice? The stamped/non-stamped versions? I like this approach. How should we handle arrays of these messages? Do we need to make Detection3D, Detection3DStamped, Detection3DArray, Detection3DStampedArray, etc.? That seems like a good path forward. Although this may get out of hand, for example we may want Detection3DStampedArray and also Detection3DStampedArrayStamped (allowing for the array message itself to have stamps or not). Thoughts?
  • Agreed on ObjectHypothesisWithPose composition.
  • VisionInfo latching: the much-more used CameraInfo does not mention latching in its message definition. I agree that latching is usually best practice...but not always. And as Steve mentioned, ROS2 makes this tough.

Are there any of your suggestions that will have to be merged before you would consider adopting vision_msgs?

I went through our detection messages, and this is what I found. We have a field explicitly storing the objectness. I'm not sure if this is a generally useful thing or if it's just our particular need. We also like to store what sensor is the origin of the detection - the frame_id can change over time (e.g. by doing cutouts and such), but we still like to know which sensor captured the source image. But maybe it's just for convenience and we could always find the source sensor some other way, e.g. traversing the TF tree or so (which is, however, harder during replay than just reading a field). We also store the estimated 3D position after going through localization (depth and size estimation). Here, I'm not sure if it's still a relevant part of vision_msgs or it should rather be a part of something else. Detection3D would seem like the ideal "storage" for the localization result, but as long as it is tied to a single pointcloud as a source, it doesn't make sense. Our localization uses a source image and several pointclouds. Which of them should be stored as the source?

As detectors can output other things than class probabilities (e.g. the objectness mentioned above, or some other estimated properties), wouldn't it make sense to generalize this a bit and allow the kind of "modular" data storage like pointcloud2 does using a byte buffer and PointField metadata description (which could be published on a separate latched topic)? But I'm not sure how often it happens that detectors output more kinds of data. This would definitely future-proof the message definition, but the question is whether it isn't just a workaround for not having a good definition :) In the case of PointCloud2, I think it really was a good decision, as at the time of its creation, nobody probably thought about "multi-layer" lidars and the need to store the "ring" from which a point comes... and so on.

We're working in pixel frame in AI, not world coordinates.

Okay, that's a fair point. Using a 3D Box message for storing 2D pixel coordinates doesn't look like a good idea in the end. What about re-using sensor_msgs/RegionOfInterest? It seems close to BoundingBox2D, even semantically, but not exactly. It has the do_rectify field that would be useless here (or not? it might tell whether the detection relates to the rectified image or not), and it's missing theta.

I was thinking a bit more about the (UU)IDs. And I figured out I had a wrong concept in my head about their usage here. I somehow thought that it would be an ID of each particular detection, so that e.g. a localization algorithm can store the set of detections that contributed to the estimated pose. But going through it more thoroughly, I figured out the IDs relate not to the detections, but to the classes. That makes Steve's point invalid in my view, because you'd probably run five detectors, but all of them with the same config, which would mean that ID=5 would mean a drill in both cameras 1 and 2. What would still argue a bit in favor of UUIDs would be the case when you run multiple detectors on a single image. If both detectors had hypotheses IDs as numbers starting from zero, you wouldn't be able to distinguish between them looking just at the hypotheses array. But I can't imagine the mess going through all our bagfiles and having to do some database lookups to find all drills (if the UUIDs would differ in each run). For this particular case, it seems to me that it'd be ideal to use IDs of form "namespace/class_number" or "namespace/class_name", which could be a single string, two strings, or a string and int. The need for a namespace is there to distinguish e.g. "yolo/drill" from an "otherdetector/drill", which could be trained on different datasets and thus report different kinds of objects as a drill. Wait, this namespace information is kind of available in VisionInfo, so maybe, in the end (through the rabbit hole and back), a hypothesis ID could be just an int or string. But it would deserve much more thorough documentation mentioning this kind of thought path...

* VisionInfo latching: the much-more used `CameraInfo` does not mention latching in its message definition. I agree that latching is usually best practice...but not always. And as Steve mentioned, ROS2 makes this tough.

CameraInfo is supposed to be published with each image, because it can change (e.g. for zooming cameras). VisionInfo should not change (except for the case of lifelong learning detectors). That's the reason why I suggested latching. In ROS 1, that would definitely make sense. In ROS 2, I don't know, I haven't tried it yet.

* The stamped/non-stamped versions? I like this approach. How should we handle arrays of these messages?

Outch, Detection3DStampedArrayStamped looks really bad. Couldn't we agree that all detections in an array share a common header? That would lead to Detection3DArrayStamped containing just Detection3D messages, all of them sharing the header. This was the kind of de-duplication I was talking about.

we could add a comment recommending that the field be set to the string representation of a UUID

I think that's not really a good move. If we "recommend" something, we should just say that you should "do" that thing. If we're telling people to use UUIDs, then we should just use the proper UUID message. As ROS gets more mature, some standards need to get more strict to enforce good behavior.

Detections from some arbitrary AI detection algorithm have no way of assigning a meaningful ID, they're also themselves just returning some hash to specifically identify one thing vs another. My understanding of this package is that this is the output of a detector / segmentation / classification task to then be consumed by later stages that will track those detections over time/space, or utilize them for the environmental model. Having truly unique IDs is probably really important to that problem in general and the class names should be what's used to communicate what the object is. If the AI detector that is publishing this message can also track the objects such that between frames a single object is correlated, then names start to matter a little more, but you could still also just use the same UUID again since the string name human 452 or chair 30 I wouldn't think would be used for anything other than book-keeping and visualization. The real "information" is that this is a unique human or chair (class ID tells us what the "thing" is) from another human or chair (the different ID field, whether UUID or not).

So what else is the string-based identifier used for that could add more value? I think that's the question that this boils down to. If its just that its a human and a unique human from another human, the class IDs that are stored in the detection and the fact that the ID is not the same is sufficient, regardless if its a UUID or something else.

duplicate headers

I don't think anything needs to change. I'm fine with it as is.

Agreed on ObjectHypothesisWithPose composition.

I think we're all on the same page here

UUIDs

@peci1 wrote:

I somehow thought that it would be an ID of each particular detection, so that e.g. a localization algorithm can store the set of detections that contributed to the estimated pose. But going through it more thoroughly, I figured out the IDs relate not to the detections, but to the classes.

You probably looked at the Kinetic/Melodic version of the messages, where id was still an int64 and tracking_id didn't exist. In Noetic, there are two ID fields (see below):

  • ObjectHypothesisWithPose.id: This is the detected class (human, mug, ...). I think we can all agree that this should remain a string, correct? Also perhaps id wasn't the best name for this field; it has often caused confusion in the past.
  • tracking_id: This is the ID of the object when tracked across multiple frames. The discussion is whether we should make this a UUID. Right?
  • The single detections (without tracking across frames) don't have an ID. There was a discussion in the issues, but we decided against it, since each detection is already uniquely identified by the time stamp of its Detection*DArray message and its array index within that message.

Detection3D.msg

std_msgs/Header header
vision_msgs/ObjectHypothesisWithPose[] results
  # The unique ID of the object class. To get additional information about
  #   this ID, such as its human-readable class name, listeners should perform a
  #   lookup in a metadata database. See vision_msgs/VisionInfo.msg for more detail.
  string id

  float64 score
  geometry_msgs/PoseWithCovariance pose
vision_msgs/BoundingBox3D bbox
sensor_msgs/PointCloud2 source_cloud

# If this message was tracking result, this field set true.
bool is_tracking

# ID used for consistency across multiple detection messages. This value will
#   likely differ from the id field set in each individual ObjectHypothesis.
# If you set this field, be sure to also set is_tracking to True.
string tracking_id

The arguments for and against UUIDs as presented in #25 and reiterated by @SteveMacenski basically boil down to:

Pro UUID:

  • If there are multiple object detector nodes, and some aggregator node just simply throws all their results together, UUIDs will prevent collisions between tracking IDs.

Contra UUID:

  • Having the possibility of having human-readable strings as tracking_ids is much more user and developer friendly; apple-3456 is easier on the eyes than ec1cf114-7aa0-4402-8bf9-eee858118b7d when used in log outputs, RViz markers etc.
  • Perhaps there's not just a one-way flow of information from object detectors into the environment model, but perhaps the object detector re-aquires an object that is already known (by some ID) in the environment model and continues to track it. Having a string here gives us the flexibility of using the existing ID for tracking, no matter what format it is in.

Presumably, if you run multiple object detector nodes, they will all publish their results on separate topics. If you are worried about collisions, the aggregator node could simply prefix each tracking_id with the topic name, which would guarantee unique strings. In my view, this negates the only advantage of UUIDs. So I'm also opposed to changing this field to a UUID.

BoundingBoxes

@peci1 wrote:

What about re-using sensor_msgs/RegionOfInterest? It seems close to BoundingBox2D, even semantically, but not exactly. It has the do_rectify field that would be useless here (or not? it might tell whether the detection relates to the rectified image or not), and it's missing theta.

Yes, we had that discussion before, but @Kukanani was adamant that we need the theta field. I believe BoundingBox2D should definitely stay in this package, because it's pixel based. But BoundingBox3D could still move to geometry_msgs, right?

On geometry_msgs, I don't think that's in the scope of this discussion. We're mainly looking to create stronger vision_msgs interfaces. It's a whole other discussion to move some of these into geometry_msgs that will likely require years of stability in vision_msgs before that would be a serious consideration.

Thanks for the discussion. It would be helpful if a general-enough representation is found which efficiently cover typical use cases.

I would suggest to organize the messages around common vision tasks, keeping the representation footprint small for typical use cases while keeping the messages general enough to allow other uses cases as well.

For example, for a 2D detector the output would be a list of bounding boxes with some confidences or class probabilities.
As the classes and their exact definition are inherently dataset-specific, I would suggest to use a generic-enough representation such as variable-length list scores float32[] to denote score(s)/probability(s) to allow various use cases.

More flexibility (for a reasonable price) could be added by explicitly adding class_id[] field so that a subset of classes can be captured in scores and class_id. This would be helpful when there are a large number of classes we model/detect and want to keep e.g. 10 most likely. This would add 4+ bytes.

As the task really is image -> list of detections, I would keep the header only in the list-of-detections message.

With 4 float32 values (or 5 with theta) to define the bounding box with subpixel precision, it would be ~36 bytes for a single 2D detection. This can't go that much lower I would say.

I see as a problem that now there isn't any lightweight message for bounding box detection or a list of these - which is typical output of a 2D detector. Having a 3D pose with covariance as a part of 2D detection seems as an overkill to me, it is 288 bytes to represent the covariance alone (and additional 56 bytes for pose). This also mix 2D with 3D for no obvious reason - there is a 3D detection message after all. There could as well be the 3D bounding box in 2D detection.

A similar situation is with the image (~ 35 bytes). The source image is presumably already being sent on some other topic and matching the detection result with the image source is therefore simple - you just sync these two topics. So I would omit it completely, definitelly from the type for a single detection, Detection2D, or use just compressed image buffer which can be empty.

Regarding tracking IDs - conceptually, it seems to me that these should be a separate thing built on top of these 2D detections.

I just realized where some of the 2D/3D confusion may come from. It seems that now the detection messages are organized more around which input modalities were used to generate them (2D image / 3D lidar) rather than around what information is being produced (2D bounding boxes / 3D poses with bounding boxes). I think the latter would be more useful since, in general, there can be any combination of modalities on input and the latter says what can be done further with the outputs.

I agree with you that the messages are quite complex at the moment (maybe too complex) and it's worth thinking about having a simpler Detection2D message.

However, I wouldn't worry too much about a couple of bytes. A single 640x480 RGB image has 900 kB, so it doesn't matter too much whether the detection message for that image is 0.1 kB or 0.5 kB.

A much better argument for removing weird fields that will probably not be filled in a large chunk of use cases is that it will make the messages clearer, less ambiguous and easier to use.

I just realized where some of the 2D/3D confusion may come from. It seems that now the detection messages are organized more around which input modalities were used to generate them (2D image / 3D lidar) rather than around what information is being produced (2D bounding boxes / 3D poses with bounding boxes). I think the latter would be more useful since, in general, there can be any combination of modalities on input and the latter says what can be done further with the outputs.

Yes, I think you are right. I have a good example for this: DOPE uses 2D images as input, but it produces 6DoF pose estimations and 3D bounding boxes, so I used Detection3D as the output message. This also means that I can't really use the source_cloud field, because the input isn't a cloud. This problem could be solved by removing the source_cloud/source_img fields (as proposed by @peci1). If some application really needs it, they could use a custom message that has the appropriate fields (for example Detection3D and Image).

In summary, would you agree that the following changes solve the problems?

diff --git i/msg/Detection2D.msg w/msg/Detection2D.msg
index e13fedf..c8f87d8 100644
--- i/msg/Detection2D.msg
+++ w/msg/Detection2D.msg
@@ -7,19 +7,12 @@
 Header header
 
 # Class probabilities
-ObjectHypothesisWithPose[] results
+ObjectHypothesis[] results
 
 # 2D bounding box surrounding the object.
 BoundingBox2D bbox
 
-# The 2D data that generated these results (i.e. region proposal cropped out of
-#   the image). Not required for all use cases, so it may be empty.
-sensor_msgs/Image source_img
-
-# If true, this message contains object tracking information.
-bool is_tracking
-
 # ID used for consistency across multiple detection messages. This value will
 #   likely differ from the id field set in each individual ObjectHypothesis.
-# If you set this field, be sure to also set is_tracking to True.
+#   This field may be empty.
 string tracking_id

That would result in the following Detection2D message:

std_msgs/Header header
  uint32 seq
  time stamp
  string frame_id
vision_msgs/ObjectHypothesis[] results
  int64 id
  float64 score
vision_msgs/BoundingBox2D bbox
  geometry_msgs/Pose2D center
    float64 x
    float64 y
    float64 theta
  float64 size_x
  float64 size_y
string tracking_id

(I assume you meant vision_msgs/ObjectHypothesis[] results in the bottom code block.)
Yes, that would definitely be a reasonable update in the direction I'd like to see.

(I assume you meant vision_msgs/ObjectHypothesis[] results in the bottom code block.)

Yes, that was a copy-paste error. I've edited my comment above and fixed it, thanks!

I think that the same thing can be done with Detection3D. The pose and bbox would be defined just by BoundingBox3D and semantics by ObjectHypothesis[] results instead of ObjectHypothesisWithPose[] results.
I don't see the point in having both ObjectHypothesisWithPose[] results and BoundingBox3D bbox.
Also, both source_cloud and is_tracking fields could be removed.

I would consider to allow a generic binary source blob to allow including source data digest of any kind. I know it has drawbacks but it could prevent people devising their own message types just to include compressed image crops, point clouds, radar data etc. they happen to have in their scenario, with minimum penalty for those who don't need this.

So trying to accumulate changes that have some momentum to move this conversation forward to specific changes. I'm going to have a list below of owners and actions that we should take to split out these discussions into appropriate locations so we can take action:

  • @peci1 can you submit a PR to ObjectHypothesisWithPose should use composition? That seems to have no objections and we can continue that topic then in a PR rather than all here.
  • @peci1 can you open a separate ticket on Decouple from data? I think that's worth some discussion. We could create a comment in the message stating that it should be correlated against a synchronous message filter on the stamp matching the input data and frame. That way we can just simply remove it from the messages and you can still deterministically get the source info, but worth further discussion.
  • @mintar can you open a ticket about UUIDs and a summary of your argument against them? We should continue that discussion in a specific forum. I'll lean towards not changing this if there's alot of resistance, but I think it would be good.

Topics discussed that we're probably not going to take action on

  • Leaving 2D messages of some form, 80% of AI networks available today work on 2D images so it makes no sense to remove this.
  • Geometry_msgs inclusion is a long term potential goal, this is going to be omitted since we don't have direct control over this and would take years to fully get accepted.

@tpet can you briefly summarize tangible changes you'd like to have made?

@mintar can you open a ticket about UUIDs and a summary of your argument against them? We should continue that discussion in a specific forum. I'll lean towards not changing this if there's alot of resistance, but I think it would be good.

Done: #47

Leaving 2D messages of some form, 80% of AI networks available today work on 2D images so it makes no sense to remove this.

Could you clarify what you mean by that sentence? I don't understand who's suggesting to remove what. Is this a response to my comment #43 (comment) ?

@Kukanani @SteveMacenski : A general question about procedure: vision_msgs is already released into noetic. Am I correct to assume that we'll be restructuring the messages anyway? I think it's okay to deviate from the "no changes to released messages" rule in this case, but I just wanted to make sure before I open any pull requests.

A general question about procedure: vision_msgs is already released into noetic. Am I correct to assume that we'll be restructuring the messages anyway? I think it's okay to deviate from the "no changes to released messages" rule in this case, but I just wanted to make sure before I open any pull requests.

No, no changes on released distributions, ROS1 is done for on all the interfaces. The state of things now is how they'll be. This is only for new ROS2 distributions moving forward since we cannot break ABI in the middle of an established distribution (especially Noetic as the last LTS ROS distribution). All migrations have been largely completed to Noetic by now so this would represent a non-trivial random task to appear out of the blue that impact a surprisingly large user base.

Ok, thanks for the clarification. So all PRs should be targeted against the ros2 branch?

* @peci1 can you submit a PR to `ObjectHypothesisWithPose should use composition`? That seems to have no objections and we can continue that topic then in a PR rather than all here.

I'm not @peci1, but I've opened a PR anyway: #48

Composition with object hypothesis has been merged, UUID discussions are on-going and haven't started the discussions on the data decoupling.

I think that the same thing can be done with Detection3D. The pose and bbox would be defined just by BoundingBox3D and semantics by ObjectHypothesis[] results instead of ObjectHypothesisWithPose[] results.
I don't see the point in having both ObjectHypothesisWithPose[] results and BoundingBox3D bbox.

No, each object hypothesis needs its own pose, and so does the bounding box. This is because the bounding box pose is defined as the translation between the header frame and the center of the bounding box; however, the reference coordinate system of an object doesn't have to be in the center of its bounding box. It's often defined as the origin of a reference mesh.

Even if it was always in the center of the bounding box, the orientation doesn't have to match. E.g. a flat object could be a keyboard in its "normal" orientation, or a book lying on its side. So you need different poses for each object hypothesis.

@mintar I would tend to define the messages around tasks and respective outputs such as 2D/3D bounding box object detection. I meant Detection2D/3D message to represent bounding box detection, which is not clearly visible from the names as it could be. In that case, the task of identifying 3D reference frames (without bounding box) is simply a different task and should have a different message (or you could just disregard the bounding box size information).

Even if it was always in the center of the bounding box, the orientation doesn't have to match. E.g. a flat object could be a keyboard in its "normal" orientation, or a book lying on its side. So you need different poses for each object hypothesis.

I don't understand how the orientation couldn't match if the bounding box orientation is defined by a full 6DOF pose, same as in ObjectHypothesisWithPose, but without covariance.

Why do you think the messages here should be restricted to bounding box object detection?

I believe they should continue to allow full 6DoF pose estimation. This is how they are defined and used at the moment, for example by DOPE.

I don't understand how the orientation couldn't match if the bounding box orientation is defined by a full 6DOF pose, same as in ObjectHypothesisWithPose, but without covariance.

My point was that the pose in ObjectHypothesisWithPose is different from the bounding box pose. For example, here's a mesh of one of the objects from the standard YCB benchmark dataset:

ycb_meshlab_screenshot

Clearly, the mesh origin is not in the center of the bounding box.

To elaborate on the "book vs keyboard" example:

  • Assume the keyboard's bounding box is x=0.1 y=0.2 z=0.03 (i.e., the "normal orientation" is "laying flat").
  • Assume the book's bounding box is x=0.1 y=0.03 z=0.2 (i.e., the "normal orientation" is "standing upright").
  • If the detector detects a flat object, and is unsure whether it's a keyboard or a book, the results[] array should contain two Detection3Ds ("book" + "keyboard"). Obviously, they need a different pose (in order to turn the book sideways).

Of course, that's still assuming we're doing 6DoF pose estimation. The whole point of the "book vs. keyboard" example is that even if we move all the mesh origins to their respective bbox centers, we still need separate poses. If we were just doing some sort of bounding box detection without pose, we wouldn't need the poses in the Detection3D, that's correct.


TL;DR: The bounding box and each object class need their own poses, since they refer to different reference frames.

I agree with @mintar, I see no real to change this. An orientation for 3D is important so the bounding boxes can be non-axially aligned. The 2D case is less used in current technologies, but there are networks which do give oriented bounding boxes.

The only thing left I think that there's consensus on is the decoupling of data (e.g. remove the PC2 / image fields from the messages). If someone wants those together, they can use an approximate or exact time message filter to grab both of those with corresponding headers.

Why do you think the messages here should be restricted to bounding box object detection?

That's a misunderstanding. I think that there should be messages for standard vision tasks which are organized based on what output is produced. E.g., there should be messages for bounding box detection, which could be named Detection2D and Detection3D (BoundingBoxDetection2D and BoundingBoxDetection3D would be more clear).

I believe they should continue to allow full 6DoF pose estimation. This is how they are defined and used at the moment, for example by DOPE.

Bounding box pose in Detection3D is full 6DoF pose. You can neglect the size of bounding box if this is not needed.

My point was that the pose in ObjectHypothesisWithPose is different from the bounding box pose.

Static transformation from bounding box to mesh origin doesn't have to be represented in each instance of the message, IMHO.

If the detector detects a flat object, and is unsure whether it's a keyboard or a book, the results[] array should contain two Detection3Ds ("book" + "keyboard"). Obviously, they need a different pose (in order to turn the book sideways).

I would say that they would need completely different bounding boxes - as the relative bounding box dimensions in x-y-z axes would be completely different in these two examples.

Please look in this repository, much of what you describe already exists.

PR for decoupling from data is created: #53.

Even after concluding the new UUID discussion as wontfix, I still have some doubts about the concept of tracking_id (or, newly, just id) here. And it is closely tied (again) to the concept of decoupling.

Strictly speaking, a detector outputs detections. By which I understand bounding boxes or similar things. But it does not (in my view) associate the bounding boxes with any objects (just with classes). Association to objects is the work of a tracker. Currently, it seems to me that the DetectionXDArray messages should be used with both detectors and trackers. But why? Why should a tracker emit detections and not tracks? (as much as I don't like this word)

Take another look. The DetectionXD message contains a list of class probabilities. I.e. it represents some uncertain information. However, the tracking_id stands for certain information. No space for disputes or probability distributions. How does that come together with the uncertain class probabilities? Is it the most probable ID? And where are the less probable ones?

Wouldn't it be better to decouple the tracking information from the detection message and create a "parent" message Track2D and Track3D? This message could then contain all the fancy tracking-related fields like track_id, velocity etc., as @SteveMacenski mentioned in #47 (comment). The track message could then either be completely decoupled from Detection (again, composable back using a message synchronizer matching TrackArray and DetectionArray messages), or Detection could be a part of it (which I would not prefer, but could be an option for algorithms that do detection and tracking in one tight loop; but even those could use the completely separate messages).

This would also nicely solve the question what to put in tracking_id when the detector does not do tracking.

I would say that they would need completely different bounding boxes - as the relative bounding box dimensions in x-y-z axes would be completely different in these two examples.

Sure, but for some detectors the bounding box may not be linked to the specific class of object. One object detection system I worked with did the following:

  1. remove the table plane, segment the remaining point cloud into point clusters (= potential objects)
  2. classify each point cluster

In this case, the bounding box was the bbox of the point cluster, and the results array contained all possible classifications of that cluster with their poses.

In short, I believe the bbox pose and the pose of each classification are generally separate things, and don't need to be connected via fixed transformations, so we need all of them in the message.

Wouldn't it be better to decouple the tracking information from the detection message and create a "parent" message Track2D and Track3D? [...]

Yes, that sounds like an excellent suggestion!

The track message could then either be completely decoupled from Detection (again, composable back using a message synchronizer matching TrackArray and DetectionArray messages), or Detection could be a part of it (which I would not prefer, but could be an option for algorithms that do detection and tracking in one tight loop; but even those could use the completely separate messages).

Personally, I would strongly prefer to include the Detection message in the Track message. The extra bandwidth is negligible, and it's much easier to use than using a synchronizer to subscribe to two different topics and then iterate the arrays in those messages in parallel.

@mintar

In this case, the bounding box was the bbox of the point cluster, and the results array contained all possible classifications of that cluster with their poses.

In short, I believe the bbox pose and the pose of each classification are generally separate things, and don't need to be connected via fixed transformations, so we need all of them in the message.

But couldn't you just use multiple bounding boxes, each with a single object hypothesis? That would be still with much less overhead compared to the current state.

@peci1

Strictly speaking, a detector outputs detections. By which I understand bounding boxes or similar things. But it does not (in my view) associate the bounding boxes with any objects (just with classes). Association to objects is the work of a tracker. Currently, it seems to me that the DetectionXDArray messages should be used with both detectors and trackers. But why? Why should a tracker emit detections and not tracks? (as much as I don't like this word)

For a single input (image / point cloud) you get multiple of these bounding box detections (similarly to e.g. instance segmentation), so you kind of get objects (not only classes), e.g. you "know" that these are different identities even if you don't link them across multiple timesteps.

@peci1 @mintar

Personally, I would strongly prefer to include the Detection message in the Track message. The extra bandwidth is negligible, and it's much easier to use than using a synchronizer to subscribe to two different topics and then iterate the arrays in those messages in parallel.

I think that could easily become not that negligible, definitely in the current setup, where you have pose with covariance, which is very verbose. If I understand the proposal correctly, to stream outputs for a single track, I would output messages ever increasing in size. E.g., after tracking for 1 minute at 30 Hz, you would end up with messages of size > 30 * 60 * (36+7) * 8 = 619200 Bytes each (pose 7 values, covariance 36).

So having some sort of tracking ID at single instance detection level has some advantages. You can easily collect these messages to get the whole track, both in online and offline/batch processing cases.

@tpet

For a single input (image / point cloud) you get multiple of these bounding box detections (similarly to e.g. instance segmentation), so you kind of get objects (not only classes)

My understanding is that per one input image/cloud/..., the detector outputs one DetectionXDArray message. The detections from this single detector run are "identified" by their position in this array, and it should be apparent that each entry in this array denotes (possibly) a different object. And this should apply to Tracks, too. One TracksXDArray per one DetectionXDArray.

If I understand the proposal correctly, to stream outputs for a single track, I would output messages ever increasing in size.

That's why I incline to not include the detection messages inside tracks. Because a track is the result of processing many detections. Maybe, if it's convenient, we could keep there a field like last_detection or alike. Or, what about including just the most probable hypothesis? This way, you could quickly look up what kind of object it is, and if you were interested in details, you could subscribe to the detections.

@tpet

I think that could easily become not that negligible, definitely in the current setup, where you have pose with covariance, which is very verbose. If I understand the proposal correctly, to stream outputs for a single track, I would output messages ever increasing in size. E.g., after tracking for 1 minute at 30 Hz, you would end up with messages of size > 30 * 60 * (36+7) * 8 = 619200 Bytes each (pose 7 values, covariance 36).

I don't think anyone was advocating for storing the entire Detection history in the Track message. Instead, the point was just that Tracks are based on Detections over time. I certainly am against storing detection history, anyway. But it is also possible that a single "current location of a specific object" as provided by one tracker could be based on multiple detections (from different detectors). Examples: sensor fusion, any object being tracked by multiple cameras spread across a volume.


@peci1

Strictly speaking, a detector outputs detections. By which I understand bounding boxes or similar things. But it does not (in my view) associate the bounding boxes with any objects (just with classes). Association to objects is the work of a tracker. Currently, it seems to me that the DetectionXDArray messages should be used with both detectors and trackers. But why? Why should a tracker emit detections and not tracks? (as much as I don't like this word)

Because in practice, the difference is small enough that a single message type works for both. The id field is https://github.com/ros-perception/vision_msgs/blob/ros2/msg/Detection3D.msg#L17 and can be used for tracking information. But if you don't think coupling a single id to a single detection is not useful, just don't use it. Elaboration below.

@peci1

What about including just the most probable hypothesis? This way, you could quickly look up what kind of object it is, and if you were interested in details, you could subscribe to the detections.

This is starting to sound like TrackXD = a single ObjectHypothesisWithPose + a single BoundingBoxXD. Do I have that right? I like this idea, as it represents the exact information I need for a very simple vision pipeline. However, that's equivalent to a DetectionXD with just one element in results (Unless, based on the discussion with @tpet, we also want to include an array of Detections, which I do not support). So I don't think there is need for a new message type; maybe just some comment clarification will do.

One more thing: as a general rule, I greatly prefer basing these discussions on actual use cases rather than just debating what the "platonic ideals" of messages should be

Closing due to inactivity.