collab example?
Ademord opened this issue · comments
Hello, I have just discovered your project. how could i use it to do inference on a real time feed?
i am trying to feed a depth image or a point cloud to "some algorithm" that will put them together and store them in a ply.
i think this is called point cloud registration? but it overlaps with sfM, where i would take N camera shots and then store a point cloud, then do N+M shots and compare with the previous point cloud to see how much did the new M steps contribute to adding "relevant" points to the PC. or some variant of that maybe with this project.
Hi @Ademord,
The inference code is in https://github.com/hehefan/P4Transformer/blob/main/train-msr.py#L48-L102.
Your research problem is very interesting. I have a question that whether the motion is important in this problem. I mean whether we can find "relevant" points just based on appearance without considering the motion or the motion is actually meaningless. If so, you might just need to extract the features of spatial local areas via static point cloud methods, e.g., PointNet++ or KPConv, for each shot, and then find their relationships via the self-attention of the transformer. Also, in this case, you may not need the "Coordinate and Local Feature Embedding", i.e., Eq (3), in the paper because the coordinates of the same point in different shots may be very different.
Best regards.
Hey, thanks for your reply! Thats actually also an interesting question.
- How could motion be relevant in any scenario?
- What kind of relationship between the self-attention transformers should I look at when I run for example KPConv, each time from a different point of view ?
I currently have an MLAgents explorer drone trained but its limited to an octree based exploration of yes and nos. I was thinking a pointcloud "discovery" exploration could contribute for in-depth analysis behaviors of car accidents, airplanes, storage or just in general concrete scenarios where a more in depth look at an object or for that matter, it's point cloud would be required. This would give the baseline behavior for the agent and in parallel some other network can do semantics for monitoring, etc.
Looking forward to hearing from you!
Hi @Ademord,
-
Motion
Motion reasoning is not always necessary. To recognize "stand up/sit down" or "open/close a door", motion modeling is necessary. To coarsely distinguish "play basketball" and "play football", motion is not very important. To my understanding, the sensor motion in your research problem does have a strong pattern and may be random. If it is true, I do not think motion is necessary for your research problem. However, even without motion modeling, recognizing accidents with multiple point clouds from different viewpoints could be helpful. -
Relationship
KPConv can be used to extract the local area features, which can be then used for performing self-attention based on local area feature similarities. Suppose two shots from different viewpoints capture the right hand of a person. First, KPConv extracts the hand features from the two shots. Then, transformer finds that the two features are relevant or similar via self-attention.
I may misunderstand your research. Just correct me if I am wrong.
Best regards
Hi @hehefan,
Thanks for your feedback, I dont think you are misunderstanding it... I am trying to find a way to motivate an RL agent (Unity) to discover a point cloud (or scan it). So I need a way to store the PC, accumulate it / aggregate it when new scans arrive, and then reward the agent the more "new points" that arrive. < this is the part where I need to figure out how to store a point cloud and augment it with new scans.
So I am still super confused with all these methods that exist. I found tsdf-fusion but I am yet to try it in my "real-time setup". On my current list left to try is:
- tsdf-fusion,
- RoutedFusion,
- panoptic segmentation (will not reward based on point cloud discovery but on yes/no detections),
- torch3D base point cloud registration approach.
I tried openSFM and openMVG but they didn't show results and they are offline methods anyway as far as I understood...
So I also need to give time to adapt the code from P4Transformer to feed it the new info from the camera... and I only have such limited time, I need to settle for a library/method and stick to it :(
Hi @Ademord,
I am sorry I am not that familiar with non-deep-learning methods. To store a point cloud and augment it with new scans, I have a naive idea to retrieve a new point from existing points based on their features. If the new point is not successfully found, then we store it. Point features can be extracted by a pretrained PointNet++, KPConv or other methods. Also, if the drone is equipped with a LiDAR and the scene is static, points can be directly merged based on point coordinates and the drone's trajectory.
Besides, I think you might need an LSTM or one of its variants as the backbone for the entire framework. I have a work titled "Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification" that may provide some insights.
Best regards.
hello @hehefan thanks again for your reply! i am also considering an LSTM indeed.
Unfortunately i need to do more reading on what PointNet++ or KPConv export, since i am really struggling to understand what goes in and what comes out of them.
I detailed quite a bit what my scenario is once again (here), because a lot of people have trouble understanding it.
Your feedback is gold and I appreciate it a lot.
Hi @Ademord,
PointNet++ and KPConv are point-based methods to capture the point cloud structure. They directly take a point cloud Nx3 as input, where N is the number of points and "3" means the xyz coordinates. If other point attributes are available, they will be treated as point features NxM. where M is the feature dimension. For classification, they will output a vector 1xC, which indicates the probabilities over all classes. For segmentation, they will output a metrix NxC, which indicates the probabilities over all classes for each point.
PointNet++ and KPConv obtain the global representation in a hierarchical manner. They design the basic modules or operations to capture the spatial structure of a local area. These basic modules are the main contributions. Specifically, they first select/sub-sample some representative points from a point cloud. Then, they search neighbours for each representative point. Each representative point and its neighbours constitute a local area. Finally, they capture the structure based on different logics. In this way, a Nx3;NxM point cloud will be encoded as a N'x3;N'xM', where N' is the number of representative points or local areas and M' is the new feature dimension and usually greater than M.
Hopefully, these explanations could help you understand PointNet++ and KPConv.
Best regards.
Hi @hehefan, I found a way to simplify my pipeline by a lot and will come back to this in around 2 weeks. Thanks a lot for your feedback and your continued support.