Training Data Generation for Dark Matter Cosmology Example

Question

Training Data Generation for Dark Matter Cosmology Example

juliareuter opened this issue 2 years ago · comments

Dear @MilesCranmer and team,
Thank you for providing this great framework in an open-source way. Since your framework showed great performance on the dark matter cosmology example, I wonder about some implementation details of the same, which were not clear to me after reading the paper and going through the spring-mass example available in this repository.

The full halo graph has a huge number of nodes and edges, and I am unsure how this is reflected in the model. I am aware that each halo considers neighbors within a certain radius. But how do you generate training data from this huge graph? Do you extract the features of one halo and its neighbors at a time to build the training data? Also, as far as I understand, the cosmology example does not have a time dependence as in the spring example. How is this reflected in the model?

I'm really looking forward to finding out more about your implementation details.

Thank you in advance for your time!

Miles Cranmer · Answer 1 · Thu Oct 06 2022 00:54:11 GMT+0800 (China Standard Time)

Dear @juliareuter,

Thanks for reaching out! For the dark matter example, let me see what I can do in terms of making intermediate data products available so you can look through it in detail.

The full halo graph has a huge number of nodes and edges, and I am unsure how this is reflected in the model. I am aware that each halo considers neighbors within a certain radius. But how do you generate training data from this huge graph? Do you extract the features of one halo and its neighbors at a time to build the training data?

Since there is only one message passing step, and no global pooling step, you can slice up the graph into many many examples of (center halo, {neighboring halos}), with edges connecting them. Thus, even though the entire graph is a very large number of nodes and edges, we only need to consider tiny subgraphs for each mini-batch.

Also, as far as I understand, the cosmology example does not have a time dependence as in the spring example. How is this reflected in the model?

Both the cosmology example and the spring example are treated as regression problems. In the spring example, one tries to predict the acceleration (or state change), and in the cosmology example, you try to predict the overdensity. So, it doesn't need to be a dynamical quantity - any regression problem will do.

Let me know if this answers some of your questions! I will followup eventually with the intermediate data products for the cosmology example (if you are still interested).

Cheers,
Miles

juliareuter · Answer 2 · Thu Oct 06 2022 16:56:08 GMT+0800 (China Standard Time)

Dear @MilesCranmer,

Thank you for your prompt reply, it helps a lot.

In the spring example, one tries to predict the acceleration (or state change), and in the cosmology example, you try to predict the overdensity.

While in the spring example, acceleration is predicted for each node (or mass) involved, in the halo example, more nodes are involved but only one feature (overdensity) of the centering node is predicted (correct me if I'm wrong). Probably this is where I get confused - For better understanding, I would be pleased if you can share some intermediate data products for the cosmology example. It would be enough to have access to one training sample, such as (X, y, edge_index).

Cheers, Julia