kayjan / bigtree

Tree Implementation and Methods for Python, integrated with list, dictionary, pandas and polars DataFrame.

Home Page:https://bigtree.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: Unable to determine root node

sailfish009 opened this issue · comments

Describe the issue
Hi, I want to generate a tree from a list of about 3 million rows with parent child relationships.
To make the representation as similar to the original data as possible, the elements showing the parent and child were generated using the pwgen password utility.

list = [
    ['IESH3EJI', 'MAE2YAKI'],
    ['MAE2YAKI', 'KAEMI2SI'],
    ['KAEMI2SI', 'NAINGE2H'],
    ['GEI0WOHP', 'OB5NOO7L'],
    ['OHSUYAI1', 'DOOPAH8E']
]

The only time this seems to work is when there is only one root. Due to the nature of my data, there are multiple roots, and I get an error.

list_ok = [
    ['a', 'b'],
    ['b', 'c'],
    ['c', 'd'],
    ['a', 'e'],
    ['e', 'f']
]

root = list_to_tree_by_relation(list_ok)
root.show()

###########################

list_error = [
    ['a', 'b'],
    ['b', 'c'],
    ['e', 'd']
]

root = list_to_tree_by_relation(list_error)
root.show()
Traceback (most recent call last):
  File "/w/test.py", line 47, in <module>
    root = list_to_tree_by_relation(list_error)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/w/lib/python3.11/site-packages/bigtree/utils/exceptions.py", line 90, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/w/lib/python3.11/site-packages/bigtree/tree/construct.py", line 593, in list_to_tree_by_relation
    return dataframe_to_tree_by_relation(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/w/lib/python3.11/site-packages/bigtree/tree/construct.py", line 944, in dataframe_to_tree_by_relation
    raise ValueError(
ValueError: Unable to determine root node
Possible root nodes: ['e', 'a']

Environment
Describe your environment.

  • Platform: [Ubuntu]
  • Python version: 3.11.6
  • bigtree version: 0.14.5

To Reproduce
If you run the sample code mentioned above, you can see it right away.

Expected behaviour


[works but only one root]
a
├── b
│   └── c
│       └── d
└── e
    └── f


[i want this, more than one root]
a
└── b
│   └── c
d
└── e

Screenshots

[i want this, more than one root]
a
└── b
│   └── c
d
└── e

Additional context
None

Hi, thanks for your question. Due to the nature of hierarchical trees, there should only be one root node and it is not possible to have multiple root nodes. One possible workaround is to identify the "multiple root nodes" that you have and add a single root to it and treat it as a dummy root node.

For instance if you have the error Possible root nodes: ['e', 'a'], you can add ["ROOT", "e"], ["ROOT", "a"] as additional parent-child relationship to ensure that you only have a single root node. If you want to get your trees with multiple root node, you can access it from root.children instead (the root node is now a dummy node).

Hope this helps!

@kayjan Thank you for your response. I've connected what I think are real roots and what are not roots, but look like roots, to the hypothetical roots. The number of these is around 500,000 (210,000 + 270,000). When I tested about 1000 of them on the whole data for confirmation, I found that the tree was generated normally, and the generation time was only a few seconds. When I tested it on the whole data, I got a runtime error (RecursionError), and I fixed it by applying the following code to unlock the recursion depth limit. However, the problem is that it takes an unrealistically long time. I ran it this morning and it's still running.

sys.setrecursionlimit(10**6)

Is there a way to generate 200,000 or so trees from the root candidates suggested by the bigtree library by individually truncating them from parent to child (Parent -> Child -> Child -> ...) only for the actual root (say, 200,000), and then add a fictitious ROOT to these trees to make them into a single tree. The function I am currently using (list_to_tree_by_relation) does not seem to provide a suitable method. If it takes time to generate 200,000 trees in the forward direction, I think this could be parallelized.

# Pseudocode
list_result = []
for root in list_root:
    partial_tree = make_partial_tree(root, df_parent_child) #forward direction : parent -> child -> child ...
    list_result.append(partial_tree)

Wow, that's a huge dataset you got there. Is it possible to send me ~200,000 of them so I can do some testing (you can host the file online / email me)? I can think about how to make list_to_tree_by_relation faster 🤔 . But if you want to perform parallelism, you can use Python's multiprocessing library for multi-processing (good for CPU-bound processes).

@kayjan Sorry about that. I can't share the data due to security regulations. If there is a function to check only the root list and return it as a list, I can do some more tests, but currently it just prints the possible root list on the screen.

There isn't a function available, but the check is embedded within the list_to_tree_by_relation function. What it is doing is to take the set of parent (minus) set of children to find the possible parents.