cafejojo / schaapi

🐑 Safe Changes for APIs

Home Page:https://cafejojo.org/schaapi

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PrefixSpan can't find certain patterns

JSAbrahams opened this issue · comments

Currently Prefix-Span won't generate all possible sub-sequences of a pattern, which means that it is not a correct implementation of the algorithm as put forward by Pei et al.

To illustrate, say we have as output sequence [node1, node2, node3, node4, node5]. In addition to the sequence itself, Prefix-Span will find all sub-sequences except:

  • [node2, node3, node4, node5]
  • [node2, node3, node4]
  • [node3, node4, node5]

It will find entire sequences, but not all sub-sequences. So at the moment, it is sufficient to generate test cases. However this should eventually be fixed as it is not a proper implementation of the algorithm, and does not function as it should/how we expect it to.

I've experimented a bit further with longer and smaller paths, and this seems to be the pattern (no pun intended): The algorithm finds all subpatterns of lengths 1 and 2, without issues. For all subpattern-lengths greater than 2, it only finds the subpatterns starting on the first node of the frequent sequence (as opposed to those that start somewhere in the middle). This means that it finds exactly one of length 3, 4, and so on. I haven't found a solution yet, but this might help :)

I had found something along those lines as well, but thank you for sharing! I will look into it further soon.

It seems that the issue is that the condition pathContainsPrefix checks for both the prefix plus the item and the last item of the prefix plus the item. However the body of the if makes no distinction between these cases meaning the new prefix is always the entire prefix plus the item and never the last item of the prefix plus the item.

If I modify parts of the PrefixSpan algorithm to read like this:

    private fun runAlgorithm(prefix: List<Node> = emptyList(), projectedPaths: Collection<List<Node>> = allPaths) {
        frequentItems.forEach { frequentItem ->
            if (projectedPaths.any { pathContainsPrefix(it, prefix, frequentItem) || pathContainsLastOfPrefix(it, prefix, frequentItem) }) {
                val newPrefix = prefix + frequentItem
                frequentSequences += newPrefix

                runAlgorithm(newPrefix, extractSuffixes(prefix, allPaths))
            }
            if (projectedPaths.any { pathContainsLastOfPrefix(it, prefix, frequentItem) }) {
                val newPrefix = prefix.last() + frequentItem
                frequentSequences += newPrefix

                runAlgorithm(newPrefix, extractSuffixes(prefix, allPaths))
            }
        }
    }

    private fun pathContainsPrefix(path: List<Node>, prefix: List<Node>, frequentItem: Node) =
        pathContainsSequence(path, prefix + frequentItem, comparator)

    private fun pathContainsLastOfPrefix(path: List<Node>, prefix: List<Node>, frequentItem: Node) =
        prefix.isNotEmpty() && pathContainsSequence(path, listOf(prefix.last(), frequentItem), comparator)

... we get 14 out of 15 patterns. The pattern we're missing for length 5 is (2, 3, 4, 5). These are more patterns but unfortunately the change also produces a StackOverflow exception elsewhere. I haven't found the cause yet...

Hmm this does seem weird. It seems to be that the second if will never be true if the first if wasn't true. And the condition of the first if is functionally identical to the current one, with the only difference being that the function has been split in two (pathContainsPrefix and pathContainsLastofPrefix) and that the or operator has been lifted out.

I can't really figure out why more patterns are found here.

No, there is another difference. Read this line in the second if carefully: val newPrefix = prefix.last() + frequentItem

As PrefixSpan is no longer part of this repository, I have created an issue there and I'll this one.