angeloskath / php-nlp-tools

Natural Language Processing Tools in PHP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implementation Question

hskrasek opened this issue · comments

So I came across this project while trying to solve a problem I am facing, which is creating clusters of similar "documents". The issue is that these documents are array versions of "Members" in a data base, so something like

array(18) { 
    ["username"]=> string(6) "oprahw" 
    ["name"]=> string(7) "Winfrey" 
    ["firstname"]=> string(5) "Oprah" 
    ["title"]=> string(0) "" 
    ["gender"]=> string(1) "3" 
    ...
}

Can your project cluster "documents" of the above sort? Or is it geared for a different purpose?

Hi,

You can cluster any type of data, so yes.

What you need to do is create a FeatureFactory (see FeatureFactoryInterface) that will create a feature array from your data that can be then used to calculate similarity values using the already provided metrics or by creating a custom metric that best fits your use case.

For instance, you could create as features simply the concatenation of the keys and values in your array and then use the Cosine Similarity for distance metric and any of the clusterers provided. Or if you want a more specific distance metric you can implement a custom distance and use the DataAsFeatures feature factory.

So to sum up see in the documentation the topics about Documents, FeatureFactories and Clustering.