A library to generate and analyse ngram graphs of text inspired by the works of George Papadakis, George Giannakopoulos and Georgios Paliouras
Installation :
pip install ngram-graphs
Usage :
import ngram_graphs
docs = ['hello world', 'hell of a hello world', 'NLP rocks']
# Create a generator for bigram graphs
# Kind can be either 'igraph' or 'networkx' (default and recommended)
generator = ngram_graphs.Generator(n=2, kind='networkx')
# Change n for trigrams
generator.set_n(3)
# Generate the graphs
# By default the generator will use character level ngrams
graphs = generator.generate_text_graphs(docs, weight=1.0)
# Generate a model graph from the graphs of all your documents
# It will contain all the nodes and edges in your documents
# The lr (learning rate) parameter describes how much the weights are affected by the weights
# in different graphs according to this formula :
# weight = current_weight + ((new_weight - current_weight) * lr)
# where:
# - current_weight is the current weight of the edge in the model graph
# - new_weight is the weight of the edge in the new document graph being added to the model graph
model_graph = ngram_graphs.utils.generate_model_graph(graphs, lr=0.5)
Different ways to get ngrams :
# Generate the graphs
# Word ngrams using split()
graphs = generator.generate_text_graphs(docs, weight=1.0, wordgram=True)
# Word ngrams using split(sep) with a single char separator
graphs = generator.generate_text_graphs(docs, weight=1.0, wordgram=True, sep=' ')
# ngrams using re.split(sep, doc) with a regular expression
graphs = generator.generate_text_graphs(docs, weight=1.0, sep='\W+')
# ngrams using a custom function. The function must take str as input and return List[str]
# The ngrams will be constructed from the returned list so the function must not construct the ngrams itself
graphs = generator.generate_text_graphs(docs, weight=1.0, sep=lambda x: x.split())
Compare the graphs :
print("How similar are the graphs' sizes ?")
print("SS : {}".format(ngram_graphs.size_similarity(graphs[0], graphs[1])))
print("SS : {}".format(ngram_graphs.size_similarity(graphs[1], graphs[2])))
print()
print("How similar are the graphs' edges ?")
print("CS : {}".format(ngram_graphs.containment_similarity(graphs[0], graphs[1])))
print("CS : {}".format(ngram_graphs.containment_similarity(graphs[1], graphs[2])))
print()
print("How similar are the graphs' edges taking weighting in account ?")
print("VS : {}".format(ngram_graphs.value_similarity(graphs[0], graphs[1])))
print("VS : {}".format(ngram_graphs.value_similarity(graphs[1], graphs[2])))
print()
print("How similar are the graphs' edges taking weighting in account and factoring out size ?")
print("NVS : {}".format(ngram_graphs.normalized_value_similarity(graphs[0], graphs[1])))
print("NVS : {}".format(ngram_graphs.normalized_value_similarity(graphs[1], graphs[2])))
Output :
How similar are the graphs' sizes ?
SS : 0.5294117647058824
SS : 0.4117647058823529
How similar are the graphs' edges ?
CS : 0.5294117647058824
CS : 0.0
How similar are the graphs' edges taking weighting in account ?
VS : 0.47058823529411764
VS : 0.0
How similar are the graphs' edges taking weighting in account and factoring out size ?
NVS : 0.8888888888888888
NVS : 0.0