nicolashernandez / PyRATA

"Python Rule-based feAture sTructure Analysis" or "Python Rule-bAsed Text Analysis"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PyRATA

Current Release Version

Python 3

Apache License 2.0

PyRATA is an acronym which stands for "Python Rule-based feAture sTructure Analysis".

Features

PyRATA

  • provides regular expression (re) matching methods on a more complex structure than a list of characters (string), namely a sequence of features set (i.e. list of dict in python jargon);
  • is free from the information encapsulated in the features and consequently can work with word features, sentences features, calendar event features... Indeed, PyRATA is not only dedicated to process textual data.
  • offers a similar re API to the python re module in order not to confuse the python re users;
  • in addition to the re methods, it provides edit methods to substitute, update or extend (sub-parts of) the data structure itself (this process can be named annotation);
  • defines a pattern grammar whose syntax follows the Perl regexes de facto standard;
  • the matching engine is based on a Gui Guan's implementation1 of the Thompson's algorithm for converting Regular Expressions (RE) to Non-deterministic Finite Automata (NFA) and running them in a linear time efficiency of O(n)2;
  • is implemented in python 3;
  • can draw out beautifully the NFA to a PDF file;
  • can output the actual matches as Deterministic Finite Automata (DFA);

* uses the PLY implementation of lex and yacc parsing tools for Python (version 3.10), the sympy library for symbolic evaluation of logical expression, the graph_tool library for drawing out PDF. as of v0.5.1 (https://github.com/nicolashernandez/PyRATA/commit/19d0c33347ce3d1355cfdb09ba4e7b1dd9500839) the sympy library was removed and replaced by a home made implementation for performance reason. * is released under the `Apache License 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_ which allows you to do what you like with the software, as long as you include the required notice; * published on PyPI; * is fun and easy to use to explore data for research study, solve deterministic problems, formulate expert knowledge in a declarative way, prototype quickly models and generate training data for Machine Learning (ML) systems, extract ML features, augment ML models...

Quick overview (in console) ==================

First install PyRATA (available on PyPI)

sudo pip3 install pyrata

v0.4.0 and v0.4.1 check how to solve importError No module named graph_tool issue.

Run python

python3

Then import the main PyRATA regular expression module:

Let's work with a sentence as data:

Do the process you want on the data... Your analysis results should be represented in the PyRATA data structure format, a list of dict i.e. a sequence of features sets, each feature having a name and a value. Here a possible resulting example of such structure after tokenization and pos tagging:

To demonstrate how easily this data structure can be generated, we simulated your processing by simply using some nltk processing. Here below:

There is no requirement on the names of the features. Value type is String. In the previous code, you see that the names raw and pos have been arbitrary chosen to mean respectively the surface form of a word and its part-of-speech.

At this point you can use the regular expression methods available to explore the data. Let's say you want to search all the adjectives in the sentence. By chance there is a property which specifies the part of speech of tokens, pos, the value of pos which stands for adjectives is JJ. Your pattern will be:

To find all the non-overlapping matches of pattern in data, you will use the findall method:

And you get the following output:

In python, list are marked by squared brackets, dict by curly brackets. Elements of list or dict are then separated by commas. Feature names are quoted. And so values when they are Strings. Names and values are separated by a colon.

Here you can read an ordered list of four matches, each one corresponding to one specific adjective of the sentence.

Reference

Documentation ===========

To go further, the next step is to have a look at the user guide.


  1. Gui Guan, "A Beautiful Linear Time Python Regex Matcher via NFA", August 19, 2014 https://www.guiguan.net/a-beautiful-linear-time-python-regex-matcher-via-nfa

  2. Thompson, K. (1968). Programming techniques: Regular expression search algorithm. Commun. ACM, 11(6):419–422, June.

About

"Python Rule-based feAture sTructure Analysis" or "Python Rule-bAsed Text Analysis"

License:Apache License 2.0


Languages

Language:Python 99.7%Language:Shell 0.3%