GuerrillaAnalytics / similarity

SQL Server wrapper for the SimMetrics string matching algorithms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This project is a C# wrapper and set of SQL Server installation scripts to make the SimMetrics string matching algorithms available in SQL Server.

  • SimMetrics was originally released at SourceForge. This project uses version 1.5 of that library. Subsequent versions were migrated to Java.
  • The C# wrapper was inspired by this blogpost

Descriptions of the supported string fuzzy match functions are provided on the wiki home page.

Motivation

This project was motivated by the frequent need for fuzzy matching (approximate string matching) algorithms in data analytics and data science work. These algorithms are missing from SQL Server. Many projects do not have the time, licencing, or budget to install additional SQL Server packages such as SSIS. Furthermore, it is best to do as much data science work as possible through program code rather than manual graphical wizards as outlined in the Guerrilla Analytics Principles. You can read more about Guerrilla Analytics in the book.

Dependencies

The project has minimal dependencies.

Installation, Configuration, Examples and how to contribute

Installation and configuration are controlled by an Apache Ant build file. Configure your database settings and you should be good to go.

Please see the GitHub wiki page for details.

Simple Code Example

You can find the functions under a schema with the name of the Similarity library version e.g. Similarity_<Major version>_<minor version>_<patch version>.

To use these functions in SQL code, simply call the function while specifying its full name. For example: SELECT SIMILARITY_1_1_0.Levenshtein('THE QUICK BROWN FOX','THE QUICK FOX')

For more detailed examples, please see the Quick Guide on the wiki.

License

This overall project is released under the GPLv3.

  • The SimMetrics library was released under GPLv2 and can be downloaded from here.
  • This project was inspired by a blogpost by Anastasios Yalanopoulos at http://anastasiosyal.com/. Please see that author's licence terms in associated code files.

About

SQL Server wrapper for the SimMetrics string matching algorithms

License:GNU General Public License v3.0


Languages

Language:HTML 86.3%Language:C# 8.0%Language:C 4.4%Language:CSS 1.1%Language:JavaScript 0.2%Language:Batchfile 0.0%