startupradar / transformers

🛢 pipelines and transformers to turn startup domains into huge dataframes filled with training data

Home Page:https://startupradar.co

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Transformers turns API data into ML-ready dataframes

StartupRadar Transformers

This python package allows you to integrate the StartupRadar API directly into your own Data or Machine Learning pipelines. With only a list of domains, you can create huge Pandas DataFrames filled with all the data available on StartupRadar.

Implemented transformers

startupradar.transformers.export

  • Creates a human-readable DataFrame for usage in Excel or Google Spreadsheets (through CSV).

startupradar.transformers.core

Transformers in this module create data from API functionality. All transformers in this module require API access.

  • LinkTransformer: Create columns for all the domains a given domain links to
  • BacklinkTransformer: Create columns for all the domains that link to the given domain
  • DomainTextTransformer: Create a text column with the homepage text of the given domain
  • BacklinkTypeCounter: Counts the types of pages that link to a specific domain

startupradar.transformers.basic

Transformers in this module work with DataFrames and provide useful feature generation on domains. The transformers in this module don't require the API and can be used by anyone.

  • DomainNameTransformer: Extract features from a domain name, currently only top level domain, e.g. com or io
  • CommonStringTransformer: Application of a CountVectorizer to find common strings among passed inputs
  • ColumnPrefixTransformer: Create a DataFrame with the same column names, but prefixed with e.g. prefix_
  • CounterTransformer: Create row-wise Counter objects and distribute keys as columns

startupradar.transformers.pandas

Transformers that re-implement scikit-learn transformers, to also output Pandas DataFrames. These transformers can be used by anyone, no API key necessary.

  • OneHotEncoderDF: One Hot Encoder outputting a dense and consistent DataFrame
  • FeatureUnionDF: Create a FeatureUnion with pd.DataFrames as input and output
  • PipelineDF: Creates a pipeline that retains DataFrames and their column names
  • TfidfVectorizerDF: Adaption of the sklearn transformer
  • CountVectorizerDF: Adaption of the sklearn transformer

Upcoming

Transformers we're thinking about that may be coming soon:

  • something to leverage the similar domains endpoint
  • tfidf of all backlinks or (forward) links combined (domain- or url-level)

How it works

For most transformers, you can simply pass a series of domain names as input. In the case of the DomainNameTransformer, it could look like this:

> import pandas as pd
> from startupradar.transformers.util import DomainNameTransformer
>
> domains = ["loreyventures.com", "startupradar.co", "karllorey.com"]
> domains_series = pd.Series(domains)
> t = DomainNameTransformer()
> t.fit_transform(domains_series)
                   tld
loreyventures.com  com
startupradar.co     co
karllorey.com      com

Install

This is a work in progress. You should expect things to change on a daily basis. If you're still convinced to try it, feel free to check the latest version by installing it as a git-based dependency:

> pip install git+https://github.com/startupradar/transformers.git

About

🛢 pipelines and transformers to turn startup domains into huge dataframes filled with training data

https://startupradar.co


Languages

Language:Python 100.0%