pawa- / Lingua-JA-TFWebIDF

TF*WebIDF calculator

Home Page:http://search.cpan.org/dist/Lingua-JA-TFWebIDF/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NAME
    Lingua::JA::TFWebIDF - TF*WebIDF calculator

SYNOPSIS
      use Lingua::JA::TFWebIDF;
      use utf8;
      use feature qw/say/;
      use Data::Printer;

      my $tfidf = Lingua::JA::TFWebIDF->new(
          pos1_filter => [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能 サ変接続/],
          ng_word     => [qw/編集 本人 自身 自分 たち さん 年 月 日/],
      );

      my %tf = (
          '自然言語処理' => 9,
          '自然言語'     => 6,
          '自然言語理解' => 4,
          '処理'         => 5,
          '解析'         => 4,
      );

      p $tfidf->tfidf(\%tf)->dump;

      p $tfidf->tfidf($text)->dump;
      p $tfidf->tf($text)->dump;

      for my $result (@{ $tfidf->tfidf($text)->list(20) })
      {
          my ($word, $weight) = each %{$result};

          say "$word: $weight";
      }

DESCRIPTION
    Lingua::JA::TFWebIDF calculates TF*WebIDF weight.

    Compared with Lingua::JA::TFIDF, this module has the following
    advantages.

    *   supports Tokyo Cabinet and many options.

    *   tfidf method accepts \%tf. (This facilitates the use of other
        morphological analyzers.)

METHODS
  new( %config || \%config )
    Creates a new Lingua::JA::TFWebIDF instance.

    The following configuration is used if you don't set %config.

      KEY                 DEFAULT VALUE
      -----------         ---------------
      pos1_filter         [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能/]
      pos2_filter         []
      pos3_filter         []
      ng_word             []
      term_length_min     2
      term_length_max     30
      concat_max          30
      tf_min              1
      df_min              0
      df_max              250_0000_0000
      fetch_unk_word_df   0
      db_auto             1
      guess_df            1

      idf_type            1
      api                 'YahooPremium'
      appid               undef
      driver              'TokyoCabinet'
      df_file             './df.tch'
      fetch_df            0
      expires_in          365
      documents           250_0000_0000
      Furl_HTTP           undef
      verbose             0

    pos(1|2|3)_filter => \@pos
        The filters of '品詞細分類'.

    concat_max => $num
        The maximum value of the number of term concatenations.

        If 2 is specified, 2 consecutive nouns are concatenated. I recommend
        that you specify a large value or 0.

    fetch_df => 0 || 1
        If df_file has the (Web)DF and it has not expired yet, it is used.

        If it is not so,

        1: fetches the DF of the word which exists in the dictionary of
        MeCab.

        0: if guess_df is 1, guesses the DF.

        if guess_df is 0, does not give TF*WebIDF weight.

    fetch_unk_word_df => 0 || 1
        'unk word' is a word which not exists in the dictionary of MeCab.

        If df_file has the (Web)DF and it has not expired yet, it is used.

        If it is not so,

        1: fetches the DF of the unk word if fetch_df is 1.

        0: if guess_df is 1, guesses the DF.

        if guess_df is 0, does not give TF*WebIDF weight.

    db_auto => 0 || 1
        If 1 is specified, (open|close)s the WebDF(Document Frequency)
        database automatically. This option works when tfidf method is
        called.

    guess_df => 0 || 1
        If df_file has the (Web)DF and it has not expired yet, it is used.

        If it is not so,

        1: guesses the DF based on the string length of $word.

        0: doesn't give TF*WebIDF weight.

    idf_type, api, appid, driver, df_file, expires_in, documents, Furl_HTTP,
    verbose
        See Lingua::JA::WebIDF.

  tfidf( $text || \%tf )
    Calculates TF*WebIDF weight. If scalar value is set, MeCab separates the
    value into appropriate morphemes. If you want to use other morphological
    analyzers, you have to set a hash reference which contains terms and
    their TF(Term Frequency).

  tf($text)
    Calculates TF(Term Frequency) via MeCab.

  idf, df, db_open, db_close, purge
    See Lingua::JA::WebIDF.

AUTHOR
    pawa <pawapawa@cpan.org>

SEE ALSO
    Lingua::JA::TermExtractor

    Lingua::JA::WebIDF

    Lingua::JA::WebIDF::Driver::TokyoTyrant

LICENSE
    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

About

TF*WebIDF calculator

http://search.cpan.org/dist/Lingua-JA-TFWebIDF/


Languages

Language:Smalltalk 97.3%Language:Perl 2.7%