Sentiment Analysis Theis

This is the code for my thesis (Research and Application of Sentiment Analysis Based on Transformer).

Abstract

Sentiment analysis is one of the research hotspots in natural language processing. It is the pipeline of analyzing, processing, summarizing and reasoning subjective texts with emotional inclination. The word embedding model that has emerged in recent years has been widely used in natural language processing. However, the current word embedding model pays more attention to the context of words, ignoring some special cases in the language, such as harmonics phenomenon. On the other hand, the rise of the Transformer Network has further solved the problem of long dependence in the text, but how to apply it to the sentiment analysis task effectively has become a hotspot problem.

Firstly, this thesis studies the harmonics phenomenon in Chinese proposes to add Pinyin information to the semantic space of words and uses TF-IDF values to evaluate the importance of each word, implements the word embedding model of joint Pinyin and TF-IDF weighted. The addition of Pinyin allows word in a similar context of Pinyin to share some semantic information, solving the semantic similarity problem of harmonics. Using the TF-IDF value enforces model to pay more attention to important words while learning, then get a better semantic representation. Secondly, this paper studies the Transformer network and proposes two sentiment analysis models based on Transformer: Transformer_Adaptive and Transformer_Gumbel. The Transformer_Adaptive model uses an adaptive way to learn the weight distribution and then derives the sentence representation from the Transformer's outputs. The Transformer_Gumbel model uses a differentiable sampling method (GumbelSoftmax) to get the sentence representation from the Transformer's outputs. The performance of sentiment classification can be improved by the sentence representation of the Transformer_Adaptive and Transformer_Gumbel models.

The results show that the F1 values of the Transformer_Adaptive and Transformer_Gumbel models are increased by 0.46% and 1.51%, respectively, over the larger dataset. After initializing with the pre-trained word vector, the F1 value is increased by 0.89% and 2.11% compared to the baseline model initialized in the same manner. The model proposed by this thesis not only improves the performance of sentiment analysis tasks to a certain extent but has an applying value.

Requirement

Python 3.5+
Pytorch 0.4+

Data

1. For Training Embedding

reviews-qa:

https://github.com/brightmart/nlp_chinese_corpus 百科类问答json版(baike2018qa) + 社区问答json版(webtext2019zh)

https://github.com/SophonPlus/ChineseNlpCorpus weibo_senti_100k + simplifyweibo_4_moods + yf_dianping + yf_amazon + dmsc_v2

wiki-news:

https://github.com/brightmart/nlp_chinese_corpus 维基百科json版(wiki2019zh) + 新闻语料json版(news2016zh)

2. For Training Sentiment Model

ChnSentiCorp_htl_all(CSCH), OS10, online_shopping_10_cats(dmsc_v2), yf_amazon: https://github.com/SophonPlus/ChineseNlpCorpus

Model

1.BiGRU

SAModel(
  (encoder): RNNEncoder(
    (embedding): Embedding(15004, 100, padding_idx=0)
    (dropout): Dropout(p=0.7)
    (rnn): GRU(100, 64, bidirectional=True)
    (reduce_state): ReduceState()
    (linear_final): Linear(in_features=128, out_features=3, bias=True)
  )
)

2.CNN

SAModel(
  (encoder): CNNEncoder(
    (embedding): Embedding(15004, 100, padding_idx=0)
    (conv1): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (conv2): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
    (conv3): Conv2d(1, 100, kernel_size=(2, 100), stride=(1, 1))
    (dropout): Dropout(p=0.7)
    (linear_final): Linear(in_features=300, out_features=3, bias=True)
  )
)

3.Transformer_Avg

SAModel(                                                                                                                               [25/329]
  (encoder): BERTCM(
    (bert): BERT(
      (embedding): Embedding(15004, 100, padding_idx=0)
      (pos_embedding): Embedding(91, 100, padding_idx=0)
      (dropout): Dropout(p=0.7)
      (transformer_blocks): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadedAttention(
            (linear_layers): ModuleList(
              (0): Linear(in_features=100, out_features=100, bias=True)
              (1): Linear(in_features=100, out_features=100, bias=True)
              (2): Linear(in_features=100, out_features=100, bias=True)
            )
            (output_linear): Linear(in_features=100, out_features=100, bias=True)
            (attention): Attention()
            (dropout): Dropout(p=0.1)
          )
          (feed_forward): PositionwiseFeedForward(
            (w_1): Linear(in_features=100, out_features=400, bias=True)
            (w_2): Linear(in_features=400, out_features=100, bias=True)
            (dropout): Dropout(p=0.7)
            (activation): GELU()
          )
          (input_sublayer): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.7)
          )
          (output_sublayer): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.7)
          )
          (dropout): Dropout(p=0.7)
        )
      )
    )
    (norm): LayerNorm()
    (linear_final): Linear(in_features=100, out_features=3, bias=True)
  )
)

4.Transformer_Max

SAModel(                                                                                                                              
  (encoder): TransformerCM(
    (transformer): Transformer(
      (embedding): Embedding(15004, 100, padding_idx=0)
      (pos_embedding): Embedding(91, 100, padding_idx=0)
      (dropout): Dropout(p=0.7)
      (transformer_blocks): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadedAttention(
            (linear_layers): ModuleList(
              (0): Linear(in_features=100, out_features=100, bias=True)
              (1): Linear(in_features=100, out_features=100, bias=True)
              (2): Linear(in_features=100, out_features=100, bias=True)
            )
            (output_linear): Linear(in_features=100, out_features=100, bias=True)
            (attention): Attention()
            (dropout): Dropout(p=0.1)
          )
          (feed_forward): PositionwiseFeedForward(
            (w_1): Linear(in_features=100, out_features=400, bias=True)
            (w_2): Linear(in_features=400, out_features=100, bias=True)
            (dropout): Dropout(p=0.7)
            (activation): GELU()
          )
          (input_sublayer): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.7)
          )
          (output_sublayer): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.7)
          )
          (dropout): Dropout(p=0.7)
        )
      )
    )
    (norm): LayerNorm()
    (linear_final): Linear(in_features=100, out_features=3, bias=True)
  )
)

5.Transformer_Adaptive

SAModel(                                                                                                                               
  (encoder): TransformerCM(
    (transformer): Transformer(
      (embedding): Embedding(15004, 100, padding_idx=0)
      (pos_embedding): Embedding(91, 100, padding_idx=0)
      (dropout): Dropout(p=0.7)
      (transformer_blocks): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadedAttention(
            (linear_layers): ModuleList(
              (0): Linear(in_features=100, out_features=100, bias=True)
              (1): Linear(in_features=100, out_features=100, bias=True)
              (2): Linear(in_features=100, out_features=100, bias=True)
            )
            (output_linear): Linear(in_features=100, out_features=100, bias=True)
            (attention): Attention()
            (dropout): Dropout(p=0.1)
          )
          (feed_forward): PositionwiseFeedForward(
            (w_1): Linear(in_features=100, out_features=400, bias=True)
            (w_2): Linear(in_features=400, out_features=100, bias=True)
            (dropout): Dropout(p=0.7)
            (activation): GELU()
          )
          (input_sublayer): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.7)
          )
          (output_sublayer): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.7)
          )
          (dropout): Dropout(p=0.7)
        )
      )
    )
    (norm): LayerNorm()
    (linear_final): Linear(in_features=100, out_features=3, bias=True)
  )
)

6.Transformer_Gumbel

SAModel(                                                                                                                             
  (encoder): TransformerCM(
    (transformer): Transformer(
      (embedding): Embedding(15004, 100, padding_idx=0)
      (pos_embedding): Embedding(91, 100, padding_idx=0)
      (dropout): Dropout(p=0.7)
      (transformer_blocks): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadedAttention(
            (linear_layers): ModuleList(
              (0): Linear(in_features=100, out_features=100, bias=True)
              (1): Linear(in_features=100, out_features=100, bias=True)
              (2): Linear(in_features=100, out_features=100, bias=True)
            )
            (output_linear): Linear(in_features=100, out_features=100, bias=True)
            (attention): Attention()
            (dropout): Dropout(p=0.1)
          )
          (feed_forward): PositionwiseFeedForward(
            (w_1): Linear(in_features=100, out_features=400, bias=True)
            (w_2): Linear(in_features=400, out_features=100, bias=True)
            (dropout): Dropout(p=0.7)
            (activation): GELU()
          )
          (input_sublayer): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.7)
          )
          (output_sublayer): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.7)
          )
          (dropout): Dropout(p=0.7)
        )
      )
    )
    (norm): LayerNorm()
    (linear_final): Linear(in_features=100, out_features=3, bias=True)
  )
)

AotY / sentiemnt_analysis-thesis