ai computational-linguistics deep-learning information-retrieval latent-dirichlet-allocation machine-learning natural-language-processing

Latent Aspect Mining

Implantation of the paper Latent Aspect Detection from Online Unsolicited Customer Reviews

The aspects of a product or service that customers review are those on which they express their opinions and views. It is crucial to a customer-centric business to recognize and prioritize customers' needs in order to maintain revenues and to reduce customer churn. Currently, supervised learning methods are usually trained on human-annotated data to detect surface forms of aspects. They fall short when aspects are latent in reviews. Yet, there is no explicit surface form representation for aspects in 35% of reviews of electronics and restaurants. Using opinion expressions, we proposed an unsupervised method to extract latent aspects.

Dependencies

python 3.8.8
nltk 3.5
scikit-learn 0.24.1
spacy 3.0.5
gensim 3.8.3
textblob 0.15.3

Also to make use of Spacy language model, run

$ conda install -c conda-forge spacy

$ pip install spacy-transformers
$ pip install spacy-lookups-data

$ python -m spacy download en_core_web_trf

See also requirements.txt You can install requirements, using the following command.

$ pip install -r requirements.txt

Data

You can find the pre-processed datasets and the pre-trained models in the \data folder. Run the following command:

$ python Main.py --path data\Canadian_Restaurant_preprocessed_corrected.xlsx --preprocess False

You can also use the original datasets of Restaurant domain. For preprocessing, run:

$ python Main.py --path data\data\Canadian_Restaurant.xlsx

The preprocessed files and lda model for each domain will be saved in folders prep_and_seg/~ and models/~ respectively.

Using Pre-Trained Models

Models can be built and reused, to do that, run:

$ python Main.py --tune False --aspect_model pxp_model_aspect.pxp \
                 --opinion_model pxp_model_opinion.pxp \
                 --all_nodeol pxp_model_all.pxp

Tune

Number of extracted topics can be automatically detected by default: or explicitly indicated, run:

$ python Main.py --tune False --num_topics 20

Tune logic is to first break dataset using KFold to 5 smaller parts then iteratively calculate coherence value for each, using mean and std in the process of choosing optimal number of topics, here are results for noun dataset and adjective dataset:

For ASPECTs:

optimal number: 36

For OPINIONs:

optimal number: 39

Inference

To derive results from trained models run:

$ python Main.py --inference inference_set.xlsx

note that models need to be indicated first, otherwise a model will be build from default settings.

Costs

Memory-count < 2GB
Working set < 1GB
CPU Avg. cycle = 49.7
Pipline actual duration is ~11 hours on a dataset of size ~6800

Cite as

@article{DBLP:journals/corr/abs-2204-06964,
  author={Mohammad Forouhesh and Arash Mansouri and Hossein Fani},
  title={Latent Aspect Detection from Online Unsolicited Customer Reviews},
  year={2022},
  cdate={1640995200000},
  journal={CoRR},
  volume={abs/2204.06964},
  url={https://doi.org/10.48550/arXiv.2204.06964}
}

About

Code and models for the paper "Latent Aspect Detection from Online Unsolicited Customer Reviews"

ai computational-linguistics deep-learning information-retrieval latent-dirichlet-allocation machine-learning natural-language-processing

Languages

Language:Python 100.0%