Preface for revise
LibShortText is an open source library for short-text classification
( Please read the COPYRIGHT file
before using LibShortText.
LibShortText is built based on project liblinear(see
which support win and linux platform both.

But LibShortText does not support Windows, So, this project do it :
-- support build and run on windows platform

Building Windows Binaries

Windows binaries are available in the directory `windows'. To re-build
them via Visual C++, use the following steps:

1. Open "X64 Native Tools Command Prompt for VS2017" comand line tools.

   also you can open a dos command window and set environment variables of VC++ like this, type

   ""C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat""

   You may have to modify the above command according which version of VC++/VS or where it is installed.

2. change to current project directory, and Type

    nmake -f clean all

3. (Optional) To build 32-bit windows binaries, you must
	(1) Setup "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\vcvars32.bat" instead of vcvars64.bat
	(2) Change CFLAGS in every /D _WIN64 to /D _WIN32

    nmake -f clean all

4.  go to the ../demo ,and copy the command in, and paste to command line to run:

    python ../ -f -A train_feats1 -A train_feats2 train_file
    python ../ -f -A test_feats1 -A test_feats2 test_file train_file.model predict_result


Author: Justin(github:   @2017/10/11

To get started, please read the ``Quick Start'' section first.  

For developers, please check our document at for integrating
LibShortText in your software.

Table of Contents

- Installation and Data Format
- Quick Start
- Command-line Usage
- More Examples about Command-line Usage 
- Interactive Error Analysis 
- Additional Information

Installation and Data Format

LibShortText requires UNIX systems with Python 2.6 or newer versions.  
The latest version (Python 2.7) is recommended for better efficiency. 

On Unix systems, type

    $ make

to install the package. For training and test data, every line in the file
contains a label and a short text in the following format:


A TAB character is between <label> and <text>. Both the label and the text can 
contain space characters. Here are some examples.

    Jewelry & Watches	handcrafted two strand multi color bead necklace
    Books	big bike magazine february 1973

Two sample sets included in this package are `train_file' and `test_file'.

Quick Start

You can run

    $ cd demo
    $ ./

to run a demonstration.

LibShortText provides a simple training-prediction workflow:

short texts ============> model ==============> predictions

The command `' trains a text set to obtain a model. For
example, the following command generates `train_file.model' for the
given `train_file'.

    $ python train_file
    [output skipped]

`' predicts a test file using the trained model. For example, the
following command predicts `test_file' with `train_file.model' and stores the
results in `predict_result'.

    $ python test_file train_file.model predict_result
    Accuracy = 87.1800% (4359/5000)

Once predict_result is obtained, LibShortText provides several handy utilities 
to conduct error analysis in the Python interactive shell. Please see
the section `Interactive Error Analysis' for more details. 

Command-line Usage
-`' Usage

    `' obtains a model by training either a short-text dataset
    or a LIBSVM-format data set generated by `'.

    Usage: [options] training_file [model]
        -P {0|1|2|3|4|5|6|7|converter_directory}
            Preprocessor options. The options include stopwrod removal, 
            stemming, and bigram. (default 1)
            0   no stopword removal, no stemming, unigram
            1   no stopword removal, no stemming, bigram
            2   no stopword removal, stemming, unigram
            3   no stopword removal, stemming, bigram
            4   stopword removal, no stemming, unigram
            5   stopword removal, no stemming, bigram
            6   stopword removal, stemming, unigram
            7   stopword removal, stemming, bigram
            If a preprocssor directory is given instead, then it is assumed
            that the training data is already in LIBSVM format. The preprocessor
            will be included in the model for test. 
        -G {0|1}
            Grid search for the parameter C in linear classifiers. (default 0)
            0   disable grid search (faster)
            1   enable grid search (slightly better results)
        -F {0|1|2|3}
            Feature representation. (default 0)
            0   binary feature
            1   word count 
            2   term frequency
            3   TF-IDF (term frequency + IDF)
        -N {0|1}
            Instance-wise normalization before training/test.
            (default 1 to conduct normalization)
        -A extra_svm_file
            Append extra libsvm-format data. This parameter can be applied many
            times if more than one extra svm-format data set need to be appended.
        -L {0|1|2|3}
            Classifier. (default 0)
            0   support vector classification by Crammer and Singer
            1   L1-loss support vector classification
            2   L2-loss support vector classification
            3   logistic regression
            Overwrite the existing model file.
    Examples: -L 3 -F 1 -N 1 raw_text_file model_file -P text2svm_converter -L 1 converted_svm_file

-`' Usage

    `' predicts labels for a test dataset with a trained model. 

    Usage: [options] test_file model output
            Overwrite the existing output file.
        -a {0|1}
            Output options. (default 1)
            0   Store only predicted labels. The information is NOT sufficient 
                for interactive analysis. Use this option if you would like to get 
                only accuracy.
            1   More information is stored. The output provides information for 
                interactive analysis, but the size of output can become much larger.
        -A extra_svm_file
            Append extra libsvm-format data. This parameter can be applied many
            times if more than one extra svm-format data set need to be appended.

-`' Usage

    `' generates a directory containing needed information for
    converting short texts to LIBSVM format. An output file in LIBSVM format is
    also generated.

    Usage: [options] text_src [output]
        -P {0|1|2|3|4|5|6|7}
            Preprocessor options. The options include stopwrod removal, 
            stemming, and bigram. (default 1)
            0   no stopword removal, no stemming, unigram
            1   no stopword removal, no stemming, bigram
            2   no stopword removal, stemming, unigram
            3   no stopword removal, stemming, bigram
            4   stopword removal, no stemming, unigram
            5   stopword removal, no stemming, bigram
            6   stopword removal, stemming, unigram
            7   stopword removal, stemming, bigram
    Default output will be a file "text_src.svm" and a directory
    "text_src.text_converter." If "output" is specified, the output will be
    "output" and "output.text_converter."

More Examples about Command-line Usage 

We use the following questions/answers to demonstrate some examples.

Q: Given many parameters provided by `', how to choose the 
   parameters at the first trial? 
A: Although `' has several parameters to tune, we carefully 
   choose default parameters based on a study on short-text classification [2].
   Running `' without parameters can deliver good
   classification accuracy in general. It is equivalent to the following 
   command, in which default parameters are explicitly specified.

   $ python -P 1 -G 0 -F 0 -N 1 -L 0 train_file
   Meaning for each parameter:

   -P 1: no stemming, no stopword removal, bigram features
   -G 0: no LIBLINEAR parameter selection
   -F 0: binary feature representation
   -N 1: each instance is normalized to unit length
   -L 0: use Crammer and Singer's multi-class method. 

Q: How to select the parameter C in LIBLINEAR automatically?
A: By default, LIBLINEAR (and `') sets the parameter C to 1. 
   You can automatically select the best parameter C by using `-G 1`. 

Q: How to generate different models using the same training data?
A: Internally, converts data to LIBSVM format and applies 
   LIBLINEAR for training. To reuse the pre-processed data, LibShortText 
   provides another workflow:

short texts ==========> LIBSVM format data ============> model ==============> result

   The following command generates a LIBSVM-format file `train_file.svm' and a directory
   `train_file.text_converter' containing information for the conversion.

   $ python train_file 
   [`train_file.text_converter' and `train_file.svm' are generated.]

   We then generate two models using the same LIBSVM-format file.

   $ python -P train_file.text_converter -L 3 train_file.svm lr.model
   [A logistic regression model, `lr.model', is generated.]

   $ python -P train_file.text_converter -L 2 train_file.svm l2svm.model
   [An L2-loss linear SVM model, `l2svm.model', is generated.]

Q: How to overwrite existing models or prediction results?
A: If the specified model or output file exists, by default, neither `'
   nor `' overwrite them. You can generate new models/prediction 
   outputs by `-f'.
   $ python -f train_file
   $ python -f test_file train_file.model predict_result

Q: Why is the file of prediction results so large?
A: By default, some additional information for analysis are stored. If you 
   need to get only classification accuracy, you can specify `-a 0' to save disk 
   space. For example,

   $ python -a 0 test_file train_file.model predict_result

Q: If I am an experienced LIBILNEAR user, how should I specify options 
   for LIBLINEAR and `'?
A: For LIBLINEAR, you can easily pass LIBLINEAR parameters in a double quoted 
   string after `-L' with a special character `@'. For example, if you want to
   use L2-regularized Logistic Regression as the classifier, set the parameter
   C to 0.5, and append a bias term to each instance, you can type

   $ python -L @"-s 3 -c 0.5 -B 1" train_file

   To show parameters provided by LIBLINEAR/grid, use

   $ python -x liblinear
   $ python -x grid

   For `', to specify the range of C, using `-G @"-log2c begin,end,step"'.  
   For example, the following command selects the best C among 
   [2^-2, 2^-1, 2^0, 2^1] in terms of cross validation rates.

   $ python -G @"-log2c -2,1,1" train_file

Q: I have more features for texts, how can I add them in LibShortText?
A: You can use `-A' option in `', `', and
   `' to append feature files. Note that you can use multiple
   feature files. If we have 20 features, and these features are included in
   two files, `train_feats1' and `train_feats2', then we can use these files in
   the training stage by

   $ python -A train_feats1 -A train_feats2 train_file

   The features you use in the training stage should be identical to those in
   the predict stage. Assume that `test_feats1' and `test_feats2' are feature
   files corresponding to `train_feats1' and `train_feats2', respectively. To
   predict a test file you should use

   $ python -A test_feats1 -A test_feats2 test_file train_file.model predict_result

   The usage of analyzer is the same as before. The features will be
   represented in the following format.


Q: I already have some LIBSVM-format features. How can I include these
   features when training the model?
A: You can use the -A option in the command line mode. For example, if you have
   two extra svm files `extra_train_1' and `extra_train_2' in LIBSVM-format, 
   then use:
   $ python train_file -A extra_train_1 -A extra_train_2
   Note that `train_file', `extra_train_1', and `extra_train_2' should 
   have the same number of instances. And then use the following command to 

   $ python test_file -A extra_test_1 -A extra_test_2 train_file.model predict_result

Interactive Error Analysis 

We provide interactive tools to analyze prediction results. First, you generate a
file of prediction results by the commands introduced in section `Quick Start.'
Note that you CANNNOT specify `-a 0' to `' or the prediction
result will not be analyzable.

You then enter Python, import the module, load the prediction results, and
create an object of `Analyzer' by reading a model.

    $ python
    >>> from libshorttext.analyzer import *
    >>> predict_result = InstanceSet('predict_result')
    >>> analyzer = Analyzer('train_file.model')
You can select a subset of test data for analysis using the following options. 

        Select wrongly predicted instances.
    `with_labels(labels, target)'
        If `target' is `true', then instances with labels in the set `labels'
        are selected. If `target' is `predict', those predicted to be in
        `labels' are chosen. `target' can also be `both' or `or'. `both' and
        `or' find the union and the intersection of `true' and `predict',
        respectively. The default value of `target' is `both'.
        Sort instances by decision values.

    `subset(amount, method)'
        Get a specific amount of data by the method `top' or `random'. The 
        default value of `method' is `top'.

For example, among wrongly predicted instances with labels 'Books', 'Music', 
'Art', and 'Baby', to get those having the highest 100 decision values, you can use

    >>> insts =, with_labels(['Books', 'Music', 'Art', 'Baby']), sort_by_dec, subset(100))

You can run the following operations to know details of the selected instances.

    Number of instances: 100
    Accuracy: 0.0 (0/100) 
    True labels: "Baby"  "Art"  "Books"  "Music"
    Predicted labels: "Baby"  "Music"  "Books"  "Art"
    Text source: /home/user/libshorttext-1.0/test_file
    -> Select wronly predicted instances
    -> labels: "Books", "Music", "Art", "Baby"
    -> Sort by maximum decision values.
    -> Select 100 instances in top.

The following command generates a confusion table on the selected instances:

    >>> analyzer.gen_confusion_table(insts)
             Art  Books  Music  Baby
    Art        0     15      4     5
    Books     10      0     17     3
    Music     10     21      0     3
    Baby       1      7      4     0

To analyze a single short text, you first load it by

    >>> insts.load_text()
Then you can print information for each single text in `insts'.

    >>> print(insts[61])
    text = avengers assemble 4 panini uk collector s edition nm 2012
    true label = Books
    predicted label = Music

You can print model weights corresponding to tokens of a short text. The
following operation prints weights of the three classes with the highest 
decision values. (To print weights in all classes, you can change 3 to 0.)

    >>> analyzer.analyze_single(insts[61], 3)
                        Music       Books    Antiques
    edition        -5.232e-02   8.869e-01  -1.303e-01
    s edition      -2.219e-02   1.527e-01  -4.077e-02
    nm              7.269e-01   6.048e-02  -1.495e-01
    collector      -5.253e-02  -5.208e-02   8.804e-02
    uk              9.466e-01  -2.089e-01   2.683e-02
    collector s    -3.174e-02   6.389e-02   9.963e-02
    4              -2.011e-01  -2.062e-01   1.526e-01
    2012           -1.173e-01   2.663e-01  -1.369e-01
    s              -5.142e-02   1.485e-01   1.757e-01
    **decval**      3.816e-01   3.705e-01   2.842e-02
    True label: Books
You can also analyze an arbitrary short text.

    >>> analyzer.analyze_single('beatles help longbox sealed usa 3 cd single', 3)
                      Music      Crafts      Travel
    sealed        4.828e-01   1.050e-03  -5.383e-02
    cd            2.872e+00  -1.032e-01  -1.723e-01
    cd single     1.663e-01  -5.181e-03  -6.558e-03
    single        4.375e-01  -6.953e-02  -9.960e-02
    usa           2.247e-01   3.530e-02   2.657e-02
    beatles       5.050e-01  -5.710e-02  -6.933e-02
    3 cd          1.320e-02  -3.837e-02  -7.793e-20
    3             3.057e-02   4.712e-02   1.402e-01
    **decval**    1.673e+00  -6.716e-02  -8.299e-02

Additional Information

[1] H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin. LibShortText: A Library 
for Short-text Classification.

[2] H.-F. Yu, C.-H. Ho, P. Arunachalam, M. Somaiya, and C.-J. Lin. Product 
title classification versus text classification.

For any questions and comments, please email


