[Website] Improve website of the master version to prepare for the 1.0 release

Question

[Website] Improve website of the master version to prepare for the 1.0 release

sxjscience opened this issue 4 years ago · comments

Description

#1374 has been merged so we have fixed the warnings in our documents. However, the current structure of the website is not very satisfactory and we should try to improve the layout and also add more tutorials.

Help needed here.

References

Previous issue on website: #1081

@dmlc/gluon-nlp-committers
@Cli212 @yongyi-wu @xinyual @barry-jin

Xingjian Shi · Answer 1 · Thu Jan 14 2021 04:27:54 GMT+0800 (China Standard Time)

Some items:

Improve the front-page demo
Improve layout of the examples

Sheng Zha · Answer 2 · Thu Mar 11 2021 01:47:58 GMT+0800 (China Standard Time)

On the Text Prediction - Part 1: Quickstart of Pretrained Backbones

the start of the tutorial mentions we will load two datasets using the nlp_data command... but the immediate next block are imports. Consider mentioning that the first block is import.
should there be a tutorial for nlp_data/nlp_process CLI?
formatting issue Let’s download two datasets from the GLUE benchmark: - The Standford Sentiment Treebank (SST-2) - Semantic Textual Similarity Benchmark (STS-B)
formatting issue A bunch of recent papers, especially BERT, have led a new trend for solving NLP problems: - Pretrain a backbone model on a large corpus, - Finetune the backbone to solve the specific NLP task.
cell 10, we should introduce the signatures of backbones and link to the relevant API doc.
cell 13, the matrix values don't add to the content of the tutorial.
cell 14, the list of backbones don't seem to provide much information. we could provide some comparison such as number of parameters, pre-training data domain, downstream task accuracy, etc., to help guide the model selection.
Conclude sentence According to the paper and our own experiments, MobileBERT performs similar to BERT-base..., while useful as an introduction to the next tutorial, diverges from the topic of quick start. Consider concluding with something like "In this tutorial, we learned how to use pre-trained backbone networks to quickly start ... Next, we will ..."

Sheng Zha · Answer 3 · Thu Mar 11 2021 01:56:41 GMT+0800 (China Standard Time)

On the Text Prediction - Part2: MobileBERT for Text Prediction

The start of the tutorial assumes that the users are going through them sequentially (e.g. Now you have learned 1) the basics about Gluon, 2) how to ...). Instead we could adjust the wording as "In Part 1 Quickstart (link), we learned ..."
"Handle Variable Length Sequence" -> "Batching Variable Length Sequences"
cell 12 should mention the signatures of the backbone and link to relevant doc.
cell 15 the training function is long without much explanation. we could consider extracting parts from it to a separate cell, such as one for constructing dataloader, one for learning rate scheduler/trainer.

Sheng Zha · Answer 4 · Thu Mar 11 2021 02:20:23 GMT+0800 (China Standard Time)

I saw several notebooks with a lot of imports. Let's make sure we only include the necessary ones. The import order should be python built-ins, third-party libraries, and then first-party libraries.

Sheng Zha · Answer 5 · Thu Mar 11 2021 02:25:31 GMT+0800 (China Standard Time)

On the Question Answering with GluonNLP

The mentions of https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering should link to the example page so that we always refer to the source code in the same version.
We should mention/provide pointer on how users can obtain their own checkpoints for !wget -O google_electra_base_squad2.0_8160.params https://gluon-nlp-log.s3.amazonaws.com/squad_training_log/fintune_google_electra_base_squad_2.0/google_electra_base_squad2.0_8160.params

Sheng Zha · Answer 6 · Thu Mar 11 2021 02:29:05 GMT+0800 (China Standard Time)

On the Tokenization - Part1: Basic Usage of Tokenizer and Vocabulary

text processing workflow: raw text => normalized (cleaned) text => tokens => network can be replaced with a diagram in which examples of each step can be shown.

On the Tokenization - Part2: Learn Subword Models with GluonNLP

We should have an introductory section for the concepts of subwords before diving in to the implementation/execution

On the Tokenization - Part3: Download Data from Wikipedia and Learn Subword

Since the learn_subword is already covered in part2, the new content in this part really is the nlp_data preparation. There should be description about what it does, and how people could discover the different options.

Sheng Zha · Answer 7 · Thu Mar 11 2021 02:38:54 GMT+0800 (China Standard Time)

For the tokenization notebooks, one pressing need I see is that there should be a reference page for the functionality of the CLIs in the API section. Otherwise, short of reading the code, it's hard for users to discover the features in them.

Sheng Zha · Answer 8 · Thu Mar 11 2021 03:08:15 GMT+0800 (China Standard Time)

On Compile NLP Models - Convert GluonNLP Models to TVM

At the moment this notebook includes only code and titles. We will need an introduction section for overview and motivation, explanation for the code, reference to other materials on compilation
compile_tvm_graph_runtime is very long and there are opportunities for breaking up the logic into smaller functions. For example, graph construction can happen separately from compilation.