dmlc / gluon-nlp

NLP made easy

Home Page:https://nlp.gluon.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Website] Improve website of the master version to prepare for the 1.0 release

sxjscience opened this issue · comments

Description

#1374 has been merged so we have fixed the warnings in our documents. However, the current structure of the website is not very satisfactory and we should try to improve the layout and also add more tutorials.

Help needed here.

References

  • Previous issue on website: #1081

@dmlc/gluon-nlp-committers
@Cli212 @yongyi-wu @xinyual @barry-jin

Some items:

  • Improve the front-page demo
  • Improve layout of the examples

On the Text Prediction - Part 1: Quickstart of Pretrained Backbones

  • the start of the tutorial mentions we will load two datasets using the nlp_data command... but the immediate next block are imports. Consider mentioning that the first block is import.
  • should there be a tutorial for nlp_data/nlp_process CLI?
  • formatting issue Let’s download two datasets from the GLUE benchmark: - The Standford Sentiment Treebank (SST-2) - Semantic Textual Similarity Benchmark (STS-B)
  • formatting issue A bunch of recent papers, especially BERT, have led a new trend for solving NLP problems: - Pretrain a backbone model on a large corpus, - Finetune the backbone to solve the specific NLP task.
  • cell 10, we should introduce the signatures of backbones and link to the relevant API doc.
  • cell 13, the matrix values don't add to the content of the tutorial.
  • cell 14, the list of backbones don't seem to provide much information. we could provide some comparison such as number of parameters, pre-training data domain, downstream task accuracy, etc., to help guide the model selection.
  • Conclude sentence According to the paper and our own experiments, MobileBERT performs similar to BERT-base..., while useful as an introduction to the next tutorial, diverges from the topic of quick start. Consider concluding with something like "In this tutorial, we learned how to use pre-trained backbone networks to quickly start ... Next, we will ..."

On the Text Prediction - Part2: MobileBERT for Text Prediction

  • The start of the tutorial assumes that the users are going through them sequentially (e.g. Now you have learned 1) the basics about Gluon, 2) how to ...). Instead we could adjust the wording as "In Part 1 Quickstart (link), we learned ..."
  • "Handle Variable Length Sequence" -> "Batching Variable Length Sequences"
  • cell 12 should mention the signatures of the backbone and link to relevant doc.
  • cell 15 the training function is long without much explanation. we could consider extracting parts from it to a separate cell, such as one for constructing dataloader, one for learning rate scheduler/trainer.
  • I saw several notebooks with a lot of imports. Let's make sure we only include the necessary ones. The import order should be python built-ins, third-party libraries, and then first-party libraries.

On the Question Answering with GluonNLP

  • The mentions of https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering should link to the example page so that we always refer to the source code in the same version.
  • We should mention/provide pointer on how users can obtain their own checkpoints for !wget -O google_electra_base_squad2.0_8160.params https://gluon-nlp-log.s3.amazonaws.com/squad_training_log/fintune_google_electra_base_squad_2.0/google_electra_base_squad2.0_8160.params

On the Tokenization - Part1: Basic Usage of Tokenizer and Vocabulary

  • text processing workflow: raw text => normalized (cleaned) text => tokens => network can be replaced with a diagram in which examples of each step can be shown.

On the Tokenization - Part2: Learn Subword Models with GluonNLP

  • We should have an introductory section for the concepts of subwords before diving in to the implementation/execution

On the Tokenization - Part3: Download Data from Wikipedia and Learn Subword

  • Since the learn_subword is already covered in part2, the new content in this part really is the nlp_data preparation. There should be description about what it does, and how people could discover the different options.

For the tokenization notebooks, one pressing need I see is that there should be a reference page for the functionality of the CLIs in the API section. Otherwise, short of reading the code, it's hard for users to discover the features in them.

On Compile NLP Models - Convert GluonNLP Models to TVM

  • At the moment this notebook includes only code and titles. We will need an introduction section for overview and motivation, explanation for the code, reference to other materials on compilation
  • compile_tvm_graph_runtime is very long and there are opportunities for breaking up the logic into smaller functions. For example, graph construction can happen separately from compilation.