- ICML 2015 Google
- internal covariate shift
- DNN中因为每层neural的变化,数据分布会不断变化。类比蝴蝶效应。模型深层每次输入的数据分布可能方差非常大。大大增加了训练收敛的难度。
- batch norm
- 在每一层,计算mini batch mean & mini batch var。然后进行norm变换。
- 注意bn是纵向的,效果很大程度上取决于mini batch的大小与不同batch间的var差距。
- 引入两个可学习参数gamma, beta。 对norm变换以后的x进行一次线性变换 gamma*x + beta (当然也就拥有了可以变回原样的能力,也就让模型能够不因为BN而损失拟合能力)。
- 优势:范化性好,收敛快,一定程度可以取代dropout
- 在每一层,计算mini batch mean & mini batch var。然后进行norm变换。
- CoRR 2016
- 与BN的区别就是是横向的,计算的是输入特征所有dimension累加的mu,sigma。这些值都是与batch无关的。所以不需要像BN那样去关心mini batch间的差异。适用于动态场景、在线学习、小mini batch,尤其在RNN中,比如Bert
- 因为是同example间所有feature统计值,如果feature间差距比较大,等于限制了feature的特性,从而约束了模型表达能力
- ICLR (Poster) 2019
- weight decay 是L2 norm的等价形式(仅限SGD,即学习率全局不变,因为等价的公式中,学习率是耦合在里面的)
- 在adaptive gradient方法中,将weight decay 和 L2 norm decouple,可以获得更好的范化能力
- 虽然adam本身自带学习率衰减,论文还是指出解耦weight decay后(adamW),仍可以获得15%的指标提升
- CoRR 2016
- Gradient descent总结性文章,主要包括BGD, SGD, Mini-batch GD, Momentum, Nesterov Accelerated Gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam
- 提供了一些SGD优化推荐,如Shuffle/Curriculum, BN, Early stopping, Gradient Gaussion Noise
- ICLR 2017 Best Paper Google
- 不觉得这篇凭什么能拿best paper,都是一些很表象的东西
- 深度网络能很容易的拟合随机数据
- 正则化对模型泛化能力不起决定作用 (百分位的提升都不叫提升了?)
- AISTATS 2010 大名鼎鼎的Xavier Glorot初始化
- 对于激活函数
- 已知sigmoid会降低学习效率( none-zero mean that induces important singular values in the Hessian)、饱和后带来梯度消失等问题
- 文章对比了softmax, tanh, softsign在5层NN中不同层的数值变化,论证了sigmoid一开始基本都是在做无用功,前三层一开始都在随机输出,最后一层直接陷入饱和状态,在迭代100次以后才逃出来真正开始训练。其中softsign表现最好
- 对于初始化方法
- 近似推导了一个初始化方法,能够尽量保证在NN传递时,W的均值为0,方差不变。方法简单但是效果异常得好。也就是目前很多机器学习工具包集成的 Xavier initialization
- CoRR 2016
- 总结性文章,一些基础知识点:Markov chain, Monte Carlo sampling, Gibbs sampling, Kullback-Leibler divergence, mean-field theory, ELBO, CAVI
XGBoost: A Scalable Tree Boosting System
Improving Generalization Performance by Switching from Adam to SGD
ON THE CONVERGENCE OF ADAM AND BEYOND
The Marginal Value of Adaptive Gradient Methods in Machine Learning
- CVPR 2019
- Dropout在train与test阶段的不同会引入variance shift,进而导致之后的BN层在train阶段学习的数据分布与test阶段不同,最终导致效果下降
- variance shift 主要取决于dropout和feature dimension. 在feature dim特别大,dropout较小的时候,variance shift也几乎可以忽略。这也是WRN(Wide ResNet)使用dropout的解释
- 可以在最后一层BN之后加Dropout
- COEC
DeepWalk: Online Learning of Social Representations
Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph
LINE: Large-scale Information Network Embedding
node2vec: Scalable Feature Learning for Networks
Structural Deep Network Embedding
Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
Learning Tree-based Deep Model for Recommender Systems
Joint Optimization of Tree-based Index and Deep Model for Recommender Systems
A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization
A Pareto-Efficient Algorithm for Multiple Objective Optimization in E-Commerce Recommendation
A User-Centered Concept Mining System for Query and Document Understanding at Tencent
Ad Click Prediction: a View from the Trenches
Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction
Behavior Sequence Transformer for E-commerce Recommendation in Alibaba
Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba
Causal Embeddings for Recommendation
Deep & Cross Network for Ad Click Predictions
Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features
Deep CTR Prediction in Display Advertising
Deep Interest Evolution Network for Click-Through Rate Prediction
Deep Neural Networks for YouTube Recommendations
Deep Session Interest Network for Click-Through Rate Prediction
DeepFM: A Factorization-Machine based Neural Network for CTR Prediction
- SIGIR 2018
- 提出了两个客观存在的问题,Sample Selection Bias和Data Sparsity。模型通过引入CTR和CTCVR两个任务来预估CVR
- p(conv|click,impression) = p(conv,impression,click) / p(click|impression)
- 对pCVR的预估,可以利用训练pCTR模型来指导embedding的训练,进而更准确的估计oCPC = pCVR*CPA
Factorization Machines
Field-aware Factorization Machines for CTR Prediction
Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction
FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction
FLEN: Leveraging Field for Scalable CTR Prediction
Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization
Image Feature Learning for Cold Start Problem in Display Advertising
From RankNet to LambdaRank to LambdaMART: An Overview
Learning and Transferring IDs Representation in E-commerce
Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction
Local Item-Item Models for Top-N Recommendation
Online Learning to Rank for Sequential Music Recommendation
Personalized Re-ranking for Recommendation
Practical Lessons from Predicting Clicks on Ads at Facebook
Real-time Personalization using Embeddings for Search Ranking at Airbnb
Recommending What Video to Watch Next: A Multitask Ranking System
Representation Learning-Assisted Click-Through Rate Prediction
Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations
Sparse Linear Methods with Side Information for Top-N Recommendations
SLIM: Sparse Linear Methods for Top-N Recommender Systems
Long and Short-Term Recommendations with Recurrent Neural Networks
Wide & Deep Learning for Recommender Systems
xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems
A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem
A Contextual-Bandit Approach to Personalized News Article Recommendation
A Simple Multi-Armed Nearest-Neighbor Bandit for Interactive Recommendation
Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches
ExcUseMe: Asking Users to Help in Item Cold-Start Recommendations
Feature-based factorized Bilinear Similarity Model for Cold-Start Top-n Item Recommendation
From Zero-Shot Learning to Cold-Start Recommendation
Item Cold-Start Recommendations: Learning Local Collective Embeddings
Neural Collaborative Filtering
Spectral Collaborative Filtering
Variational Autoencoders for Collaborative Filtering
Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model
Notes on Noise Contrastive Estimation and Negative Sampling
Deep contextualized word representations
Improving Language Understanding by Generative Pre-Training
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Single Headed Attention RNN: Stop Thinking With Your Head
Pre-Training with Whole Word Masking for Chinese BERT
A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification
Attention Is All You Need
Bag of Tricks for Efficient Text Classification
Bidirectional LSTM-CRF Models for Sequence Tagging
Convolutional Neural Networks for Sentence Classification
ERNIE: Enhanced Language Representation with Informative Entities
GloVe: Global Vectors for Word Representation
Inference Methods for Latent Dirichlet Allocation
Latent Dirichlet Allocation
Network–Efficient Distributed Word2vec Training System for Large Vocabularies
Notes on Noise Contrastive Estimation and Negative Sampling
Parameter estimation for text analysis
SpanBERT: Improving Pre-training by Representing and Predicting Spans
The Dirichlet-multinomial distribution
Distributed Representations of Words and Phrases and their Compositionality
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Fully Convolutional Networks for Semantic Segmentation
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
U-Net: Convolutional Networks for Biomedical Image Segmentation
UNet++: A Nested U-Net Architecture for Medical Image Segmentation
YOLOv3: An Incremental Improvement
CosFace: Large Margin Cosine Loss for Deep Face Recognition
FaceNet: A Unified Embedding for Face Recognition and Clustering
Auto-Encoding Variational Bayes
Early Visual Concept Learning with Unsupervised Deep Learning
Semi-supervised Learning with Deep Generative Models
Stochastic Gradient VB and the Variational Auto-Encoder
Tutorial on Variational Autoencoders
Generative Adversarial Nets
Are GANs Created Equal? A Large-Scale Study
Conditional Generative Adversarial Nets
UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS
f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization
Image-to-Image Translation with Conditional Adversarial Networks
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
Least Squares Generative Adversarial Networks
TOWARDS PRINCIPLED METHODS FOR TRAINING GENERATIVE ADVERSARIAL NETWORKS
Autoencoding beyond pixels using a learned similarity metric
Wasserstein Generative Adversarial Networks