breandan / tracelink

πŸ”— Trace Link Prediction from code to documentation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TraceLink

The goal of this project is to link source code to documentation and other realated software artifacts. We train a recommender system that suggests a list of documents sorted by their relevance to a given context or code snippet. For uncommon tokens, this should at least include all documents which refer to the token directly (e.g. an inverted index), as well as documents which are semantically or contextually related to the source code in non-obvious ways.

Approach

We train a variational autoencoder and use the encoder to project short sequences of text with their accompanying link into link space. In the same manner, we train a second VAE on documents, to learn a document space embedding. Finally we train a supervised model from link space to document space, i.e. to predict the document(s) which a link with unknown destination may have targeted.

Datasets

The following datasets are used to extract relevant links from documentation:

StackExchange contains a large dataset of programming related Q&A:

It may be interesting to explore code search and suggestion, in a similar manner.

Links matching a simple pattern are collected from API documentation.

Sample

The following is an excerpt from the post-processed documentation dataset:

link	context	source	target	fragment
"qgsprocessingalgorithm.h:223"	"orithm::groupIdvirtual QString groupId() constReturns the unique ID of the group this algorithm belongs to. Definition:  <<LNK>> "	"QGIS.tgz!/QGIS.docset/Contents/Resources/Documents/qgsalgorithmswapxy_8h_source.html"	"QGIS.tgz!/QGIS.docset/Contents/Resources/Documents/qgsprocessingalgorithm_8h_source.html"	"#l00223"
"QgsProcessingFeatureBasedAlgorithm"	" <<LNK>> An abstract QgsProcessingAlgorithm base class for processing algorithms which operate "feature-by-fea...Definition: qgsp"	"QGIS.tgz!/QGIS.docset/Contents/Resources/Documents/qgsalgorithmswapxy_8h_source.html"	"QGIS.tgz!/QGIS.docset/Contents/Resources/Documents/classQgsProcessingFeatureBasedAlgorithm.html"	""
"qgsprocessingalgorithm.h:867"	"ithmAn abstract QgsProcessingAlgorithm base class for processing algorithms which operate "feature-by-fea...Definition:  <<LNK>> "	"QGIS.tgz!/QGIS.docset/Contents/Resources/Documents/qgsalgorithmswapxy_8h_source.html"	"QGIS.tgz!/QGIS.docset/Contents/Resources/Documents/qgsprocessingalgorithm_8h_source.html"	"#l00867"

Experiments

  • Compare doc2vec with keyphrase / bag-of-words extraction.
  • Compare in-vocabulary to out-of-vocabulary retrieval precision.
  • Stack trace entity alignment to e.g. GitHub lines of code.
  • IDE based context alignment to e.g. StackOverflow issues.

References

About

πŸ”— Trace Link Prediction from code to documentation


Languages

Language:TeX 76.2%Language:Python 16.5%Language:Kotlin 7.4%