wipo-analytics/drones

---
title: ""
author: ""
date: ""
output: github_document
---
<img src="man/figures/logo.png" align="left" />
---
[![Build Status](https://travis-ci.org/poldham/drones.svg?branch=master)](https://travis-ci.org/poldham/drones)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
[![minimal R version](https://img.shields.io/badge/R%3E%3D-3.5.0-6666ff.svg)](https://cran.r-project.org/)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/drones)](https://cran.r-project.org/package=drones)
[![packageversion](https://img.shields.io/badge/Package%20version-0.0.0.9000-orange.svg?style=flat-square)](commits/master)
[![Last-changedate](https://img.shields.io/badge/last%20change-`r gsub('-', '--', Sys.Date())`-yellowgreen.svg)](/commits/master)

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, cache = TRUE)
```

---

#   The Drones Patent Dataset

The drones package provides access to patent datasets containing references to drones in the text. The datasets are intended to be used for training in patent analytics by providing access to raw and cleaned data in one place. 

The main `drones` dataset consists of 15,570 patent applications that refer to the word drone or drones somewhere in the text. The dataset is based on a search of patent documents from the main patent jurisdictions for the period 1845 to 2017 using the [Clarivate Analytics Derwent Innovation](https://clarivate.com/products/derwent-innovation/) database. Additional supplementary datasets may be provided in future updates.

The data within the drones package is intended exclusively for use in training in patent analytics as described on the [WIPO analytics](https://github.com/wipo-analytics) home page. It is important to note that the dataset is deliberately noisy in order to work on methods for cleaning data. This is a training dataset and is not intended to be used to make definitive statements about patent activity for drone technology and it is not expected to be complete. The drones patent dataset is part of work in progress on the [WIPO Patent Analytics Handbook](https://wipo-analytics.github.io/handbook/). 

The drones dataset is contained in an R package but you can use the drones dataset with or without R. 

### Download for use outside RStudio

If you want to work outside RStudio you can download the core `drones` reference file as a zip file containing the data in .csv format [here](https://storage.googleapis.com/owa/drones.csv.zip). Other datasets within the R package are simply subsets of the core datasets. See the online [reference page](https://wipo-analytics.github.io/drones/reference/index.html) for details. 

### Install in RStudio

If you are using RStudio then you can import the package from Github as follows. If you need to install RStudio follow the instructions to install R [here](https://cran.rstudio.com/) and RStudio [here](https://www.rstudio.com/products/rstudio/download/)

Make sure you have devtools installed and it is a very good idea to install the tidyverse if you don't have it already.

```{r eval=FALSE}
install.packages("devtools")
install.packages("tidyverse")
```

Next install the drones package. 

```{r eval=FALSE}
devtools::install_github("wipo-analytics/drones")
```

```{r}
library(drones)
```

### What is in the datasets

The datasets are fully documented inside the R package. 

The core drones dataset is a table with 15,570 observations of 22 variables:

- `abstract` The original document abstract, a character vector. 12,452 (94%)

- `abstract_english` The English document abstract, a character vector. 12,027 (95%)

- `application_number` The long application number including the date, a character vector. 15776, (100%)

- `basic_patent_date` Derwent Innovation basic patent date, a character vector. 2325 (93%)

- `basic_patent_number` The Derwent Innovation basic patent number forming the base for the dwpi_family, a character vector. 10,281 (93%)

- `applicant` The original applicant or assignee name, a character vector. 7488 (90%)

- `applicant_cleaned` A cleaned version of the applicant name, a character vector. 6744 (90%)

- `cited_nonpatent` Literature citations, field is noisy, a character vector. 28361 (30%)

- `cited_patents` Patents cited in one or more documents, a character vector. 93703 (71%)

- `citing_patents` Patents citing one or more documents, a character vector. 64328 (39%)

- `cpc` The Cooperative Patent Classification Codes (CPC), a character vector 17077 (92%)

- `dwpi_family_dates` Family dates for DWPI family numbers (Derwent World Patent Index), a character vector. 5153 (93%)

- `dwpi_family_kind` Document kind codes for DWPI Family members, a character vector. 42 (93%)

- `dwpi_family_numbers` DWPI family members, a character vector. 30984 (93%)

- `first_claim` The first claim in a patent document, a character vector. 13332 (97%)

- `inpadoc_family_members` INPADOC Family Members in long format with dates, a character vector. 49,625, (98%)

- `inpadoc_first_family_member` The earliest publication number in the `inpadoc_family_members` based on the date, a character vector. 9020 (98%)

- `inventor` The original inventor name, a character vector. 19293 (94%)

- `ipc` International Patent Classification (IPC) codes, a character vector. 8489 (98%)

- `priority_number` Patent priority numbers in long for with dates, a character vector. 23379 (99%)

- `publication_number` Publication numbers in short form minus dates, a character vector. 15776 (100%)

- `publication_year` The year of publication of the publication numbers, a character vector. 145 (99%)

- `related_application_numbers` Details of related patent applications, a character vector. 7124 (35%)

- `title_english` The English title, a character vector. 10815 (99%)

- `title_original` The original title, normally concatenated as English, French, German etc., character vector. 13,753 (97%)

Note that the coverage of each field typically does not add up to 100 percent of the documents. The numbers provided above are intended to provide reference counts for cross-checking when developing counts of the data. Typical reasons for variance from these counts will be failing to trim leading and trailing white space when separating concatenated fields on the semi-colon and NAs (Not Available) that appear in the data in R where there is less than 100% coverage. Where variance between the reference numbers is high you should investigate why. 

For R users it is possible that foreign characters are present in the main text fields. In the event you run into problems please raise an issue and provide details of the problem. 

### How to separate concatenated columns

Patent data is not tidy. Many of the columns contain data that is concatenated (joined) with a semicolon as a delimeter. In the case of the Lens patent data the delimited is a double semicolon.

This means that in order to access the data for counting you will need to separate the data onto the relevant row and you will also need to trim the data. You can do this easily with the tidyverse packages.

```{r eval = FALSE}
install.packages("tidyverse")
```

Load the library

```{r}
library(tidyverse)
```

To demonstrate this let's separate out the applicant (assignee) field and then count it up:

```{r}
applicants <- drones::applicants %>% 
  separate_rows(applicant_cleaned, sep = ";") %>%
  mutate(applicant_cleaned = str_trim(applicant_cleaned, side = "both")) %>% 
  drop_na(applicant_cleaned) # drop not available entries

applicants %>% count(applicant_cleaned, sort = TRUE)
```

If you wanted to run the same operation as a reusable function you could use this. Note that this has been adapted for tidy evaluation. In some fields it will be common for NA (not available) to appear prominently. Consider adding `dplyr::drop_na` to address these cases.

```{r}
separate_rows_trim <- function(df, col, sep){
  df %>% tidyr::separate_rows(col, sep = sep) %>% 
    dplyr::mutate(!!col := stringr::str_trim(.[[col]], side = "both")) %>% 
    dplyr::count(!!col := .[[col]], sort = TRUE)
}
```


We'll add more to the package by the way of guides and examples as we develop it. For the moment this helps to get you started. 

### Attribution

The lovely looking drone icon for the package was made by [Nikita Golubev](https://www.flaticon.com/authors/nikita-golubev) from www.flaticon.com
wipo-analytics / drones

About

Languages