With the steadily increasing amount of data, the importance of handling it is more relevant than ever. Keywords like big data, business intelligence, data analytics and such are trending on many news sites. I wanted to get a clearer picture of the situation and decided to take a look at the job market in Germany. I used the open-source language R and Jupyter Notebook for the analysis and documentation.
To get some data to work with, I scraped the site monster.de for job openings, searching for the term "Data Science". I excluded offerings for internships and part-time jobs.
For the scraping I used the rvest-library combined with SelectorGadget, a point and click CSS selector. This made it easy to get a hold of the data I was looking for.
The following code loops through the pages of monster.de, scrapes the company name of each job opening and creates a vector with 315 dimensions in total.
library(rvest)
companies = unlist(lapply(paste0("https://www.monster.de/jobs/suche/Festanstellung+Freie-Mitarbeit-Dienstvertrag+Vollzeit_888?cy=de&q=Data-Science&where=deutschland&rad=20-km&page=", 1:13),
function(url){
url %>% read_html() %>%
html_nodes(".company span") %>%
html_text()
}))
id | company_name |
---|---|
310 | Gefunden bei: SAP |
311 | Gefunden bei: SAP |
312 | Gefunden bei: SAP PS Consultants fuer Grossunternehmen in Bayern |
313 | Gefunden bei: SAP |
314 | Gefunden bei: Digital Performance GmbH |
315 | Gefunden bei: SAP |
Scraped data often has to be cleaned as the aquired data is inconsistent. In the following gsub() uses regular expressions and eliminates the "Gefunden bei: " string at the beginning of some of the company names.
companies_clean = gsub("Gefunden bei: ", "", companies)
id | company_name |
---|---|
310 | SAP |
311 | SAP |
312 | SAP PS Consultants fuer Grossunternehmen in Bayern |
313 | SAP |
314 | Digital Performance GmbH |
315 | SAP |
The table() function then outputs the frequency of each of the company names. The frequencies are tied to each company name and displayed in a matrix.
companies_freq = as.data.frame(table(companies_clean))
id | company | frequency |
---|---|---|
6 | Accenture | 1 |
7 | Adidas | 1 |
8 | Amazon | 1 |
9 | Amgen | 1 |
10 | anykey GmbH | 2 |
11 | AppLift | 1 |
12 | Arvato Bertelsmann - (Embrace) Recruiting Services | 1 |
13 | arvato Financial Solutions | 1 |
I not only wanted to know which companies are focusing on data science, but also if it is limited to big cities only. For this step, I scraped the location component of each job opening.
locations = unlist(lapply(paste0("https://www.monster.de/jobs/suche/Festanstellung+Freie-Mitarbeit-Dienstvertrag+Vollzeit_888?cy=de&q=Data-Science&where=deutschland&rad=20-km&page=", 1:13),
function(url){
url %>% read_html() %>%
html_nodes(".location a") %>%
html_text()
}))
id | location |
---|---|
20 | Dresden, Sachsen |
21 | Nürnberg, Bayern |
22 | München, Bayern |
23 | Bonn, Nordrhein-Westfalen |
24 | München, Bayern |
25 | Hannover, Niedersachsen |
26 | München, Bayern |
27 | München, Bayern |
The scraped location data was even messier than the former company name data. More data cleaning had to be done.
locations_clean = gsub("\r\n", "", locations)
locations_clean = gsub(",.*", "", locations_clean)
locations_clean = gsub("\u00FC", "ue", locations_clean)
locations_clean = gsub("\u00F6", "oe", locations_clean)
id | location |
---|---|
20 | Dresden |
21 | Nuernberg |
22 | Muenchen |
23 | Bonn |
24 | Muenchen |
25 | Hannover |
26 | Muenchen |
27 | Muenchen |
Storing the cleaned locations into a data frame and attaching the frequencies allowed for a nice overview of the data.
locations_freq = as.data.frame(table(locations_clean))
id | location | frequency |
---|---|---|
2 | Aachen | 1 |
3 | Aschheim | 2 |
4 | Augsburg | 2 |
5 | Bad Homburg v.d. Hoehe | 1 |
6 | Baden-Baden | 1 |
7 | Bayern | 1 |
8 | Berlin | 38 |
To better understand the data I wanted to visualize the data on a map. Importing the ggmap-library allows quick access and usage of the Google Maps API.
library(ggmap)
To visualize the data on a map, the location names had to be transformed to latitude and longitude values. The geocode() function accesses the Google Maps API and does exactly that. The method below traverses the data frame and geocodes every location.
data = lapply(locations_freq[,1], function(x){
geocode(toString(x))
})
-93.21909 | 30.23101 | 6.083887 | 50.77535 | 11.71591 | 7.220952 | 51.14052 | 8.239761 | 50.07822 |
Transposing the above data frame gives a better view of the data. Unfortunately, latitude and longitude are in a single column. To seperate them I picked every second element from the data frame starting with the first (and second respectively).
lon | -93.219090 |
lat | 30.231008 |
lon.1 | 6.083887 |
lat.1 | 50.775346 |
lon.2 | 11.715910 |
lat.2 | 48.172310 |
lon = df[seq(1,length(df),2)]
longitude |
---|
-93.219090 |
6.083887 |
11.715910 |
10.897790 |
8.618162 |
8.228524 |
Binding the now separate latitude and longitude to the location name and its frequency gives a data frame that can be used for the visualization.
m = data.frame(locations_freq, lon, lat)
id | location | longitude | latitude | |
---|---|---|---|---|
2 | Aachen | 1 | 6.083887 | 50.77535 |
3 | Aschheim | 2 | 11.715910 | 48.17231 |
4 | Augsburg | 2 | 10.897790 | 48.37054 |
5 | Bad Homburg v.d. Hoehe | 1 | 8.618162 | 50.22683 |
6 | Baden-Baden | 1 | 8.228524 | 48.76564 |
7 | Bayern | 1 | 11.497889 | 48.79045 |
8 | Berlin | 38 | 13.404954 | 52.52001 |
In the last step, the aquired data is finally plotted onto a map of Germany.
Germany = get_map(location = 'Germany', zoom = 6)
p = ggmap(Germany)
p <- p + geom_point(data = m, aes(x=lon, y=lat, size = m$Freq, color = "red"))
Berlin and Munich as leading cities was no big surprise. Although I did expect some jobs in North Rine-Westphalia, the cluster around Cologne was unexpected. It seems that Germany is a rising market for data science. Even if the regional distribution is not yet given, with a total of number of 315 job openings, there seems to be solid foundation for future development.