fabge / data_science_germany

A brief analysis of the current job market for data scientists in Germany.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Science in Germany

With the steadily increasing amount of data, the importance of handling it is more relevant than ever. Keywords like big data, business intelligence, data analytics and such are trending on many news sites. I wanted to get a clearer picture of the situation and decided to take a look at the job market in Germany. I used the open-source language R and Jupyter Notebook for the analysis and documentation.

To get some data to work with, I scraped the site monster.de for job openings, searching for the term "Data Science". I excluded offerings for internships and part-time jobs.

For the scraping I used the rvest-library combined with SelectorGadget, a point and click CSS selector. This made it easy to get a hold of the data I was looking for.

The following code loops through the pages of monster.de, scrapes the company name of each job opening and creates a vector with 315 dimensions in total.

library(rvest)

companies = unlist(lapply(paste0("https://www.monster.de/jobs/suche/Festanstellung+Freie-Mitarbeit-Dienstvertrag+Vollzeit_888?cy=de&q=Data-Science&where=deutschland&rad=20-km&page=", 1:13),
              function(url){
                url %>% read_html() %>% 
                  html_nodes(".company span") %>% 
                  html_text()
              }))
idcompany_name
310Gefunden bei: SAP
311Gefunden bei: SAP
312Gefunden bei: SAP PS Consultants fuer Grossunternehmen in Bayern
313Gefunden bei: SAP
314Gefunden bei: Digital Performance GmbH
315Gefunden bei: SAP

Scraped data often has to be cleaned as the aquired data is inconsistent. In the following gsub() uses regular expressions and eliminates the "Gefunden bei: " string at the beginning of some of the company names.

companies_clean = gsub("Gefunden bei: ", "", companies)
idcompany_name
310SAP
311SAP
312SAP PS Consultants fuer Grossunternehmen in Bayern
313SAP
314Digital Performance GmbH
315SAP

The table() function then outputs the frequency of each of the company names. The frequencies are tied to each company name and displayed in a matrix.

companies_freq = as.data.frame(table(companies_clean))
idcompanyfrequency
6Accenture 1
7Adidas 1
8Amazon 1
9Amgen 1
10anykey GmbH 2
11AppLift 1
12Arvato Bertelsmann - (Embrace) Recruiting Services1
13arvato Financial Solutions 1

I not only wanted to know which companies are focusing on data science, but also if it is limited to big cities only. For this step, I scraped the location component of each job opening.

locations = unlist(lapply(paste0("https://www.monster.de/jobs/suche/Festanstellung+Freie-Mitarbeit-Dienstvertrag+Vollzeit_888?cy=de&q=Data-Science&where=deutschland&rad=20-km&page=", 1:13),
                          function(url){
                            url %>% read_html() %>% 
                              html_nodes(".location a") %>% 
                              html_text()
                          }))
idlocation
20 Dresden, Sachsen
21 Nürnberg, Bayern
22 München, Bayern
23 Bonn, Nordrhein-Westfalen
24 München, Bayern
25 Hannover, Niedersachsen
26 München, Bayern
27 München, Bayern

The scraped location data was even messier than the former company name data. More data cleaning had to be done.

locations_clean = gsub("\r\n", "", locations)
locations_clean = gsub(",.*", "", locations_clean)
locations_clean = gsub("\u00FC", "ue", locations_clean)
locations_clean = gsub("\u00F6", "oe", locations_clean)
idlocation
20Dresden
21Nuernberg
22Muenchen
23Bonn
24Muenchen
25Hannover
26Muenchen
27Muenchen

Storing the cleaned locations into a data frame and attaching the frequencies allowed for a nice overview of the data.

locations_freq = as.data.frame(table(locations_clean))
idlocationfrequency
2Aachen 1
3Aschheim 2
4Augsburg 2
5Bad Homburg v.d. Hoehe 1
6Baden-Baden 1
7Bayern 1
8Berlin 38

To better understand the data I wanted to visualize the data on a map. Importing the ggmap-library allows quick access and usage of the Google Maps API.

library(ggmap)

To visualize the data on a map, the location names had to be transformed to latitude and longitude values. The geocode() function accesses the Google Maps API and does exactly that. The method below traverses the data frame and geocodes every location.

data = lapply(locations_freq[,1], function(x){
  geocode(toString(x))
})
-93.2190930.23101 6.083887 50.77535 11.71591 7.220952 51.14052 8.239761 50.07822

Transposing the above data frame gives a better view of the data. Unfortunately, latitude and longitude are in a single column. To seperate them I picked every second element from the data frame starting with the first (and second respectively).

lon-93.219090
lat 30.231008
lon.1 6.083887
lat.1 50.775346
lon.2 11.715910
lat.2 48.172310
lon = df[seq(1,length(df),2)]
longitude
-93.219090
6.083887
11.715910
10.897790
8.618162
8.228524

Binding the now separate latitude and longitude to the location name and its frequency gives a data frame that can be used for the visualization.

m = data.frame(locations_freq, lon, lat)
idlocationlongitudelatitude
2Aachen 1 6.083887 50.77535
3Aschheim 2 11.715910 48.17231
4Augsburg 2 10.897790 48.37054
5Bad Homburg v.d. Hoehe 1 8.618162 50.22683
6Baden-Baden 1 8.228524 48.76564
7Bayern 1 11.497889 48.79045
8Berlin 38 13.404954 52.52001

In the last step, the aquired data is finally plotted onto a map of Germany.

Germany = get_map(location = 'Germany', zoom = 6)
p = ggmap(Germany)
p <- p + geom_point(data = m, aes(x=lon, y=lat, size = m$Freq, color = "red"))

Berlin and Munich as leading cities was no big surprise. Although I did expect some jobs in North Rine-Westphalia, the cluster around Cologne was unexpected. It seems that Germany is a rising market for data science. Even if the regional distribution is not yet given, with a total of number of 315 job openings, there seems to be solid foundation for future development.

About

A brief analysis of the current job market for data scientists in Germany.

License:MIT License


Languages

Language:Jupyter Notebook 100.0%