Data Science in Germany

With the steadily increasing amount of data, the importance of handling it is more relevant than ever. Keywords like big data, business intelligence, data analytics and such are trending on many news sites. I wanted to get a clearer picture of the situation and decided to take a look at the job market in Germany. I used the open-source language R and Jupyter Notebook for the analysis and documentation.

To get some data to work with, I scraped the site monster.de for job openings, searching for the term "Data Science". I excluded offerings for internships and part-time jobs.

For the scraping I used the rvest-library combined with SelectorGadget, a point and click CSS selector. This made it easy to get a hold of the data I was looking for.

The following code loops through the pages of monster.de, scrapes the company name of each job opening and creates a vector with 315 dimensions in total.

library(rvest)

companies = unlist(lapply(paste0("https://www.monster.de/jobs/suche/Festanstellung+Freie-Mitarbeit-Dienstvertrag+Vollzeit_888?cy=de&q=Data-Science&where=deutschland&rad=20-km&page=", 1:13),
              function(url){
                url %>% read_html() %>% 
                  html_nodes(".company span") %>% 
                  html_text()
              }))

id	company_name
310	Gefunden bei: SAP
311	Gefunden bei: SAP
312	Gefunden bei: SAP PS Consultants fuer Grossunternehmen in Bayern
313	Gefunden bei: SAP
314	Gefunden bei: Digital Performance GmbH
315	Gefunden bei: SAP

Scraped data often has to be cleaned as the aquired data is inconsistent. In the following gsub() uses regular expressions and eliminates the "Gefunden bei: " string at the beginning of some of the company names.

companies_clean = gsub("Gefunden bei: ", "", companies)

id	company_name
310	SAP
311	SAP
312	SAP PS Consultants fuer Grossunternehmen in Bayern
313	SAP
314	Digital Performance GmbH
315	SAP

The table() function then outputs the frequency of each of the company names. The frequencies are tied to each company name and displayed in a matrix.

companies_freq = as.data.frame(table(companies_clean))

id	company	frequency
6	Accenture	1
7	Adidas	1
8	Amazon	1
9	Amgen	1
10	anykey GmbH	2
11	AppLift	1
12	Arvato Bertelsmann - (Embrace) Recruiting Services	1
13	arvato Financial Solutions	1

I not only wanted to know which companies are focusing on data science, but also if it is limited to big cities only. For this step, I scraped the location component of each job opening.

locations = unlist(lapply(paste0("https://www.monster.de/jobs/suche/Festanstellung+Freie-Mitarbeit-Dienstvertrag+Vollzeit_888?cy=de&q=Data-Science&where=deutschland&rad=20-km&page=", 1:13),
                          function(url){
                            url %>% read_html() %>% 
                              html_nodes(".location a") %>% 
                              html_text()
                          }))

id	location
20	Dresden, Sachsen
21	Nürnberg, Bayern
22	München, Bayern
23	Bonn, Nordrhein-Westfalen
24	München, Bayern
25	Hannover, Niedersachsen
26	München, Bayern
27	München, Bayern

The scraped location data was even messier than the former company name data. More data cleaning had to be done.

locations_clean = gsub("\r\n", "", locations)
locations_clean = gsub(",.*", "", locations_clean)
locations_clean = gsub("\u00FC", "ue", locations_clean)
locations_clean = gsub("\u00F6", "oe", locations_clean)

id	location
20	Dresden
21	Nuernberg
22	Muenchen
23	Bonn
24	Muenchen
25	Hannover
26	Muenchen
27	Muenchen

Storing the cleaned locations into a data frame and attaching the frequencies allowed for a nice overview of the data.

locations_freq = as.data.frame(table(locations_clean))

id	location	frequency
2	Aachen	1
3	Aschheim	2
4	Augsburg	2
5	Bad Homburg v.d. Hoehe	1
6	Baden-Baden	1
7	Bayern	1
8	Berlin	38

To better understand the data I wanted to visualize the data on a map. Importing the ggmap-library allows quick access and usage of the Google Maps API.

library(ggmap)

To visualize the data on a map, the location names had to be transformed to latitude and longitude values. The geocode() function accesses the Google Maps API and does exactly that. The method below traverses the data frame and geocodes every location.

data = lapply(locations_freq[,1], function(x){
  geocode(toString(x))
})

⋯

-93.21909

30.23101

6.083887

50.77535

11.71591

7.220952

51.14052

8.239761

50.07822

Transposing the above data frame gives a better view of the data. Unfortunately, latitude and longitude are in a single column. To seperate them I picked every second element from the data frame starting with the first (and second respectively).

lon	-93.219090
lat	30.231008
lon.1	6.083887
lat.1	50.775346
lon.2	11.715910
lat.2	48.172310

lon = df[seq(1,length(df),2)]

longitude
-93.219090
6.083887
11.715910
10.897790
8.618162
8.228524

Binding the now separate latitude and longitude to the location name and its frequency gives a data frame that can be used for the visualization.

m = data.frame(locations_freq, lon, lat)

id	location	longitude	latitude
2	Aachen	1	6.083887	50.77535
3	Aschheim	2	11.715910	48.17231
4	Augsburg	2	10.897790	48.37054
5	Bad Homburg v.d. Hoehe	1	8.618162	50.22683
6	Baden-Baden	1	8.228524	48.76564
7	Bayern	1	11.497889	48.79045
8	Berlin	38	13.404954	52.52001

In the last step, the aquired data is finally plotted onto a map of Germany.

Germany = get_map(location = 'Germany', zoom = 6)
p = ggmap(Germany)
p <- p + geom_point(data = m, aes(x=lon, y=lat, size = m$Freq, color = "red"))

Berlin and Munich as leading cities was no big surprise. Although I did expect some jobs in North Rine-Westphalia, the cluster around Cologne was unexpected. It seems that Germany is a rising market for data science. Even if the regional distribution is not yet given, with a total of number of 315 job openings, there seems to be solid foundation for future development.

About

A brief analysis of the current job market for data scientists in Germany.

analysis blogpost datascience germany

MIT License

Languages

Language:Jupyter Notebook 100.0%