Simple project proposed for a selection process.
The details of the challenge can be found here, and the answers, here.
- OpenJDK (1.8.0_212)
- Apache Spark (2.4.3)
- Scala (2.11.12)
- Clone it
git clone https://github.com/gwyddie/semantix-challenge.git
- Change directory
cd semantix-challenge
- Run it
spark-shell -i Main.scala
The script written in Scala will search the logs/
directory for logs, and then match lines with the following RegExp:
/^(.*)\s-\s-\s\[(.*)\]\s"(\w*)\s(.*)"\s(\d+)\s?(\d+|-)$/
Any line that does not follow such pattern is ignored.
After that, it maps the nice lines to a case class
, but then all it does is boring calculation, that you can check out at the script.
The directory logs/*
is where the logs live. By putting 'NASA logs' there, the script will find them, read, evaluate and then print results for the following statistics:
- amount of unique hosts
- total of 404 errors
- most 404 throwing URLs
- amount of 404 erros per day
- total of returned bytes
That's all, folks!
Thanks ;)