DanijelMisulic / BookRecommendation

App for book recommendation using cosine similarity.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Book Recommendation

##About the project

Though the Web was originally conceived to be used by human users, new data-oriented content have been produced and made available on the Web with the introduction and development of the Semantic Web idea. In particular, more recently there has been a growing interest in the Linked Open Data (LOD) initiative. The cornerstone of Linked Open Data is making available free and open RDF datasets linked with each other.

The aim of this project is to create a book recommender system. The idea is to implement a system that will generate a list of a suggested books to read for a selected book. The project was inspired by the paper MORE: More than Movie Recommendation [1], where authors describe a web application for movie recommendation based on movie's attributes.

The project workflow consists of the following steps:

  • Collecting data from DBPedia and preprocessing
  • Building recommendation system
  • Implementation
  • Technical realisation

##Collecting data from DBPedia and preprocessing

Datasets used in this project are extracted from the DBpedia, the RDF-based version of Wikipedia. RDF(Resource Description Framework) is a standard model for data interchange on the Web. For searching DBPedia we have used SPARQL (The Simple Protocol and RDF Query Language), which makes possible to ask complex queries to DBpedia.

The example of a SPARQL query for extracting the data is displayed in the Listing 1. Books are filtered by movements of their authors. Only books that have all of their attributes values in English are considered and used for the recommender system. In the SELECT part of the query, attributes required for the dataset are listed.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX ontology: <http://dbpedia.org/ontology/>

SELECT DISTINCT ?bookURI ?bookName ?authorName ?authorMovement ?bookGenre ?bookAbstract
where {
?bookURI rdf:type ontology:Book .
?bookURI  ontology:author ?author .
?bookURI  ontology:abstract ?bookAbstract . 
?bookURI  ontology:literaryGenre ?genre . 
?bookURI rdfs:label ?bookName .

?author rdfs:label ?authorName . 
?author ontology:movement ?movement .
?genre rdfs:label ?bookGenre . 
?movement rdfs:label ?authorMovement .
FILTER (regex(?authorMovement, "Romanticism", "i") || regex(?authorMovement, "Realism", "i") || regex(?authorMovement, "Social novel", "i") || regex(?authorMovement, "19th-century French literature", "i") || regex(?authorMovement, "Proletarian literature", "i") || regex(?authorMovement, "Science fiction", "i") || regex(?authorMovement, "Detective fiction", "i") || regex(?authorMovement ,"Impressionism", "i") || regex(?authorMovement ,"Modernism", "i"))
FILTER (lang(?authorName) = "en" && lang(?bookName) = "en" && lang(?bookAbstract) = "en" && lang(?authorMovement) = "en" && lang(?bookGenre) = "en") 

}

Listing 1 - SPARQL query for collecting data

The results of this query are available here. Extracted data is stored into a CSV file data/bookDataSet.csv. Snippet of the collected data is given in the Listing 2.

uri,name,author_name,author_movement,genre,abstract

http://dbpedia.org/resource/The_Brothers_Karamazov,The Brothers Karamazov,Fyodor Dostoyevsky,Literary realism,Philosophical fiction,The Brothers Karamazov also translated as The Karamazov Brothers, is the final novel by the Russian author Fy...	

http://dbpedia.org/resource/Crime_and_Punishment,Crime and Punishment,Fyodor Dostoyevsky,Literary realism,Philosophical fiction/Psychological novel,Crime and Punishment is a novel by the Russian author Fyodor Dostoyevsky. It was first published in the lit...

http://dbpedia.org/resource/The_Village_of_Stepanchikovo,The Village of Stepanchikovo,Fyodor Dostoyevsky,Literary realism,Satire,The Village of Stepanchikovo also known as The Friend of the Famil...					

Listing 2 - Data snippet

##Building recommendation system

###Vector Space Model

In order to compute the similarities, VSM (Vector Space Model) is implemented. In VSM non-binary weights are assigned to index terms in queries and in documents (represented as sets of terms), and are used to compute the degree of similarity between each document in the collection and the query. [1]

Book attributes that are used as a base for recommendation are:

  • author_name - name of a book's author,
  • genre - a genre of a book,
  • author_movement - the literary movement of a book's author.

So, the goal is to create a vector of values for thw listed atributes for every book and calculate its similarity score with vectors of all other books in the dataset.

To increase precision, it's recommanded to use TFIDF [4] values for creating vectors. TF(term-frequency) is a measure of how many times the terms present in vocabulary E(t) are present in the documents, we define the term-frequency as a couting function [4]:

tf

where the fr(x, t) is a simple function defined as:

fr

In Listing 3, the code snippet for calculating tf value for genre attribute, is shown.

public double calculateGenreTF(Book mainBook, Book book) {
	double tf = 0;
		
	for(int i = 0; i < book.getGenres().size(); i++) {
		for(int j = 0; j < mainBook.getGenres().size(); j++) {
			if(book.getGenres().get(i).equals(mainBook.getGenres().get(j))) {
				tf++;
			}
		}
	}
		
	return tf/book.getGenres().size();
}

Listing 3 - Java code for calculating tf

Some terms may be very common, so we use IDF (Inverse Document Frequency) which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. In context of book recommendation, it's obvious that author_name attribute is more relevant for recommending system then author_movement attribute, because there are many more books that belongs to Literary realism movement then books writen by Fyodor Dostoyevsky. IDF is calculated according to next formula:

idf

with

  • N: total number of documents in the corpus N = |D|
  • |{d in D : t in d}| : number of documents where the term t appears.

In Listing 4, a snippet of Java code for calculating idf values for author_name attribute, is displayed.

public double calculateAuthorIDF(Book mainBook, Book book) {
	int counter = 0;
	for (Book b : books) {
		if (b.getAuthorName().equals(mainBook.getAuthorName())) {
			counter++;
		}
	}
	
	return Math.log10((books.size() * 1.0) / (counter * 1.0));
}

Listing 4 - Java code for calculating idf

Now, if input for recommender is a book Crime And Punishment, some of the book vectors for this and other books will look like in Listing 5.

Crime and Punishment (1.9594083500800643, 1.505149978319906, 1.541872785344646)
The Brothers Karamazov (1.9594083500800643, 0.752574989159953, 1.541872785344646)
The Village of Stepanchikovo (1.9594083500800643, 0.0, 1.541872785344646)

Listing 5 - Vector examples

Cosine similarity

The cosine similarity between two vectors (or two documents in the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space. The equation for calculating cosine similarity is depicted in Figure 1 [2]:

equation Figure 1 - Cosine similarity equation

The dividend is a dot product of those vectors, and the divisor is a product of vector intensities. Cosine Similarity will generate a metric that says how related are two documents by looking at the angle instead of magnitude. So, the more the result is closer to 1, two vectors (documents/books) are more similar. On the other hand, if the result tends to 0, it means that vectors are opposed (the angle between them is 90 degrees). [3]

In the Listing 6, Java code for calculating cosine similarity is given.

public double calculateCosineSimilarity(double[] vector1, double[] vector2) {
	double firstElement = 0;
	double secondElement = 0;
	double thirdElement = 0;
	for (int i = 0; i < vector1.length; i++) {
		firstElement = firstElement + vector1[i] * vector2[i];
		secondElement = secondElement + vector1[i] * vector1[i];
		thirdElement = thirdElement + vector2[i] * vector2[i];
	}
	if(Math.sqrt(secondElement) * Math.sqrt(thirdElement) == 0 || Double.isNaN(Math.sqrt(secondElement) * 		Math.sqrt(thirdElement))) {
		return 0;
	}
	double result = firstElement / (Math.sqrt(secondElement) * Math.sqrt(thirdElement));
	return result;
}

Listing 6 - Cosine similarity calculating in Java

If this is invoked for the example from thw previous section, the program will generate a result as in the Listing 7.

Crime and Punishment: 1.0
The Brothers Karamazov: 0.9689186192323702
The Village of Stepanchikovo: 0.8561026924522617

Listing 7 - Cosine similarity values for given vectors

Implementation

This recommender system accepts two parameters for every recommendation, a book and a number of recommendations to be displayed.

List<Book> lb = Controler.getInstance().getRecommendedBooks(mainBook, 20);

Listing 8 - Starting the system from main class

For the input book "Crime and Punishment" by Dostoyevsky, the system will build vectors for every book from the dataset (with calculation of tfidf values for its attributes, as it is shown in previous chapter), and then calculate cosine similarity between selected book and every other book from dataset:

for (int i = 0; i < books.size(); i++) {
	BookVector compareVector = new BookVector(mainBook, books.get(i), (ArrayList<Book>) books);
	double value = cosineSimilarityCalculator.calculateCosineSimilarity(vectorMain.getBookVector(),compareVector.getBookVector());
	ValuedBook valuedBook = new ValuedBook(books.get(i), value);
	valuedBooks.add(valuedBook);
}

...

VectorBuilder vb = new VectorBuilder(books);
attributeVector = vb.createVector(mainBook, book);

...

public double[] createVector(Book mainBook, Book book) {
	bookAuthor = tfidfCalculator.calculateAuthorTFIDF(mainBook, book);
	bookGenre = tfidfCalculator.calculateGenreTDIDF(mainBook, book);
	authorMovement = tfidfCalculator.calculateMovementTFIDF(mainBook, book);
	double[] attributeVector = new double[3];
	attributeVector[0] = bookAuthor;
	attributeVector[1] = bookGenre;
	attributeVector[2] = authorMovement;
	return attributeVector;
}

//Vector creation is explained in previous section

Listing 9 - The program flow

VauedBook is a class with two attributes, book and value, which represents a cosine similarity value added to this specific book. At the end, recommended books are displayed sorted by recommendation score:

Calculating...

These are recommended books for Crime and punishment:
Book name: The House of the Dead (novel), Author: Fyodor Dostoyevsky, Genres: Philosophical fiction, Autobiographical novel,  Movement: Literary realism
Book name: The Idiot, Author: Fyodor Dostoyevsky, Genres: Philosophical fiction,  Movement: Literary realism
Book name: Demons (Dostoyevsky novel), Author: Fyodor Dostoyevsky, Genres: Philosophical fiction, Political fiction,  Movement: Literary realism
Book name: The Brothers Karamazov, Author: Fyodor Dostoyevsky, Genres: Philosophical fiction,  Movement: Literary realism
Book name: The Landlady (Fyodor Dostoyevsky), Author: Fyodor Dostoyevsky, Genres: Fantasy literature, Gothic fiction,  Movement: Literary realism
Book name: Notes from Underground, Author: Fyodor Dostoyevsky, Genres: Philosophy, Novella,  Movement: Literary realism
Book name: Poor Folk, Author: Fyodor Dostoyevsky, Genres: Epistolary novel,  Movement: Literary realism
Book name: The Gambler (novel), Author: Fyodor Dostoyevsky, Genres: Novel,  Movement: Literary realism
Book name: The Eternal Husband, Author: Fyodor Dostoyevsky, Genres: Novel,  Movement: Literary realism
Book name: Humiliated and Insulted, Author: Fyodor Dostoyevsky, Genres: Novel,  Movement: Literary realism
Book name: The Village of Stepanchikovo, Author: Fyodor Dostoyevsky, Genres: Satire,  Movement: Literary realism
Book name: Netochka Nezvanova (novel), Author: Fyodor Dostoyevsky, Genres: Novel,  Movement: Literary realism
Book name: Hunger (Hamsun novel), Author: Knut Hamsun, Genres: Philosophical fiction, Psychological novel, Philosophical fiction, Psychological novel,  Movement: Literary realism
Book name: Home of the Gentry, Author: Ivan Turgenev, Genres: Romance novel, Political fiction,  Movement: Literary realism
Book name: Rudin, Author: Ivan Turgenev, Genres: Romance novel, Politics,  Movement: Literary realism
Book name: On the Eve, Author: Ivan Turgenev, Genres: Romance novel, Political fiction,  Movement: Literary realism
Book name: The Novice (poem), Author: Mikhail Lermontov, Genres: Verse (poetry), Verse (poetry),  Movement: Literary realism
Book name: Torrents of Spring, Author: Ivan Turgenev, Genres: Fiction,  Movement: Literary realism
Book name: Smoke (novel), Author: Ivan Turgenev, Genres: Fiction,  Movement: Literary realism
Book name: Fathers and Sons (novel), Author: Ivan Turgenev, Genres: Romanticism,  Movement: Literary realism

Listing 10 - The output

##Acknowledgements

This application has been developed as a part of the project assignment for the subject Intelligent Systems at the Faculty of Organization Sciences, University of Belgrade, Serbia.

Technical realisation

  • JavaSE 1.8 - widely used platform for development and deployment of portable code for desktop and server environments. Java SE uses the object-oriented Java programming language. It is part of the Java software-platform family. Java SE defines a wide range of general-purpose APIs – such as Java APIs for the Java Class Library – and also includes the Java Language Specification and the Java Virtual Machine Specification.
  • JENA - a Java framework for building Semantic Web applications. It provides a extensive Java libraries for helping developers develop code that handles RDF, RDFS, RDFa, OWL and SPARQL in line with published W3C recommendations. Jena includes a rule-based inference engine to perform reasoning based on OWL and RDFS ontologies, and a variety of storage strategies to store RDF triples in memory or on disk. [5]

References

  1. R Mirizzi, T Di Noia, VC Ostuni, A Ragone - Politecnico di Bari, Linked Open Data for content-based recommender systems , 2012
  2. Carleton, http://cs.carleton.edu/cs_comps/0910/netflixprize/final_results/knn/index.html, accessed: 13/7/2016
  3. C Perone, Machine Learning :: Cosine Similarity for Vector Space Models, 12/09/2013, accessed: 13/7/2016
  4. D Kauchak, Pomona College, TF-IDF, 2009
  5. Apache Jena, https://jena.apache.org/documentation/query/, accessed: 13/7/2016

About

App for book recommendation using cosine similarity.


Languages

Language:Java 100.0%