MU Thesis 📄

This is Youwei's thesis at Monmouth University

Jan 2020

Schedule 📆

Jan 2020 to Aug 2020

Achievements

System
Thesis

Topic

A Search Engine for Education and Learning

Abstract

Today the most common and convenient way to look for information is searching on the Internet. The top Internet search engines are handful, such as Google[ https://www.google.com], Bing[ https://www.bing.com], Baidu[ https://www.baidu.com], and so on. They use web crawler technologies to sniff the whole visible network and then provide users simple indexes and links as the results to help users find the source of resources. Most of those Internet search engines have good performance in terms of speed and precise keyword search ability, but wide-range search engines have their cons, such as massive potential results with uncertain quality. The goal of this project is to propose a new search engine, targeting on education, to provide “valuable” learning resources to users. Being valuable could be an subjective judgement and limited to the resources available for searching. In this project, it refers to means “user-trusted” or “user-liked”, based on common practices in education and learning. A few new technologies are proposed to support efficient storage and enhanced searching for valuable learning resources, in particular, to address two main issues: (1) How to build a high-performance search engine; (2) How to define the value of resources. The details include the strategies designed to optimize general information querying, storage, ranking, and most importantly, finding the relevant learning resources that are “useful” for users. Prototyping and experimental study are conducted to conceptually prove this research.

Keywords: Search Engine, Education, Learning Resource, Resource Discovery, Resource Value

Assets List

thesis
presentation slides
images for thesis
UML files for thesis
code (full project refer to myresource-vue and myresource-node)
other documents (not thesis)
test data

Storage-Query

The statistics are generated randomly. Use the simulated data to test the search engine system.

Dictionary

Dic 1: 33 keywords

Redis Remote Dictionary Server is an in-memory data structure project implementing a distributed The project is mainly developed by Salvatore Sanfilippo and as of 2019 is sponsored by Redis Labs It is open-source software released under a BSD 3-clause license

Test Resources Amount Cases

Test search keywords time in 3 layers as the data increase.

33 keywords
4 keywords as test cases
order of magnitude in the database, generate random resources, index number based on 10, 10^0, 10^1, ... 10^6
static test results, without entering new rows of data
not random page, page 1
not rank

Note: million is the max data level, more data need opening to the public network.

id	total	keyword	count1	count2	count3	layer1	layer2	layer3
1	1	a	1	1	1	0.009	0.007	0.002
2	1	is	1	1	1	0.003	0.002	0.001
3	1	2019	1	1	1	0.005	0.005	0.004
4	1	open-source	1	1	1	0.003	0.003	0.001
1	10	a	9	8	8	0.004	0.004	0.003
2	10	is	9	9	9	0.005	0.005	0.003
3	10	2019	6	6	6	0.005	0.005	0.003
4	10	open-source	4	4	4	0.002	0.004	0.002
1	100	a	101	57	57	0.003	0.002	0.001
2	100	is	93	83	83	0.003	0.002	0.002
3	100	2019	37	37	37	0.001	0.002	0.001
4	100	open-source	34	34	34	0.005	0.003	0.001
1	1000	a	966	575	575	0.002	0.007	0.006
2	1000	is	897	787	787	0.005	0.004	0.005
3	1000	2019	378	378	378	0.003	0.003	0.001
4	1000	open-source	349	349	349	0.004	0.004	0.002
1	10000	a	9551	5747	5747	0.024	0.018	0.017
2	10000	is	8803	7741	7741	0.022	0.021	0.021
3	10000	2019	3735	3735	3735	0.019	0.009	0.009
4	10000	open-source	3740	3740	3740	0.02	0.007	0.005
1	100000	a	95395	58521	58521	0.137	0.012	0.117
2	100000	is	88271	77922	77922	0.217	0.013	0.156
3	100000	2019	37902	37902	37902	0.156	0.008	0.069
4	100000	open-source	37819	37819	37819	0.189	0.008	0.09
1	1000000	a	953614	585595	585595	1.965	0.095	0.873
2	1000000	is	881562	777936	777936	2.001	0.132	1.212
3	1000000	2019	379714	379709	379709	1.963	0.072	0.636
4	1000000	open-source	379255	379246	379246	2.433	0.109	0.623

Test Page Cases

Table 2 changed some environment variables. And get a different set of data

33 keywords
1 million resources in database
random pages: 1, 100, 10000, 50000, end
2 keywords as cases
not rank

id	keyword	page	layer1	layer2	layer3
1	a	1	1.777	0.111	0.828
2	a	100	1.675	0.084	0.848
3	a	10000	2.528	0.261	0.925
4	a	50000	2.991	1.6	0.884
5	a	end	3.197	2.585	0.775
1	implementing	1	2.34	0.098	0.606
2	implementing	100	2.4	0.078	0.578
3	implementing	10000	2.945	0.245	0.59
4	implementing	50000	2.678	1.973	0.597
5	implementing	end	3.731	1.46	0.522

Test Cases under high server pressure

Table 3 run query command under high pressure data operation (insert and update)

1 million data is inserting or udpating, high pressure server load, under 100-200 asynchronous reuqets per second.
1 million resources existed in the database
page 100
2 keywords as cases
not rank

id	keyword	operation	layer1	layer2	layer3
1	a	insert	1.449	1.615	0.872
2	implementing	insert	2.09	1.193	0.702
1	a	update	207.016	338.566	89.327
2	implementing	update	timeout	timeout	timeout

Test Cases with tanking

1 million resources existed in the database
rank = true, false
page 10000
2 keywords as cases

id	keyword	rank	layer1	layer2	layer3
1	a	false	2.528	0.261	0.925
1	a	true	2.999	3.005	1.684
2	implementing	false	2.945	0.245	0.59
2	implementing	true	2.522	2.532	1.095

Charts

Examples

Conclusion

Although there are many kinds of search engines that are now an important part of people's daily lives, there is still room to change and improve. This project aimed to create a search engine that is unique and more useful in certain areas. This thesis proposes a novel search engine, specifically for the field of learning and education. It aims to help users to search for learning resources, and not just any learning resources, but those that are “good”, that is, valuable to support individual users’ learning goals.

This search engine’s advantages are the following: (1) ensure the quality of search results, (2) allow users to focus on their learning goals, and (3) ease the development and operation of the search engine system.

This study focused on two main concerns in the area of search engine research and development: (1) search speed and (2) result ranking. This search engine adopts a three-layer storage structure to improve search speed. It uses the LRV ranking algorithm to usefully rank search results. A prototype search engine was successfully built and tested for this research.

Future Work

During the research on designing LRV algorithms, two issues were unresolved. A "query timeout" occurs when the database is updated under high pressure. The "Reliability" characteristic of a resource has two weights for the certified personal accounts and organizational accounts which cannot be accurately measured for the time being; more experimental data may be helpful in determining these weights.

The future direction of this research will continue to focus on solving two basic issues: storage-query and ranking resources based on LRV. To improve the paging strategy in the three-layer storage structure, large pages may be further divided into smaller pages. The "query timeout" issue when simultaneously updating resources might be solved by adding ECS (Elastic Compute Service) servers like Amazon EC2 Online. The Reliability characteristic weight values could be made more reasonable through more experiments or by using real user feedback.

As a software application, the system must pass the actual test of running live on the Internet. If more students, teachers, and educational institutions use this system, more resources will be collected and recognized. A practical search engine system for learning and educational resources can benefit many people in this massive and ever-changing knowledge world.

Contributors

Youwei Huang

Cui Yu

Copyright

Monmouth University

devilyouwei / MU-thesis