Domain Shift Visualization

Detect and visulize the domain changes occur around SIGIR conference

Algorithm Explaination

The newest algorithm flows are contained in report.pptx. We've attached another version below:

The final goal of this algorithm is to compute the 2D embedding of the papers points around SIGIR conference. To accomplish this, we need to construct a reference graph and use largeVis to compute the graph embedding in two-dimensional space.

Treat each paper in Aminer as a point, two paper are connected with a directed edge if one cites another.
BFS the graph starting from all papers belong to SIGIR, which create a subset of all paper points. The set of paper nodes is denoted as $V_{papers}$ . We then construct all the conference that $V_{papers}$ has touched, which is to say $S_{conf} = \{ conf|v\in V_{paper}\ \wedge\ v\ belongs\ to\ conf\}$ .
Since that this may introduce too much possible conferences around SIGIR, like if only one paper in $V_{papers}$ belongs Nature, we still need to include Nature in $S_{conf}$ . So we need to continue on filtering out more conferences in $S_{conf}$ by introducing an important score of each conference. The score of conference is computed as $score_c = \frac{\#points\ in\ V_{papers}\ belongs\ to\ c}{\#total\ points\ belongs\ to\ c}$ . The intuition is simple, if the points in that BFS can touch is not enough, we then consider this conference as irrelevent to the SIGIR. We only take the conference that has higher score than a certain creteria, $S_{conf, smaller}=\{c\ |\ c\in S_{conf}\wedge score_c>threshold\}$ . (One possible future improvement is to filter out the conference with not enough paper in it).

Take all the points of paper that belongs to one of the conference in $S_{conf, smaller}$ , which is to say $V_{extract} = \{v|v\ belongs\ to\ c\wedge c \in S_{conf, smaller}\}$ .
Also, treat each conference in $S_{conf, smaller}$ as a node, connect each conf node with each paper belongs to it, denoted all the conf node as $V_{conf}$ , then we have $V = V_{extract}\cup V_{conf}$

At visualization stage, we only visualize the paper node without the conference node, the purpose of conference node here is to draw the points that belongs to a same conference closer, and get rid of the points that has no edges connecting to it.

Code Structure and Running

Most of the code logic resides in largeScaleGraph/cpp, the code logic is split by multiple excutable files so as not to re-run the whole since from beginning.

# enter the code file
cd largeScaleGraph/cpp

# optional: to support c++17, load module
module load gcc/7.2.0

# before compiling, revise each cpp file's input and output directory
# compile each bfs layer generator
g++ --std=c++17  generate_first.cpp -o generate_first -lstdc++fs -pthread
g++ --std=c++17  generate_second.cpp -o generate_second -lstdc++fs -pthread
g++ --std=c++17  generate_third.cpp -o generate_third -lstdc++fs -pthread


# run the three layer bfs
./generate_first
./generate_second
./generate_third

# each layer will generate an intermediate representation as
# paper_id \t conf_name \t year \t citation_id_1 \space citation_id_2 \space ....

The above instruction will create three bfs layer file, which is used in the later filter part.

# before compiling, revise each cpp file's input and output directory
# compile the first filter layer, which create a file with only the line that has the same
# conference as the three (or two) bfs layer file
# the second filter compute the importance score, and keep only the conference with enough score.
g++ -std=c++17  filter_first.cpp -o filter_first -lstdc++fs -pthread
g++ -std=c++17  filter_second.cpp -o filter_second -lstdc++fs -pthread

# the first filter has a very long running time
./filter_first
./filter_second

# after this, the final output of filter_second is couple of line that belongs to the conferences
# that has high important factor

# compile the largeVis input file generator
g++ -std=c++17  generate_final_input.cpp -o generate_final_input -lstdc++fs -pthread

# generate the final input to largeVis
./generate_final_input

After this, run largeVis on the output file of the input generator. Each point is associated with a corresponding 2D embedding, then we need to remap the index to the corresponding label and create a split according to the time series. To do this, we need to rerun the generate_final_input and find the run time distribution of the index, and find whether it is SIGIR paper to color them.

# remember to set the correct file dir of the input and output, the output of the split is a folder, the default is final_visualization.
./generate_final_input

# suppose final directory is final_visualization
cd final_visualization
sh workflow.sh

# at this point, three images will be generated

Embedding Visualization Result

This is a set of the visualization result. I haven't put too much efforts into improving the parameters and the visualization of the author points and conference points, I think those may be good future work.

CSerxy / domainShiftVisualization

Domain Shift Visualization

Algorithm Explaination

Code Structure and Running

Embedding Visualization Result

About

Languages