Coffee4MePlz / PCA_ElectTransp

Principal Component Analysis (PCA) for Electron Transport Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Principal Component Analysis (PCA) for Electron Transport Data

This is an example of an application of Principal component analysis to Electron transport data. It was developed by me under the supervision of Prof. Luis Rego at my home university the Universidade Federal de Santa Catarina (UFSC), during my undergraduate research in data treatment methods for Molecular Dynamics. This is only a small piece of our work, and if you are interested in what we are currently researching please visit our lab page: Dynemol

The Data source and PCA usage

Our data was mainly produced here at The Department of Physics, Chemistry, Astronomy of the U. Delaware and U. Rutgers, together with our laboratory by simulations of Double-Linker Sensitizers interaction with a surface. We were interested in how these charged structures were transferring electrons (charge) when they connected to a surface (The results are discussed minutely on the following paper ). We ran a PCA on the data to identify which variables were the most important (statistically).

To be fair, this is not what PCA is doing. It does not find the most relevant variable, but more exactly it finds the most relevant combination of variables, in your data set. W

In this case, we were particularly interested in which variables were determining the speed at which the electrons were flowing to the interface between the molecule and the surface. We only measured variables that were dependent on the geometry of the molecule: angles, distances, and torsions.

The data is in the file "IET.LUMO.dhd.LL.HH.BL.dist.dat" Such variables are represented on the following image

Variables for MM

This image is available on the cited paper.

The Method

PCA is mainly a dimension reduction method for raw data. It is mostly used when we wish to know which is the most important direction in a vector field, i.e. which single direction in your data matrix (interpreted as a vector field) is the one that preserves most of the information given. In linear algebra terms: PCA gives you the eigenvectors associated with the greatest eigenvalues. Since PCA is used in a wide variety of data, the data matrix may not be squared, even then the PCA is guaranteed to exists, since every matrix (either squared or not) can be decomposed in a singular value decomposition:

Were X is the data matrix, U and W are rectangular matrices, and Sigma is a diagonal matrix with the eigenvalues.

The correlation matrix can be computed by . In this set of data it is a 7x7 matrix, but we'll only show the firs 3x3:

IET LUMO dhd
IET 1.00000000 -0.45620206 0.14029317
LUMO -0.45620206 1.00000000 -0.43759060
dhd 0.14029317 -0.43759060 1.00000000

And the Principal Components (vectors) can be calculated by direct matrix multiplication . Once computed we get the components by score:

PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.6842 1.1103 1.0153 0.9509 0.69169 0.64736 0.31355
Proportion of Variance 0.4052 0.1761 0.1472 0.1292 0.06835 0.05987 0.01405
Cumulative Proportion 0.4052 0.5813 0.7286 0.8577 0.92609 0.98595 1.00000

Results

The following is a Biplot, it is a plot of the two main PCA vectors, and all other entries in accordance with them. The data here is separated into two groups, a fast one and a slow one, with the boundary being the median of the time for IET to reach 0.3 (Interfacial electron transfer), physically the IET means that most of the charge (about 70%) has been transferred to the surface already. Therefore we divided the data as: fast < 32.5 = median of IET < slow

Biplot of components 1 and 2

By a direct look at the Biplot, one can see that IET and dhd favor mostly the 1st PC, and LL and LUMO are contrary to it. Similarly, IET and BL favor PC2, while HH and dist are contrary to it. Also, our data is relatively well separated by our time criterion, since there is a small overlap of the ellipses (remember they are centered in their "center of mass"). We can then conclude that IET, HH, LUMO, and LL are the main players in separating the data via the time criterion. If we were to develop a machine learning algorithm for optimization here, we would like for it to follow these parameters first, so it can get to the best results faster.

About

Principal Component Analysis (PCA) for Electron Transport Data


Languages

Language:R 100.0%