Present visualizations of what cryptocurrencies are on the trading market and how cryptocurrencies could be grouped toward creating a classification for developing a new investment product.
Prepare the data for dimensions reduction with PCA and clustering using K-means. Reduce data dimensions using PCA algorithms from sklearn. Predicte clusters using cryptocurrencies data using the K-means algorithm form sklearn. Create some plots and data tables to present your results.
Started by loading the data in a Pandas DataFrame named “crypto_df.”
Looked at first five rows of dataframe.
Got the number of rows and columns and the datatypes for each column.
Continued with the following data preprocessing tasks:
• Removed all cryptocurrencies that aren’t trading.
• Removed all cryptocurrencies that don’t have an algorithm defined.
• Removed the IsTrading column.
• Removed all cryptocurrencies with at least one null value. Checked dataframe to ensure
the number of rows were reduced as an indication that all rows with nulls were removed:
• Removed all cryptocurrencies without coins mined.
• Stored the names of all cryptocurrencies on a DataFramed named coins_name,
and used the crypto_df.index as the index for this new DataFrame.
• Removed the CoinName column.
• Used get_dummies to create dummies variables for all of the text features,
and stored the resulting data on a DataFrame named X. Snippet of output after creating dummies.
• Used the StandardScaler from sklearn
to standardize all of the data from the X DataFrame.
Used the PCA algorithm from sklearn to reduce the dimensions of the X DataFrame down to three principal components. Created a DataFrame named “pcs_df” that includes the following columns: PC 1, PC 2, and PC 3. Used the crypto_df.index as the index for this new DataFrame.
Used the KMeans Algorithm from sklearn to cluster the cryptocurrencies using the PCA data.
• Created an elbow curve to find the best value for K, and used
the pcs_df DataFrame to find how many clusters to use in the KMeans algorithm.
• With the defined best value for K, ran the K-means algorithm to predict the K clusters
for the cryptocurrencies’ data. Used the pcs_df to run the K-means algorithm.
• Created a new DataFrame named “clustered_df,” that includes the following columns:
Algorithm, ProofType, TotalCoinsMined, TotalCoinSupply, PC 1, PC 2, PC 3, CoinName, and Class,
while maintaining the index of the crypto_df DataFrame as is shown below:
Data visualizations of the final results. • Created a 3D scatter plot using Plotly Express to plot the clusters using the clustered_df DataFrame. Included the following parameters on the plot: hover_name="CoinName" and hover_data=["Algorithm"] to show this additional info on each data point. Top-down image of 3D plot:
Image to include data upon hover:
• Used hvplot.table to create a data table with all the current tradable cryptocurrencies
with columns: CoinName, Algorithm, ProofType, TotalCoinSupply, TotalCoinsMined, and Class.
• Created a scatter plot using hvplot.scatter to present the clustered data about cryptocurrencies having x="TotalCoinsMined" and y="TotalCoinSupply" to contrast the number of available coins versus the total number of mined coins. Included the hover_cols=["CoinName"] parameter to include the cryptocurrency name on each data point.
Image to include data upon hover:
I was curious about an outlier on the scatter plot, BitTorrent. I did some research to identify why it is an outlier. BitTorrent is a communication protocol, not a cryptocurrency.
I recommend the dataset be cleaned further to eliminate entries that are not true cryptocurrencies but rather just built on BlockChain.
About BitTorrent:
BitTorrent (abbreviated to BT) is a communication protocol for peer-to-peer file sharing (P2P) which is used to distribute data and electronic files over the Internet. BitTorrent is one of the most common protocols for transferring large files, such as digital video files containing TV shows or video clips or digital audio files containing songs. Peer-to-peer networks have been estimated to collectively account for approximately 43% to 70% of all Internet traffic (depending on location) as of February 2009.[1] In February 2013, BitTorrent was responsible for 3.35% of all worldwide bandwidth, more than half of the 6% of total bandwidth dedicated to file sharing.[2] Source: https://en.wikipedia.org/wiki/BitTorrent.
Resource:
crypto_data.csv