- Crypto Clustering Overview
- Data Preprocessing
- Reducing Data Dimentions Using PCA
- Clustering Cryptocurrencies Using K-Means
- Visualizing Results
- Optional Challenge
File: Clustering Crypto File: Optional Challenge
-
In this assignment I run the k-Means algorithm and a Principal Component Analysis (PCA) to cluster cryptocurrencies.
-
Assuming I am a Senior Manager at the Advisory Services team on a Big Four firm.
-
One of my most important clients, a prominent investment bank, is interested in offering a new cryptocurrencies investment portfolio for its customers, however, they are lost in the immense universe of cryptocurrencies.
-
They ask me to help them make sense of it all by generating a report of what cryptocurrencies are available on the trading market and how they can be grouped using classification.
-
I will put my new unsupervivsed learning and Amazon SageMaker skills into action by clustering cryptocurrencies and creating plots to present my results.
-
I am asked to accomplish the following main tasks:
-
Data Preprocessing: Prepare data for dimension reduction with PCA and clustering using K-Means.
-
Reducing Data Dimensions Using PCA: Reduce data dimension using the
PCA
algorithm fromsklearn
. -
Clustering Cryptocurrencies Using K-Means: Predict clusters using the cryptocurrencies data using the
KMeans
algorithm fromsklearn
. -
Visualizing Results: Create some plots and data tables to present my results.
-
Optional Challenge: Deploy my notebook to Amazon SageMaker.
-
-
Using the following
requests
library, retreive the necessary data from the following API endpoint from CryptoCompare -https://min-api.cryptocompare.com/data/all/coinlist
. HINT: I will need to use the 'Data' key from the json response, then transpose the DataFrame. Name my DataFramecrypto_df
.# Use the following endpoint to fetch json data url = "https://min-api.cryptocompare.com/data/all/coinlist" response = requests.get(url).json() # Create a DataFrame crypto_df = pd.DataFrame(response["Data"]).T
-
With the data loaded into a Pandas DataFrame, continue with the following data preprocessing tasks.
-
Keep only the necessary columns:
'CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','CirculatingSupply'
# Keep only necessary columns crypto_df = crypto_df[['CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','CirculatingSupply']]
- Keep only the cryptocurrencies that are trading.
# Keep only cryptocurrencies that are trading crypto_df = crypto_df[crypto_df["IsTrading"] == True]
- Keep only the cryptocurrencies with a working algorithm.
crypto_df = crypto_df[crypto_df["Algorithm"] != "N/A"]
- Remove the
IsTrading
column.
crypto_df = crypto_df.drop(columns = ["IsTrading"])
- Remove all cryptocurrencies with at least one null value.
crypto_df = crypto_df.dropna()
- Remove all cryptocurrencies that have no coins mined.
crypto_df = crypto_df[crypto_df["TotalCoinsMined"] > 0]
- Drop all rows where there are 'N/A' text values.
crypto_df = crypto_df[crypto_df.iloc[:] != "N/A"].dropna()
- Store the names of all cryptocurrencies in a DataFrame named
coins_name
, use thecrypto_df.index
as the index for this new DataFrame.
coins_name = crypto_df.index
- Remove the
CoinName
column.
crypto_df = crypto_df.drop("CoinName", axis=1)
- Create dummy variables for all the text features, and store the resulting data in a DataFrame named
X
.
X = pd.get_dummies(data = crypto_df, columns = ["Algorithm", "ProofType"])
- Use the
StandardScaler
fromsklearn
to standardize all the data of theX
DataFrame. Remember, this is important prior to using PCA and K-Means algorithms.
X = StandardScaler().fit_transform(X)
- Use the
PCA
algorithm fromsklearn
to reduce the dimensions of theX
DataFrame down to three principal components.
pca = PCA(n_components=3) crypto_pca = pca.fit_transform(X)
- Once I have reduced the data dimensions, create a DataFrame named
pcs_df
using as columns names"PC 1", "PC 2"
and"PC 3"
; use thecrypto_df.index
as the index for this new DataFrame.
pcs_df = pd.DataFrame( crypto_pca, columns = ["PC 1", "PC 2", "PC 3"], index = coins_name ) pcs_df.head(10)
-
- Create an Elbow Curve to find the best value for
k
using thepcs_df
DataFrame.
inertia = []
k = list(range(1, 11))
# Calculate the inertia for the range of k values
for i in k:
k_model = KMeans(n_clusters=i, random_state=1)
k_model.fit(pcs_df)
inertia.append(k_model.inertia_)
# Create the Elbow Curve using hvPlot
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)
# Create Elbow plot
df_elbow.hvplot.line(
x="k",
y="inertia",
title="Elbow Curve",
xticks=k
)
- Once I define the best value for
k
, run theKmeans
algorithm to predict thek
clusters for the cryptocurrencies data. Use thepcs_df
to run theKMeans
algorithm.
# Initialize the K-Means model
model = KMeans(n_clusters = 10, random_state=0)
# Fit the model
model.fit(pcs_df)
# Predict clusters
k_10 = model.predict(pcs_df)
- Create a new DataFrame named
clustered_df
, that includes the following columns"Algorithm", "ProofType", "TotalCoinsMined", "TotalCoinSupply", "PC 1", "PC 2", "PC 3", "CoinName", "Class"
. I should maintain the index of thecrypto_df
DataFrames as is shown bellow.
clustered_df = pd.concat([crypto_df, pcs_df], axis=1)
clustered_df["Class"] = k_10
clustered_df["CoinName"] = coins_name
clustered_df.head(20)
- In this section, I will create some data visualization to present the final results.
- Create a scatter plot using
hvplot.scatter
, to present the clustered data about cryptocurrencies havingx="TotalCoinsMined"
andy="TotalCoinSupply"
to contrast the number of available coins versus the total number of mined coins. Use thehover_cols=["CoinName"]
parameter to include the cryptocurrency name on each data point.
# Plot Scatter plot
clustered_df.hvplot.scatter(
x= "TotalCoinsMined",
y= "CirculatingSupply",
hover_cols=["CoinName"]
)
- Use
hvplot.table
to create a data table with all the current tradable cryptocurrencies. The table should have the following columns:"CoinName", "Algorithm", "ProofType", "CirculatingSupply", "TotalCoinsMined", "Class"
clustered_df.hvplot.table(columns=["CoinName", "Algorithm", "ProofType", "CirculatingSupply", "TotalCoinsMined", "Class"], sortable=True, selectable=True)
-
For the challenge section, I have to upload my Jupyter notebook to Amazon SageMaker and deploy it.
-
The
hvplot
library is not included in the built-in anaconda environments, so for this challenge section, I should use thealtair
library instead. -
Upload my Jupyter notebook and rename it as
crypto_clustering_sm.ipynb
-
Select the
conda_python3
environment. -
Install the
altair
library by running the following code before the initial imports.!pip install -U altair
-
Use the
altair
scatter plot to create the Elbow Curve.
inertia = []
k = list(range(1, 11))
# Calculate the inertia for the range of k values
for i in k:
k_model = KMeans(n_clusters=i, random_state=1)
k_model.fit(pcs_df)
inertia.append(k_model.inertia_)
# Create the Elbow Curve using altair
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)
# Create Elbow plot
alt.Chart(df_elbow).mark_line().encode(
x="k",
y="inertia"
)
- Use the
altair
scatter plot to visualize the clusters. Since this is a 2D-Scatter, usex="PC 1"
andy="PC 2"
for the axes, and add the following columns as tool tips:"CoinName", "Algorithm", "TotalCoinsMined", "TotalCoinSupply"
.
# Plot the scatter with x="PC 1" and y="PC 2"
# Plot the clusters
alt.Chart(clustered_df).mark_circle(size=60).encode(
x="PC 1",
y="PC 2",
color='Class',
tooltip=['CoinName', 'Algorithm', 'TotalCoinsMined', 'CirculatingSupply']
).interactive()