CIA Country Analysis and Clustering

Author: Paul A. Beata
GitHub: pbeata


Data Source: All these data sets are made up of data from the US government's website called The World Factbook containing information for each country.

Project Goals: To gain insights into similarity between countries and regions of the world through exploratory data analysis and by experimenting with different clusters of countries. The key focus is to think about what these clusters actually represent when segmenting the world's nations into different groups (clusters) based on data alone.

Country and Region Data

The data set used in this project can be found in the GitHub repository where all of this work is stored: Link to Repository.

Here we can also explore the rows and columns of the data, as well as the data types of the columns.

Part 1: Exploratory Data Analysis

Here we create visualizations to explore the data further.

Box Plot of Population

Histogram of Population

Using our knowledge of outliers from above, we will zoom in to only show the populations of countries that have less than 4 billion people.

GDP and Regions

Here we create a bar chart showing the mean GDP per Capita per region.

Here we plot the relationship between phones per 1000 people and the GDP per capita, with the data points colored by region.

Here we plot the relationship between GDP per capita and literacy, with the data points colored by region.

Obervations

Heatmap of the Correlation Among Features

Clustering

For clustering, we know that Seaborn can automatically perform hierarchal clustering through the clustermap() function. So here we will start by creating a clustermap of the correlations between each feature of the data set with this function.

Part 2: Data Preparation and Model Discovery

Now we will prepare our data for K-Means Clustering with scikit-learn.

There are 15 countries with missing values for "Agriculture". What is the main aspect of these countries?

Many of these countries that have a NaN value for agriculture are islands. Others are smaller regions of other countries, such as San Marino (Italy) and Gibraltar (British Territory), so it's likely that their agriculture is quite small or virtually non-existent. Greenland is the one exception, but due to its climate and from a quick Google search, we realize that "... the lack of agriculture, forestry and similar countryside activities has meant that very few country roads have been built" (Wikipedia page for Greenland), and thus there is essentially no agriculture industry there either. For these countries, we will fill their agriculture values with zero:

We notice that the climate is missing for a few countries. Here one option will be to fill in the missing "Climate" values based on the mean climate value for a country's region.

The like "Literacy" percentage is also missing for a few countries. We can use the same tactic as we did with "Climate" missing values and fill in any missing literacy values with the mean literacy of the region.

In this section we will attempt to handle the missing values by updating our DataFrame with the current 2021 estimates listed on the cia.gov website for the following features:

Net Migration

Infant mortality (per 1000 births)

GDP ($ per capita)

Birthrate and Deathrate

Arable, Crops, Other

Phones (per 1000)

Industry and Service

The industry and service features are missing for 14 countries still.

A scatter plot of "industry" vs "service" does not reveal any clear pattern for estimating these missing values from the full data set, so we will fill them using the mean values from the corresponding region next.

Part 3: Feature Preparation

Prepare the data for clustering: the "Country" column is still a unique identifier string, so it won't be useful for clustering, since its unique for each point. Therefore, we can drop the "Country" column now:

Now we can create the X array of features. Since the "Region" column still contains categorical strings, we will use Pandas to create dummy variables from this column to create a finalzed X array of continuous features along with dummy variables to replace the categorical regions.

Feature Scaling

Due to some measurements being in terms of percentages and other metrics being total counts (population), we should scale this data first.

Part 4: K-Means Clustering

For our clustering analysis, we will use a for loop to create and fit multiple KMeans models by testing from K=2 to K=30 clusters. We store the "sum of squared distances" (SSD) for each K value and then plot this out to create an elbow plot of K versus SSD.

We could also create a bar plot showing the SSD difference from when using K+1 veruse K clusters.

Part 5: Model Interpretation

What are some of the possible cut-off values for K?

Option 1: Choosing K=3

You could say that there is a significant drop off in SSD difference at K=3 (although we can see it continues to drop off past this). What would an analysis look like for K=3?

Option 2: Choosing K=6

Option 3: Choosing K=14


Part 6: Geographical Model Interpretation

The best way to interpret this model is through visualizing the clusters of countries on a real map of the world. Our goal is to create cluster labels for a chosen K value and display them on the map. Based on the solutions, we believe either K=3, K=6, or K=14 could be reasonable choices.

We will plot out these clusters on a country level choropleth map:

  1. Make to have the plotly library installed: https://plotly.com/python/getting-started/

  2. Example of how to create a geographical choropleth map using plotly: https://plotly.com/python/choropleth-maps/#using-builtin-country-and-state-geometries

  3. We will need ISO country codes for this: either use the Wikipedia page, or use our provided file from the data folder in this repository this: "./data/country_iso_codes.csv"

  4. Combine the cluster labels, ISO Codes, and Country Names to create a world map plot with the plotly library