Plotting Population Weighted Mean Centroids on a Country Map
A few days ago, a reddit user posted a beautiful visualization for Mean Centers of U.S. population by state on /r/MapPorn subreddit.
Pretty neat, right ? This map conveys the point across and also maintains an appeasing simplicity. This inspired me to create a similar population map for my motherland, India.
But what I imagined to be a straightforward task, turned into something really complicated. While the original creator of the above reddit post was able to get the desired data in nicely arranged format from the US Census Bureau website, similar information wasn’t readily available for India. So, I set out to create my map from scratch, And I would like to share that experience with you. :)
Step 1: Set the Goals
- I decided that I would plot comparison of state-wise population for last 2 years for which census was conducted in India (2001 and 2011).
- My preferred choice of programming language is Python. So I need to work out a solution that is Python based.
- For plotting the output, one way in Python could be to create a choropleth map using Plot.ly. But unfortunately, Plot.ly does not support India for plotting country level choropleth maps. So, I decided to use Tableau for plotting the output on a map.
Step 2: Defining Process
To create the target map I need the followings:
- Find a way to map each district in India as a polygon. The borders of the polygon would be the 2-D points (latitude, longitude). Of course, there would be multiple districts in a single state.
- Find out the mean center for each district that constitutes a state. “Mean” center simply signifies a point inside district on which the district map would balance perfectly.
- Assign a weight to each district centroid. This weight value is essentially the population of that district. I might need to (web) scrape this population data from some website.
- For all of the weighted [longitude, latitude] 2-D point inside a state, find one weighted mean centroid that would represent a state in the final map. “Weighted” mean is required because I need to take into account that different districts would have a different population and a weighted mean centroid would be a true depiction of the population center.
- Plot each state on a map using Tableau. Each state would display two points, one centroid each for 2 census years (2001 vs 2011).
Step 3: Implement the plan
Let’s go step-by-step through the implementation procedure:
Draw Polygons
After searching a bit, I found an excellent source of district-wise longitude, latitude coordinates for India. Tableau user Indumon was kind enough to share a csv file with enough information to plot Indian districts as polygons. You can download the csv file from this link.
Here’s a snippet of this csv file:
And here’s what the polygons look like on a Tableau map:
Calculate Mean Centroids
As you would imagine, there are multiple rows for one district in the above csv file, forming the borders for the districts (polygons). I used shapely Python library to calculate the mean centroids.
Shapely is a BSD-licensed Python package for manipulation and analysis of planar geometric objects. (shapely GitHub)
As also explained in the shapely documentation, it is not guaranteed that the mean center of a polygon would always lie inside the polygon. Imagine a state shaped as English letter ‘L’ or ‘C’ in this case the mean of the coordinates might not lie inside the polygon. But even though shapely does provide a “cheat” method to get a point inside the polygon, I didn’t require it as my initial runs showed that none of the Indian provinces were highly twisted in shape.
To calculate the mean center, I imported the csv data inside a Pandas dataframe and calculated center mean for each of the districts. Below is the code that I used for this task. The code is pretty much self-explanatory.
The output from this step is exported to a csv file. Below is a brief snippet of the output saved in csv file.
Here’s how centroids look like on a Tableau map:
Scrape the population data
After hours of trudging through the sluggish Indian Census Library, I came across City Population website, that neatly displays census data for each district in India, for last 3 decennial Censuses of India. However, this information is available on separate webpages, so next step was to write a simple scraper to get this data into a nicely formatted csv file.
Below code shows the scraping steps and saves the scraped data into a csv file.
As you can see from the code above, the nifty Pandas ability to read tables directly from webpages is very handy, especially for such small tasks. Here is a brief look at the contents of the saved csv file:
Merge the data
At this stage, we have geospatial data for each district in one csv file, and population data for them in a separate csv file. Naturally, the next step would be to combine them. Note that one important reason for having two csv files is to deal with the fact that our geospatial misses a few districts(yes, I counted). So, it is a good option to have our location and population data in separate tables and then perform a join operation on the two, to get the clean merged table of data. The Python code below does it by performing a merge operation on the two dataframes that we have established from above examples.
# merge data
merged_df = pd.merge(df,population_df,on=['District','State'],how='left')
#save merged data
merged_df.to_csv('merged.csv',index=False)
# drop null-valued rows (not recommended, see below)
merged_df.dropna(axis=0,how='any').to_csv('merged_withoutNA.csv',index=False)
In the above code, we performed a merge operation with District and State as key columns. Unfortunately, there are some rows with null values in the above code. This is because the website, that we used to scrape population data, has slightly different naming conventions for a few states and cities (for example, our location data has state name “Jammu and Kashmir” while the population data has “Jammu & Kashmir”, and so on for a few districts). One could write a simple fuzzy match to handle this siutation, but considering that only a few rows had the missing data values, I manually filled out the corresponding population data by reading the numbers from the City Population website.
Calculate Weighted Mean Center
In the previous step, we have effectively assigned weights(population data) to each district centroid. Next step is to calculate the center of population density for each state based on the weighted averages.
First, I considered using ArcGis Pro’s MeanCenter_stats functionality for this purpose. ArcGis has a powerful set of tools for analyzing geospatial data, along with a very useful Python driver arcpy. However, upon diving down into the ESRI documentations, I realized that the mean center implementation technique was pretty straightforward. So, I decided to write my own code to calculate the weighted means, based on below formula.
Below is the Python implementation to generate the desired weighted means.
Repeat the above process for 2011 population data, and we have all the weighted centered means along with the year-wise population data.
Step 4: (Profit?) Plot the Results
Phew! Now that we have all the information we need, the only thing left is to plot the results in tableau to see the state-wise population density trends. And here’s how the output looks like on a Tableau dashboard:
Well, the results are not quite what I was expecting. There has been very little population density movement between the last two decennial Censuses of India. I reckon a city-level density plot, instead of a district-level map might exhibit more interesting movement. But it was a fun little project and I got to learn many new things out of it.
I hope you too learned something new from reading this post. And I hope I could inspire you to work on some mini-project (the way the original reddit post inspired me). Thank you for taking time to read this. Till next time. :)
Edit:
I was requested to plot national center of population density for the data I have. Here’s the output on Tableau: