Tying up my time at Mapillary in mid January, I decided that instead of diving directly into a job hunt, that I would take a bit of time to hone existing skills and develop some new. Ever since extracting and visualizing sentiment data from New York Times comments as an undergraduate in design school, I've been attracted to data analysis as a means to gain and share a greater understanding of the world. With some time on my hands and a passion for writing software in Python, I decided to dive in heads first. This is a first this series of blog posts where I'll catalog my self-education, share some of what I build, and hopefully provide an opportunity for others with an interest to learn alongside.
Self Education
Thus far, my eduction has been self driven on Kaggle - there's a ton of knowledge there! While my experience diving into the domain has been pretty humbling -- pretty much everyone around me knows so much more, it's also been enlightening and energizing -- the possibilities are boundless.
Thinking about next steps, I plan to:
- complete an end to end machine learning project and publish my progress on this blog and on Kaggle
- complete fast.ai Deep Learning Part 1
- build a deep learning computer
- compete in at least one Kaggle competition, hopefully joining a team to boost my learning
E2E: Predicting the cause of Wildfires in the United States
Surveying the datasets available to work with on Kaggle, I was immediately attracted to a dataset describing 1.88M US Wildfires over 15 years originally published by the US Forest Service. Between a geospatial component, high topical relevance, and personal interest in the subject, I decided that I'd focus my first end to end project on predicting the cause of a Fire given information available when the fire began. Given the limited information available in the data set, it's highly likely that integrating additional data such as historical weather or land use, will be essential to building a strong model.
This first blog post outlines covers the initial steps of preparing an environment and data to begin working with it.
Changes
Since the project is still moving, I expect the content here to change, stay tuned here.
2017/1/30 - initial post 2017/2/9 - added some new imports used later in notebook, addressed warning that I had incorrectly suppressed 2017/2/13 - updating configuration cell to add some style to our plots
Some notes on Notebooks/Jupyter/Kaggle
This post, and work, was completed in a Jupyter notebook running in a docker container. While my process has been roughly cataloged here, throughout the project, the container has been modified a bit to update existing tools and provide new tools. The Dockerfile and associated notebooks can be found on github, and I'll write a blog post about it sooner or later. I also suspect that the code, verbatim, won't run on kaggle -- there are some geo related libraries that likely need to be installed.
Prepare Environment
The first thing that we need to do is prepare our working environment by importing libraries and performing some configuration incantations.
Load Libraries
import itertools
import functools
import math
import os
from IPython.core.display import HTML
import geopandas as gpd
import graphviz
import matplotlib as mpl
from matplotlib import pyplot as plt
import numpy as np
import palettable
import pandas as pd
import pandas.tools.plotting as pdplot
import pprint
import pyproj
import rasterio
import rasterio.mask
import rasterio.plot
from rasterstats import zonal_stats
import seaborn as sns
import shapely
from shapely.ops import transform
import sklearn
from sklearn import model_selection
import subprocess
from geoalchemy2 import Geometry, WKTElement
from sqlalchemy.types import String, JSON
from sqlalchemy import create_engine
import sqlite3
from tqdm import tqdm , tqdm_notebook
Configure
%matplotlib inline
qual_colormap = palettable.matplotlib.Inferno_20
quant_colormap = palettable.matplotlib.Inferno_20_r
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 15)
plt.rcParams['agg.path.chunksize'] = 100000
plt.rcParams['font.family'] = "Rubik"
plt.rcParams["font.weight"] = "300"
plt.rcParams["font.size"] = "14"
plt.rcParams["axes.labelsize"] = "14"
plt.rcParams["ytick.labelsize"] = "12"
plt.rcParams["xtick.labelsize"] = "12"
sns.set_palette(qual_colormap.mpl_colors)
Load Data
Now that we have our libraries in place, we'll load in our data and have a look at it. Since our data comes in a SQLite format, we can use Pandas' read_sql_query
functionality to build a data frame. Since we're not yet acquainted with the contents of our data, we'll load all provided columns and have a look.
input_filename = '/data/188-million-us-wildfires/src/FPA_FOD_20170508.sqlite'
conn = sqlite3.connect(input_filename)
query = '''
SELECT
NWCG_REPORTING_AGENCY,
NWCG_REPORTING_UNIT_ID,
NWCG_REPORTING_UNIT_NAME,
FIRE_NAME,
COMPLEX_NAME,
FIRE_YEAR,
DISCOVERY_DATE,
DISCOVERY_DOY,
DISCOVERY_TIME,
STAT_CAUSE_CODE,
STAT_CAUSE_DESCR,
CONT_DATE,
CONT_DOY,
CONT_TIME,
FIRE_SIZE,
FIRE_SIZE_CLASS,
LATITUDE,
LONGITUDE,
OWNER_CODE,
OWNER_DESCR,
STATE,
COUNTY
FROM
Fires;
'''
raw_df = pd.read_sql_query(query, conn)
Review Raw Data
Now that our data is loaded, let's give it a very high level look and start to develop an understanding of what we're working with.
Info
Let's have a look at our column names and the type of data in each column.
raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1880465 entries, 0 to 1880464
Data columns (total 22 columns):
NWCG_REPORTING_AGENCY object
NWCG_REPORTING_UNIT_ID object
NWCG_REPORTING_UNIT_NAME object
FIRE_NAME object
COMPLEX_NAME object
FIRE_YEAR int64
DISCOVERY_DATE float64
DISCOVERY_DOY int64
DISCOVERY_TIME object
STAT_CAUSE_CODE float64
STAT_CAUSE_DESCR object
CONT_DATE float64
CONT_DOY float64
CONT_TIME object
FIRE_SIZE float64
FIRE_SIZE_CLASS object
LATITUDE float64
LONGITUDE float64
OWNER_CODE float64
OWNER_DESCR object
STATE object
COUNTY object
dtypes: float64(8), int64(2), object(12)
memory usage: 315.6+ MB
Missing Values
Let's see how many values are missing in each column.
raw_df.isna().sum()
NWCG_REPORTING_AGENCY 0
NWCG_REPORTING_UNIT_ID 0
NWCG_REPORTING_UNIT_NAME 0
FIRE_NAME 957189
COMPLEX_NAME 1875282
FIRE_YEAR 0
DISCOVERY_DATE 0
DISCOVERY_DOY 0
DISCOVERY_TIME 882638
STAT_CAUSE_CODE 0
STAT_CAUSE_DESCR 0
CONT_DATE 891531
CONT_DOY 891531
CONT_TIME 972173
FIRE_SIZE 0
FIRE_SIZE_CLASS 0
LATITUDE 0
LONGITUDE 0
OWNER_CODE 0
OWNER_DESCR 0
STATE 0
COUNTY 678148
dtype: int64
Sample
Let's look at some sample data.
raw_df.sample(10)
NWCG_REPORTING_AGENCY | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | STAT_CAUSE_CODE | ... | CONT_DOY | CONT_TIME | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_CODE | OWNER_DESCR | STATE | COUNTY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1036316 | ST/C&L | USNCNCS | North Carolina Forest Service | YOU GO TO THIS ONE | None | 2000 | 2451852.5 | 309 | None | 8.0 | ... | NaN | None | 0.20 | A | 34.816700 | -79.383300 | 14.0 | MISSING/NOT SPECIFIED | NC | None |
1144181 | ST/C&L | USCASLU | San Luis Obispo Unit | RANGE | None | 1993 | 2449220.5 | 233 | None | 2.0 | ... | NaN | None | 4.00 | B | 35.331111 | -120.710000 | 14.0 | MISSING/NOT SPECIFIED | CA | None |
704956 | ST/C&L | USMSMSS | Mississippi Forestry Commission | None | None | 1995 | 2449770.5 | 53 | 1349 | 5.0 | ... | 53.0 | 1521 | 6.00 | B | 34.911615 | -88.970655 | 14.0 | MISSING/NOT SPECIFIED | MS | Tippah |
507237 | ST/C&L | USMNMNS | Minnesota Department of Natural Resources | None | None | 2001 | 2452028.5 | 119 | None | 2.0 | ... | NaN | None | 4.00 | B | 46.123636 | -94.769860 | 14.0 | MISSING/NOT SPECIFIED | MN | Todd |
473711 | ST/C&L | USLALAS | Louisiana Office of Forestry | None | None | 2005 | 2453507.5 | 137 | None | 1.0 | ... | NaN | None | 4.00 | B | 31.504505 | -92.512665 | 14.0 | MISSING/NOT SPECIFIED | LA | Grant |
1573233 | NPS | USCODSP | Dinosaur National Monument | WAPITI | None | 2012 | 2456150.5 | 224 | 1501 | 1.0 | ... | 227.0 | 1415 | 104.00 | D | 40.399060 | -108.306250 | 8.0 | PRIVATE | CO | None |
1868546 | ST/C&L | USHICNTY | Hawaii Counties | None | None | 2012 | 2455927.5 | 1 | None | 13.0 | ... | NaN | None | 0.01 | A | 20.862314 | -156.476700 | 14.0 | MISSING/NOT SPECIFIED | HI | Mauii |
1195322 | ST/C&L | USNYNYX | Fire Department of New York | None | None | 2010 | 2455356.5 | 160 | None | 9.0 | ... | NaN | None | 0.10 | A | 40.896927 | -72.906787 | 14.0 | MISSING/NOT SPECIFIED | NY | SUFFOLK |
483475 | ST/C&L | USMEMES | Maine Forest Service | None | None | 2005 | 2453689.5 | 319 | None | 5.0 | ... | NaN | None | 0.01 | A | 43.407530 | -70.764500 | 14.0 | MISSING/NOT SPECIFIED | ME | York |
166478 | FS | USAZTNF | Tonto National Forest | OX | None | 2004 | 2453161.5 | 157 | 1628 | 9.0 | ... | 157.0 | 1715 | 0.20 | A | 34.180833 | -111.335556 | 5.0 | USFS | AZ | 7 |
10 rows × 22 columns
Describe
raw_df.describe(include='all')
NWCG_REPORTING_AGENCY | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | STAT_CAUSE_CODE | ... | CONT_DOY | CONT_TIME | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_CODE | OWNER_DESCR | STATE | COUNTY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1880465 | 1880465 | 1880465 | 923276 | 5183 | 1.880465e+06 | 1.880465e+06 | 1.880465e+06 | 997827 | 1.880465e+06 | ... | 988934.000000 | 908292 | 1.880465e+06 | 1880465 | 1.880465e+06 | 1.880465e+06 | 1.880465e+06 | 1880465 | 1880465 | 1202317 |
unique | 11 | 1640 | 1635 | 493633 | 1416 | NaN | NaN | NaN | 1440 | NaN | ... | NaN | 1441 | NaN | 7 | NaN | NaN | NaN | 16 | 52 | 3455 |
top | ST/C&L | USGAGAS | Georgia Forestry Commission | GRASS FIRE | OSAGE-MIAMI COMPLEX | NaN | NaN | NaN | 1400 | NaN | ... | NaN | 1800 | NaN | B | NaN | NaN | NaN | MISSING/NOT SPECIFIED | CA | 5 |
freq | 1377090 | 167123 | 167123 | 3983 | 54 | NaN | NaN | NaN | 20981 | NaN | ... | NaN | 38078 | NaN | 939376 | NaN | NaN | NaN | 1050835 | 189550 | 7576 |
mean | NaN | NaN | NaN | NaN | NaN | 2.003710e+03 | 2.453064e+06 | 1.647191e+02 | NaN | 5.979037e+00 | ... | 172.656766 | NaN | 7.452016e+01 | NaN | 3.678121e+01 | -9.570494e+01 | 1.059658e+01 | NaN | NaN | NaN |
std | NaN | NaN | NaN | NaN | NaN | 6.663099e+00 | 2.434573e+03 | 9.003891e+01 | NaN | 3.483860e+00 | ... | 84.320348 | NaN | 2.497598e+03 | NaN | 6.139031e+00 | 1.671694e+01 | 4.404662e+00 | NaN | NaN | NaN |
min | NaN | NaN | NaN | NaN | NaN | 1.992000e+03 | 2.448622e+06 | 1.000000e+00 | NaN | 1.000000e+00 | ... | 1.000000 | NaN | 1.000000e-05 | NaN | 1.793972e+01 | -1.788026e+02 | 0.000000e+00 | NaN | NaN | NaN |
25% | NaN | NaN | NaN | NaN | NaN | 1.998000e+03 | 2.451084e+06 | 8.900000e+01 | NaN | 3.000000e+00 | ... | 102.000000 | NaN | 1.000000e-01 | NaN | 3.281860e+01 | -1.103635e+02 | 8.000000e+00 | NaN | NaN | NaN |
50% | NaN | NaN | NaN | NaN | NaN | 2.004000e+03 | 2.453178e+06 | 1.640000e+02 | NaN | 5.000000e+00 | ... | 181.000000 | NaN | 1.000000e+00 | NaN | 3.545250e+01 | -9.204304e+01 | 1.400000e+01 | NaN | NaN | NaN |
75% | NaN | NaN | NaN | NaN | NaN | 2.009000e+03 | 2.455036e+06 | 2.300000e+02 | NaN | 9.000000e+00 | ... | 232.000000 | NaN | 3.300000e+00 | NaN | 4.082720e+01 | -8.229760e+01 | 1.400000e+01 | NaN | NaN | NaN |
max | NaN | NaN | NaN | NaN | NaN | 2.015000e+03 | 2.457388e+06 | 3.660000e+02 | NaN | 1.300000e+01 | ... | 366.000000 | NaN | 6.069450e+05 | NaN | 7.033060e+01 | -6.525694e+01 | 1.500000e+01 | NaN | NaN | NaN |
11 rows × 22 columns
Observations
Considering that our investigation is around predicting the cause of a wildfire from the time that it was discovered, I've made the following initial observations about the data:
- STATCAUSECODE and STATCAUSEDESCR are related and represent the value that we are trying to predict. Before training, we'll drop STATCAUSEDESCR in favor of the numerical value of STATCAUSECODE.
- OWNER_CODE and OWNER_DESCR are related and describe the owner of the property where the fire was discovered. This is an interesting value because it represents the land management and usage of a particular peice of land. These will be interesting in our investigation. Before training, we'll drop OWNER_DESCR in favor of the numerical value of OWNER_CODE.
- DISCOVERY_DATE, DISCOVERY_DOY, DISCOVERY_TIME describe the time that a fire was discovered. DISCOVERY_DOY is most interesting to our investigation due to it's relation to climate and usage patterns of a particular peice of land. DISCOVERY_TIME may be interesting due, but also might be too fine grained, additionally, it's missing values - let's drop it for now. DISCOVERY_DATE is too specific to be useful.
- LATITUDE, and LONGITUDE are both very interesting due to their very high relationship to land cover, land use, and climate - all three big factors in wildfire creation.
- STATE and both categorically describe the location of a fire. STATE might be interesting due to it's relation in land use patterns. STATE also might prove to be a useful generalization of the more specific LATITUDE, and *LONGITUDE.
- COUNTY, while potentially interesting, has too many missing values. If we want to more closely explore categorical location data, we can add it via a geocoding process in the data engineering process.
- A number of columns contain information about how a fire was addressed and not about what caused the fire. Lets' ignore the following columns for now: NWCGREPORTINGAGENCY, NWCGREPORTINGUNIT_ID, NWCGREPORTINGUNIT_NAME, FIRE_NAME, FIRE_COMPLEX, CONT_DATE, CONT_DOY, CONT_TIME, FIRE_SIZE, FIRESIZECLASS
This leaves us with the following interesting fields:
- STATCAUSECODE
- STATCAUSEDESCR [for EDA]
- OWNER_CODE
- OWNER_DESCR
- DISCOVERY_DOY
- LATITUDE
- LONGITUDE
- STATE
Some things we can keep in our back pocket for future exploration:
- look harder at DISCOVERY_TIME
- look harder at DISCOVERY_DATE
Create Human Readable Mappings
Before we drop our human readable columns, let's create a set of mappings that we can use to associate numerical categories back to human readable categories. We'll use these later..
Map Cause Code to Cause Description:
stat_cause_mapping = raw_df \
.groupby(['STAT_CAUSE_DESCR', 'STAT_CAUSE_CODE']) \
.size()\
.to_frame()\
.reset_index()\
.drop(0, axis=1)\
.set_index('STAT_CAUSE_CODE')\
.sort_index()['STAT_CAUSE_DESCR']
stat_cause_mapping
STAT_CAUSE_CODE
1.0 Lightning
2.0 Equipment Use
3.0 Smoking
4.0 Campfire
5.0 Debris Burning
6.0 Railroad
7.0 Arson
8.0 Children
9.0 Miscellaneous
10.0 Fireworks
11.0 Powerline
12.0 Structure
13.0 Missing/Undefined
Name: STAT_CAUSE_DESCR, dtype: object
Map Owner Code to Owner Description:
owner_code_mapping = raw_df \
.groupby(['OWNER_DESCR', 'OWNER_CODE']) \
.size()\
.to_frame()\
.reset_index()\
.drop(0, axis=1)\
.set_index('OWNER_CODE')\
.sort_index()['OWNER_DESCR']
owner_code_mapping
OWNER_CODE
0.0 FOREIGN
1.0 BLM
2.0 BIA
3.0 NPS
4.0 FWS
5.0 USFS
6.0 OTHER FEDERAL
7.0 STATE
8.0 PRIVATE
9.0 TRIBAL
10.0 BOR
11.0 COUNTY
12.0 MUNICIPAL/LOCAL
13.0 STATE OR PRIVATE
14.0 MISSING/NOT SPECIFIED
15.0 UNDEFINED FEDERAL
Name: OWNER_DESCR, dtype: object
Strip Data
Let's create a new dataframe that contains only the fields that we're interested in. This will reduce memory usage and help keep things tidy.
df = raw_df.copy()[[
'STAT_CAUSE_CODE',
'STAT_CAUSE_DESCR',
'OWNER_CODE',
'OWNER_DESCR',
'DISCOVERY_DOY',
'LATITUDE',
'LONGITUDE',
'STATE'
]]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1880465 entries, 0 to 1880464
Data columns (total 8 columns):
STAT_CAUSE_CODE float64
STAT_CAUSE_DESCR object
OWNER_CODE float64
OWNER_DESCR object
DISCOVERY_DOY int64
LATITUDE float64
LONGITUDE float64
STATE object
dtypes: float64(4), int64(1), object(3)
memory usage: 114.8+ MB
Split Data
We need to split the data into a train set and a test set. The train set will be used to build our model, and the test set will be used to evaluate the model.
We will use sklearn's model_selection.train_test_split
to split our dataframe into two.
Last, we will create a convenience _df
that allows us to access the union of the test and train sets.
train_df, test_df = model_selection.train_test_split(df)
display(HTML('''
<p>
Number of Training Rows: {}<br />
Number of Test Rows: {}
'''.format(train_df.shape[0], test_df.shape[0])))
_df = [train_df, test_df]
Number of Training Rows: 1410348<br />
Number of Test Rows: 470117
Conclusion (for now)
Now that we've loaded our data, given it a high level look, cleaned it up a bit, and split it into a train and test set, we're in a good position to begin some Exploratory Data Analysis. Keep an eye here for subsequent posts that will get into:
- Exploratory Data Analysis
- Data Engineering
- More EDA! (Including geo viz)
- Model Training and Evaluation
- Ensembling Models
I hope this was helpful and/or entertaining so far. Since I'm pretty new to this, I'm sure there are, or will be errors - if you see any, please drop me a line -- I'd love to fix them right away!
Until next time,
Andrew