Tying up my time at Mapillary in mid January, I decided that instead of diving directly into a job hunt, that I would take a bit of time to hone existing skills and develop some new. Ever since extracting and visualizing sentiment data from New York Times comments as an undergraduate in design school, I've been attracted to data analysis as a means to gain and share a greater understanding of the world. With some time on my hands and a passion for writing software in Python, I decided to dive in heads first. This is a first this series of blog posts where I'll catalog my self-education, share some of what I build, and hopefully provide an opportunity for others with an interest to learn alongside.

Self Education

Thus far, my eduction has been self driven on Kaggle - there's a ton of knowledge there! While my experience diving into the domain has been pretty humbling -- pretty much everyone around me knows so much more, it's also been enlightening and energizing -- the possibilities are boundless.

Thinking about next steps, I plan to:

complete an end to end machine learning project and publish my progress on this blog and on Kaggle
complete fast.ai Deep Learning Part 1
build a deep learning computer
compete in at least one Kaggle competition, hopefully joining a team to boost my learning

E2E: Predicting the cause of Wildfires in the United States

Surveying the datasets available to work with on Kaggle, I was immediately attracted to a dataset describing 1.88M US Wildfires over 15 years originally published by the US Forest Service. Between a geospatial component, high topical relevance, and personal interest in the subject, I decided that I'd focus my first end to end project on predicting the cause of a Fire given information available when the fire began. Given the limited information available in the data set, it's highly likely that integrating additional data such as historical weather or land use, will be essential to building a strong model.

This first blog post outlines covers the initial steps of preparing an environment and data to begin working with it.

wildfire on mountain 1600

Changes

Since the project is still moving, I expect the content here to change, stay tuned here.

2017/1/30 - initial post 2017/2/9 - added some new imports used later in notebook, addressed warning that I had incorrectly suppressed 2017/2/13 - updating configuration cell to add some style to our plots

Some notes on Notebooks/Jupyter/Kaggle

This post, and work, was completed in a Jupyter notebook running in a docker container. While my process has been roughly cataloged here, throughout the project, the container has been modified a bit to update existing tools and provide new tools. The Dockerfile and associated notebooks can be found on github, and I'll write a blog post about it sooner or later. I also suspect that the code, verbatim, won't run on kaggle -- there are some geo related libraries that likely need to be installed.

Prepare Environment

The first thing that we need to do is prepare our working environment by importing libraries and performing some configuration incantations.

Load Libraries

import itertools
import functools
import math
import os

from IPython.core.display import HTML
import geopandas as gpd
import graphviz
import matplotlib as mpl
from matplotlib import pyplot as plt
import numpy as np
import palettable
import pandas as pd
import pandas.tools.plotting as pdplot
import pprint
import pyproj
import rasterio
import rasterio.mask
import rasterio.plot
from rasterstats import zonal_stats
import seaborn as sns
import shapely
from shapely.ops import transform
import sklearn
from sklearn import model_selection
import subprocess
from geoalchemy2 import Geometry, WKTElement
from sqlalchemy.types import String, JSON
from sqlalchemy import create_engine
import sqlite3
from tqdm import tqdm , tqdm_notebook

Configure

%matplotlib inline

qual_colormap = palettable.matplotlib.Inferno_20
quant_colormap = palettable.matplotlib.Inferno_20_r

plt.style.use('ggplot')

plt.rcParams['figure.figsize'] = (15, 15)
plt.rcParams['agg.path.chunksize'] = 100000
plt.rcParams['font.family'] = "Rubik"
plt.rcParams["font.weight"] = "300"
plt.rcParams["font.size"] = "14"
plt.rcParams["axes.labelsize"] = "14"
plt.rcParams["ytick.labelsize"] = "12"
plt.rcParams["xtick.labelsize"] = "12"

sns.set_palette(qual_colormap.mpl_colors)

Load Data

Now that we have our libraries in place, we'll load in our data and have a look at it. Since our data comes in a SQLite format, we can use Pandas' read_sql_query functionality to build a data frame. Since we're not yet acquainted with the contents of our data, we'll load all provided columns and have a look.

input_filename = '/data/188-million-us-wildfires/src/FPA_FOD_20170508.sqlite'
conn = sqlite3.connect(input_filename)

query = '''
    SELECT
        NWCG_REPORTING_AGENCY,
        NWCG_REPORTING_UNIT_ID,
        NWCG_REPORTING_UNIT_NAME,
        FIRE_NAME,
        COMPLEX_NAME,
        FIRE_YEAR,
        DISCOVERY_DATE,
        DISCOVERY_DOY,
        DISCOVERY_TIME,
        STAT_CAUSE_CODE,
        STAT_CAUSE_DESCR,
        CONT_DATE,
        CONT_DOY,
        CONT_TIME,
        FIRE_SIZE,
        FIRE_SIZE_CLASS,
        LATITUDE,
        LONGITUDE,
        OWNER_CODE,
        OWNER_DESCR,
        STATE,
        COUNTY
    FROM
        Fires;
'''

raw_df = pd.read_sql_query(query, conn)

Review Raw Data

Now that our data is loaded, let's give it a very high level look and start to develop an understanding of what we're working with.

Info

Let's have a look at our column names and the type of data in each column.

raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1880465 entries, 0 to 1880464
Data columns (total 22 columns):
NWCG_REPORTING_AGENCY       object
NWCG_REPORTING_UNIT_ID      object
NWCG_REPORTING_UNIT_NAME    object
FIRE_NAME                   object
COMPLEX_NAME                object
FIRE_YEAR                   int64
DISCOVERY_DATE              float64
DISCOVERY_DOY               int64
DISCOVERY_TIME              object
STAT_CAUSE_CODE             float64
STAT_CAUSE_DESCR            object
CONT_DATE                   float64
CONT_DOY                    float64
CONT_TIME                   object
FIRE_SIZE                   float64
FIRE_SIZE_CLASS             object
LATITUDE                    float64
LONGITUDE                   float64
OWNER_CODE                  float64
OWNER_DESCR                 object
STATE                       object
COUNTY                      object
dtypes: float64(8), int64(2), object(12)
memory usage: 315.6+ MB

Missing Values

Let's see how many values are missing in each column.

raw_df.isna().sum()

NWCG_REPORTING_AGENCY             0
NWCG_REPORTING_UNIT_ID            0
NWCG_REPORTING_UNIT_NAME          0
FIRE_NAME                    957189
COMPLEX_NAME                1875282
FIRE_YEAR                         0
DISCOVERY_DATE                    0
DISCOVERY_DOY                     0
DISCOVERY_TIME               882638
STAT_CAUSE_CODE                   0
STAT_CAUSE_DESCR                  0
CONT_DATE                    891531
CONT_DOY                     891531
CONT_TIME                    972173
FIRE_SIZE                         0
FIRE_SIZE_CLASS                   0
LATITUDE                          0
LONGITUDE                         0
OWNER_CODE                        0
OWNER_DESCR                       0
STATE                             0
COUNTY                       678148
dtype: int64

Sample

Let's look at some sample data.

raw_df.sample(10)

	NWCG_REPORTING_AGENCY	NWCG_REPORTING_UNIT_ID	NWCG_REPORTING_UNIT_NAME	FIRE_NAME	COMPLEX_NAME	FIRE_YEAR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	STAT_CAUSE_CODE	...	CONT_DOY	CONT_TIME	FIRE_SIZE	FIRE_SIZE_CLASS	LATITUDE	LONGITUDE	OWNER_CODE	OWNER_DESCR	STATE	COUNTY
1036316	ST/C&L	USNCNCS	North Carolina Forest Service	YOU GO TO THIS ONE	None	2000	2451852.5	309	None	8.0	...	NaN	None	0.20	A	34.816700	-79.383300	14.0	MISSING/NOT SPECIFIED	NC	None
1144181	ST/C&L	USCASLU	San Luis Obispo Unit	RANGE	None	1993	2449220.5	233	None	2.0	...	NaN	None	4.00	B	35.331111	-120.710000	14.0	MISSING/NOT SPECIFIED	CA	None
704956	ST/C&L	USMSMSS	Mississippi Forestry Commission	None	None	1995	2449770.5	53	1349	5.0	...	53.0	1521	6.00	B	34.911615	-88.970655	14.0	MISSING/NOT SPECIFIED	MS	Tippah
507237	ST/C&L	USMNMNS	Minnesota Department of Natural Resources	None	None	2001	2452028.5	119	None	2.0	...	NaN	None	4.00	B	46.123636	-94.769860	14.0	MISSING/NOT SPECIFIED	MN	Todd
473711	ST/C&L	USLALAS	Louisiana Office of Forestry	None	None	2005	2453507.5	137	None	1.0	...	NaN	None	4.00	B	31.504505	-92.512665	14.0	MISSING/NOT SPECIFIED	LA	Grant
1573233	NPS	USCODSP	Dinosaur National Monument	WAPITI	None	2012	2456150.5	224	1501	1.0	...	227.0	1415	104.00	D	40.399060	-108.306250	8.0	PRIVATE	CO	None
1868546	ST/C&L	USHICNTY	Hawaii Counties	None	None	2012	2455927.5	1	None	13.0	...	NaN	None	0.01	A	20.862314	-156.476700	14.0	MISSING/NOT SPECIFIED	HI	Mauii
1195322	ST/C&L	USNYNYX	Fire Department of New York	None	None	2010	2455356.5	160	None	9.0	...	NaN	None	0.10	A	40.896927	-72.906787	14.0	MISSING/NOT SPECIFIED	NY	SUFFOLK
483475	ST/C&L	USMEMES	Maine Forest Service	None	None	2005	2453689.5	319	None	5.0	...	NaN	None	0.01	A	43.407530	-70.764500	14.0	MISSING/NOT SPECIFIED	ME	York
166478	FS	USAZTNF	Tonto National Forest	OX	None	2004	2453161.5	157	1628	9.0	...	157.0	1715	0.20	A	34.180833	-111.335556	5.0	USFS	AZ	7

10 rows × 22 columns

Describe

raw_df.describe(include='all')

	NWCG_REPORTING_AGENCY	NWCG_REPORTING_UNIT_ID	NWCG_REPORTING_UNIT_NAME	FIRE_NAME	COMPLEX_NAME	FIRE_YEAR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	STAT_CAUSE_CODE	...	CONT_DOY	CONT_TIME	FIRE_SIZE	FIRE_SIZE_CLASS	LATITUDE	LONGITUDE	OWNER_CODE	OWNER_DESCR	STATE	COUNTY
count	1880465	1880465	1880465	923276	5183	1.880465e+06	1.880465e+06	1.880465e+06	997827	1.880465e+06	...	988934.000000	908292	1.880465e+06	1880465	1.880465e+06	1.880465e+06	1.880465e+06	1880465	1880465	1202317
unique	11	1640	1635	493633	1416	NaN	NaN	NaN	1440	NaN	...	NaN	1441	NaN	7	NaN	NaN	NaN	16	52	3455
top	ST/C&L	USGAGAS	Georgia Forestry Commission	GRASS FIRE	OSAGE-MIAMI COMPLEX	NaN	NaN	NaN	1400	NaN	...	NaN	1800	NaN	B	NaN	NaN	NaN	MISSING/NOT SPECIFIED	CA	5
freq	1377090	167123	167123	3983	54	NaN	NaN	NaN	20981	NaN	...	NaN	38078	NaN	939376	NaN	NaN	NaN	1050835	189550	7576
mean	NaN	NaN	NaN	NaN	NaN	2.003710e+03	2.453064e+06	1.647191e+02	NaN	5.979037e+00	...	172.656766	NaN	7.452016e+01	NaN	3.678121e+01	-9.570494e+01	1.059658e+01	NaN	NaN	NaN
std	NaN	NaN	NaN	NaN	NaN	6.663099e+00	2.434573e+03	9.003891e+01	NaN	3.483860e+00	...	84.320348	NaN	2.497598e+03	NaN	6.139031e+00	1.671694e+01	4.404662e+00	NaN	NaN	NaN
min	NaN	NaN	NaN	NaN	NaN	1.992000e+03	2.448622e+06	1.000000e+00	NaN	1.000000e+00	...	1.000000	NaN	1.000000e-05	NaN	1.793972e+01	-1.788026e+02	0.000000e+00	NaN	NaN	NaN
25%	NaN	NaN	NaN	NaN	NaN	1.998000e+03	2.451084e+06	8.900000e+01	NaN	3.000000e+00	...	102.000000	NaN	1.000000e-01	NaN	3.281860e+01	-1.103635e+02	8.000000e+00	NaN	NaN	NaN
50%	NaN	NaN	NaN	NaN	NaN	2.004000e+03	2.453178e+06	1.640000e+02	NaN	5.000000e+00	...	181.000000	NaN	1.000000e+00	NaN	3.545250e+01	-9.204304e+01	1.400000e+01	NaN	NaN	NaN
75%	NaN	NaN	NaN	NaN	NaN	2.009000e+03	2.455036e+06	2.300000e+02	NaN	9.000000e+00	...	232.000000	NaN	3.300000e+00	NaN	4.082720e+01	-8.229760e+01	1.400000e+01	NaN	NaN	NaN
max	NaN	NaN	NaN	NaN	NaN	2.015000e+03	2.457388e+06	3.660000e+02	NaN	1.300000e+01	...	366.000000	NaN	6.069450e+05	NaN	7.033060e+01	-6.525694e+01	1.500000e+01	NaN	NaN	NaN

11 rows × 22 columns

Observations

Considering that our investigation is around predicting the cause of a wildfire from the time that it was discovered, I've made the following initial observations about the data:

STATCAUSECODE and STATCAUSEDESCR are related and represent the value that we are trying to predict. Before training, we'll drop STATCAUSEDESCR in favor of the numerical value of STATCAUSECODE.
OWNER_CODE and OWNER_DESCR are related and describe the owner of the property where the fire was discovered. This is an interesting value because it represents the land management and usage of a particular peice of land. These will be interesting in our investigation. Before training, we'll drop OWNER_DESCR in favor of the numerical value of OWNER_CODE.
DISCOVERY_DATE, DISCOVERY_DOY, DISCOVERY_TIME describe the time that a fire was discovered. DISCOVERY_DOY is most interesting to our investigation due to it's relation to climate and usage patterns of a particular peice of land. DISCOVERY_TIME may be interesting due, but also might be too fine grained, additionally, it's missing values - let's drop it for now. DISCOVERY_DATE is too specific to be useful.
LATITUDE, and LONGITUDE are both very interesting due to their very high relationship to land cover, land use, and climate - all three big factors in wildfire creation.
STATE and both categorically describe the location of a fire. STATE might be interesting due to it's relation in land use patterns. STATE also might prove to be a useful generalization of the more specific LATITUDE, and *LONGITUDE.
COUNTY, while potentially interesting, has too many missing values. If we want to more closely explore categorical location data, we can add it via a geocoding process in the data engineering process.
A number of columns contain information about how a fire was addressed and not about what caused the fire. Lets' ignore the following columns for now: NWCGREPORTINGAGENCY, NWCGREPORTINGUNIT_ID, NWCGREPORTINGUNIT_NAME, FIRE_NAME, FIRE_COMPLEX, CONT_DATE, CONT_DOY, CONT_TIME, FIRE_SIZE, FIRESIZECLASS

This leaves us with the following interesting fields:

STATCAUSECODE
STATCAUSEDESCR [for EDA]
OWNER_CODE
OWNER_DESCR
DISCOVERY_DOY
LATITUDE
LONGITUDE
STATE

Some things we can keep in our back pocket for future exploration:

look harder at DISCOVERY_TIME
look harder at DISCOVERY_DATE

Create Human Readable Mappings

Before we drop our human readable columns, let's create a set of mappings that we can use to associate numerical categories back to human readable categories. We'll use these later..

Map Cause Code to Cause Description:

stat_cause_mapping = raw_df \
    .groupby(['STAT_CAUSE_DESCR', 'STAT_CAUSE_CODE']) \
    .size()\
    .to_frame()\
    .reset_index()\
    .drop(0, axis=1)\
    .set_index('STAT_CAUSE_CODE')\
    .sort_index()['STAT_CAUSE_DESCR']
stat_cause_mapping

STAT_CAUSE_CODE
1.0             Lightning
2.0         Equipment Use
3.0               Smoking
4.0              Campfire
5.0        Debris Burning
6.0              Railroad
7.0                 Arson
8.0              Children
9.0         Miscellaneous
10.0            Fireworks
11.0            Powerline
12.0            Structure
13.0    Missing/Undefined
Name: STAT_CAUSE_DESCR, dtype: object

Map Owner Code to Owner Description:

owner_code_mapping = raw_df \
    .groupby(['OWNER_DESCR', 'OWNER_CODE']) \
    .size()\
    .to_frame()\
    .reset_index()\
    .drop(0, axis=1)\
    .set_index('OWNER_CODE')\
    .sort_index()['OWNER_DESCR']
owner_code_mapping

OWNER_CODE
0.0                   FOREIGN
1.0                       BLM
2.0                       BIA
3.0                       NPS
4.0                       FWS
5.0                      USFS
6.0             OTHER FEDERAL
7.0                     STATE
8.0                   PRIVATE
9.0                    TRIBAL
10.0                      BOR
11.0                   COUNTY
12.0          MUNICIPAL/LOCAL
13.0         STATE OR PRIVATE
14.0    MISSING/NOT SPECIFIED
15.0        UNDEFINED FEDERAL
Name: OWNER_DESCR, dtype: object

Strip Data

Let's create a new dataframe that contains only the fields that we're interested in. This will reduce memory usage and help keep things tidy.

df = raw_df.copy()[[
    'STAT_CAUSE_CODE',
    'STAT_CAUSE_DESCR',
    'OWNER_CODE',
    'OWNER_DESCR',
    'DISCOVERY_DOY',
    'LATITUDE',
    'LONGITUDE',
    'STATE'
]]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1880465 entries, 0 to 1880464
Data columns (total 8 columns):
STAT_CAUSE_CODE     float64
STAT_CAUSE_DESCR    object
OWNER_CODE          float64
OWNER_DESCR         object
DISCOVERY_DOY       int64
LATITUDE            float64
LONGITUDE           float64
STATE               object
dtypes: float64(4), int64(1), object(3)
memory usage: 114.8+ MB

Split Data

We need to split the data into a train set and a test set. The train set will be used to build our model, and the test set will be used to evaluate the model.

We will use sklearn's model_selection.train_test_split to split our dataframe into two.

Last, we will create a convenience _df that allows us to access the union of the test and train sets.

train_df, test_df = model_selection.train_test_split(df)

display(HTML('''
<p>
    Number of Training Rows: {}<br />
    Number of Test Rows: {}
'''.format(train_df.shape[0], test_df.shape[0])))

_df = [train_df, test_df]

Number of Training Rows: 1410348<br />
Number of Test Rows: 470117

Conclusion (for now)

Now that we've loaded our data, given it a high level look, cleaned it up a bit, and split it into a train and test set, we're in a good position to begin some Exploratory Data Analysis. Keep an eye here for subsequent posts that will get into:

Exploratory Data Analysis
Data Engineering
More EDA! (Including geo viz)
Model Training and Evaluation
Ensembling Models

I hope this was helpful and/or entertaining so far. Since I'm pretty new to this, I'm sure there are, or will be errors - if you see any, please drop me a line -- I'd love to fix them right away!

Until next time,

Andrew

Andrew Mahon

Predicting Wildfires - Part 1