Now that we have prepared our environment and loaded base data and started to understand the data, let's turn our focus to engineering some new features.
Engineer Data
We'll start by engineering new featured derived from our base data, and then fold in a couple of external data sources. In particular, let's build some features that describe some geographic properties about the areas where each fire began. Two factors integral to the occurrence of a wildfire, but that are only loosely captured in our base data, are land cover and human behavior.
Land Cover plays a role in wildfire creation for a number of reasons. Vegetation can be more or less flammable depending on the time of year - coniferous forests can become especially flagrant during a hot dry summer. Certain land cover classes lend themselves to different activities - forestry won't be prevalent in grasslands. To capture this in our model, we'll engineer a new feature that describes the land cover in the area surrounding each fire. We'll base the feature on Multi Resolution Land Cover, and use a binning strategy to aggregate our base points geographically.
Human behavior also plays a large role in the occurrence of wildfires. As we saw during EDA, over 40% of wildfires occur as a result of Debris Burning or Arson -- both human behaviors. While a rough measure at best, the presence of humans and dwellings may correlate with the cause of wildfires - particularly those of human origin. We can use US Census Grid, employing a similar binning strategy as above, to engineer a couple of features around humans in the areas around wildfires.
Both of the above engineering tasks require binning. Originally, I had intended to use a geohash binning technique. At that point of the project, we were still looking at data from Alaska. Unfortunately at those latitudes, geohash becomes quite distorted. To mitigate distortion in our bins, I switched to Uber's H3 binning system.
H3 provides tools to generate an H3 bin from a set of coordinates, and a boundary shape for an H3 bin. In conjunction, these functions allow us to assign a bin to each wildfire, and then use that bin's geography to extract geodata from the raster datasets.
In any case, let's dive in.
Quantify Data
A couple fields in our reduced set are still non-numeric, in particular the STATE
field is categorical text - let's convert this to a numeric label using sklearn.preprocessing.LabelEncoder.
label_encoder = sklearn.preprocessing.LabelEncoder()
for key, dataframe in fires_df.items():
fires_df[key].loc[:,'STATE_CODE'] = label_encoder.fit_transform(dataframe['STATE'])
STAT_CAUSE_CODE
is not zero-inded - let's use our label encoder to add a new set of labels
label_encoder = sklearn.preprocessing.LabelEncoder()
for key, dataframe in fires_df.items():
fires_df[key].loc[:,'STAT_CAUSE_CODE_ZERO'] = label_encoder.fit_transform(dataframe['STAT_CAUSE_CODE'])
Missing Data
None of the fields that we're concerned with have missing data. We'll skip this for now.
Discovery Week of Year
Let's create a feature that reduces the resolution of DISCOVERY_DOY
for key, dataframe in fires_df.items():
dataframe['DISCOVERY_WEEK'] = dataframe['DISCOVERY_DOY']\
.apply(lambda x: math.floor(x/7) + 1)
Climate Regions
Let's create a feature that abstracts the STATE
column already in our data. We'll use NOAA's U.S. Climate Regions to group states into larger units that share certain environmental characteristics.
climate_regions = {
"northwest": ["WA","OR","ID"],
"west": ["CA","NV"],
"southwest": ["UT","CO","AZ","NM"],
"northern_rockies": ["MT","ND","SD","WY","NE"],
"upper_midwest": ["KS","OK","TX","AR","LA","MS"],
"south": ["MN","WI","MI","IA"],
"ohio_valley": ["MO","IL","IN","OH","WV","KY","TN"],
"southeast": ["VA","NC","SC","GA","AL","FL"],
"northeast": ["ME","NH","VT","NY","PA","MA","RI","CT","NJ","DE","MD","DC"],
"alaska": ["AK",],
"hawaii": ["HI"],
"puerto_rico": ["PR"]
}
state_region_mapping = {}
for region, region_states in climate_regions.items():
for state in region_states:
state_region_mapping[state] = region
for key, dataframe in fires_df.items():
dataframe['CLIMATE_REGION'] = dataframe['STATE']\
.apply(lambda x: state_region_mapping[x])
label_encoder = sklearn.preprocessing.LabelEncoder()
for key, dataframe in fires_df.items():
dataframe['CLIMATE_REGION_CODE'] = label_encoder.fit_transform(dataframe['CLIMATE_REGION'])
H3 Bins
In preparation to extract data from geo rasters, or perform other geoanalysis, let's bin our wildfires using uber/h3. Binning helps mitigate variable resolution of source data, reduce computational complexity of feature engineering, and hopefully help avoid overfitting.
Since there are no python bindings for H3 - although they are forthcoming - I took the approach of calling the H3 binaries using python's subprocess
. This works fairly well, especially when calls are batched.
Calcluate Bins for fires_df
Start by calculating a bin for each wildfire using geoToH3
. We do this batched to improve performance. We're using resolution 5.
H3 Resolution | Average Hexagon Area (km2) | Average Hexagon Edge Length (km) | Number of unique indexes |
---|---|---|---|
0 | 4,250,546.8477000 | 1,107.712591000 | 122 |
1 | 607,220.9782429 | 418.676005500 | 842 |
2 | 86,745.8540347 | 158.244655800 | 5,882 |
3 | 12,392.2648621 | 59.810857940 | 41,162 |
4 | 1,770.3235517 | 22.606379400 | 288,122 |
5 | 252.9033645 | 8.544408276 | 2,016,842 |
6 | 36.1290521 | 3.229482772 | 14,117,882 |
7 | 5.1612932 | 1.220629759 | 98,825,162 |
8 | 0.7373276 | 0.461354684 | 691,776,122 |
9 | 0.1053325 | 0.174375668 | 4,842,432,842 |
10 | 0.0150475 | 0.065907807 | 33,897,029,882 |
11 | 0.0021496 | 0.024910561 | 237,279,209,162 |
12 | 0.0003071 | 0.009415526 | 1,660,954,464,122 |
13 | 0.0000439 | 0.003559893 | 11,626,681,248,842 |
14 | 0.0000063 | 0.001348575 | 81,386,768,741,882 |
15 | 0.0000009 | 0.000509713 | 569,707,381,193,162 |
def h3_for_chunk(chunk, precision):
lat_lon_lines = "\n".join([
"{} {}".format(row['LATITUDE'], row['LONGITUDE']) for index,row in chunk.iterrows()
])
h3 = subprocess.run(
['/tools/h3/bin/geoToH3', str(precision)],
input=str.encode(lat_lon_lines),
stdout=subprocess.PIPE).stdout.decode('utf-8').splitlines()
return pd.DataFrame({
"H3": h3
}, index=chunk.index)
def chunker(seq, size):
return (seq.iloc[pos:pos + size] for pos in range(0, len(seq), size))
for key in fires_df:
print('Calculating bins for fires_df["{}"]'.format(key))
h3_df = pd.DataFrame()
with tqdm_notebook(total=fires_df[key].shape[0]) as pbar:
for chunk in chunker(fires_df[key], 10000):
h3_df = h3_df.append(h3_for_chunk(chunk, 5))
pbar.update(chunk.shape[0])
fires_df[key].drop('H3', 1, inplace=True, errors='ignore')
fires_df[key] = fires_df[key].merge(
h3_df,
how='left',
left_index=True,
right_index=True,
copy=False
)
display(fires_df[key].sample(5))
Calculating bins for fires_df["train"]
STAT_CAUSE_CODE | STAT_CAUSE_DESCR | OWNER_CODE | OWNER_DESCR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | LATITUDE | LONGITUDE | STATE | STATE_CODE | STAT_CAUSE_CODE_ZERO | DISCOVERY_WEEK | CLIMATE_REGION | CLIMATE_REGION_CODE | H3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1254933 | 5.0 | Debris Burning | 8.0 | PRIVATE | 2452145.5 | 236 | 1405 | 33.229500 | -82.787100 | GA | 9 | 4 | 34 | southeast | 5 | 8544c06bfffffff |
1511283 | 7.0 | Arson | 14.0 | MISSING/NOT SPECIFIED | 2455642.5 | 81 | 1608 | 31.047471 | -85.837722 | AL | 0 | 6 | 12 | southeast | 5 | 8544e18bfffffff |
601688 | 5.0 | Debris Burning | 14.0 | MISSING/NOT SPECIFIED | 2452544.5 | 270 | None | 31.677220 | -94.518610 | TX | 41 | 4 | 39 | upper_midwest | 7 | 85446bc7fffffff |
1302225 | 7.0 | Arson | 14.0 | MISSING/NOT SPECIFIED | 2454212.5 | 112 | 1116 | 42.788903 | -75.964293 | NY | 32 | 6 | 17 | northeast | 0 | 852b8b43fffffff |
615075 | 5.0 | Debris Burning | 14.0 | MISSING/NOT SPECIFIED | 2453447.5 | 77 | None | 33.083330 | -94.300550 | TX | 41 | 4 | 12 | upper_midwest | 7 | 85444d0bfffffff |
Calculating bins for fires_df["test"]
STAT_CAUSE_CODE | STAT_CAUSE_DESCR | OWNER_CODE | OWNER_DESCR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | LATITUDE | LONGITUDE | STATE | STATE_CODE | STAT_CAUSE_CODE_ZERO | DISCOVERY_WEEK | CLIMATE_REGION | CLIMATE_REGION_CODE | H3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1017024 | 5.0 | Debris Burning | 14.0 | MISSING/NOT SPECIFIED | 2449820.5 | 103 | None | 35.708300 | -81.561700 | NC | 25 | 4 | 15 | southeast | 5 | 8544db97fffffff |
1145718 | 5.0 | Debris Burning | 8.0 | PRIVATE | 2449504.5 | 152 | None | 42.532810 | -123.148780 | OR | 35 | 4 | 22 | northwest | 2 | 852818dbfffffff |
85133 | 4.0 | Campfire | 5.0 | USFS | 2450280.5 | 198 | 1300 | 39.121667 | -120.436667 | CA | 3 | 3 | 29 | west | 8 | 8528324ffffffff |
1617233 | 7.0 | Arson | 8.0 | PRIVATE | 2455983.5 | 57 | 1315 | 35.597400 | -82.625850 | NC | 25 | 6 | 9 | southeast | 5 | 8544d983fffffff |
945685 | 8.0 | Children | 14.0 | MISSING/NOT SPECIFIED | 2450529.5 | 81 | None | 37.483300 | -77.616700 | VA | 43 | 7 | 12 | southeast | 5 | 852a8dd7fffffff |
Isolate Bins
Identify unique bins in the wildfires dataframe and create a new GeoDataFrame, bins
, containing only the bins. GeoDataFrame comes from the GeoPandas project that adds a layer of geographic processing on top of Pandas. For our use, we're most concerned with the ability to store, and plot geographic shapes. Under the covers, GeoPandas uses the Shapely library for geometric analysis.
bins = gpd.GeoDataFrame({
'H3': np.unique(np.concatenate((fires_df["train"].H3.unique(), fires_df["test"].H3.unique())))
})
print(bins.info())
bins.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 25999 entries, 0 to 25998
Data columns (total 1 columns):
H3 25999 non-null object
dtypes: object(1)
memory usage: 203.2+ KB
None
H3 | |
---|---|
7418 | 8526eabbfffffff |
20399 | 85445253fffffff |
19060 | 852b1e3bfffffff |
9713 | 85279033fffffff |
3543 | 85266c37fffffff |
Calculate Bounds
Calculate bounds for each bin in bins
. This should batched as well - it takes quite a while to run.
def h3_bounds_for_bin(h3_bin):
h3_bounds = subprocess.run(
['/tools/h3/bin/h3ToGeoBoundary'],
input=str.encode(h3_bin),
stdout=subprocess.PIPE).stdout.decode('utf-8').splitlines()
coords = []
for coord in h3_bounds[2:-1]:
lon_lat = coord.lstrip().split(" ")
coords.append(((float(lon_lat[1]) - 360.), float(lon_lat[0])))
return shapely.geometry.Polygon(coords)
bins['GEOMETRY'] = bins['H3'].apply(h3_bounds_for_bin)
print(bins.info())
bins.sample(10)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 2 columns):
H3 27099 non-null object
GEOMETRY 27099 non-null object
dtypes: object(2)
memory usage: 423.5+ KB
None
H3 | GEOMETRY | |
---|---|---|
11178 | 85280243fffffff | POLYGON ((-123.003254979 40.831762, -122.94977... |
18725 | 852aa0affffffff | POLYGON ((-77.10954865399998 41.466234529, -77... |
10585 | 852793c3fffffff | POLYGON ((-110.871721099 46.713100841, -110.80... |
1670 | 8526242bfffffff | POLYGON ((-95.19630366400003 46.585965214, -95... |
19727 | 852af2affffffff | POLYGON ((-76.58726582000003 36.178472701, -76... |
19912 | 852b1857fffffff | POLYGON ((-67.882502542 44.775920586, -68.0000... |
24820 | 85488b83fffffff | POLYGON ((-101.943425426 31.173151717, -101.87... |
4800 | 85268eb3fffffff | POLYGON ((-105.272718304 38.958513793, -105.20... |
3552 | 85265ed3fffffff | POLYGON ((-92.729148119 37.080579662, -92.6569... |
26521 | 8548da13fffffff | POLYGON ((-104.892143771 35.715818432, -104.82... |
Now we have a geodataframe, bins
, that contains all of the bins whose geography contains at least one wildfire.
Land Cover Classification
As discussed above, land cover has a direct effect in the occurrence of wildfires. In order to incorporate land cover into our model, we'll engineer a feature based on DOI & USGS's Multiple Resolution Land Cover National Land Cover Database 2011. The data comes in a raster format with a 30 meter resolution. Each pixel can be one of the 20 classes described in the table to the right. A Land Use And Land Cover Classification System For Use With Remote Sensor Data provides some interesting context for land cover classification.
We'll use Mapbox's rasterio and rasterstats to extract land cover data about each of our bins. That data will come as a dict containing a count of points for each land cover class. We'll transform that data into a format more useful for our model.
The fact that we're using land cover data from 2011 alone threatens the value of this feature. While annual land cover data is not available, there is data available from 2001, and 2006. An improvement would be to select data from each of the three sets, 2001, 2006, or 2011, as appropriate.
We'll start by making a copy of our bins
dataframe so that we can experiment without trashing upstream work.
bins_land_cover = bins.copy()
Let's define the path to our data file.
land_cover_path = "/data/188-million-us-wildfires/src/nlcd_2011_landcover_2011_edition_2014_10_10/nlcd_2011_landcover_2011_edition_2014_10_10.img"
Define Coordinate Transformation
Before we use rasterstats
to count the values in each bin, we use pyproj
to reproject project our bin's geometry into the CRS of our datasource.
project = functools.partial(
pyproj.transform,
pyproj.Proj(init='EPSG:4326'),
pyproj.Proj('+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'))
Raster Samples
Before we extract information from our source raster, let's preview the contents to sanity check.
land_cover_data = rasterio.open(land_cover_path)
def render_masked_bin(row):
return rasterio.mask.mask(
land_cover_data,
[shapely.geometry.mapping(transform(project, row['GEOMETRY']))],
crop=True
)
masked_bin_images = bins_land_cover.sample(6).apply(render_masked_bin, 1).tolist()
fig, plots = plt.subplots(3, 2)
for index, val in enumerate(masked_bin_images):
my_plot = plots[math.floor(index/2)][index % 2]
rasterio.plot.show(val[0], ax=my_plot, transform=val[1], cmap='gist_earth')
my_plot.axis('on')
Calculate Land Cover
Now that we've confirmed that the data we're masking from our source raster looks reasonable, let's get to extracting it. Let's employ zonal_stats
with categorial=True
- this will returns out dict of counts for each value found in the masked region.
land_cover_path = "/data/188-million-us-wildfires/src/nlcd_2011_landcover_2011_edition_2014_10_10/nlcd_2011_landcover_2011_edition_2014_10_10.img"
def land_cover_for_geometry(geometry):
stats = zonal_stats(
[shapely.geometry.mapping(transform(project, geometry))],
land_cover_path,
nodata=-999,
categorical=True
)
return stats[0]
bins_land_cover = bins.copy()
bins_land_cover["LAND_COVER"] = bins_land_cover["GEOMETRY"].apply(land_cover_for_geometry)
print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 3 columns):
H3 27099 non-null object
LAND_COVER 27099 non-null object
GEOMETRY 27099 non-null object
dtypes: object(3)
memory usage: 635.2+ KB
None
H3 | LAND_COVER | GEOMETRY | |
---|---|---|---|
16644 | 8529ac17fffffff | {'21': 8414, '22': 188, '23': 7, '31': 3385, '... | POLYGON ((-119.761306269 34.848452511, -119.70... |
6266 | 8526b46bfffffff | {'11': 2814, '12': 448, '31': 1957, '41': 6709... | POLYGON ((-109.733245976 43.359435343, -109.66... |
25823 | 8548c8a3fffffff | {'11': 62, '21': 946, '22': 1538, '23': 134, '... | POLYGON ((-110.416144017 34.938354721, -110.35... |
24302 | 8544f17bfffffff | {'11': 977, '21': 24201, '22': 18629, '23': 42... | POLYGON ((-82.27732916000002 31.188086238, -82... |
8556 | 85270b1bfffffff | {'11': 30430, '21': 6245, '22': 3832, '23': 11... | POLYGON ((-93.92304508000001 47.431408525, -94... |
Calculate Primary Class
Let's create a feature describing the primary land cover level II class for each bin.
def primary_land_cover(land_cover):
return max(land_cover, key=land_cover.get)
bins_land_cover["PRIMARY_LAND_COVER"] = bins_land_cover["LAND_COVER"].apply(primary_land_cover)
print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 4 columns):
H3 27099 non-null object
LAND_COVER 27099 non-null object
GEOMETRY 27099 non-null object
PRIMARY_LAND_COVER 27099 non-null object
dtypes: object(4)
memory usage: 846.9+ KB
None
H3 | LAND_COVER | GEOMETRY | PRIMARY_LAND_COVER | |
---|---|---|---|---|
8427 | 85270837fffffff | {'11': 21425, '21': 4510, '22': 857, '23': 235... | POLYGON ((-92.87331802199998 48.073667863, -92... | 90 |
18892 | 852aa3c3fffffff | {'11': 3339, '21': 21052, '22': 7207, '23': 22... | POLYGON ((-78.347508672 41.131021516, -78.4469... | 41 |
24658 | 854883d7fffffff | {'11': 177, '21': 8740, '22': 1837, '23': 61, ... | POLYGON ((-99.61025040200002 28.581772583, -99... | 52 |
26656 | 8548dcb7fffffff | {'21': 1208, '22': 612, '23': 106, '31': 677, ... | POLYGON ((-106.971115621 34.134266323, -106.90... | 52 |
8010 | 8526e9c3fffffff | {'11': 392, '21': 10943, '22': 2740, '23': 645... | POLYGON ((-94.26269021799999 35.502182409, -94... | 41 |
Categorize Primary Class
Let's group each bin's primary land cover class into it's Level I class.
land_cover_categories = {
"water": [11, 12],
"developed": [21, 22, 23, 24],
"barren": [31],
"forest": [41, 42, 43],
"shrubland": [51, 52],
"herbaceous": [71, 72, 73, 74],
"planted_cultivated": [81, 82],
"wetlands": [90, 95]
}
land_cover_categories_map = {}
for category, values in land_cover_categories.items():
for value in values:
land_cover_categories_map[value] = category
def land_cover_category_for_class(land_cover_class):
return land_cover_categories_map[int(land_cover_class)] if int(land_cover_class) in land_cover_categories_map else 'unknown'
bins_land_cover["PRIMARY_LAND_COVER_CATEGORY"] = bins_land_cover["PRIMARY_LAND_COVER"].apply(land_cover_category_for_class)
print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 5 columns):
H3 27099 non-null object
LAND_COVER 27099 non-null object
GEOMETRY 27099 non-null object
PRIMARY_LAND_COVER 27099 non-null object
PRIMARY_LAND_COVER_CATEGORY 27099 non-null object
dtypes: object(5)
memory usage: 1.0+ MB
None
H3 | LAND_COVER | GEOMETRY | PRIMARY_LAND_COVER | PRIMARY_LAND_COVER_CATEGORY | |
---|---|---|---|---|---|
9264 | 85275093fffffff | {'11': 14071, '21': 10506, '22': 254, '23': 19... | POLYGON ((-90.94032905199998 46.288750521, -91... | 90 | wetlands |
21200 | 85444d57fffffff | {'11': 1358, '21': 11153, '22': 4413, '23': 17... | POLYGON ((-94.09936341299999 32.872628429, -94... | 42 | forest |
2062 | 85262e4bfffffff | {'11': 4610, '21': 13418, '22': 1594, '23': 44... | POLYGON ((-95.42043053700002 44.125116576, -95... | 82 | planted_cultivated |
1727 | 8526250ffffffff | {'11': 693, '21': 9651, '22': 2978, '23': 942,... | POLYGON ((-94.09905895100002 46.105765621, -94... | 81 | planted_cultivated |
12075 | 85283457fffffff | {'11': 2115, '21': 14075, '22': 2299, '23': 57... | POLYGON ((-121.731573885 37.047433286, -121.67... | 43 | forest |
Encode
Employ LabelEncoder to assign a numeric value to each category.
for key, dataframe in fires_df.items():
label_encoder = sklearn.preprocessing.LabelEncoder()
bins_land_cover.loc[:,'PRIMARY_LAND_COVER_CODE'] = label_encoder.fit_transform(bins_land_cover['PRIMARY_LAND_COVER'])
label_encoder = sklearn.preprocessing.LabelEncoder()
bins_land_cover.loc[:,'PRIMARY_LAND_COVER_CATEGORY_CODE'] = label_encoder.fit_transform(bins_land_cover['PRIMARY_LAND_COVER_CATEGORY'])
print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 7 columns):
H3 27099 non-null object
LAND_COVER 27099 non-null object
GEOMETRY 27099 non-null object
PRIMARY_LAND_COVER 27099 non-null object
PRIMARY_LAND_COVER_CATEGORY 27099 non-null object
PRIMARY_LAND_COVER_CODE 27099 non-null int64
PRIMARY_LAND_COVER_CATEGORY_CODE 27099 non-null int64
dtypes: int64(2), object(5)
memory usage: 1.4+ MB
None
H3 | LAND_COVER | GEOMETRY | PRIMARY_LAND_COVER | PRIMARY_LAND_COVER_CATEGORY | PRIMARY_LAND_COVER_CODE | PRIMARY_LAND_COVER_CATEGORY_CODE | |
---|---|---|---|---|---|---|---|
12157 | 8528806bfffffff | {'11': 250, '31': 5, '41': 32, '42': 178500, '... | POLYGON ((-115.591008981 44.7829381, -115.5268... | 42 | forest | 8 | 2 |
22925 | 8544cb47fffffff | {'11': 448, '21': 13178, '22': 5027, '23': 156... | POLYGON ((-84.00949905599998 36.478531766, -83... | 41 | forest | 7 | 2 |
13345 | 85289dc3fffffff | {'11': 1694, '21': 3039, '22': 95, '23': 13, '... | POLYGON ((-115.582331357 46.102164083, -115.51... | 42 | forest | 8 | 2 |
12794 | 85289163fffffff | {'11': 981, '21': 3125, '22': 225, '23': 21, '... | POLYGON ((-113.068429643 45.777292588, -113.00... | 42 | forest | 8 | 2 |
13156 | 85289a03fffffff | {'11': 296, '31': 10575, '42': 187046, '43': 4... | POLYGON ((-113.206564175 47.439544808, -113.13... | 42 | forest | 8 | 2 |
Land Cover Percentage
After all the above categorizing, let's make some continuous values. We'll create a new column for each land cover class in each bin that describes how much of that bin is covered by that class. Since we'll have some null values in the resulting dataset, let's make sure to set those to 0.
drop_cols = [c for c in bins_land_cover.columns if c[:15] == "LAND_COVER_PCT_"]
bins_land_cover = bins_land_cover.drop(drop_cols, 1)
def land_cover_percentages(row):
total = 0
for lc_class, val in row['LAND_COVER'].items():
total = total + int(val)
output = {}
for lc_class, val in row['LAND_COVER'].items():
pct = val / total
row['LAND_COVER_PCT_{}'.format(lc_class)] = pct
return row
bins_land_cover = bins_land_cover.apply(land_cover_percentages, 1)
bins_land_cover.loc[
:,[c for c in bins_land_cover.columns if c[:15] == "LAND_COVER_PCT_"]
] = bins_land_cover[[c for c in bins_land_cover.columns if c[:15] == "LAND_COVER_PCT_"]].fillna(0)
print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 7 columns):
H3 27099 non-null object
LAND_COVER 27099 non-null object
PRIMARY_LAND_COVER 27099 non-null object
PRIMARY_LAND_COVER_CATEGORY 27099 non-null object
PRIMARY_LAND_COVER_CATEGORY_CODE 27099 non-null int64
PRIMARY_LAND_COVER_CODE 27099 non-null int64
GEOMETRY 27099 non-null object
dtypes: int64(2), object(5)
memory usage: 1.4+ MB
None
H3 | LAND_COVER | PRIMARY_LAND_COVER | PRIMARY_LAND_COVER_CATEGORY | PRIMARY_LAND_COVER_CATEGORY_CODE | PRIMARY_LAND_COVER_CODE | GEOMETRY | |
---|---|---|---|---|---|---|---|
18093 | 852a8b83fffffff | {'11': 1779, '21': 8729, '22': 2660, '23': 769... | 41 | forest | 2 | 7 | POLYGON ((-78.600277807 37.661639755, -78.6924... |
10184 | 852783b7fffffff | {'11': 47, '21': 800, '22': 52, '31': 1537, '4... | 71 | herbaceous | 3 | 11 | POLYGON ((-103.574894942 45.204828189, -103.50... |
26470 | 8548d84ffffffff | {'11': 10, '21': 1263, '22': 75, '31': 23, '41... | 42 | forest | 2 | 8 | POLYGON ((-105.458776659 35.348940866, -105.39... |
14779 | 8528f03bfffffff | {'11': 20645, '21': 14749, '22': 7655, '23': 2... | 43 | forest | 2 | 9 | POLYGON ((-123.601736884 46.320872208, -123.54... |
7857 | 8526e46ffffffff | {'11': 1041, '21': 10898, '22': 1571, '23': 32... | 82 | planted_cultivated | 4 | 13 | POLYGON ((-94.99411935400002 38.81449329, -94.... |
Merge onto fires_df
Merge our bins_land_cover
onto our wildfire data keying on the H3 bin column.
for key, dataframe in fires_df.items():
fires_df[key] = dataframe.merge(bins_land_cover, on="H3", how="left")
display(fires_df[key].sample(5))
STAT_CAUSE_CODE | STAT_CAUSE_DESCR | OWNER_CODE | OWNER_DESCR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | LATITUDE | LONGITUDE | STATE | ... | DISCOVERY_WEEK | CLIMATE_REGION | CLIMATE_REGION_CODE | H3 | LAND_COVER | PRIMARY_LAND_COVER | PRIMARY_LAND_COVER_CATEGORY | PRIMARY_LAND_COVER_CATEGORY_CODE | PRIMARY_LAND_COVER_CODE | GEOMETRY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
324193 | 7.0 | Arson | 14.0 | MISSING/NOT SPECIFIED | 2451883.5 | 340 | None | 35.077500 | -81.768100 | SC | ... | 49 | southeast | 5 | 8544d8bbfffffff | {'11': 3373, '21': 25890, '22': 7596, '23': 22... | 41 | forest | 2 | 7 | POLYGON ((-80.84926178299997 35.39708965, -80.... |
385381 | 1.0 | Lightning | 1.0 | BLM | 2451821.5 | 278 | 1250 | 36.167500 | -115.538900 | NV | ... | 40 | west | 8 | 852986b7fffffff | {'11': 8, '21': 870, '22': 239, '23': 4, '31':... | 52 | shrubland | 5 | 10 | POLYGON ((-118.837357878 38.868465355, -118.78... |
1010417 | 4.0 | Campfire | 5.0 | USFS | 2451812.5 | 269 | 1200 | 35.103056 | -111.540833 | AZ | ... | 39 | southwest | 6 | 8548cd2ffffffff | {'11': 3208, '21': 1196, '22': 237, '23': 17, ... | 42 | forest | 2 | 8 | POLYGON ((-110.797864332 34.294343747, -110.73... |
953964 | 2.0 | Equipment Use | 14.0 | MISSING/NOT SPECIFIED | 2453227.5 | 223 | None | 38.720000 | -120.731944 | CA | ... | 32 | west | 8 | 852832cffffffff | {'11': 1645, '21': 14866, '22': 3334, '23': 56... | 42 | forest | 2 | 8 | POLYGON ((-122.09915726 37.494361459, -122.046... |
827733 | 2.0 | Equipment Use | 8.0 | PRIVATE | 2452828.5 | 189 | 1204 | 42.854400 | -112.013600 | ID | ... | 28 | northwest | 2 | 8528b203fffffff | {'21': 1209, '22': 1, '31': 6, '41': 34553, '4... | 52 | shrubland | 5 | 10 | POLYGON ((-111.859646108 42.561970602, -111.79... |
5 rows × 22 columns
STAT_CAUSE_CODE | STAT_CAUSE_DESCR | OWNER_CODE | OWNER_DESCR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | LATITUDE | LONGITUDE | STATE | ... | DISCOVERY_WEEK | CLIMATE_REGION | CLIMATE_REGION_CODE | H3 | LAND_COVER | PRIMARY_LAND_COVER | PRIMARY_LAND_COVER_CATEGORY | PRIMARY_LAND_COVER_CATEGORY_CODE | PRIMARY_LAND_COVER_CODE | GEOMETRY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
269557 | 7.0 | Arson | 14.0 | MISSING/NOT SPECIFIED | 2451393.5 | 215 | 1532 | 34.683344 | -89.124754 | MS | ... | 31 | upper_midwest | 7 | 85445b7bfffffff | {'11': 3514, '21': 9513, '22': 3877, '23': 124... | 41 | forest | 2 | 7 | POLYGON ((-90.98733658999998 32.724780019, -90... |
689 | 2.0 | Equipment Use | 1.0 | BLM | 2448868.5 | 247 | 1047 | 34.700000 | -117.167500 | CA | ... | 36 | west | 8 | 8529a3d7fffffff | {'21': 913, '22': 172, '23': 35, '31': 1615, '... | 52 | shrubland | 5 | 10 | POLYGON ((-116.50303393 32.759248771, -116.446... |
141424 | 3.0 | Smoking | 14.0 | MISSING/NOT SPECIFIED | 2452647.5 | 8 | None | 29.166670 | -82.200000 | FL | ... | 2 | southeast | 5 | 8544f413fffffff | {'11': 152, '21': 46492, '22': 25796, '23': 92... | 42 | forest | 2 | 8 | POLYGON ((-81.44679156199999 29.448254503, -81... |
83447 | 1.0 | Lightning | 1.0 | BLM | 2453964.5 | 229 | 1412 | 42.964600 | -118.326900 | OR | ... | 33 | northwest | 2 | 8528aec3fffffff | {'11': 2205, '31': 22, '42': 2988, '52': 27022... | 52 | shrubland | 5 | 10 | POLYGON ((-113.932881084 42.180656872, -113.86... |
61411 | 7.0 | Arson | 12.0 | MUNICIPAL/LOCAL | 2452674.5 | 35 | None | 34.200500 | -95.299833 | OK | ... | 6 | upper_midwest | 7 | 8526cd83fffffff | {'11': 396, '21': 8681, '22': 449, '23': 19, '... | 41 | forest | 2 | 7 | POLYGON ((-97.09674050699999 34.066314039, -97... |
5 rows × 22 columns
Alas, we now have a handful of features related to Land Cover on our main dataframe. Let's move onto some other features!
Census Data
In addition to land cover, human behavior plays a large role the the creation of wildfires. Similarly to how we engineered features around Land Cover, we can bring in an external dataset to engineer some features around population and human behavior. Fortunately, it's possible to obtain 2010 US Census Data in a raster format allowing us to employ the the same techniques that we did for land cover.
US Census publishes a number of interesting metrics: for our purposes, we will look at Population Density, Seasonal Housing, Housing Units, and Occupied Housing Units. For each of these, we will look at the mean value for each bin.
Similarly to the Land Cover data, this method is flawed in it's use of data from 2010. Since the wildfires we're looking at occurred in the years 1995-2015, 2010 census data may not be highly relevant to some of the earlier or later datapoints.
As with Land Cover, start by creating a new dataframe to house or work with the census data.
census_data = pd.DataFrame(
{"H3": bins['H3'].copy()},
index=bins.index
)
Raster Samples
Before we extract information from our source raster, let's preview the contents to sanity check.
census_files = {
"Population Density": "/data/188-million-us-wildfires/src/usgrid_data_2010/uspop10.tif",
"Seasonal Housing": "/data/188-million-us-wildfires/src/usgrid_data_2010/ussea10.tif",
"Housing Units": "/data/188-million-us-wildfires/src/usgrid_data_2010/ushu10.tif",
"Occupied Housing Units": "/data/188-million-us-wildfires/src/usgrid_data_2010/usocc10.tif"
}
def render_bin(row):
output = []
for index, value in census_files.items():
census_data = rasterio.open(value)
output.append(rasterio.mask.mask(
census_data,
[shapely.geometry.mapping(row['GEOMETRY'])],
crop=True,
nodata=0
))
return output
masked_bin_images = bins_land_cover.sample(2).apply(render_bin, 1).tolist()
fig, plots = plt.subplots(4, 2)
for index, val in enumerate(masked_bin_images):
for subindex, image in enumerate(val):
my_plot = plots[subindex][index]
rasterio.plot.show(image[0], ax=my_plot, transform=image[1], cmap='inferno')
Looks reasonable, let's create our features!
Population Density
Start with Population Density, use zonal_stats
to extract the mean value for each bin. Add the values as a new column to our dataframe.
census_population_path = "/data/188-million-us-wildfires/src/usgrid_data_2010/uspop10.tif"
def mean_population_density_for_geometry(geometry):
stats = zonal_stats(
[shapely.geometry.mapping(geometry)],
census_population_path,
nodata=0,
stats=["mean"]
)
return stats[0]["mean"]
census_data = census_data.drop("MEAN_POPULATION_DENSITY", 1, errors='ignore')
census_data["MEAN_POPULATION_DENSITY"] = \
bins["GEOMETRY"].apply(mean_population_density_for_geometry).fillna(value=0).replace([-np.inf], 0)
print(census_data.info())
census_data.sample(5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 2 columns):
H3 27099 non-null object
MEAN_POPULATION_DENSITY 27099 non-null float64
dtypes: float64(1), object(1)
memory usage: 423.5+ KB
None
H3 | MEAN_POPULATION_DENSITY | |
---|---|---|
14506 | 8528d243fffffff | -9.426104e+35 |
16105 | 85299d47fffffff | 0.000000e+00 |
11505 | 85281523fffffff | 1.346886e+00 |
11892 | 8528311bfffffff | 0.000000e+00 |
182 | 8512da2bfffffff | 0.000000e+00 |
Seasonal Housing
census_seasonal_path = "/data/188-million-us-wildfires/src/usgrid_data_2010/ussea10.tif"
def mean_seasonal_housing_for_geometry(geometry):
stats = zonal_stats(
[shapely.geometry.mapping(geometry)],
census_seasonal_path,
nodata=0,
stats=["mean"]
)
return stats[0]["mean"]
census_data = census_data.drop("MEAN_SEASONAL_HOUSING", 1, errors='ignore')
census_data["MEAN_SEASONAL_HOUSING"] = \
bins["GEOMETRY"].apply(mean_seasonal_housing_for_geometry).fillna(value=0).replace([-np.inf], 0)
print(census_data.info())
census_data.sample(5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 3 columns):
H3 27099 non-null object
MEAN_POPULATION_DENSITY 27099 non-null float64
MEAN_SEASONAL_HOUSING 27099 non-null float64
dtypes: float64(2), object(1)
memory usage: 635.2+ KB
None
H3 | MEAN_POPULATION_DENSITY | MEAN_SEASONAL_HOUSING | |
---|---|---|---|
23525 | 8544dd5bfffffff | 37.240550 | 0.248122 |
19559 | 852ad463fffffff | 27.189120 | 0.147108 |
9871 | 8527704ffffffff | 0.000000 | 0.000000 |
13150 | 85289a8ffffffff | 1.460237 | 0.939296 |
11226 | 8528063bfffffff | 1.233903 | 0.260069 |
Housing Units
census_housing_units = "/data/188-million-us-wildfires/src/usgrid_data_2010/ushu10.tif"
def mean_housing_units_for_geometry(geometry):
stats = zonal_stats(
[shapely.geometry.mapping(geometry)],
census_housing_units,
nodata=0,
stats=["mean"]
)
return stats[0]["mean"]
census_data = census_data.drop("MEAN_HOUSING_UNITS", 1, errors='ignore')
census_data["MEAN_HOUSING_UNITS"] = \
bins["GEOMETRY"].apply(mean_housing_units_for_geometry).fillna(value=0).replace([-np.inf], 0)
print(census_data.info())
census_data.sample(5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 4 columns):
H3 27099 non-null object
MEAN_POPULATION_DENSITY 27099 non-null float64
MEAN_SEASONAL_HOUSING 27099 non-null float64
MEAN_HOUSING_UNITS 27099 non-null float64
dtypes: float64(3), object(1)
memory usage: 846.9+ KB
None
H3 | MEAN_POPULATION_DENSITY | MEAN_SEASONAL_HOUSING | MEAN_HOUSING_UNITS | |
---|---|---|---|---|
17269 | 852a1613fffffff | 4.882109 | 1.034020 | 3.289251 |
3010 | 8526518bfffffff | 7.562320 | 0.087591 | 3.296344 |
13867 | 8528ab9bfffffff | 0.030536 | 0.058152 | 0.032559 |
17224 | 852a1483fffffff | 441.285687 | 1.192067 | 183.367477 |
8780 | 8527402bfffffff | 0.000000 | 0.000000 | 0.000000 |
Occupied Housing Units
census_occupied_housing_units = "/data/188-million-us-wildfires/src/usgrid_data_2010/usocc10.tif"
def mean_occupied_housing_units_for_geometry(geometry):
stats = zonal_stats(
[shapely.geometry.mapping(geometry)],
census_occupied_housing_units,
nodata=0,
stats=["mean"]
)
return stats[0]["mean"]
census_data = census_data.drop("MEAN_OCCUPIED_HOUSING_UNITS", 1, errors='ignore')
census_data["MEAN_OCCUPIED_HOUSING_UNITS"] = \
bins["GEOMETRY"]\
.apply(mean_occupied_housing_units_for_geometry)\
.fillna(value=0)\
.replace([-np.inf], 0)
print(census_data.info())
census_data.sample(5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 5 columns):
H3 27099 non-null object
MEAN_POPULATION_DENSITY 27099 non-null float64
MEAN_SEASONAL_HOUSING 27099 non-null float64
MEAN_HOUSING_UNITS 27099 non-null float64
MEAN_OCCUPIED_HOUSING_UNITS 27099 non-null float64
dtypes: float64(4), object(1)
memory usage: 1.0+ MB
None
H3 | MEAN_POPULATION_DENSITY | MEAN_SEASONAL_HOUSING | MEAN_HOUSING_UNITS | MEAN_OCCUPIED_HOUSING_UNITS | |
---|---|---|---|---|---|
23738 | 8544e2abfffffff | 25.395138 | 0.242860 | 10.882212 | 10.882212 |
21173 | 85444d37fffffff | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
4213 | 85268093fffffff | 0.060832 | 0.141386 | 0.169451 | 0.169451 |
8087 | 8526ec57fffffff | 1.851572 | 0.138690 | 0.763857 | 0.763857 |
5865 | 8526aa47fffffff | 29.909779 | 0.240654 | 10.667183 | 10.667183 |
Vacant Housing Units
Let's compare overall Housing Units and Occupied Housing Units to create a feature around vacant housing units.
census_data["MEAN_VACANT_HOUSING_UNITS"] = census_data["MEAN_HOUSING_UNITS"] - census_data["MEAN_OCCUPIED_HOUSING_UNITS"]
Merge onto fires_df
Let's merge our Census features onto our main data frame.
for key, dataframe in fires_df.items():
fires_df[key] = fires_df[key]\
.drop([
"MEAN_POPULATION_DENSITY",
"MEAN_SEASONAL_HOUSING",
"MEAN_HOUSING_UNITS",
"MEAN_OCCUPIED_HOUSING_UNITS",
"MEAN_VACANT_HOUSING_UNITS"
], axis=1, errors='ignore')
fires_df[key] = dataframe.merge(census_data, on="H3", how="left")
display(fires_df[key].sample(5))
STAT_CAUSE_CODE | STAT_CAUSE_DESCR | OWNER_CODE | OWNER_DESCR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | LATITUDE | LONGITUDE | STATE | ... | PRIMARY_LAND_COVER | PRIMARY_LAND_COVER_CATEGORY | PRIMARY_LAND_COVER_CATEGORY_CODE | PRIMARY_LAND_COVER_CODE | GEOMETRY | MEAN_POPULATION_DENSITY | MEAN_SEASONAL_HOUSING | MEAN_HOUSING_UNITS | MEAN_OCCUPIED_HOUSING_UNITS | MEAN_VACANT_HOUSING_UNITS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
190846 | 5.0 | Debris Burning | 14.0 | MISSING/NOT SPECIFIED | 2456724.5 | 67 | None | 31.565195 | -94.676084 | TX | ... | 42 | forest | 2 | 8 | POLYGON ((-95.53469668000002 30.461285287, -95... | -8.861519e+35 | -1.189798e+36 | -8.861519e+35 | -8.861519e+35 | 0.0 |
392853 | 2.0 | Equipment Use | 8.0 | PRIVATE | 2450414.5 | 332 | 1510 | 32.732400 | -82.682700 | GA | ... | 42 | forest | 2 | 8 | POLYGON ((-82.16565888299999 33.140867951, -82... | 1.526929e+01 | 2.266859e-01 | 4.978497e+00 | 4.978497e+00 | 0.0 |
236781 | 7.0 | Arson | 8.0 | PRIVATE | 2452523.5 | 249 | 2022 | 34.023425 | -85.147918 | GA | ... | 41 | forest | 2 | 7 | POLYGON ((-83.54737807800001 34.783248074, -83... | 4.844742e+01 | 2.591060e-01 | 2.042438e+01 | 2.042438e+01 | 0.0 |
953893 | 6.0 | Railroad | 8.0 | PRIVATE | 2449374.5 | 22 | 1540 | 31.515200 | -81.976300 | GA | ... | 90 | wetlands | 8 | 14 | POLYGON ((-81.13579223099998 29.521477031, -81... | 1.595580e+01 | 3.028798e-01 | 4.464921e+00 | 4.464921e+00 | 0.0 |
762546 | 5.0 | Debris Burning | 8.0 | PRIVATE | 2449294.5 | 307 | 1200 | 31.806500 | -81.653100 | GA | ... | 90 | wetlands | 8 | 14 | POLYGON ((-82.56952801199998 30.150289017, -82... | 1.031592e+02 | 5.569562e-01 | 4.348338e+01 | 4.348338e+01 | 0.0 |
5 rows × 27 columns
STAT_CAUSE_CODE | STAT_CAUSE_DESCR | OWNER_CODE | OWNER_DESCR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | LATITUDE | LONGITUDE | STATE | ... | PRIMARY_LAND_COVER | PRIMARY_LAND_COVER_CATEGORY | PRIMARY_LAND_COVER_CATEGORY_CODE | PRIMARY_LAND_COVER_CODE | GEOMETRY | MEAN_POPULATION_DENSITY | MEAN_SEASONAL_HOUSING | MEAN_HOUSING_UNITS | MEAN_OCCUPIED_HOUSING_UNITS | MEAN_VACANT_HOUSING_UNITS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
316901 | 5.0 | Debris Burning | 14.0 | MISSING/NOT SPECIFIED | 2455874.5 | 313 | 1320 | 43.125472 | -75.218345 | NY | ... | 41 | forest | 2 | 7 | POLYGON ((-74.05886090500002 42.510741585, -74... | 54.448256 | 0.584297 | 21.798168 | 21.798168 | 0.0 |
173974 | 1.0 | Lightning | 5.0 | USFS | 2449234.5 | 247 | 2030 | 38.996667 | -112.106667 | UT | ... | 42 | forest | 2 | 8 | POLYGON ((-112.634109074 36.682798716, -112.57... | 0.000000 | 0.048080 | 0.048080 | 0.048080 | 0.0 |
234596 | 5.0 | Debris Burning | 14.0 | MISSING/NOT SPECIFIED | 2453825.5 | 90 | None | 34.279960 | -78.734420 | NC | ... | 82 | planted_cultivated | 4 | 13 | POLYGON ((-77.93933100999999 35.494919175, -78... | 34.872965 | 0.330970 | 15.151135 | 15.151135 | 0.0 |
180091 | 5.0 | Debris Burning | 8.0 | PRIVATE | 2456756.5 | 99 | 1635 | 32.981111 | -81.753889 | GA | ... | 82 | planted_cultivated | 4 | 13 | POLYGON ((-83.64038666099998 31.710417745, -83... | 8.222949 | 0.186019 | 3.652465 | 3.652465 | 0.0 |
266232 | 4.0 | Campfire | 5.0 | USFS | 2452048.5 | 139 | 1300 | 38.885000 | -120.033333 | CA | ... | 42 | forest | 2 | 8 | POLYGON ((-116.705052106 39.732564363, -116.64... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
5 rows × 27 columns
Conclusion
Well that's all for now. In this blog post we engineered a handful of features both using data already present in our base dataset, and by incorporating new datasets. We geographically binned our datapoints, and used the geometry of each bin to extract statistics from land cover and human behavior raster files.
There is a lot that can be done to improve this - if you think there's something that could be approached differently, I want to hear from you! Please reach out: andrewmahon@fastmail.com.
Next post we'll revisit our Exploratory Data Analysis with a focus on our new features.
Updates
2017/2/14 - initial post 2017/2/22 - fixed a couple of typos