total hurricanes that occurred in the Atlantic Ocean from 1851 to 2018. The file was not a CSV
and could not be read traditionally with built-in Pandas functions. For each hurricane, the
identifying information was listed such as hurricane name, year, and unique identifying codes.
Then there was a variable amount of data for each six-hour timestamp provided. Each timestamp
consisted of the time and date and the atmospheric measurements. A PDF was provided that gave
a detailed explanation of how to successfully read the text file. To acquire the data, each line of
the file was read and relevant information was extracted using the specific indexing information
provided in the PDF.
After reading in all the lines and using the associated substring to extract the specific data
points, the information was stored as a list of hurricane objects. Each object consisted of
properties for their name, year, and other hurricane specific identifying information.
Additionally, each hurricane object consisted of a Pandas DataFrame that held the timestamp
specific data for each timestamp within a hurricane.
To create a model, it was determined what features would be relevant. It was then
decided to include the latitude and longitude which represent the location of the hurricane at a
timestamp as well as the maximum wind speed and pressure to represent the intensity of a
hurricane at a timestamp. The other features included in the data provided by the National
Hurricane Center were nearly all missing for hurricanes before 2004 which represented a
significant portion of the data. Thus, those features were not used in the models. To determine
the optimal number of timestamps included, models were tested using differing numbers of
timestamps from 3 timestamps to 10 timestamps of all 4 input variables. Thus, the input size
ranged from (3,4) to (10,4). To organize the list of hurricane objects obtained by extracting the
data from the text file into input and output data, hurricane objects that had fewer observations
than the required window length were discarded. Then, hurricanes were either split into the
training or testing set using sklearn.TrainTestSplit assigning 80% of the hurricanes as training
hurricanes and the rest as test hurricanes. For the training hurricanes and the test hurricanes
separately, for each window length, inputs were created by selecting any instance of timestamps
of length equal to the window length with the next timestamp becoming the output for this
particular input and output pair. For example, a hurricane that recorded 10 timestamps would be
able to provide 5 input and output pairs using a window length of 5 where each input had
dimension (5,4) and each output had dimension (1,4). As all models required a two-dimensional
input and the input was three dimensional once all training samples were combined, the input
matrix was transposed so that a single input would have a dimension of (1, window length * 4).
Additionally, as all training inputs were created, a list of the actual lines used to make the input
and output respectively without repeats. For example, a hurricane with only 10 timestamps
would contribute the first 9 timestamps to this non-repeated list.
Three different variations using all three types of models were first investigated. Using all
three models with the data as is, normalizing the data, and determining the distance and bearing
from the latitude and longitude between two points to replace the latitude and longitude data