Sliding down Roller Coaster data

RollerCoaster Tycoon is my childhood game. Ok… I still play it sometimes. It is a PC game where you can create your own amusement park. If you like to feel challenged, you can play in campaigns where your amusement park has to fulfill several requirements. One of the main elements of this game is building roller coasters. Having built a lot of them, I wanted to see how roller coasters are created in the game compared to real world rides.

First, I needed some data and I looked for open data sets. There is a dataset available at Kaggle (https://www.kaggle.com/nolanbconaway/rollercoaster-tycoon-rides) that includes characteristics of roller coasters built in RollerCoaster Tycoon 2 (version on iPad). I will use it as data on rides created in the game. As for real roller coasters, there is a huge roller coaster database (https://rcdb.com/) that gathers information about places all around the world, however data is very incomplete. There is information about the year of opening, localization and photos of roller coasters, however, not so many have information about structure, including height, maximal speed and similar metrics that I would like to compare. Therefore I decided to look for different data sources. I found csv with data on real roller coasters on a site of Montana University (http://www.math.montana.edu/courses/s216/), including description of used metrics (with units), so decided to use this data source instead.

First, I needed to compare the data, see what metrics I have available and if they are presented in the same units. In general, dataset with data from RollerCoaster Tycoon (game, further referred to as RCT) had more numerical features included. Some of the metrics seem to be available in both sets like top/max speed, however others are not e.g., g-forces values available only for RCT data set. There are four metrics that I felt confident that they show comparable metrics presented in the table below. ''' There were two additional metrics that seemed interesting named drop (ft) and max_height (ft) in the real roller coaster dataset. However, in the RCT dataset we have only one metric that resembles those named highest_drop_height and based on the description it is not clear if this is drop height or maximal height and also unit is not clearly stated. In such a case, I decided not to include this metric. In the end we have four numerical metrics that have the same unit.

There are also non numerical columns available in each data set. For example, a real roller coaster dataset has column Type with values “Wooden” and “Steel” which could make an interesting comparison. However, in RCT dataset a similar column called rollercoaster_type contains very granular data and it is not really mappable to wooden/steel distinguishment. There are two types that have “Wooden” in name, but other types are not easily mappable. However, I will try to employ those values later in the analysis.

Moving to the actual analysis, I wanted to see what is the distribution of values for each of the four chosen metrics in each dataset - e.g. if those values are comparable at all before further analysis. Having this overview, we can observe in general that real roller coasters are faster, longer and have longer ride time. I’ve checked statistical significance for those four metrics. For inversions analysis I removed “-1” values as those indicated lack of data. For all four analysed metrics the difference between RCT and real roller coasters is statistically significant (max_speed p-value: 6.27e-25, ride_time p-value: 9.69e-38, ride_length p-value: 2.18e-18, inversions p-value: 0.03).

Next, I wanted to check if all of the RCT roller coasters have metrics within values obtained by rides created in real life. From a maximal speed perspective There were no roller coasters built in RCT outside the limits of maximal speed, but there were three roller coasters created that had shorter ride time in comparison with all real life rides. I can only imagine that those were created to fulfill some campaign requirements (e.g. having at least X roller coasters in your amusement park). As for ride length, only one roller coaster from RCT was outside the limits, reaching 7497ft, but the observed maximum was 7359ft so the difference is not really big. Additionally,  a one game-based roller coaster exceeded the limits in case of inversion. It had 9 inversions while the maximum number of inversions observed in real life rides was 7. It was probably one nauseous roller coaster.

In the plot it can also be observed that there seems to be a correlation between ride length and ride time. Let’s look into it in more detail. Pearson correlation coefficient analysis showed significant correlation between ride length and ride time for whole dataset (0.79, 3.26e-68) as well as separately for real (0.65, 4.56e-16) and RCT roller coaster (0.83, 4.09e-50). On the plot, we can also observe that for rides in the top right corner (longer in terms of time and ride length) also the maximal speed is higher (dot size is bigger) in comparison to those in the bottom left part of the plot (shorter rides). That makes sense - usually the biggest roller coasters are also the ones that have more attractions like very fast sections and inversions.

Next, I wanted to look into roller coaster types. In the real roller coaster dataset, we have a clear distinction between Steel and Wooden ride types. We don’t have a similar distinction for RCT dataset. Therefore, in this part we will analyse real roller coasters. Pearson correlation calculated for ride time and ride length showed statistical significance for both Wooden (0.79, 2.03e-09) and Steel (0.63, 1.71e-10) rides. Interestingly, there is no clear difference between Steel and Wooden rides in case of ride length, ride time and maximal speed.

As mentioned earlier, in RCT dataset there are multiple types distinguished and two of them contain “Wooden” in the name. It turns out that there are 32 rides of these types in our dataset. In the next step I will compare them with the Wooden roller coaster from the real world. What we can observe at the first glance is that the wooden roller coaster data fit the earlier observed trend that the real roller coasters are longer and faster in comparison with RCT rides. I wonder if this is connected with the fact that in game building roller coasters is time consuming and usually without any additional advantages of building really long rides. In real life you can always aim to have the longest ride in the world. Getting back to wooden roller coasters, in general, both real and game-based rides follow the same ride length/ride time relation.

Let’s now deep dive in roller coasters data from Roller Coaster Tycoon dataset as it has many interesting features that we are unable to analyse together with real life rides. First, I want to see if there are clear differences between custom designed roller coasters and those built-in in the game (we are able to use a binary custom_design column for this purpose). From the first overview, it does not seem like a case (see plots below). I divided the analysed metrics to two groups in order to achieve better plots clarity. In blue built-in designs are shown and in orange custom designed rides. Looking at the distributions of metrics like 'excitement', 'intensity', 'nausea', which are very important from perspective of the ride success in the game, we can observe that most of the rides included in the dataset tend to have intensity and nausea in the lower range and excitement in the upper range, which is anticipated. When comparing custom designed (1, understood as built by user from scratch) with pre-designed (0, understood as ready to build on appropriate grounds) roller coaster, we can observe that custom designed roller coaster has lower excitement and lower nausea rate. It may not be true for all of the RCT users as we are here using data from one person experience, however it is hard in a game to create custom rides with high intensity rates comparable to those pre-designed. However, from my experience nausea rate is easily exceeded what is not visible in used dataset. As a last part, based on the data I would like to better understand what is influencing the excitement rate of a ride. In a different word, what should I take care of while building a custom designed roller coaster to achieve a high excitement rate. Looking at visualization of correlation matrix (based on Pearson correlation), we can observe highest correlation for excitement and ride length, number of drops, total time in air, highest drop height as well as maximal speed. Number of inversions and observed G-Force values had low correlation with excitement values.

Comparing real roller coasters data with RollerCoaster Tycoon data shows that this game is not unrealistic in case of rides structure and characteristics. Additionally, we were able to gain some knowledge on what is influencing the needful metric of design built rides that is excitement. It would be great to gather similar data for multiple players of RCT and verify if our sample was representative. How are your roller coasters built in comparison with real and presented here RCT rides?

Notebook with analysis and visualizations in Python can be found on the github.

''If you have data on your roller coasters built in RCT and would like to share, let me know (or create a PR to the github repo with data file). Thanks!''

Author: Martyna Urbanek-Trzeciak