Uncategorized

Analysis: Video Game Sales

After scavenging Kaggle for new datasets to play around with, I found an older one I have been interested in for a while now: video game sales. It’s a dataset from about three years ago that is scraped from a website that looks at video games sales and ratings.

Data

The data is three years old, which is quite unfortunate as the video game market has greatly expanded over the last years. Multiplayer online games such as Fortnite now have a dominant position on the market. Therefore, I wanted to find a way to get a dataset that contains data from the last three years as well. After scraping for hours, I have an up-to-date dataset. However, I quickly noticed that the dataset I ‘created’ was missing a lot of important values. Thus, I have decided to stick to the dataset I found on Kaggle.

If you want to try using the scraping script that I found on Github, download the script here. I would recommend using time sleep in the ‘for loop’ that scrapes the data. If you do not do this, you might get an “error” (HTTPS 429) as you’re sending too many requests in a short amount of time to their server.

import time
time.sleep(25)

I found that the shortest time possible to not get an error was 25 seconds. You will have to let it run for days if you want to have a full dataset. But if you would rather not let it run for days, try changing the amount of ‘pages’ at the beginning of the script. This will reduce the amount of data, but you won’t have to wait for 4 days for your data to be done. I ended up scraping 1 page. Unfortunately, I did notice that the scraped data had a lot of missing values.

For all of these reasons, I decided to stick to the premade dataset I found on Kaggle.

If you do decide to scrape the whole dataset, my advice would be to slightly change the scraping script. Try moving up the portion of the code that saves the scraped to dataframe and csv file. If your internet connection drops or you get some kind of error, at least you will still have data saved to disk.

Analysis

Please keep in mind that the dataset is from about three years ago. Therefore, you will not see games such as Minecraft or PUBG on the list.

For the analysis I wanted to know 4 different things:

  • Top 10 titles in gaming
  • Sales per publisher
  • Sales per year
  • Sales per platform

Top 10 titles in gaming

Here we see the top 10 titles in video games (three years ago), the ranking is based on the amount of global sales. As you can see Wii Sports did quite well. I have a speculation for the first place here though. I remember that Wii Sports came with the Wii console itself. So the global sales for Wii Sports might reflect (some of) the Wii console sales as opposed to people intentionally buying Wii Sports.

Next in line is Super Mario Bros, a classic game that has been around since 1985, according to this dataset.

Sales per publisher

Up next we have the top 20 game publishers based on global sales. Earlier we saw that Wii Sports is the most sold title, here we see that Wii Sports’ publisher, Nintendo, also has the most sales out of all publishers. If you scroll back up to look at the top 10 titles, you’ll see that it is completely dominated by Nintendo.

Sales per year

This is my favorite graph for this dataset. It reflects the trend in sales over the last 30+ years. We see a clear upward trend for sales. I would carefully speculate that video games have greatly increased in popularity. We see a bit of a downward trend towards the latter part of the graph. However, I would like to point out that this might be an illusion as adding more contemporary data will change this trend. I would assume that the global video game sales trend is currently still on the increase.

Sales per platform

This one surprised me the most. Based on the other graphs and table, I expected for Wii to do much better. But it appears that PS2 was the most sold console three years ago.

Conclusion

For this dataset I carried out a simple analysis that some basic trends. Unfortunately, the dataset is not up to date. I would assume that you would find even more interesting trends if you were to include data from the last three years.

However, the most important piece of information this dataset provided is that gaming appears to have grown as a market. My assumption would be the fact that video games have grown more diverse in their genres and titles and therefore caters to a wider audience.

Do you want to use this dataset? Download it here on Kaggle. You’ll need to create an account, this is completely free of any cost.

Uncategorized

Analysis: Weather in Amsterdam (November 9 – 16, 2019)

In a previous post I embedded a Jupyter Notebook that used DarkSky’s API to pull weather data about a random location in Amsterdam. In this post I display some of the graphs plotted from the data.

The data consists of 168 data points, these are hourly predictions of the weather in Amsterdam. The data ranges from November 9th 19:00 to November 16th 19:00.

Unfortunately, I was not able to properly save the plots as images. Therefore, to read the x axis you might have to resort to some eye squinting and zooming in (ctrl + scroll up).

Data cleaning

After I loaded the API data into a dataframe, I looked through DarkSky’s documentation to understand the data. The API call that I used mostly fetched data in units from the imperial system. As I am more familiar with the metric system, I used formulas to change to convert it to other units. Second, I changed the Unix time column in the dataframe to ‘datetime’. I also added a shortened version of the time for it to be more legible on the x axis of the plots.

You should also be able to fetch data in units from the metric system through the API. However, I wanted to play around with the units myself.

# Fahrenheit to celsius
df['temperature'] = (df['temperature'] - 32) * (5/9)
df['apparentTemperature'] = (df['apparentTemperature'] - 32) * (5/9)
df['dewPoint'] = (df['dewPoint']- 32) * (5/9)
# Miles to kilometers
df['visibility'] = df['visibility'] * 1.609344
# Unix to datetime
df['time'] = pd.to_datetime(df['time'],unit='s')
# Changing the datetime format
# I want to see: day, shortened month name & hour and minutes
df['time_short'] = df['time'].dt.strftime('%d %b %H:%M')

Graphs

In order to plot these graphs, I used matplotlib. To save the images, I used:

plt.savefig('temp.png')

Temperatures in Amsterdam

This line graph displays two different lines. The blue line predicts the actual temperatures per hour. The orange line shows what the actual temperature will feel like.

Precipitation in Amsterdam

This line graph shows the precipitation in milliliters per hour.

Wind speeds in Amsterdam

This line graph shows the wind speeds in Amsterdam in kilometers per hour.

Dew points in Amsterdam

This line graph shows the different dew points over the course of time in Amsterdam, the data points are in degree Celsius.

Want to use this weather API?

Go to DarkSky‘s website, make an account and get your own free key!

Uncategorized

Jupyter Notebook: Weather in Amsterdam

I found an interesting and free API: darksky. Darksky allows you to pull weather related data, from any location you desire. All you have to do is sign up and you’ll receive a key that will give you access to their API.

I pulled hourly data from their API for a random location in Amsterdam. Below I have embedded a Jupyter Notebook in which I have plotted different weather related factors such as: temperature, wind speed, visibility, and precipitation.

Uncategorized

How to embed your Jupyter Notebook into a WordPress post. No plugin needed. [beginner’s guide]

You’ll need:
– Jupyter Notebooks in ‘tree mode’
– Jupyter Notebook terminal or Linux terminal
– A Github account
– A WordPress account
– The ‘Gist’ extension

This guide will first help you to create a ‘gist’ of your notebook, which will then be embedded into your wordpress post.

Installing the gist extension on my Windows machine

To get the gist extension up and running on my Windows machine I ran the following code in the Anaconda terminal:

pip install jupyter_contrib_nbextensions

pip install jupyter_nbextensions_configurator

Installing the gist extension on my Linux machine

The windows installation method did not work for my Linux machine. To achieve the same thing, I used the following code in my Linux terminal:

conda install -c conda-forge jupyter_contrib_nbextensions

Enabling the Gist extension

Next, make sure that you’re viewing the Jupyter Notebook you want to embed in your WordPress in ‘tree mode’.

Second, we are going to enable the ‘Gist’ extension. To do this, go to the Edit > nbextensions config. This will open a new window with a list of extensions. Look for ‘Gist-it‘. When you click on ‘Gist-it’. You’ll be able to set the parameters for the extension.

The first parameter we need to set is the ‘Github personal access token‘. To get this token, you’ll need to go to your personal Github account. This token can be generated at: https://github.com/settings/tokens. Click on ‘generate new token‘. Then fill in a name for the token in the ‘Note‘ textbox. After that, select ‘gist‘. Right after creating the token, you’ll see your own personal access token. Copy it. DO NOT share this token with other people.

Set the parameters for the gist extension in your nbextension config window.

Get your personal access token on your Github account. Click “generate new token”.

After copying the personal access token from your Github account, go back to the nbextension config window and paste the token into the first parameter field ‘Gitbub personal access token‘. Next, check the ‘Gists default to public‘ box.

Sharing the notebook to gist

Gist.

Go to the notebook you would like to add to your Github gist (in tree mode). And click on the Github logo that is now present in the Jupyter Notebook task bar.

When you click on the logo a new window will appear that asks for a gist id. If this is the first notebook your adding to your gist, you won’t need to fill in an id. So skip this field. Check the ‘make the gist public box‘. And add a description, this will be the title of your gist.

Go to your gists on Github. You can easily access these by clicking on your profile picture in the righthand corner. A dropdown menu will appear. Click on ‘your gists‘. If it went well, you’ll see your Jupyter notebook on this page.

Embedding your gist on WordPress

Open the gist by clicking on it. Then copy the URL from your browser. This is the URL to your gist. Make a new WordPress post. Make sure your WordPress ‘block’ is switched to HTML. Paste the URL into the HTML block. And you’re done. Publish your post and look at your site. Your first Jupyter notebook on WordPress.

Big Data

Big Data: volume, velocity, variety… and value

Doug Laney introduced us to the first 3 Vs of big data back in 2001. The three original Vs were volume, velocity, and variety. As we have amassed more data over time, the volume of data has increased. Think about sensor-meters on machines. We can now investigate how machines in a factory or a warehouse are doing based on continuous sensor readings. Furthermore, through our smart devices and social media platforms, even more data is being generated. The emergence of the internet of things (IoTs) has brought us a goldmine of datasets.

What makes big data even more special is that this data arrives repeatedly in realtime. We can monitor machines as they drill oil, likes on social media posts are instantly registered, and rainfall measures are continually measured and recorded. These three examples fall under the second V, velocity. The speed at which data arrives has increased incredibly over the last decades. This has been facilitated by the increase in bandwidth and internet speeds.

Moreover, we have a myriad of different types of data formats now. Think about the different types of data an online store can generate. First, there are the click paths that people go through on the website. The information the customer fills in on the website. What items a customer ends up buying. What payment method they used. And those are just a few examples. All of these actions generate different data formats that need to be stored, processed, and analyzed.

Other scholars and big data engineers have added other Vs to the mix. Examples are variability, veracity, and visualization. But in this post I would like to discuss a different V, referred to as value. Big data on the surface of things seems great. We have a lot of information. Which pleases those who adhere to the ‘law of large numbers’. In statistics there are many principles that point toward the idea of bigger is better. Think of the central limit theorem, the larger the sample size, the more likely the sample will morph into a normal distribution. And increasing the sample size is all about getting an answer that is closer to ‘reality’. We want a sample represents the population.

But let’s say we’re a company. We have loads of data. Statisticians would be jealous of our datasets. So much data. But what now? We let the data sit in a database or a distributed file system for a couple of weeks before we analyze it. We analyze it, and oops – it’s already too late. The interesting trends we found through our analysis are now irrelevant. That is why we should seek value in our data. It means acting fast, it means performing the right analyses. It also means realizing what we want to achieve with our data. Do we want to increase sales? Do we want to understand our population better? Do we want to facilitate better decision-making processes?

We have to know what we’re doing, that is why the value principle is so important. This principle is based on our other Vs as well, the volume, variety, and velocity of the data. Value is often overlooked but it is definitely imperative to your big data solution.