Uncategorized

Analysis: COVID-19 confirmed cases around the world

First of all, shout-out to Johns Hopkins University for posting COVID-19 datasets on their Github. Their datasets can be found here. These are the best datasets I have found so far. They have datasets that include data of confirmed cases, recoveries, and deaths. The data is quite clean and contains data on a province/state-level as well as country-level.

For this particular analysis I have used the ‘confirmed cases’ dataset. I wanted the look at the top 10 countries with the most reported cases as of now. Furthermore, I wanted to see a time series on country-level and on a global scale. Lastly, we look at the progression in the Netherlands.

What is great about this specific dataset is that a new column is added every day with the new reported cases of yesterday. It is important to note that the data is cumulative, this means that every day we see the total number of confirmed cases per country.

Furthermore, the date format in this dataset is mm/dd/yy.

Top 10 countries with the most reported cases

This graph shows the cumulative number of confirmed cases of the top 10 countries as of the 20th of March, 2020. With China still having the highest number of confirmed cases.

Time series of top 5 countries with most reported cases of COVID-19

This graph displays a time series of the confirmed cases in the top 5 countries. While Italy, Spain, Germany, and Iran are still steadily increasing in numbers, we see that China’s cases have started to stagnate since March. We also see that around the time the stagnation in China takes place, cases start to be reported in Europe and Iran.

Even though South Korea has reported cases before Europe and Iran, they are not part of this graph. The countries in this graph were selected on the condition of having the most cases on the 20th of March, 2020.

Time series of top 5 countries with most reported cases of COVID-19 (excluding China and Italy)

Let’s exclude China and Italy for a moment. Here we can see that Spain, Germany and the US have had quite similar trajectories in terms of reported cases over time.

Time series of top 10 countries with most reported cases of COVID-19 (excluding China and Italy)

South Korea clearly stands out in their COVID-19 trajectory. They have managed to quickly respond and flatten the curve to a point where it’s almost stagnant.

Time series globally

This graph displays the global trajectory of the confirmed COVID-19 cases. In February and March there appears to be a small dip around the same time of the month. After that second small dip in March we see a major increase in reported cases around the world.

Time series of reported COVID-19 cases in the Netherlands

This graph looks at the confirmed cases in the Netherlands. March 12 stands out here as there were no reported cases on this day. However, after this stagnation we see a larger increase compared to the trajectory before March 12.

Want to see my full notebook of code to see how I made these graphs? Go to this post.

Uncategorized

programming made me impatient: from psychology to python

with python this could have been automated…

1. psychology

The human psyche has been a long time fascination of mine. So much that I felt I needed to watch any lecture I could find on William James before I even enrolled in any psych study. I was fully immersed into the world of the bystander effect, availability heuristics, and personality disorders.

2. doing the math

Quickly after actually enrolling, I found myself drowning in statistical methods. Math did not necessarily come easy to me. I definitely had to put in extra hours to achieve any type of passing grade. I fully understood the concepts, the math bits however… I don’t recall ever feeling that frustrated before. Though, the catch was, I actually really enjoyed all of it, once I did manage to grasp it. Statistics became a puzzle I needed to solve. I wanted nothing more than to figure out the significance of any piece of research. I suddenly had a goal to dissect any statistics used by researchers in scientific journals. What kind of flaws were they hiding? I felt like those people who always find themselves automatically spellchecking any piece of text they read.

3. first encounter: writing syntax

I then seriously considered to study statistics. In my spare time I would download datasets and perform any kind of analysis on them. My free uni edition of SPSS was absolutely godsend at this point. Most importantly, I fully enjoyed writing SPSS syntax. I was able to trace my thought process and I could quickly replicate tests. Yet, it didn’t take long for this sentiment to dissolve. I was shocked to find out how limiting SPSS really was. I mean, yes, it is nice software, but what if I want to go outside of SPSS capabilities? That is when I found out about R.

4. whoever created R…

I downloaded RStudio, and again, I felt as confused as I first felt when I was confronted with just the idea of statistics. R made no sense to me. At this point most of my statistics journey took place outside of university. I decided to not go for a statistics master. I wanted to understand internet culture. So, I was limited to making sense of R on the weekends. My master’s was all about qualitative research, so no statistics in sight. However, to understand internet culture, I needed to use tools scrape the web. Suddenly, I realized I needed to learn an actual programming language (Sorry, R). In order to pull data using an API, I needed to use python.

5. python

My R weekends were soon replaced with python weekends. This is when my love-hate relationship started with programming. I felt on top of the world whenever my code worked. But unbelievably impatient and frustrated when I couldn’t get it to work. This was also the first time I ever experienced what they call flow. I have pretty good time management skills. But python threw it all out of the window. I worked for hours on writing a script that would order a string into alphabetical order. I couldn’t believe it, I seemed to forget the very concept of time. My love-hate relationship turned into a full on love for python and programming.

6. more data and more pandas please

I still enjoy reading about psychology and groundbreaking experiments. And I frequently try to catch up on developments in internet culture. However, I felt I needed to further develop my technical skills. I no longer wanted to work with a small dataset, I wanted big data. That is why I decided to go for a traineeship in data engineering. I dropped python basics for a python library: pandas. It was like doing statistics on steroids. Never have I experienced statistics like that.

7. think like a computer

Don’t worry, I have not actually abandoned python basics, I just temporarily put those lessons on hold. But now I’m back at it. When I first tried python, I could not get myself to think like a computer. I wrote two lines of simple code and expected the IDE to just “get it”. I read ‘Python for Dummies’ and found an anecdote that finally made me understand computers. If I tell someone that has never toasted bread before to “just put the bread in the toaster”, they will probably try to force a loaf of bread, packaging and all, into a toaster. You can’t just tell a computer to “do something”. It needs a full rundown.

8. impatient

Now that my programming has gotten a bit better and computers and I vibe well, I have grown impatient. Any time I find myself using software such as excel, powerbi or even querying languages such as SQL, I get impatient. “With python this could have been automated”. Or “with python this would have been solved in 3 steps instead of 10”. Programming will make you realize how much you can customize. This even means the software you use. Imagine if you could tweak everything you use? This thought process led me to using Linux. I loved windows for its user friendliness, however, it does feel like your stuck in a box. Linux to a hardcore windows-user has not exactly been smooth sailing. I am still trying to figure out some of the compatibility issues that I am experiencing. Yet, I have close to full control.

Uncategorized

Analysis: Video Game Sales

After scavenging Kaggle for new datasets to play around with, I found an older one I have been interested in for a while now: video game sales. It’s a dataset from about three years ago that is scraped from a website that looks at video games sales and ratings.

Data

The data is three years old, which is quite unfortunate as the video game market has greatly expanded over the last years. Multiplayer online games such as Fortnite now have a dominant position on the market. Therefore, I wanted to find a way to get a dataset that contains data from the last three years as well. After scraping for hours, I have an up-to-date dataset. However, I quickly noticed that the dataset I ‘created’ was missing a lot of important values. Thus, I have decided to stick to the dataset I found on Kaggle.

If you want to try using the scraping script that I found on Github, download the script here. I would recommend using time sleep in the ‘for loop’ that scrapes the data. If you do not do this, you might get an “error” (HTTPS 429) as you’re sending too many requests in a short amount of time to their server.

import time
time.sleep(25)

I found that the shortest time possible to not get an error was 25 seconds. You will have to let it run for days if you want to have a full dataset. But if you would rather not let it run for days, try changing the amount of ‘pages’ at the beginning of the script. This will reduce the amount of data, but you won’t have to wait for 4 days for your data to be done. I ended up scraping 1 page. Unfortunately, I did notice that the scraped data had a lot of missing values.

For all of these reasons, I decided to stick to the premade dataset I found on Kaggle.

If you do decide to scrape the whole dataset, my advice would be to slightly change the scraping script. Try moving up the portion of the code that saves the scraped to dataframe and csv file. If your internet connection drops or you get some kind of error, at least you will still have data saved to disk.

Analysis

Please keep in mind that the dataset is from about three years ago. Therefore, you will not see games such as Minecraft or PUBG on the list.

For the analysis I wanted to know 4 different things:

  • Top 10 titles in gaming
  • Sales per publisher
  • Sales per year
  • Sales per platform

Top 10 titles in gaming

Here we see the top 10 titles in video games (three years ago), the ranking is based on the amount of global sales. As you can see Wii Sports did quite well. I have a speculation for the first place here though. I remember that Wii Sports came with the Wii console itself. So the global sales for Wii Sports might reflect (some of) the Wii console sales as opposed to people intentionally buying Wii Sports.

Next in line is Super Mario Bros, a classic game that has been around since 1985, according to this dataset.

Sales per publisher

Up next we have the top 20 game publishers based on global sales. Earlier we saw that Wii Sports is the most sold title, here we see that Wii Sports’ publisher, Nintendo, also has the most sales out of all publishers. If you scroll back up to look at the top 10 titles, you’ll see that it is completely dominated by Nintendo.

Sales per year

This is my favorite graph for this dataset. It reflects the trend in sales over the last 30+ years. We see a clear upward trend for sales. I would carefully speculate that video games have greatly increased in popularity. We see a bit of a downward trend towards the latter part of the graph. However, I would like to point out that this might be an illusion as adding more contemporary data will change this trend. I would assume that the global video game sales trend is currently still on the increase.

Sales per platform

This one surprised me the most. Based on the other graphs and table, I expected for Wii to do much better. But it appears that PS2 was the most sold console three years ago.

Conclusion

For this dataset I carried out a simple analysis that some basic trends. Unfortunately, the dataset is not up to date. I would assume that you would find even more interesting trends if you were to include data from the last three years.

However, the most important piece of information this dataset provided is that gaming appears to have grown as a market. My assumption would be the fact that video games have grown more diverse in their genres and titles and therefore caters to a wider audience.

Do you want to use this dataset? Download it here on Kaggle. You’ll need to create an account, this is completely free of any cost.

Uncategorized

Analysis: Weather in Amsterdam (November 9 – 16, 2019)

In a previous post I embedded a Jupyter Notebook that used DarkSky’s API to pull weather data about a random location in Amsterdam. In this post I display some of the graphs plotted from the data.

The data consists of 168 data points, these are hourly predictions of the weather in Amsterdam. The data ranges from November 9th 19:00 to November 16th 19:00.

Unfortunately, I was not able to properly save the plots as images. Therefore, to read the x axis you might have to resort to some eye squinting and zooming in (ctrl + scroll up).

Data cleaning

After I loaded the API data into a dataframe, I looked through DarkSky’s documentation to understand the data. The API call that I used mostly fetched data in units from the imperial system. As I am more familiar with the metric system, I used formulas to change to convert it to other units. Second, I changed the Unix time column in the dataframe to ‘datetime’. I also added a shortened version of the time for it to be more legible on the x axis of the plots.

You should also be able to fetch data in units from the metric system through the API. However, I wanted to play around with the units myself.

# Fahrenheit to celsius
df['temperature'] = (df['temperature'] - 32) * (5/9)
df['apparentTemperature'] = (df['apparentTemperature'] - 32) * (5/9)
df['dewPoint'] = (df['dewPoint']- 32) * (5/9)
# Miles to kilometers
df['visibility'] = df['visibility'] * 1.609344
# Unix to datetime
df['time'] = pd.to_datetime(df['time'],unit='s')
# Changing the datetime format
# I want to see: day, shortened month name & hour and minutes
df['time_short'] = df['time'].dt.strftime('%d %b %H:%M')

Graphs

In order to plot these graphs, I used matplotlib. To save the images, I used:

plt.savefig('temp.png')

Temperatures in Amsterdam

This line graph displays two different lines. The blue line predicts the actual temperatures per hour. The orange line shows what the actual temperature will feel like.

Precipitation in Amsterdam

This line graph shows the precipitation in milliliters per hour.

Wind speeds in Amsterdam

This line graph shows the wind speeds in Amsterdam in kilometers per hour.

Dew points in Amsterdam

This line graph shows the different dew points over the course of time in Amsterdam, the data points are in degree Celsius.

Want to use this weather API?

Go to DarkSky‘s website, make an account and get your own free key!

Uncategorized

Jupyter Notebook: Weather in Amsterdam

I found an interesting and free API: darksky. Darksky allows you to pull weather related data, from any location you desire. All you have to do is sign up and you’ll receive a key that will give you access to their API.

I pulled hourly data from their API for a random location in Amsterdam. Below I have embedded a Jupyter Notebook in which I have plotted different weather related factors such as: temperature, wind speed, visibility, and precipitation.