with python this could have been automated…
The human psyche has been a long time fascination of mine. So much that I felt I needed to watch any lecture I could find on William James before I even enrolled in any psych study. I was fully immersed into the world of the bystander effect, availability heuristics, and personality disorders.
2. doing the math
Quickly after actually enrolling, I found myself drowning in statistical methods. Math did not necessarily come easy to me. I definitely had to put in extra hours to achieve any type of passing grade. I fully understood the concepts, the math bits however… I don’t recall ever feeling that frustrated before. Though, the catch was, I actually really enjoyed all of it, once I did manage to grasp it. Statistics became a puzzle I needed to solve. I wanted nothing more than to figure out the significance of any piece of research. I suddenly had a goal to dissect any statistics used by researchers in scientific journals. What kind of flaws were they hiding? I felt like those people who always find themselves automatically spellchecking any piece of text they read.
3. first encounter: writing syntax
I then seriously considered to study statistics. In my spare time I would download datasets and perform any kind of analysis on them. My free uni edition of SPSS was absolutely godsend at this point. Most importantly, I fully enjoyed writing SPSS syntax. I was able to trace my thought process and I could quickly replicate tests. Yet, it didn’t take long for this sentiment to dissolve. I was shocked to find out how limiting SPSS really was. I mean, yes, it is nice software, but what if I want to go outside of SPSS capabilities? That is when I found out about R.
4. whoever created R…
I downloaded RStudio, and again, I felt as confused as I first felt when I was confronted with just the idea of statistics. R made no sense to me. At this point most of my statistics journey took place outside of university. I decided to not go for a statistics master. I wanted to understand internet culture. So, I was limited to making sense of R on the weekends. My master’s was all about qualitative research, so no statistics in sight. However, to understand internet culture, I needed to use tools scrape the web. Suddenly, I realized I needed to learn an actual programming language (Sorry, R). In order to pull data using an API, I needed to use python.
My R weekends were soon replaced with python weekends. This is when my love-hate relationship started with programming. I felt on top of the world whenever my code worked. But unbelievably impatient and frustrated when I couldn’t get it to work. This was also the first time I ever experienced what they call flow. I have pretty good time management skills. But python threw it all out of the window. I worked for hours on writing a script that would order a string into alphabetical order. I couldn’t believe it, I seemed to forget the very concept of time. My love-hate relationship turned into a full on love for python and programming.
6. more data and more pandas please
I still enjoy reading about psychology and groundbreaking experiments. And I frequently try to catch up on developments in internet culture. However, I felt I needed to further develop my technical skills. I no longer wanted to work with a small dataset, I wanted big data. That is why I decided to go for a traineeship in data engineering. I dropped python basics for a python library: pandas. It was like doing statistics on steroids. Never have I experienced statistics like that.
7. think like a computer
Don’t worry, I have not actually abandoned python basics, I just temporarily put those lessons on hold. But now I’m back at it. When I first tried python, I could not get myself to think like a computer. I wrote two lines of simple code and expected the IDE to just “get it”. I read ‘Python for Dummies’ and found an anecdote that finally made me understand computers. If I tell someone that has never toasted bread before to “just put the bread in the toaster”, they will probably try to force a loaf of bread, packaging and all, into a toaster. You can’t just tell a computer to “do something”. It needs a full rundown.
Now that my programming has gotten a bit better and computers and I vibe well, I have grown impatient. Any time I find myself using software such as excel, powerbi or even querying languages such as SQL, I get impatient. “With python this could have been automated”. Or “with python this would have been solved in 3 steps instead of 10”. Programming will make you realize how much you can customize. This even means the software you use. Imagine if you could tweak everything you use? This thought process led me to using Linux. I loved windows for its user friendliness, however, it does feel like your stuck in a box. Linux to a hardcore windows-user has not exactly been smooth sailing. I am still trying to figure out some of the compatibility issues that I am experiencing. Yet, I have close to full control.
After scavenging Kaggle for new datasets to play around with, I found an older one I have been interested in for a while now: video game sales. It’s a dataset from about three years ago that is scraped from a website that looks at video games sales and ratings.
The data is three years old, which is quite unfortunate as the video game market has greatly expanded over the last years. Multiplayer online games such as Fortnite now have a dominant position on the market. Therefore, I wanted to find a way to get a dataset that contains data from the last three years as well. After scraping for hours, I have an up-to-date dataset. However, I quickly noticed that the dataset I ‘created’ was missing a lot of important values. Thus, I have decided to stick to the dataset I found on Kaggle.
If you want to try using the scraping script that I found on Github, download the script here. I would recommend using time sleep in the ‘for loop’ that scrapes the data. If you do not do this, you might get an “error” (HTTPS 429) as you’re sending too many requests in a short amount of time to their server.
I found that the shortest time possible to not get an error was 25 seconds. You will have to let it run for days if you want to have a full dataset. But if you would rather not let it run for days, try changing the amount of ‘pages’ at the beginning of the script. This will reduce the amount of data, but you won’t have to wait for 4 days for your data to be done. I ended up scraping 1 page. Unfortunately, I did notice that the scraped data had a lot of missing values.
For all of these reasons, I decided to stick to the premade dataset I found on Kaggle.
If you do decide to scrape the whole dataset, my advice would be to slightly change the scraping script. Try moving up the portion of the code that saves the scraped to dataframe and csv file. If your internet connection drops or you get some kind of error, at least you will still have data saved to disk.
Please keep in mind that the dataset is from about three years ago. Therefore, you will not see games such as Minecraft or PUBG on the list.
For the analysis I wanted to know 4 different things:
- Top 10 titles in gaming
- Sales per publisher
- Sales per year
- Sales per platform
Top 10 titles in gaming
Here we see the top 10 titles in video games (three years ago), the ranking is based on the amount of global sales. As you can see Wii Sports did quite well. I have a speculation for the first place here though. I remember that Wii Sports came with the Wii console itself. So the global sales for Wii Sports might reflect (some of) the Wii console sales as opposed to people intentionally buying Wii Sports.
Next in line is Super Mario Bros, a classic game that has been around since 1985, according to this dataset.
Sales per publisher
Up next we have the top 20 game publishers based on global sales. Earlier we saw that Wii Sports is the most sold title, here we see that Wii Sports’ publisher, Nintendo, also has the most sales out of all publishers. If you scroll back up to look at the top 10 titles, you’ll see that it is completely dominated by Nintendo.
Sales per year
This is my favorite graph for this dataset. It reflects the trend in sales over the last 30+ years. We see a clear upward trend for sales. I would carefully speculate that video games have greatly increased in popularity. We see a bit of a downward trend towards the latter part of the graph. However, I would like to point out that this might be an illusion as adding more contemporary data will change this trend. I would assume that the global video game sales trend is currently still on the increase.
Sales per platform
This one surprised me the most. Based on the other graphs and table, I expected for Wii to do much better. But it appears that PS2 was the most sold console three years ago.
For this dataset I carried out a simple analysis that some basic trends. Unfortunately, the dataset is not up to date. I would assume that you would find even more interesting trends if you were to include data from the last three years.
However, the most important piece of information this dataset provided is that gaming appears to have grown as a market. My assumption would be the fact that video games have grown more diverse in their genres and titles and therefore caters to a wider audience.
Do you want to use this dataset? Download it here on Kaggle. You’ll need to create an account, this is completely free of any cost.
In a previous post I embedded a Jupyter Notebook that used DarkSky’s API to pull weather data about a random location in Amsterdam. In this post I display some of the graphs plotted from the data.
The data consists of 168 data points, these are hourly predictions of the weather in Amsterdam. The data ranges from November 9th 19:00 to November 16th 19:00.
Unfortunately, I was not able to properly save the plots as images. Therefore, to read the x axis you might have to resort to some eye squinting and zooming in (ctrl + scroll up).
After I loaded the API data into a dataframe, I looked through DarkSky’s documentation to understand the data. The API call that I used mostly fetched data in units from the imperial system. As I am more familiar with the metric system, I used formulas to change to convert it to other units. Second, I changed the Unix time column in the dataframe to ‘datetime’. I also added a shortened version of the time for it to be more legible on the x axis of the plots.
You should also be able to fetch data in units from the metric system through the API. However, I wanted to play around with the units myself.
# Fahrenheit to celsius df['temperature'] = (df['temperature'] - 32) * (5/9) df['apparentTemperature'] = (df['apparentTemperature'] - 32) * (5/9) df['dewPoint'] = (df['dewPoint']- 32) * (5/9)
# Miles to kilometers df['visibility'] = df['visibility'] * 1.609344
# Unix to datetime df['time'] = pd.to_datetime(df['time'],unit='s')
# Changing the datetime format # I want to see: day, shortened month name & hour and minutes df['time_short'] = df['time'].dt.strftime('%d %b %H:%M')
In order to plot these graphs, I used matplotlib. To save the images, I used:
Temperatures in Amsterdam
This line graph displays two different lines. The blue line predicts the actual temperatures per hour. The orange line shows what the actual temperature will feel like.
Precipitation in Amsterdam
This line graph shows the precipitation in milliliters per hour.
Wind speeds in Amsterdam
This line graph shows the wind speeds in Amsterdam in kilometers per hour.
Dew points in Amsterdam
This line graph shows the different dew points over the course of time in Amsterdam, the data points are in degree Celsius.
Want to use this weather API?
Go to DarkSky‘s website, make an account and get your own free key!
I found an interesting and free API: darksky. Darksky allows you to pull weather related data, from any location you desire. All you have to do is sign up and you’ll receive a key that will give you access to their API.
I pulled hourly data from their API for a random location in Amsterdam. Below I have embedded a Jupyter Notebook in which I have plotted different weather related factors such as: temperature, wind speed, visibility, and precipitation.
– Jupyter Notebooks in ‘tree mode’
– Jupyter Notebook terminal or Linux terminal
– A Github account
– A WordPress account
– The ‘Gist’ extension
This guide will first help you to create a ‘gist’ of your notebook, which will then be embedded into your wordpress post.
Installing the gist extension on my Windows machine
To get the gist extension up and running on my Windows machine I ran the following code in the Anaconda terminal:
pip install jupyter_contrib_nbextensions
pip install jupyter_nbextensions_configurator
Installing the gist extension on my Linux machine
The windows installation method did not work for my Linux machine. To achieve the same thing, I used the following code in my Linux terminal:
conda install -c conda-forge jupyter_contrib_nbextensions
Enabling the Gist extension
Next, make sure that you’re viewing the Jupyter Notebook you want to embed in your WordPress in ‘tree mode’.
Second, we are going to enable the ‘Gist’ extension. To do this, go to the Edit > nbextensions config. This will open a new window with a list of extensions. Look for ‘Gist-it‘. When you click on ‘Gist-it’. You’ll be able to set the parameters for the extension.
The first parameter we need to set is the ‘Github personal access token‘. To get this token, you’ll need to go to your personal Github account. This token can be generated at: https://github.com/settings/tokens. Click on ‘generate new token‘. Then fill in a name for the token in the ‘Note‘ textbox. After that, select ‘gist‘. Right after creating the token, you’ll see your own personal access token. Copy it. DO NOT share this token with other people.
After copying the personal access token from your Github account, go back to the nbextension config window and paste the token into the first parameter field ‘Gitbub personal access token‘. Next, check the ‘Gists default to public‘ box.
Sharing the notebook to gist
Go to the notebook you would like to add to your Github gist (in tree mode). And click on the Github logo that is now present in the Jupyter Notebook task bar.
When you click on the logo a new window will appear that asks for a gist id. If this is the first notebook your adding to your gist, you won’t need to fill in an id. So skip this field. Check the ‘make the gist public box‘. And add a description, this will be the title of your gist.
Go to your gists on Github. You can easily access these by clicking on your profile picture in the righthand corner. A dropdown menu will appear. Click on ‘your gists‘. If it went well, you’ll see your Jupyter notebook on this page.
Embedding your gist on WordPress
Open the gist by clicking on it. Then copy the URL from your browser. This is the URL to your gist. Make a new WordPress post. Make sure your WordPress ‘block’ is switched to HTML. Paste the URL into the HTML block. And you’re done. Publish your post and look at your site. Your first Jupyter notebook on WordPress.
The past couple of months I have been learning to use pandas and dataframes in jupyter notebook. Below is a snippet of one of those notebooks, where I look at the football matches played in the FIFA world Cups since 1930.
Doug Laney introduced us to the first 3 Vs of big data back in 2001. The three original Vs were volume, velocity, and variety. As we have amassed more data over time, the volume of data has increased. Think about sensor-meters on machines. We can now investigate how machines in a factory or a warehouse are doing based on continuous sensor readings. Furthermore, through our smart devices and social media platforms, even more data is being generated. The emergence of the internet of things (IoTs) has brought us a goldmine of datasets.
What makes big data even more special is that this data arrives repeatedly in realtime. We can monitor machines as they drill oil, likes on social media posts are instantly registered, and rainfall measures are continually measured and recorded. These three examples fall under the second V, velocity. The speed at which data arrives has increased incredibly over the last decades. This has been facilitated by the increase in bandwidth and internet speeds.
Moreover, we have a myriad of different types of data formats now. Think about the different types of data an online store can generate. First, there are the click paths that people go through on the website. The information the customer fills in on the website. What items a customer ends up buying. What payment method they used. And those are just a few examples. All of these actions generate different data formats that need to be stored, processed, and analyzed.
Other scholars and big data engineers have added other Vs to the mix. Examples are variability, veracity, and visualization. But in this post I would like to discuss a different V, referred to as value. Big data on the surface of things seems great. We have a lot of information. Which pleases those who adhere to the ‘law of large numbers’. In statistics there are many principles that point toward the idea of bigger is better. Think of the central limit theorem, the larger the sample size, the more likely the sample will morph into a normal distribution. And increasing the sample size is all about getting an answer that is closer to ‘reality’. We want a sample represents the population.
But let’s say we’re a company. We have loads of data. Statisticians would be jealous of our datasets. So much data. But what now? We let the data sit in a database or a distributed file system for a couple of weeks before we analyze it. We analyze it, and oops – it’s already too late. The interesting trends we found through our analysis are now irrelevant. That is why we should seek value in our data. It means acting fast, it means performing the right analyses. It also means realizing what we want to achieve with our data. Do we want to increase sales? Do we want to understand our population better? Do we want to facilitate better decision-making processes?
We have to know what we’re doing, that is why the value principle is so important. This principle is based on our other Vs as well, the volume, variety, and velocity of the data. Value is often overlooked but it is definitely imperative to your big data solution.
This post is part of a series that looks at social media from different perspectives. My second interview was with a social media marketer.
We first start to talk about social media itself. I am curious to hear about her personal opinion on online platforms. She explains to me that she has a love-hate relationship with them. She is concerned about the privacy breaches and bad algorithms, content that she liked once, will repeatedly show up. However, she likes the fact that it is easy to get updates on the people she knows.
If you post it, it has to be perfect
People refuse to participate
My interviewee has worked with companies to improve their social media. For this she required participation from the company. Though this was hard to achieve. People are excited about social media but refuse to participate. They’ll tell her “if you post [something], it has to be perfect” or “people do not want to see my content”. Thus, the people within companies do not believe their content is exciting enough to be shared or that they will not be able to live up to a certain level of perceived perfection that they feel is required. Furthermore, they experience difficulties posting content due to the strict rules the management has laid on them. And on top of that, people experience self-censorship when engaging with social media. Which is fueled by social pressures from other coworkers or their own strive for perfection. A sentiment the social marketer also shares with the people within a company is that there is an anxiety that comes from ‘playing’ around with company’s brands. You don’t want ruin a company’s image.
To post is to exist
But the social media marketer doesn’t just look at social media from her work perspective, but also as a consumer. If she’s looking for a specific company, and they haven’t posted for a month, she’ll start to question if they’re still around. Another social media pet peeve of hers is social media managers who do not know how to use social media. As an example she mentions people who do not put spaces between their hashtags. On some platforms this will result in one long hashtag that won’t function properly. On the other hand, she also wonders about self-proclaimed social media experts who claim to be able to help you expand your clientele. If these experts spend a lot of time on social media explaining their expertise… how will they have time left to actually help their clients? This way they are signaling that they might now have work. She compares it to ‘clean desk policies’. If people have a clean desk at work, are they spending time working or cleaning their desks? A clean desk then signals that they might not be working after all. A messy desk suggests the opposite.
Don’t think about all the things that can go wrong
You need a level of superficiality
Over the years she has learned that with social media you can’t go too ‘deep’, you need a level of superficiality to practice social media marketing. And you definitely shouldn’t “think about all the things that can go wrong”. A problem she faced while working on social media related content is that she would overthink it. You just need to think about your target audience and consider your statistics. People should look at the ratio between website visits and the call to action. If people merely visit your website, but don’t buy your products, then something is clearly wrong. If you are not getting that many visits, but most visitors buy your products, you’re on the right path. She explains to me that people have been making this mistake for decades. The same principle holds for flyers people would receive in the mail. If not done correctly, they would also not lead to more sales. She feels that there is a discrepancy between sales and marketing, which she calls ‘waste’. Both departments will end up blaming each other for the lack of sales. There is no group looking at why this waste is happening.
Who am I communicating with?
Doing business is still personal
She recommends businesses to retain a personal touch in their online communication and on their platforms in general. This could be in the form of pictures of employees or by being able to see names of the people you communicated with. When she interacts with businesses online she asks herself “who am I communicating with?”. Knowing who the person is and if they have talked before would beg understanding. For instance, in customer service, does the person know her case or does she have to explain it again? “Doing business is still personal”. We end the interview on the note that companies cannot survive without an online presence, as it happens facilitate the ability to easily find information about the company. She adds: “when I hear about a company [offline], I will look them up online first”.