The Joy and Meaning of Exploration

Exploration of a text-centered dataset

Paul Baharet
7 min readSep 23, 2022
Vincent van Gogh — Champs près des Alpilles, 1889.

— Exploration of a dataset is always exciting. One never knows what's hidden between the lines, behind the bushes, what patterns will emerge, and what conclusions will arise. This stage is rewarding and filled with surprises and illustrations. It embodies the discovery nature of the data scientist.

We will perform a quick (not rigorous) exploration effort using Rohit Kulkarni’s ABC’s (Australian Broadcasting Corporation) news headlines dataset (https://www.kaggle.com/datasets/therohk/million-headlines).

This dataset is simple and lends itself to showing the basic techniques used when dealing with text. We will also dispense some common sense and intuition along the way.

Be aware of the expectations of what we’ll uncover as you go through the reading. Please ask yourself and take a mental note, ‘what conclusions will we reach?’ The answer should now be, ‘I have no idea,’ which should evolve as the article continues. We’ll revisit this at the end.

Let’s fire up our Jupyter Notebook and let’s get going.

Loading

The first step is loading the dataset. This might look trivial in this example, but it is not if the dimensions of the dataset increase. Decoding or sampling (subsets for memory management or specific algorithms) might be required. Regardless, this set is a small CSV file of 22 MB and directly loaded into a pandas Dataframe.

import pandas as pd
df = pd.read_csv('drive/MyDrive/abcnews-date-text.csv')

Overview

The overview intends to get a glance into the basic structure and nature of the dataset. In this case, it’s simple.

The last line shows us the dimensionality of the Dataframe. (1,244,184 x 2). Look at how the date is formatted.
First three entries.
Notice it is pre-processed into all lowercase characters. Some information is lost by doing this, but it's a prerequisite for most operations.

The dataset contains two columns, one with the date and the second with the headline. Notice how the date shows as an 8-digit integer number. Let’s check the data types.

Side note, the int64 is overkill for an 8-digit number. In a different situation, this would be optimized by downcasting.

Pre-processing and Exploration

Once the dataset's nature is understood, data manipulation begins. The format of the date column is not useful to us as is. Let’s convert it into panda’s timestamp type. This will be particularly useful when plotting.

It is *much* more efficient to insert a list as a column than manipulate the Dataframe line by line.

Now that we know what the two-column, million-deep headline dataset is, what should we do? Now what? There is no clear path to take from here. All we know is that we have all the headlines and their date.

Given that we know so little, we should take a topographical approach and determine if we get any leads. Let's look at the surface of the dataset.

Perhaps a frequency count of words is a good starting place to understand the set better. Let’s find out which words are the most frequent in the corpus (all the headlines).

We will use the collections library, which provides additional data structures, and NLTK (Natural Language Toolkit).

We first aggregate all headlines into a long string. Then we split the string into a list of words. In other words, we tokenize the string. This is an atemporal tool as we dispense dates in this particular exercise. We will focus only on the text data.

Output is the first six tokens.
Corpora of a bit over 8 million words

Let’s remove the stop-words. Stop-words are so common that they convey little meaning, and it's advantageous to ignore them.

List of stop words used by NLTK.
List of tokens without stop words.
About 17% of the tokens were removed.

We can begin to work on the dictionaries once we remove the stop words. To get a frequency dictionary, we use the following function.

Inputs a list and returns a frequency dictionary.
We use a dictionary comprehension to sort from most to least.
The left of the colon is the number of times it appears in the corpus.

The most frequent word is police. This is an interesting result being significantly more frequent than the rest of the words. There is a gloomy pattern in the most used words like fire, death, crash, court, charged, and murder. Perhaps this gives a small glimpse into a larger tendency in news media.

We can see new is the second most used word. It is a very common word but take note in the Australian context where New Zealand is possibly mentioned often. Each mention of New Zealand adds to the ‘new’ count. We can estimate this by looking at the value for ‘Zealand.’

Dictionaries are not meant to be used this way, but this is data science.

We can confidently say that Zealand is not adding too many additions to the frequency count of new with only 1,715 instances compared to the total of 33,734 instances of new. We used general knowledge and intuition to predict the frequency of new to be skewed by the use of ‘New Zealand.’ In practice, we were incorrect by being only 5% of the mentions of new. Not worth correcting.

Exploration has many of these ‘failed attempts’ of extracting value. This is part of the process. It is a large part of the exploration, and one should get comfortable with it sooner rather than later.

Let’s look into police a little further. Let’s narrow our results to only those including ‘police’ in the headline.

There are a lot of mentions of police throughout the entire dataset, it seems. Let’s find out the daily frequency of mentions.

Let’s plot it. We will ignore the days when ‘police’ is not mentioned. For rigorous analysis, we can add these and determine the mentions' density. Although, we are taking this liberty due to the dense nature of the subset. The coverage of police is yearlong.

This is very interesting. There is a clear downward trend in the use of police. Go figure, I would not have imagined that. Let’s add a moving average to make it stand out more.

We use a window width of 10 days.

We can better see a downward trend. Therefore, we can conclude the mentions of the word police have decreased in the past two decades in the Australian Broadcasting Corporation.

What does this mean? I don’t know. But we could keep digging if we wanted to. We saw a lot of negative words topping the word frequency list, and there is much more to look into. We could also use NLP and dab into sentiment analysis or entity recognition. We could even use embeddings to point us to the next step. But, back to the initial question. What was your expectation on this dataset at the onset? I bet you a dollar you did not expect the roots and trunks to be, ‘police is being mentioned less and less.’ I sure didn’t.

In conclusion, we have seen how entertaining and exciting data exploration is. We can now better understand it as an open-ended exercise and be comfortable with it. We saw how having no clear path does not impede looking for it. We saw how all the efforts don't necessarily yield useful or insightful results, which is OK. We saw how unpredictable the information that comes back is and how to adapt to it regardless.

I wish you bold and happy explorations!

Vincent van Gogh — Tree Roots and Trunks, 1890

--

--

Paul Baharet
Paul Baharet

No responses yet