Content Tags

There are no tags.

Data Sources 101

Authors
Ayswarrya G

(This article originally appeared at KDNuggets.com here. For more, visit https://www.kdnuggets.com/)

Data collection is one of the first steps of the data lifecycle — you need to get all the data you require in the first place. To collect the right data, you need to know where to find it and determine the effort involved in collecting it. This article answers the most basic question: where does all the data you need (or might need) come from?

Getting started with the data universe can be overwhelming. Big data, alternative data, primary data, internal data — the list goes on. A common confusion is the difference between these terms and understanding the difference is quite significant.

Here's why.

Data collection is one of the first steps of the data lifecycle — to analyze data, you need to get all the data you require in the first place (a no-brainer, right?).

To collect the right data, you need to know where to find it and determine the effort involved in collecting it.

That's why I wrote this article, to answer the most basic question: where does all the data you need (or might need) come from?

Sources of data

Before looking at sources of data, let's understand primary and secondary data.

Primary data — Data that you create yourself

When you create the data you want by yourself, it's called primary data. If you interview people to gather feedback on your product, the interview data is primary data.

Secondary data — Data that you collect from someone else

When you collect data from sources that someone else owns, it's called secondary data. If you use data from Google Analytics to understand how many people visit your website, you're using secondary data. It's still data on your organization, but it's something that a secondary organization (in our example, Google) collected for you.

Figure

So far, so good, yeah?

Now let's build on this base. The data sources can either be internal or external.

Internal data — Data that you create, own or control

Internal data is private data that your organization owns, controls or collects. The sales data or financial data of your organization are examples of internal data.

Notice that I say data you create, own or control?

There's a reason why. Internal data can either be primary or secondary.

When you create data by surveying people within your organization and use these insights to show factors that influence workplace productivity, that data is internal and primary.

On the other hand, when you use data from Google Analytics to show that most of your website visitors search for alternative data products, such data is internal and secondary.

External data — Data from outside sources

External data is data collected from sources outside your organization. The data could be:

  1. Publicly available data such as census, electoral statistics, tax records and internet searches
  2. Private data from third parties such as Amazon, Facebook, Google, Walmart and credit reporting agencies like Experian

Can external data also be primary or secondary? If you're thinking along these lines, you're on the right track!

When you conduct interviews with data science leaders worldwide, you're collecting primary data, but from an external source. So, such data is external and primary.

When you use the interviews conducted by a digital publication like Kaggle or Stack Overflow, you're using data that's external and secondary.

But wait a minute... isn't there also something called Alternative Data?

Hold your horses! I was just about to mention it.

Alternative data is secondary data that's complex, unique and mostly unexplored. To understand alternative data, let's take a quick, 2-minute detour and look at big data.

Big data

Big data refers to massive volumes of structured, semi-structured or unstructured data that is too complex to be processed by traditional data systems (relational databases and data warehouses).

Formats of data

Structured what?

Data comes in several formats. Here's my quick take on the two most prominent ones:

  1. Structured data: Data organized in a fixed format on a relational database (think of the files you store on your computer)
  2. Unstructured data: Data without any particular format (think of surveillance data); Gartner estimates that more than 80% of enterprise data is unstructured.

Examples of big data include social media data📱, transactional data💸 (stock prices, purchase histories), sensor data (location data, weather data) and satellite data📡. (Here's a fun read on big data that the world generates today)

Figure

The 4 Vs of big data by IBM. Image courtesy: IBM Big Data & Analytics Hub

Traditional data systems aren't fully equipped to process such large amounts of unstructured data.

Analyzing big data requires complex big data technologies (A topic for another article, but if you're in a hurry, check out this handy wiki on big data tech).

That's where our quick detour ends.

So, alternative data is considered to be big data. It all began with hedge funds using non-financial information such as rental payments and utility bills to estimate the lending risk of an individual. This data transformed the financial industry (See this article on how hedge funds are using alternative data).

Soon, other industries caught on to its potential and how it can help them maintain an edge over their competition.

Figure

Some examples of how alternative data is being used today. Image courtesy: Humans of Data

Some common examples of alternative data sets are:

  1. Satellite data
  2. Location data
  3. Financial transactions
  4. Online browsing activity
  5. Social media posts
  6. Product reviews

Final Word

So there you have it. This should give you a rough idea of where all the world's data comes from. Here's an illustration that quickly summarizes the various sources of data.

Figure

P.S. For the past few months, I've been working on a community project called The Atlan Data Wiki — a fun, helpful, jargon-free encyclopedia for navigating through the data universe. If you like my article, please check out the wiki where I use a similar approach to tackle other such topics in data. I'd love to hear your thoughts on the same.

Stay in the loop.

Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.