9923170071 / 8108094992 info@dimensionless.in

Data wrangling, in simpler terms, it is the process of transforming raw data into another format to make it more suitable for analytics. It may involve structuring and cleaning the data to make it more usable. For eg: Filtering the data by a particular year, replacing null values with mean and merging two or more columns. These were just a few examples of data wrangling. As a data scientist, you would have to probably spend most of the time in data wrangling.

Data wrangling with data science

Source: forbes

 

Importance of Data Wrangling

 

1. Improve the Quality – Many a time; the data is untidy and incomplete such as missing values, noisy values. So by cleaning the data, we improve its quality to improve accuracy.

2. Makes Data Presentable – Sometimes, the data is in a different format than needed for which we may want to change its format and schemas.

3. Scaling Down the Data – The data available might be too huge in size for which we would have to reduce its size. Also, we may want to analyze specific things about specific variables for which we need to use data wrangling.

4. Finding Outliers – Sometimes, the data may contain outliers or errors that may need to be removed or replaced. Also, in some cases the data might be biased which can lead to sampling error, for example, interpreting the global opinion about the most unsafe countries; the participants might be from a particular country, or the questions might be too value or biased.     

 

Data Wrangling with R

In this article, I have demonstrated some examples in R to make the data clean and tidy using the packages tidyr, dplyr and stringr.

This is the tourist data file which shows the number of tourists that visited India in particular years from various countries-

Link

Read this data and store it in variable-

tourist_datafile <- 
read.csv("C:\\Users\\hp\\Desktop\\Dimensionless\\tourist_datafile.csv")

view(tourist_datafile)

 

data stored

 

Gather Columns into Rows

Notice the variable Country is in rows and the variable year is in column. This means the dataset is untidy. So, we need to bring the years in rows and countries in columns.  

For this purpose, we will use the package tidyr, install and the load the package with these commands-

install.packages(“tidyr”)

library(tidyr)

 

Now let’s understand what Gather function is. Gather takes multiple columns and collapses into key—value pairs, duplicating all other columns as needed. We use gather function when the columns are not variables. Here the year are mentioned in the column and we want to bring it I rows. So, first we will use gather function to bring all columns in rows-

tourist1 <- gather(data = tourist_datafile,....=X2001:X2015) 

view(tourist1)

 

Data stored in column and rows

Here, we have specified the dataset and the columns (that is from X2001 To X2015). Notice in the dataset tourist_datafile, there are 19 rows and in new dataset tourist1, there are 285 rows. Basically, in old data, those 19 rows represent 19 countries, and now we have gathered the years (15 years from 2001 to 2015) mentioned in columns into rows, so all those years and the value corresponding to year and country have collapsed in the form of key value pairs. So we have 15 multiplied by 19 rows that makes it 285 rows.

We can give a specific name to key and value-

tourist2 <- gather(data = tourist_datafile, key = "year", 
value = "number_tourists", X2001:X2015)

view(tourist2)

 

 

Spread Rows into Columns

So now we want to pull the countries into the columns. For that we would use Spread function which spread a key value pair into multiple columns.

tourist3 <- spread(tourist2, key = "Country", value = "number_tourists")

 

Notice, here we have mentioned key as ‘Country’ as we want to spread Country.  

view(tourist3)

 

spread rows in column

The countries that represent the variable are now in columns and the years in rows.

 

Modify Patterns of a String

Notice in the year column, there is X before every year such as X2001.

So we will be cleaning the year column by removing the X from it using str_replace which is used to replace patterns in a string because the class of year column is string. It is in ‘stringr’ package, so first we will install and load this package.

install.packages(“stringr)

library(stringr)

tourist3$year <- str_replace(tourist3$year, "X", "")

 

Here, we have modified the year column in the data by replacing the pattern X with a null string.

So now we have cleaned values in the column year.

 

Separating a Column

Download the dataset from here – 

Let’s read the dataset-

date <- read.csv("C:\\Users\\hp\\Desktop\\Dimensionless\\date.csv")

view(date)

 

Suppose we want to separate the date column into day, month and year; we could do use separate function for that-

separate(date, col = V1, into = c("Day","Month","Year"), sep = "/")

 

Here, the first argument is the dataset, and then in ‘col’, we specify the column that we want to separate. In ‘into’ we specify the names of the columns that we want, here since we want 3 columns, I have passed a vector of 3 values in it. ‘sep’ is used to specify the separator in the column.

date_seperate <- separate(date, col = V1, into = 
c("Day","Month","Year"), sep = "/")

 

Let’s view the data-

view(date_seperate)

 

Separated columns

So the column V1 has been separated into 3 columns.

 

Uniting the Columns

Similarly, we can unite multiple columns into one-

date_unite <- unite(date_seperate, col = "Date", 
c("Day", "Month", "Year"), sep = "-")

 

Here, the first argument is the dataset; in ‘col’ , we specify the name of the new column, then, we specify the columns that we want to unite. If we don’t specify the columns, it would unite all columns. And then we specified the separator using ‘sep’. Rather than specifying the columns as a vector, we can also pass it as variables.

unite(date_seperate, col = "Date", Day, Month, Year, sep = "-")

 

So, now let’s view the dataset-  

view(data_unite)

 

uniting column

 

Convert to Date Type

If we check the class of the date column, it is character.

So, we’ll convert the column to Date type and set the format accordingly-

date_unite$Date <- as.Date(date_unite$Date, format = "%d-%m-%Y")

class(date_unite$Date)

 

 

Removal of NAs

Household data –

https://drive.google.com/open?id=19j8zv6dzly5g87f1np8U25qWZUnVlHwK

household_data <- read.csv("C:\\Users\\hp\\Desktop\\Dimensionless\\household.csv")

view(household_data)

 

removal of NAs in Household data

Notice the last 6 rows and 4 columns in the data, they all are null.

The dimensions of the data are 18*16

dim(household_data)

 

So we can remove them by sub-setting the data.

household_data <- household_data[1:15,1:10]

 

removed data by sub setting the data

Notice after removal of the last few rows and columns, there are null values still remaining in the data in the columns pernum and sex; and in 8th row in column sex, there is a value called NULL which depicts that it is a null value but R would treat it as any other value. 

household_data[8,6]

 

It is treating NULL as a level along with female and male.

So, we need to change this value such that R depicts is as a null value.

household_data[household_data == "NULL"] <- NA

 

It will change all the values that are written as NULL to NA

view(household_data)

 

changed column

Notice that the 8th row and the 6th column has been changed to NA

 

Replace NA with a Value

We can replace the missing value to make better analyses of the data.

Let’s replace the NAs in pernum column to the mean of the data, since NA are present, so we have to specify na.rm as TRUE,

mean(household_data[,4], na.rm = TRUE)

 

Since in this data, all columns are not numeric, so we cannot change all values that are NULL to their mean as it will throw error. So, here I have specifically changed the column 4’s NA values to its mean.

household_data[is.na(household_data[,4]),4] <- 
mean(household_data[,4], na.rm = TRUE)

 

replacing NA with value

Notice the NA values in pernum column are changed to 3.

Now since in the 6th column, the values aren’t numeric, so will change the NAs in it with the gender that appears most number of times.

table(household_data$sex)

 

Here the frequency of female is greater than male, so I will replace the NAs with female.

household_data[is.na(household_data$sex),"sex"] <- "Female"

view(household_data)

 

replaced na with female

 

Filtering the Data

Let’s say we want to analyze the data for females whose households are under 1970 definition –

household_filter <- filter(household_data, gq == "Households under 1970 definition" & 
sex == "Female")

 

Let’s say we want the data of some particular years in the tourist data-

tourist_filter1 <- filter(tourist3, year == c(2013, 2014, 2015))

view(tourist_filter1)

 

filriting data chart

These were some basic examples of data wrangling with R.

To learn more of it, you can refer to our course on Data Science –

Data Science using R & Python

To read about data visualization, refer to this blog-

Data Visualization with R