An outlier is a data point that significantly differs from other observations in a dataset. It can be:

  1. - Unusually high or low compared to the rest of the data.
  2. - Anomalous due to measurement errors, data entry mistakes, or rare events.
  3. - A true extreme value that represents natural variation.

Example of data set

name of the loaded data is Data
Image description

How to identify outliers

  1. basic summary function
summary(Data)

output

Image description

  1. Visual methods (using Box plot)

Plot age on a box plot
boxplot(Data$Age, main = "Age",col = "skyblue")
output
Image description

Plot Net_worth on a box plot

boxplot(Data$Net_worth, main ="Networth in *10000 PLN",col = "orange")

output

Image description

  1. Using interquatile range

Identify the outlier on age values

Q1 <- quantile(Data$Age, 0.25)
Q3 <- quantile(Data$Age, 0.75)
IQR <- Q3 - Q1
lower_bound_age <- Q1 - 1.5 * IQR
upper_bound_age <- Q3 + 1.5 * IQR
outlier_age <- Data$Age[Data$Age < lower_bound_age | Data$Age > upper_bound_age]
print(outlier_age)

output
93

Identify the outlier on Net_worth values

Q1 <- quantile(Data$Net_worth, 0.25)
Q3 <- quantile(Data$Net_worth, 0.75)
IQR <- Q3 - Q1
lower_bound_Net_worth <- Q1 - 1.5 * IQR
upper_bound_Net_worth <- Q3 + 1.5 * IQR
outlier_networth <- Data$Net_worth[Data$Net_worth < lower_bound_Net_worth | Data$Net_worth > upper_bound_Net_worth]
print(outlier_networth)

output
152000

SOLVING THE OUTLIER

  1. Droping the outliers using the interquartile range
new_data <- Data[
  Data$Net_worth >= lower_bound_Net_worth & Data$Net_worth <= upper_bound_Net_worth &
  Data$Age >= lower_bound_age & Data$Age <= upper_bound_age, 
]

summary(new_data)

output

Image description

  1. Substituting the outliers with column mean

identify the row index for outliers

# check the data row
which(Data$Net_worth== 152000 )
which(Data$Age== 93)

output
12, 10

Replace the outliers with the means

#Replace the data points with the mean
Data$Net_worth[12] <- mean(Data$Net_worth)
Data$Age[10] <- mean(Data$Age)
summary(Data)

plot the new data columns on a box plot

boxplot(Data$Age, 
        main = "Age", 
        col = "green", 
        border = "blue")

Output

Image description

boxplot(Data$Net_worth, 
        main = "Networth in *10000 PLN", 
        col = "yellow", 
        border = "blue")

output

Image description