Home / Blog / Data Science / Case Study: Telecom Customer Complaints Analysis

Case Study: Telecom Customer Complaints Analysis

June 26, 2022
55

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Problem Introduction

Business problem:

Multiple client complaints may be made to a telecom firm. While filing complaints is nothing, it is one method we can give our customers better service.
In order to achieve customer pleasure, it is crucial to prioritise client convenience. This can only be done if we speak to these concerns honestly and effectively. Customers will be dissatisfied with our services, lose trust in us, and finally become disconnected from our network if we are unable to resolve their tickets. Therefore, responding to concerns is essential.
This use case examines a small dataset where customer complaints and other factors are carefully examined and machine learning modelling is carried out.
To determine which important issues are being presented while customers are withdrawing their concerns. The challenge might begin as indicated below.
To make a trend chart showing the number of complaints received on a daily and monthly basis.
To provide a frequency chart for different sorts of complaints.
Which state has the maximum number of complaints?
Building a unigram and bigram word cloud.

Learn the core concepts of Data Science Course video on Youtube:

Exploratory Data Analysis

Let's take tour of dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df = pd.read_csv("SP_telecom_complaints_data.csv")

Ticket #: A unique ID generated for each complaint
Customer Complaint: Written complaint of the customer
Date: date of complaint filing
Date_month_year: Date of the complaint filing in different format
Time: Time of complaint filing
Received Via: Medium of filing complaint which is either via Internet or Customer care
City: City where the customer has faced the issue
State: State where the customer has faced the issue
Zip Code: Zip code of area where the customer has faced the issue
Status: Status of the complaint from the side of firm
Filing on Behalf of Someone: Whether the customer filed the complaint for himself/herself or on behalf of someone else

Let us first see the ‘Percent wise distribution of complaint status’:

Closed: 33.17%
Open: 16.27%
Solved: 43.52%
Pending: 7.17%

Majority of the complaints (43.52% of total complaints) are solved whereas 33.17% are closed.

Now that we have a fair amount of idea about the dataset,

We have to perform some feature engineering on time and Date_month_year. Here we are adding two columns and saving them in a new column as an index. Convert it to DateTime data type along with that we are resetting the index.

df[ "date_index"] = df[ "Date_month_year"] + " " + df[ "Time"]

df["date_index"] = pd.to_datetime(df["date_index"])
df["Date_month_year"] = pd.to_datetime(df["Date_month_year"])
df = df.set_index(df["date_index"])
df.head(3)

To know how many complaints are there in a month we will find :

f = df.groupby(pd.Grouper(freq="d")).size()
df.groupby(pd.Grouper(freq= "D")).size().plot()

df.groupby(pd.Grouper(freq= "M")).size().plot()

We can infer that June has more complaints. It seems like the busiest month is June (6) as majority of the complaint filing is being done in this month.

df[ 'Day'] = pd.to_datetime(df[ 'Date']).dt.day
df[ 'Month'] = pd.to_datetime(df[ 'Date']).dt.month
df[ 'Year'] = pd.to_datetime(df[ 'Date']).dt.year
df.head()
Now which day has more complaints can also be analysed .

fig, ax = plt.subplots(figsize = (15,10))
df.groupby(['Day','Status']).count()['Ticket #'].unstack().plot.bar(ax=ax)

The number of closed complaints is larger in the first few days (1, 2, 3, etc.) of the month, but on the 23rd, 24th, and 25th, there is an abrupt and unanticipated rise in the number of closed complaints. The likelihood that a complaint will be resolved is therefore significantly increased if it is filed on the 23rd, 24th, and 25th of each month.

fig, ax = plt.subplots(figsize = (15,10)) df.groupby(['Month','Status']).count()['Ticket #'].unstack().plot.bar(ax=ax)

In the 6th month we have more complaints solved .

To analyse which state has high resolved complaints :

df["newStatus"] = ["Open" if Status=="Open" or Status=="Pending" else "Closed" for Status in df["Status"]]
df.groupby(["State"]).size().sort_values(ascending=False).to_frame().reset_index().rename({0: "Count"}, axis=1)[:5]
Status_complaints = df.groupby(["State","newStatus"]).size().unstack().fillna(0)
Status_complaints
Status_complaints.plot(kind="bar", figsize=(30,50), stacked=True)

Which state has maximum number of complaints:

df.groupby(["State"]).size().sort_values(ascending=False).to_frame().reset_index().rename({0: "Count"}, axis=1) max()

State: West Virginia
Count: 288
dtype: object

Customer Complaint

This variable is text data and hence will start with word counts.

#https://stackoverflow.com/a/37483537/4084039

word_count = df['Customer Complaint'].str.split().apply(len).value_counts()
word_dict = dict(word_count)
word_dict = dict(sorted(word_dict.items(), key=lambda x: x[1]))
ind = np.arange(len(word_dict))
plt.figure(figsize=(10,5))
p1 = plt.bar(ind, list(word_dict.values()))
plt.ylabel('count of complaints')
plt.title('Words present in each complaint')
plt.xticks(ind, list(word_dict.keys()))
plt.show()

We can build a word cloud :

from wordcloud import WordCloud, STOPWORDS text = df['Customer Complaint'] wordcloud = WordCloud().generate(str(text)) plt.figure( figsize=(10,10) ) plt.imshow(wordcloud) plt.axis("off") plt.show()

The most common concerns with complaints are speed and Internet problems, and among all of them, we can observe that service is the best of all the word tokens. Service for this company should thus be enhanced.

Let's build a bigram word cloud

doc_complete = df[ "Customer Complaint"].tolist()
doc_clean = [clean(doc).split() for doc in doc_complete]
doc_clean
ip_rev_string = " ".join(doc_complete)
ip_rev_string

import nltk
import re
# word cloud with bigram
nltk.download('punkt')
#Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain.

from wordcloud import WordCloud, STOPWORDS

WNL = nltk.WordNetLemmatizer()

# Lowercase and tokenize
text = ip_rev_string.lower()

# Remove single quotes early since it causes problems with the tokenizer.
#text = text.replace("'", "")

tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
#help(nltk.Text)
# Remove extra chars and remove stop words.#Split the string by each occurrence of the pattern
text_content = [''.join(re.split("[ .,;:!?‘’``''@#$%^_&*()<>{}~\n\t\\\-]" , word)) for word in text1]

# Create a set of stopwords
stopwords_wc = set(STOPWORDS)
customised_words = ['price', 'great','doesn','SP','sp']
# If you want to remove any particular word form text which does not contribute much in meaning

new_stopwords = stopwords_wc.union(customised_words)

# Remove stop words
text_content = [word for word in text_content if word not in new_stopwords]

# Take only non-empty entries
text_content = [s for s in text_content if len(s) != 0]

# Best to get the lemmas of each word to reduce the number of similar words
text_content = [WNL.lemmatize(t) for t in text_content]
help(nltk.word_tokenize)
nltk_tokens = nltk.word_tokenize(text)
bigrams_list = list(nltk.bigrams(text_content))
print(bigrams_list)

dictionary2 = [' '.join(tup) for tup in bigrams_list]
print (dictionary2)

# Using count vectoriser to view the frequency of bigrams
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2, 2))
bag_of_words = vectorizer.fit_transform(dictionary2)
vectorizer.vocabulary_

sum_words = bag_of_words.sum(axis=0)##rowsum in r
words_freq = [(word, sum_words[0, columnname]) for word, columnname in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
print(words_freq[:100])

# Generating wordcloud
words_dict = dict(words_freq)
WC_height = 1000
WC_width = 1500
WC_max_words = 100
wordCloud = WordCloud(max_words=WC_max_words, height=WC_height, width=WC_width, stopwords=new_stopwords)
wordCloud.generate_from_frequencies(words_dict)
plt.figure(4)
plt.title('Most frequently occurring bigrams connected by same colour and font size')
plt.imshow(wordCloud, interpolation='bilinear')
plt.axis("off")
plt.show()