Natural Language Processing in Python. Implementation code is also provided.
Natural Language Processing (or NLP) is an applying Machine Learning models to text and language. Teaching machines to understand what is said in the spoken and written word is the focus of Natural Language Processing. Whenever you dictate something into your iPhone / Android device that is then converted to text, that’s an NLP algorithm in action.
You can also use NLP on a text review to predict if
the review is a good one or a bad one. You can use NLP on an article to predict
some categories of the articles you are trying to segment. Likewise, you can use NLP on a
book to predict the genre of the book. And it can go further, you can use NLP
to build a machine translator or a speech recognition system. Speaking of
classification algorithms, most NLP algorithms are classification models,
and they include Logistic Regression, Naive Bayes, CART which is a model based
on decision trees, Maximum Entropy again related to Decision Trees, Hidden Markov
Models which are models based on Markov processes.
A very well-known model in NLP is the Bag of Words
model. It is a model used to preprocess the texts to classify before fitting
the classification algorithms on the observations containing the texts.
In this part, you will understand and learn how to:
- Clean texts to prepare them for the Machine Learning models,
- Create a Bag of Words model,
- Apply Machine Learning models onto this Bag of Worlds model.
Here’s the step-by-step implementation and illustration of coding:
- Importing libraries
- Importing datasets
- Cleaning text
- Creating the Bag of Words model
- Splitting the data into Training and Test sets
- Training thee Naive Bayes model on the Training set
- Predicting the Test set results
- Making the confusion matrix
Explaining NLP code written in Python:
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv',
delimiter = '\t', quoting = 3)
Cleaning the texts (Data Preprocessing)
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review =
re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review =
review.lower()
review =
review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
review =
[ps.stem(word) for word in review if not word in set(all_stopwords)]
review = '
'.join(review)
corpus.append(review)
Creating the Bag of Words model
from sklearn.feature_extraction.text import
CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values
Splitting the dataset into the Training set and Test
set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size = 0.20, random_state = 0)
Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
Predicting the Test set results
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),
y_test.reshape(len(y_test),1)),1))
Making the Confusion Matrix
from sklearn.metrics import confusion_matrix,
accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
Get the complete Code:
## Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
## Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv',
delimiter = '\t', quoting = 3)
## Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review =
re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review =
review.lower()
review =
review.split()
ps =
PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
review =
[ps.stem(word) for word in review if not word in set(all_stopwords)]
review = '
'.join(review)
corpus.append(review)
## Creating the Bag of Words model
from sklearn.feature_extraction.text import
CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values
## Splitting the dataset into the Training set and Test
set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size = 0.20, random_state = 0)
## Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
GaussianNB(priors=None, var_smoothing=1e-09)
## Predicting the Test set results
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),
y_test.reshape(len(y_test),1)),1))
## Making the Confusion Matrix
from sklearn.metrics import confusion_matrix,
accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
Comments
Post a comment