Python affords a number of libraries for NLP, however at present we’ll concentrate on the Pure Language Toolkit (NLTK), a complete library for constructing NLP packages.
1. Set up
First, let’s set up NLTK:
pip set up nltk
After set up, obtain the required datasets:
import nltk
nltk.obtain('punkt')
nltk.obtain('stopwords')
nltk.obtain('wordnet')
2. Tokenization
Tokenization is the method of breaking textual content into particular person phrases or sentences.
from nltk.tokenize import word_tokenize, sent_tokenizetextual content = "Hey there! Welcome to the world of NLP."
print(sent_tokenize(textual content))
print(word_tokenize(textual content))
3. Eradicating Stopwords
Stopwords are widespread phrases (like “the”, “is”, “in”) that will not add important which means to a sentence.
from nltk.corpus import stopwordsstop_words = set(stopwords.phrases('english'))
phrases = word_tokenize(textual content)
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)
4. Stemming and Lemmatization
These methods cut back phrases to their root types.
Stemming:
from nltk.stem import PorterStemmerps = PorterStemmer()
print(ps.stem("operating"))
Lemmatization:
from nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("operating", pos="v"))