Udemy NLP

 NLP Python Basics:

1. Spacy Pipelines

2. Tokenisations

3. Stemming

4. Lemmatizations

5.stop words

6. Vocabulary and Phrase Matching

Parts of Speech Tagging:

Parts-of-speech(POS)

### wORKING WITH pdfs:

we can use pypdf library to read text data from pdf.

NOTE: But  not all pdfs have text that can be extracted.

Some pdfs are created using scanning and these requires specialised softwares and is difficult to extract the text from images.

##Text files
# In jupyter: %%writefile test.tst
# my_file =open('test.txt')
# in jupyter read a file:
# myfile.read()
# in jupyter write a file:
# my_file =open('test.txt',"w+") # w+ allows both read and write==> open w+ with consious as it willltruncate or remove the previous data
#myfile.close() # it will cause problems in os.Python is using u cant use i.
#myfile.seek(0) ##keeping the index at zero position of fle
with open("writeFile.txt","w") as file:
file.write("This ic create using pycharm.\n I wrote it in pycharm")
# pwd is current working directory
#print("File location", pwd)

with open("writeFile.txt","r") as file:
content=file.readlines()
print(content)
for line in content:
print(line.split()[0])

#####################
#### WORKING WITH PDFS
#################
import PyPDF2
# pdf files are opned in read binary format
# with open("US_Declaration.pdf", mode='rb') as file:
# pdf_reader=PyPDF2.PdfFileReader(file)
# pdf_reader.numPages
# page_one=pdf_reader.getPage(0)
# my_text = page_one.extractText()
# print(my_text)

# ###Adding page to pdf
with open("US_Declaration.pdf", mode='rb') as file:
pdf_reader=PyPDF2.PdfFileReader(file)
pdf_reader.numPages
page_one=pdf_reader.getPage(0)
pdf_writer= PyPDF2.PdfFileWriter()
pdf_writer.addPage(page_one)
with open("MY_BRAND_NEW.pdf",'wb') as pdf_output:
pdf_writer.write(pdf_output)

# ##Grabbing all data from a pdf
# with open("US_Declaration.pdf", mode='rb') as file:
# pdf_text = [0] ##created an empty variable to get all th text
# pdf_reader = PyPDF2.PdfFileReader(file)
# for p in range(pdf_reader.numPages):
# page=pdf_reader.getPage(p)
# pdf_text.append(page.extractText())
#
# len(pdf_text)
# for page in pdf_text:
# print(page)
# print('/n')
# print('/n')
# print('/n')
# print('/n')
# print('/n')
#
# my_text = page_one.extractText()
# print(my_text)

\d indicates digits

For regular expressions import re

#For search: Always assign the expressions to a variable

pattern="jhsdgds"

my_match = re.search(pattern, in_text)

my_match.span()  ## gives the start and end index of the pattern ==> o/p:(4,7)

my_match.start()==>4

my_match.end()==>7

##if the pattern is repeated multiple times then the my_atch will give the presence of only first pattern occurence

##To find all the matches:

matches=re.findall(pattern, text)

len(matches) ## Number of times the pattern repeated

# print match objects

for match in re.finditer(pattern="phone",text):

       printmatch.span()

##Looking for a (text =)phone number 345-567-5688. But dont remember the actual number but u remembered the pattern

pattern =r'\d\d\d-\d\d\d-\d\d\d\d'

phone_number=re.search(pattern, text)   ## Returns object, span and match

if u want to return only match then use group()

phone_number.group()

Quantifiers:  {n} indicates occurance of n times

Now pattern can be written as :

pattern =r"(\d{3})-(\d{3})-(\d{3})"

phone_number.group() gives the entire pattern , but if we want the  digits i.e., 345 then use

phone_number=re.search(pattern, text)

phone_number.group(1)

#pipe opeartor is used for or condition

searching man or woman

re.search(r"man| woman","This woman is here")

re.findall(r".at", "The cat is in hat sat")  #.at indicates only one character before at.(for 2 characters use ..at)

Ans: cat, hat,sat

##Start with ^ ; end with  $

re.findall(r"^\d", '1 is the first')  ===> ans is 1

[]==> exclude 

[^\d] ==> exclude the numbers

my_words=re.findall(r"[^!.?]+", text)==> remove those chracters in the bracket=> returns a list of words

if we have to join the words use

" " .join(my_words)

text="Here is one-hand and two-hand code"===> using   [\w]+

re.findall(r'[\w]+-[\w]+',text)












Comments