Udemy NLP

NLP Python Basics:

1. Spacy Pipelines

2. Tokenisations

3. Stemming

4. Lemmatizations

5.stop words

6. Vocabulary and Phrase Matching

Parts of Speech Tagging:

Parts-of-speech(POS)

### wORKING WITH pdfs:

we can use pypdf library to read text data from pdf.

NOTE: But not all pdfs have text that can be extracted.

Some pdfs are created using scanning and these requires specialised softwares and is difficult to extract the text from images.

##Text files
# In jupyter: %%writefile test.tst
# my_file =open('test.txt')
# in jupyter read a file:
# myfile.read()
# in jupyter write a file:
# my_file =open('test.txt',"w+") # w+ allows both read and write==> open w+ with consious as it willltruncate or remove the previous data
#myfile.close() # it will cause problems in os.Python is using u cant use i.
#myfile.seek(0) ##keeping the index at zero position of fle
with open("writeFile.txt","w") as file:
    file.write("This ic create using pycharm.\n I wrote it in pycharm")
# pwd is current working directory
#print("File location", pwd)

with open("writeFile.txt","r") as file:
    content=file.readlines()
    print(content)
    for line in content:
        print(line.split()[0])

#####################
####  WORKING WITH PDFS
#################
import PyPDF2
# pdf files are opned in read binary format
# with open("US_Declaration.pdf", mode='rb') as file:
#     pdf_reader=PyPDF2.PdfFileReader(file)
#     pdf_reader.numPages
#     page_one=pdf_reader.getPage(0)
#     my_text = page_one.extractText()
#     print(my_text)

# ###Adding page to pdf
with open("US_Declaration.pdf", mode='rb') as file:
    pdf_reader=PyPDF2.PdfFileReader(file)
    pdf_reader.numPages
    page_one=pdf_reader.getPage(0)
    pdf_writer= PyPDF2.PdfFileWriter()
    pdf_writer.addPage(page_one)
    with open("MY_BRAND_NEW.pdf",'wb') as pdf_output:
        pdf_writer.write(pdf_output)

# ##Grabbing all data from a pdf
# with open("US_Declaration.pdf", mode='rb') as file:
#     pdf_text = [0]  ##created an empty variable to get all th text
#     pdf_reader = PyPDF2.PdfFileReader(file)
#     for p in range(pdf_reader.numPages):
#         page=pdf_reader.getPage(p)
#         pdf_text.append(page.extractText())
#
# len(pdf_text)
# for page in pdf_text:
#     print(page)
#     print('/n')
#     print('/n')
#     print('/n')
#     print('/n')
#     print('/n')
#
#     my_text = page_one.extractText()
#     print(my_text)

\d indicates digits

For regular expressions import re

#For search: Always assign the expressions to a variable

pattern="jhsdgds"

my_match = re.search(pattern, in_text)

my_match.span() ## gives the start and end index of the pattern ==> o/p:(4,7)

my_match.start()==>4

my_match.end()==>7

##if the pattern is repeated multiple times then the my_atch will give the presence of only first pattern occurence

##To find all the matches:

matches=re.findall(pattern, text)

len(matches) ## Number of times the pattern repeated

# print match objects

for match in re.finditer(pattern="phone",text):

printmatch.span()

##Looking for a (text =)phone number 345-567-5688. But dont remember the actual number but u remembered the pattern

pattern =r'\d\d\d-\d\d\d-\d\d\d\d'

phone_number=re.search(pattern, text) ## Returns object, span and match

if u want to return only match then use group()

phone_number.group()

Quantifiers: {n} indicates occurance of n times

Now pattern can be written as :

pattern =r"(\d{3})-(\d{3})-(\d{3})"

phone_number.group() gives the entire pattern , but if we want the digits i.e., 345 then use

phone_number=re.search(pattern, text)

phone_number.group(1)

#pipe opeartor is used for or condition

searching man or woman

re.search(r"man| woman","This woman is here")

re.findall(r".at", "The cat is in hat sat") #.at indicates only one character before at.(for 2 characters use ..at)

Ans: cat, hat,sat

##Start with ^ ; end with $

re.findall(r"^\d", '1 is the first') ===> ans is 1

[]==> exclude

[^\d] ==> exclude the numbers

my_words=re.findall(r"[^!.?]+", text)==> remove those chracters in the bracket=> returns a list of words

if we have to join the words use

" " .join(my_words)

text="Here is one-hand and two-hand code"===> using [\w]+

re.findall(r'[\w]+-[\w]+',text)

Space_S

Search This Blog

Udemy NLP

Comments

Post a Comment