NLP Python Basics:
1. Spacy Pipelines
2. Tokenisations
3. Stemming
4. Lemmatizations
5.stop words
6. Vocabulary and Phrase Matching
Parts of Speech Tagging:
Parts-of-speech(POS)
### wORKING WITH pdfs:
we can use pypdf library to read text data from pdf.
NOTE: But not all pdfs have text that can be extracted.
Some pdfs are created using scanning and these requires specialised softwares and is difficult to extract the text from images.
##Text files
# In jupyter: %%writefile test.tst
# my_file =open('test.txt')
# in jupyter read a file:
# myfile.read()
# in jupyter write a file:
# my_file =open('test.txt',"w+") # w+ allows both read and write==> open w+ with consious as it willltruncate or remove the previous data
#myfile.close() # it will cause problems in os.Python is using u cant use i.
#myfile.seek(0) ##keeping the index at zero position of fle
with open("writeFile.txt","w") as file:
file.write("This ic create using pycharm.\n I wrote it in pycharm")
# pwd is current working directory
#print("File location", pwd)
with open("writeFile.txt","r") as file:
content=file.readlines()
print(content)
for line in content:
print(line.split()[0])
#####################
#### WORKING WITH PDFS
#################
import PyPDF2
# pdf files are opned in read binary format
# with open("US_Declaration.pdf", mode='rb') as file:
# pdf_reader=PyPDF2.PdfFileReader(file)
# pdf_reader.numPages
# page_one=pdf_reader.getPage(0)
# my_text = page_one.extractText()
# print(my_text)
# ###Adding page to pdf
with open("US_Declaration.pdf", mode='rb') as file:
pdf_reader=PyPDF2.PdfFileReader(file)
pdf_reader.numPages
page_one=pdf_reader.getPage(0)
pdf_writer= PyPDF2.PdfFileWriter()
pdf_writer.addPage(page_one)
with open("MY_BRAND_NEW.pdf",'wb') as pdf_output:
pdf_writer.write(pdf_output)
# ##Grabbing all data from a pdf
# with open("US_Declaration.pdf", mode='rb') as file:
# pdf_text = [0] ##created an empty variable to get all th text
# pdf_reader = PyPDF2.PdfFileReader(file)
# for p in range(pdf_reader.numPages):
# page=pdf_reader.getPage(p)
# pdf_text.append(page.extractText())
#
# len(pdf_text)
# for page in pdf_text:
# print(page)
# print('/n')
# print('/n')
# print('/n')
# print('/n')
# print('/n')
#
# my_text = page_one.extractText()
# print(my_text)
\d indicates digits
For regular expressions import re
#For search: Always assign the expressions to a variable
pattern="jhsdgds"
my_match = re.search(pattern, in_text)
my_match.span() ## gives the start and end index of the pattern ==> o/p:(4,7)
my_match.start()==>4
my_match.end()==>7
##if the pattern is repeated multiple times then the my_atch will give the presence of only first pattern occurence
##To find all the matches:
matches=re.findall(pattern, text)
len(matches) ## Number of times the pattern repeated
# print match objects
for match in re.finditer(pattern="phone",text):
printmatch.span()
##Looking for a (text =)phone number 345-567-5688. But dont remember the actual number but u remembered the pattern
pattern =r'\d\d\d-\d\d\d-\d\d\d\d'
phone_number=re.search(pattern, text) ## Returns object, span and match
if u want to return only match then use group()
phone_number.group()
Quantifiers: {n} indicates occurance of n times
Now pattern can be written as :
pattern =r"(\d{3})-(\d{3})-(\d{3})"
phone_number.group() gives the entire pattern , but if we want the digits i.e., 345 then use
phone_number=re.search(pattern, text)
phone_number.group(1)
#pipe opeartor is used for or condition
searching man or woman
re.search(r"man| woman","This woman is here")
re.findall(r".at", "The cat is in hat sat") #.at indicates only one character before at.(for 2 characters use ..at)
Ans: cat, hat,sat
##Start with ^ ; end with $
re.findall(r"^\d", '1 is the first') ===> ans is 1
[]==> exclude
[^\d] ==> exclude the numbers
my_words=re.findall(r"[^!.?]+", text)==> remove those chracters in the bracket=> returns a list of words
if we have to join the words use
" " .join(my_words)
text="Here is one-hand and two-hand code"===> using [\w]+
re.findall(r'[\w]+-[\w]+',text)
Comments
Post a Comment