Python String Pattern Process

Using Python to reshape String col. to long String, and conduct pattern match

Posted by Jiashu Miao on August 12, 2022

Project Overview

  • we have a list of keywords in the excel sheet
  • we should check all the video titles and see if the keywords ever appear in the titles.
  • to enable this, we load the keywords from excel to a dataframe first using Pandas package.
  • then we reshape the column into a wide long string, using tolist function and .join
  • python has a string method str.contains(<YOUR PATTERN>) to check if there is a match of strings separated by ** **.
  • ** ** works as “or” syntax for string match. For example as long string – "|2022 bolsonaro|2022 doria|brazil|" if the title of the video contains any of the 3 phrases, then it is considered as a match.

Code

from Ipython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
import pandas as pd


key = pd.read_csv("/home/tiger/br_election/brkeywords.csv")
key = key.iloc[:,0:2]
key.head(10)

keywords = '|'.join(key['Name'].tolist())
keywords[1:20]

  • column are concated to longer string now.
# find match
elec = pd.read_csv("/home/tiger/br_election/election_video.csv")
elec = elec[elec['title'].str.contains(keywords)] -- pattern match
elec = elec.drop_duplicates()
  • Done!!

    edited by Jiashu miao edited by Jiashu Miao :) 08/12/2022