View on GitHub

NLP-Gender-Analysis-

NLP comparison of genders mentioned in Wiki Pages

Gender Entity Analysis Using NLTK

Russian vs French Monarchies Comparison

rus_symb rus_symb

Importing

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download("names")
nltk.download("genesis")
nltk.download("inaugural")
nltk.download("nps_chat")
nltk.download("webtext")
nltk.download("treebank")
nltk.download('gutenberg')
nltk.download('punkt')

Defining our functions

def is_name(word):
	return True if word in names else False

def is_female_name(word):
	return True if word in female_names else False

def get_web_text(url1):
     from bs4 import BeautifulSoup
     from urllib import request
     html1 = request.urlopen(url1).read().decode('utf8')
     the_text= BeautifulSoup(html1, 'html.parser').get_text()
     return the_text

def top_names(number,text):
    txt_names=[name for name in filter(is_name,text)]
    names_freq=nltk.FreqDist(txt_names)
    top_names={}
    for name,count in names_freq.most_common(number):
        top_names[name]=count
    return top_names

Scraping from the web with BeautifulSoup

Both English Values from Wikipedia:

1st

2nd

Analyzing with ‘names’ corpus


names=nltk.corpus.names.words()
female_names=nltk.corpus.names.words('female.txt')
male_names = nltk.corpus.names.words('male.txt')

Getting our outputs detecting & sorting


def analyze_text_names(url1):
     web_text=nltk.word_tokenize(get_web_text(url1))
     all_names=[name for name in filter(is_name,web_text)]
     all_names_dict=sorted(set(all_names))
     female_names_dict=[name for name in filter(is_female_name,all_names_dict)]
    
     print("\r\n url: " + url1)
     print("\r\n percentage of female names: " + "{:.1%}".format(len(female_names_dict) / len(all_names_dict)))
     print("\r\n all names: ("+ str(len(all_names_dict))+")\r\n" + str(all_names_dict))
     print("\r\n female names: ("+str(len(female_names_dict)) +") \r\n"  + str(female_names_dict))

Femenine names vs. Masculine names in each wiki value

NapoleonCathrine

Displaying our outputs


#__________________________________________________


analyze_text_names("https://en.wikipedia.org/wiki/List_of_Russian_monarchs")

 url: https://en.wikipedia.org/wiki/List_of_Russian_monarchs

 all names: (60)
['Alexander', 'Alexandra', 'Alexis', 'Alta', 'Anastasia', 'Andrei', 'Andrew', 'Andrey', 'Anna', 'Anne', 'April', 'August', 'Boris', 'Canada', 'Catherine', 'Christina', 'Cookie', 'Curtis', 'Cyril', 'Daniel', 'Dimitri', 'Duke', 'Elena', 'Elizabeth', 'France', 'George', 'Glenn', 'Harvard', 'Igor', 'Ivan', 'June', 'Karl', 'King', 'Konstantin', 'Lucia', 'Maria', 'May', 'Maya', 'Michael', 'Mikhail', 'Natalia', 'Natalya', 'Nicholas', 'Oleg', 'Olga', 'Paul', 'Peter', 'Prince', 'Royal', 'Saul', 'See', 'Simeon', 'Simon', 'Sophia', 'Vasili', 'Vasily', 'Vincent', 'Vladimir', 'Xenia', 'Yuri']

 female names: (28) 
['Alexandra', 'Alexis', 'Alta', 'Anastasia', 'Andrei', 'Anna', 'Anne', 'April', 'Canada', 'Catherine', 'Christina', 'Cookie', 'Daniel', 'Elena', 'Elizabeth', 'France', 'George', 'Glenn', 'June', 'Lucia', 'Maria', 'May', 'Maya', 'Natalia', 'Natalya', 'Olga', 'Sophia', 'Xenia']


 percentage of female names: 46.7 %
 

#__________________________________________________

analyze_text_names("https://en.wikipedia.org/wiki/List_of_French_monarchs")

 url: https://en.wikipedia.org/wiki/List_of_French_monarchs


 all names: (53)
['Adrien', 'Antoine', 'April', 'August', 'Auguste', 'Augustus', 'Brewer', 'Catherine', 'Charles', 'Clovis', 'Cookie', 'David', 'Duke', 'Edward', 'France', 'Francis', 'French', 'George', 'Gita', 'Henri', 'Henry', 'Hercule', 'Hugh', 'Isabella', 'Jean', 'Joan', 'John', 'June', 'King', 'Lion', 'Louis', 'Magnus', 'Mary', 'May', 'Michael', 'Napoleon', 'Pascal', 'Philip', 'Philippe', 'Prince', 'Raoul', 'Rex', 'Richard', 'Robert', 'Rudolph', 'See', 'Son', 'Sterling', 'Temple', 'Walter', 'Webster', 'West', 'Whitney']

 female names: (18) 
['Adrien', 'April', 'Auguste', 'Catherine', 'Clovis', 'Cookie', 'France', 'Francis', 'George', 'Gita', 'Isabella', 'Jean', 'Joan', 'June', 'Mary', 'May', 'Philippe', 'Whitney']

 percentage of female names: 34.0 %
 

Visualizing the results

Most common names in French Monarchy:

Most common names in Russian Monarchy:

family_tree

Proportion of Femenine/ Masculine Entities Mentioned in values compared: