Natural Language Processing for Entity(name, place, etc.) Extraction using R

Overview

This is a project I work on a yearbook PDF document (University of Cincinnati_1928) and would like to extract the name, places, and institutions in the book for analysis and further study.

The language utilized in this project is mainly R and several packages have been included.

This post will talk about some basic setup to use NLP in R and another shortcut method for fast and convenient extraction for entities in text files.

Documents

The original yearbook PDF file:

The converted text file using R for format transformation(part of it):

iBook tai
(^rgan^ationg
iBook toil
Inimour
Contents
iBook i
Campus impressions
iBook ii
3bmimstration
iBook ill
Seniors
IBook ito
StfjleticS
iBook to
Sctibities;
Seniors;
Abaecherli
Alexander
Abbihl
Allbright
Abbott Ames
...

Instruction

Install and load the required packages

library(rJava)
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
install.packages("openNLPmodels.en",
                 repos = "http://datacube.wu.ac.at/",
                 type = "source")
library(NLP)
library(openNLP)
library(RWeka)
library(magrittr)

Make sure Java is installed in your computer

en means in English

Basic Tokenization – Reading a text file

bio <- readLines("PATHTOMYDOCUMENT/University\ of\ Cincinnati_1928_txt.txt")
print(bio)

…

[28] “CARL E. ABAECHERLI, LL. B.; A9<F; TKA; “C." Hughes High School.”
[29] “Cross Country 2, 3, 4; Track 2, 3, 4; Glee Club 2;”
[30] “French Club 2, 3: Debate Team 4: Cincinnati Law Review Staff 6.”
[31] “OTTO RENNER ALEXANDER, LL. B.; AXA: TAA. Hughes High School.”
[32] “Student Council 5.”
[33] “LOUISE ABBIHL, A. B.: ZTA.”
[34] “Hughes High School.”
[35] “Y. W. C. A.; W. S. G. A.: Greek Games 1, 2: Basketball 1, 2; League of Women Voters 3, 4; Sociology Club 4.”
[36] “MARION E. ABBOTT, M. B .; ATA; AET.”
[37] “Hughes High School.”
[38] “DURWARD ALLBRIGHT, C. E.”
[39] “Crockett, Texas, H

…

bio <- paste(bio, collapse = " ")

## You can see that there are thirteen lines in the file, each
## contained in a character vector. 
## We can combine all of these character vectors into a single character vector 
## using the paste() function, adding a space between each of them.

bio <- as.String(bio)


## Next we need to create annotators for words and sentences. 
## Annotators are created by functions which load the underlying 
## Java libraries. These functions then mark the places in the
## string where words and sentences start and end. The annotation  ## functions are themselves created by functions.

word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)

id type start end features

1 sentence 1 110 constituents=«integer,20»

2 sentence 112 240 constituents=«integer,20»

3 sentence 242 386 constituents=«integer,25»

4 sentence 388 412 constituents=«integer,6»

5 sentence 414 693 constituents=«integer,49»

6 sentence 695 791 constituents=«integer,19»

We can combine the biography and the annotations to create what the NLP package calls an AnnotatedPlainTextDocument. If we wishd we could also associate metadata with the object using the meta = argument.


bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)

Annotating people and places!

These kinds of annotator functions are created using the same kinds of constructor functions that we used for word_ann() and sent_ann().

person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")


pipeline <- list(sent_ann,
                 word_ann,
                 person_ann,
                 location_ann,
                 organization_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)


entities <- function(doc, kind) {
  s <- doc$content
  a <- annotation(doc)[[1]]
  if(hasArg(kind)) {
    k <- sapply(a$features, `[[`, "kind")
    s[a[k == kind]]
  } else {
    s[a[a$type == "entity"]]
  }
}

Remember the updated package use annotation() without an s after the name of the function.


entities(bio_doc, kind = "person")

[1] “Jarena Lee” “Richard Allen” “Lee” “Joseph Lee”

The name of the person then could be printed out by applying this method from a text file.

Shortcut Method

Inspired by Easy named entity extraction

entity is wrapper to simplify and extend NLP and openNLP named entity recognition. The package contains 6 entity extractors that take a text vector and return a list of vectors of named entities. The entity extractors include:

person_entity
location_entity
organization_entity
date_entity
money_entity
percent_entity

Sample code for fast extraction of name from text files

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/entity")

bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")

name <- person_entity(bio)

name

Attachment

Here I attached the R code me writing for testing only. (Not in correct order, for reference)

---
title: "flowerflower"
author: "JIASHU MIAO"
date: "10/16/2019"
output: html_document
---
```{r}
getwd()

install.packages("openNLP")
 install.packages("readxl")
 install.packages("readr")
 install.packages("car")
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
install.packages("magrittr")

pkg <- c("readr","readxl","dplyr","stringr","ggplot2","tidyr","car")
pkgload <- lapply(pkg, require, character.only = TRUE)
library(rJava)
library(NLP)
library(openNLP)
library(RWeka)
library("readxl")
print("read successfully")
 !sudo R CMD javareconf #### in terminal yes

 install.packages("openNLPmodels.en",
                  repos = "http://datacube.wu.ac.at/",
                  type = "source")

 Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk-13.0.1/")
 # install.packages("rJava")
library(rJava)
 #install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
 require("NLP")
 library(openNLP)
 library(RWeka)

person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")

bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")
 bio
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
print(bio)
length(bio)
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
print(bio_doc)


person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")
person_ann


pipeline <- list(sent_ann,
                 word_ann,
                 person_ann 
                 )
bio_annotations <- annotate(bio, pipeline)
 bio_annotations
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)

entities <- function(doc, kind) {
  s <- doc$content
  a <- annotation(doc)[[1]]
  if(hasArg(kind)) {
    k <- sapply(a$features, `[[`, kind)
    s[a[k == kind]]
  } else {
    s[a[a$type == "entity"]]
  }
}


sapply(s,Maxent_Entity_Annotator(kind = "person"))
Maxent_Entity_Annotator(kind = "person")
entities(bio_doc, kind = "person")

a=annotation(bio_doc)[[1]]
s <- bio_doc$content

a$type
#a$features

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/entity")


bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")
print(bio)
name <- person_entity(bio)

name

  
  


sapply(a$features,'person')















a$features

bio <- paste(bio, collapse = " ")
print(bio) ## combine them together

library(NLP)
library(openNLP)
library(magrittr)

bio <- as.String(bio)
bio

word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)

head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)



sents(bio_doc) %>% head(10)
words(bio_doc) %>% head(10)

person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")




typeof(bio)

pipeline <- list(sent_ann,
                 word_ann,
                 person_ann,
                 location_ann,
                 organization_ann)
bio_annotations <- annotate(bio,pipeline)
bio_doc <- AnnotatedPlainTextDocument(as.character(bio), bio_annotations)
paras(bio_doc)
test_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)

}

```{r}
print(bio_doc)

~~~~~~~~~~~~~~~~~~~

title: “flowerflower” author: “JIASHU MIAO” date: “10/16/2019” output: html_document —

getwd()

#install.packages("openNLP")
# install.packages("readxl")
# install.packages("readr")
# install.packages("car")
#install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
#install.packages("magrittr")

pkg <- c("readr","readxl","dplyr","stringr","ggplot2","tidyr","car")
pkgload <- lapply(pkg, require, character.only = TRUE)
library(rJava)
library(NLP)
library(openNLP)
library(RWeka)
library("readxl")
print("read successfully")
# !sudo R CMD javareconf #### in terminal yes

# install.packages("openNLPmodels.en",
#                  repos = "http://datacube.wu.ac.at/",
#                  type = "source")

# # Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk-13.0.1/")
# # install.packages("rJava")
# library(rJava)
# #install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
# require("NLP")
# library(openNLP)
# library(RWeka)

#person_ann <- Maxent_Entity_Annotator(kind = "person")
#location_ann <- Maxent_Entity_Annotator(kind = "location")

bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")
# bio
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
#print(bio)
length(bio)
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
print(bio_doc)


person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")
person_ann


pipeline <- list(sent_ann,
                 word_ann,
                 person_ann 
                 )
bio_annotations <- annotate(bio, pipeline)
# bio_annotations
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)

entities <- function(doc, kind) {
  s <- doc$content
  a <- annotation(doc)[[1]]
  if(hasArg(kind)) {
    k <- sapply(a$features, `[[`, kind)
    s[a[k == kind]]
  } else {
    s[a[a$type == "entity"]]
  }
}


sapply(s,Maxent_Entity_Annotator(kind = "person"))
Maxent_Entity_Annotator(kind = "person")
entities(bio_doc, kind = "person")

a=annotation(bio_doc)[[1]]
s <- bio_doc$content

a$type
#a$features

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/entity")


bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")
print(bio)
name <- person_entity(bio)

name

  
  


#sapply(a$features,'person')















#a$features

bio <- paste(bio, collapse = " ")
#print(bio) ## combine them together

library(NLP)
library(openNLP)
library(magrittr)

bio <- as.String(bio)
#bio

word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)

head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)



sents(bio_doc) %>% head(10)
words(bio_doc) %>% head(10)

person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")




typeof(bio)

pipeline <- list(sent_ann,
                 word_ann,
                 person_ann,
                 location_ann,
                 organization_ann)
bio_annotations <- annotate(bio,pipeline)
bio_doc <- AnnotatedPlainTextDocument(as.character(bio), bio_annotations)
paras(bio_doc)
test_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)


#test_doc

# bio_annotations

bio_doc

bio_annotation



entities <- function(doc, kind) {
  s <- doc$content
  a <- annotation(doc)[[1]]
  if(hasArg(kind)) {
    k <- sapply(a$features, `[[`, "kind")
    s[a[k == kind]]
  } else {
    s[a[a$type == "entity"]]
  }
}

#as.Annotator_Pipeline(f)

#bio_doc$content
entities(bio_doc,"location")

word_ann <- Maxent_Word_Token_Annotator() sent_ann <- Maxent_Sent_Token_Annotator() pipeline <- list(sent_ann, word_ann, person_ann, location_ann) bio_annotations <- annotate(bio, pipeline) bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)

entities <- function(doc, kind) { s <- doc$content a <- annotation(doc)[[1]] if(hasArg(kind)) { k <- sapply(a$features, [[, “kind”) s[a[k == kind]] } else { s[a[a$type == “entity”]] } } entities(bio_doc, kind = “person”) entities(bio_doc, kind = “location”) bio_doc entities(bio_doc)

```{r}
print(bio_doc)

Apply NLP techniques in R to Annotate people and places in text files and extract them into a clean table.