Overview
This is a project I work on a yearbook PDF document (University of Cincinnati_1928) and would like to extract the name, places, and institutions in the book for analysis and further study.
The language utilized in this project is mainly R and several packages have been included.
This post will talk about some basic setup to use NLP in R and another shortcut method for fast and convenient extraction for entities in text files.
Documents
- The original yearbook PDF file:
- The converted text file using R for format transformation(part of it):
iBook tai
(^rgan^ationg
iBook toil
Inimour
Contents
iBook i
Campus impressions
iBook ii
3bmimstration
iBook ill
Seniors
IBook ito
StfjleticS
iBook to
Sctibities;
Seniors;
Abaecherli
Alexander
Abbihl
Allbright
Abbott Ames
...
Instruction
Install and load the required packages
library(rJava)
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
install.packages("openNLPmodels.en",
repos = "http://datacube.wu.ac.at/",
type = "source")
library(NLP)
library(openNLP)
library(RWeka)
library(magrittr)
Make sure Java is installed in your computer
en
means in English
Basic Tokenization – Reading a text file
bio <- readLines("PATHTOMYDOCUMENT/University\ of\ Cincinnati_1928_txt.txt")
print(bio)
…
[28] “CARL E. ABAECHERLI, LL. B.; A9<F; TKA; “C." Hughes High School.”
[29] “Cross Country 2, 3, 4; Track 2, 3, 4; Glee Club 2;”
[30] “French Club 2, 3: Debate Team 4: Cincinnati Law Review Staff 6.”
[31] “OTTO RENNER ALEXANDER, LL. B.; AXA: TAA. Hughes High School.”
[32] “Student Council 5.”
[33] “LOUISE ABBIHL, A. B.: ZTA.”
[34] “Hughes High School.”
[35] “Y. W. C. A.; W. S. G. A.: Greek Games 1, 2: Basketball 1, 2; League of Women Voters 3, 4; Sociology Club 4.”
[36] “MARION E. ABBOTT, M. B .; ATA; AET.”
[37] “Hughes High School.”
[38] “DURWARD ALLBRIGHT, C. E.”
[39] “Crockett, Texas, H
…
bio <- paste(bio, collapse = " ")
## You can see that there are thirteen lines in the file, each
## contained in a character vector.
## We can combine all of these character vectors into a single character vector
## using the paste() function, adding a space between each of them.
bio <- as.String(bio)
## Next we need to create annotators for words and sentences.
## Annotators are created by functions which load the underlying
## Java libraries. These functions then mark the places in the
## string where words and sentences start and end. The annotation ## functions are themselves created by functions.
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
id type start end features
1 sentence 1 110 constituents=«integer,20»
2 sentence 112 240 constituents=«integer,20»
3 sentence 242 386 constituents=«integer,25»
4 sentence 388 412 constituents=«integer,6»
5 sentence 414 693 constituents=«integer,49»
6 sentence 695 791 constituents=«integer,19»
We can combine the biography and the annotations to create what the NLP package calls an AnnotatedPlainTextDocument
. If we wishd we could also associate metadata with the object using the meta =
argument.
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
Annotating people and places!
These kinds of annotator functions are created using the same kinds of constructor functions that we used for word_ann()
and sent_ann()
.
person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")
pipeline <- list(sent_ann,
word_ann,
person_ann,
location_ann,
organization_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
entities <- function(doc, kind) {
s <- doc$content
a <- annotation(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
Remember the updated package use
annotation()
without an s after the name of the function.
entities(bio_doc, kind = "person")
[1] “Jarena Lee” “Richard Allen” “Lee” “Joseph Lee”
The name of the person then could be printed out by applying this method from a text file.
Shortcut Method
- Inspired by Easy named entity extraction
- entity is wrapper to simplify and extend NLP and openNLP named entity recognition. The package contains 6 entity extractors that take a text vector and return a list of vectors of named entities. The entity extractors include:
1. person_entity
2. location_entity
3. organization_entity
4. date_entity
5. money_entity
6. percent_entity
Sample code for fast extraction of name from text files
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/entity")
bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")
name <- person_entity(bio)
name
Attachment
Here I attached the R code me writing for testing only. (Not in correct order, for reference)
---
title: "flowerflower"
author: "JIASHU MIAO"
date: "10/16/2019"
output: html_document
---
```{r}
getwd()
install.packages("openNLP")
install.packages("readxl")
install.packages("readr")
install.packages("car")
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
install.packages("magrittr")
pkg <- c("readr","readxl","dplyr","stringr","ggplot2","tidyr","car")
pkgload <- lapply(pkg, require, character.only = TRUE)
library(rJava)
library(NLP)
library(openNLP)
library(RWeka)
library("readxl")
print("read successfully")
!sudo R CMD javareconf #### in terminal yes
install.packages("openNLPmodels.en",
repos = "http://datacube.wu.ac.at/",
type = "source")
Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk-13.0.1/")
# install.packages("rJava")
library(rJava)
#install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
require("NLP")
library(openNLP)
library(RWeka)
person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")
bio
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
print(bio)
length(bio)
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
print(bio_doc)
person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")
person_ann
pipeline <- list(sent_ann,
word_ann,
person_ann
)
bio_annotations <- annotate(bio, pipeline)
bio_annotations
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
entities <- function(doc, kind) {
s <- doc$content
a <- annotation(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, kind)
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
sapply(s,Maxent_Entity_Annotator(kind = "person"))
Maxent_Entity_Annotator(kind = "person")
entities(bio_doc, kind = "person")
a=annotation(bio_doc)[[1]]
s <- bio_doc$content
a$type
#a$features
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/entity")
bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")
print(bio)
name <- person_entity(bio)
name
sapply(a$features,'person')
a$features
bio <- paste(bio, collapse = " ")
print(bio) ## combine them together
library(NLP)
library(openNLP)
library(magrittr)
bio <- as.String(bio)
bio
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
sents(bio_doc) %>% head(10)
words(bio_doc) %>% head(10)
person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")
typeof(bio)
pipeline <- list(sent_ann,
word_ann,
person_ann,
location_ann,
organization_ann)
bio_annotations <- annotate(bio,pipeline)
bio_doc <- AnnotatedPlainTextDocument(as.character(bio), bio_annotations)
paras(bio_doc)
test_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
}
```{r}
print(bio_doc)
~~~~~~~~~~~~~~~~~~~
title: “flowerflower” author: “JIASHU MIAO” date: “10/16/2019” output: html_document —
getwd()
#install.packages("openNLP")
# install.packages("readxl")
# install.packages("readr")
# install.packages("car")
#install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
#install.packages("magrittr")
pkg <- c("readr","readxl","dplyr","stringr","ggplot2","tidyr","car")
pkgload <- lapply(pkg, require, character.only = TRUE)
library(rJava)
library(NLP)
library(openNLP)
library(RWeka)
library("readxl")
print("read successfully")
# !sudo R CMD javareconf #### in terminal yes
# install.packages("openNLPmodels.en",
# repos = "http://datacube.wu.ac.at/",
# type = "source")
# # Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk-13.0.1/")
# # install.packages("rJava")
# library(rJava)
# #install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
# require("NLP")
# library(openNLP)
# library(RWeka)
#person_ann <- Maxent_Entity_Annotator(kind = "person")
#location_ann <- Maxent_Entity_Annotator(kind = "location")
bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")
# bio
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
#print(bio)
length(bio)
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
print(bio_doc)
person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")
person_ann
pipeline <- list(sent_ann,
word_ann,
person_ann
)
bio_annotations <- annotate(bio, pipeline)
# bio_annotations
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
entities <- function(doc, kind) {
s <- doc$content
a <- annotation(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, kind)
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
sapply(s,Maxent_Entity_Annotator(kind = "person"))
Maxent_Entity_Annotator(kind = "person")
entities(bio_doc, kind = "person")
a=annotation(bio_doc)[[1]]
s <- bio_doc$content
a$type
#a$features
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/entity")
bio <- readLines("/Users/MichaelMiao/Documents/career/FlowerFlower/University\ of\ Cincinnati_1928_txt.txt")
print(bio)
name <- person_entity(bio)
name
#sapply(a$features,'person')
#a$features
bio <- paste(bio, collapse = " ")
#print(bio) ## combine them together
library(NLP)
library(openNLP)
library(magrittr)
bio <- as.String(bio)
#bio
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
sents(bio_doc) %>% head(10)
words(bio_doc) %>% head(10)
person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")
typeof(bio)
pipeline <- list(sent_ann,
word_ann,
person_ann,
location_ann,
organization_ann)
bio_annotations <- annotate(bio,pipeline)
bio_doc <- AnnotatedPlainTextDocument(as.character(bio), bio_annotations)
paras(bio_doc)
test_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
#test_doc
# bio_annotations
bio_doc
bio_annotation
entities <- function(doc, kind) {
s <- doc$content
a <- annotation(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
#as.Annotator_Pipeline(f)
#bio_doc$content
entities(bio_doc,"location")
word_ann <- Maxent_Word_Token_Annotator() sent_ann <- Maxent_Sent_Token_Annotator() pipeline <- list(sent_ann, word_ann, person_ann, location_ann) bio_annotations <- annotate(bio, pipeline) bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
entities <- function(doc, kind) {
s <- doc$content
a <- annotation(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, [[
, “kind”)
s[a[k == kind]]
} else {
s[a[a$type == “entity”]]
}
}
entities(bio_doc, kind = “person”)
entities(bio_doc, kind = “location”)
bio_doc
entities(bio_doc)
```{r}
print(bio_doc)