Title: | Find Related Items and Lexical Dimensions in a Lexicon |
---|---|
Description: | Implements code to identify lexical competitors in a given list of words. We include many of the standard competitor types used in spoken word recognition research, such as functions to find cohorts, neighbors, and rhymes, amongst many others. The package includes documentation for using a variety of lexicon files, including those with form codes made up of multiple letters (i.e., phoneme codes) and also basic orthographies. Importantly, the code makes use of multiple CPU cores and vectorization when possible, making it extremely fast and able to handle large lexicons. Additionally, the package contains documentation for users to easily write new functions, allowing researchers to examine other relationships within a lexicon. Preprint: <https://osf.io/preprints/psyarxiv/8dyru/>. Open access: <doi:10.3758/s13428-021-01667-6>. Citation: Li, Z., Crinnion, A.M. & Magnuson, J.S. (2021). <doi:10.3758/s13428-021-01667-6>. |
Authors: | ZhaoBin Li [aut, cre], Anne Marie Crinnion [aut], James S. Magnuson [aut, cph] |
Maintainer: | ZhaoBin Li <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.0 |
Built: | 2024-10-31 21:16:25 UTC |
Source: | https://github.com/comp-cogneuro-lang/lexfindr |
Cohorts overlap in onset phoneme(s).
get_cohorts( target, lexicon, sep = " ", form = FALSE, count = FALSE, overlap = 2 )
get_cohorts( target, lexicon, sep = " ", form = FALSE, count = FALSE, overlap = 2 )
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
overlap |
(get_cohorts only) Integer specifying the number of onset phonemes to overlap for matching with the target word |
the indexes of the competitors in the lexical database
get_cohorts("AA R K", c("AA R K", "AA R T", "B AA B"))
get_cohorts("AA R K", c("AA R K", "AA R T", "B AA B"))
Cohorts that are not neighbors
get_cohortsP( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
get_cohortsP( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
neighbors |
(get_neighbors only) Character vector specifying the type of neighbor to return. Return the delete, add, substitute neighbors of the target when 'd', 'a', and/or 's' is in neighbors respectively |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
the indexes of the competitors in the lexical database
get_cohortsP("AA R K", c("AA R K", "AA R", "B AA B"), neighbors = "das")
get_cohortsP("AA R K", c("AA R K", "AA R", "B AA B"), neighbors = "das")
Embedding competitors are items embedded in target
get_embeds_in_target(target, lexicon, sep = " ", form = FALSE, count = FALSE)
get_embeds_in_target(target, lexicon, sep = " ", form = FALSE, count = FALSE)
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
the indexes of the competitors in the lexical database
get_embeds_in_target("AA R K", c("AA R K", "AA R", "B AA B"))
get_embeds_in_target("AA R K", c("AA R K", "AA R", "B AA B"))
Items embedded in the target which are not cohorts or neighbors
get_embeds_in_targetP( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
get_embeds_in_targetP( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
neighbors |
(get_neighbors only) Character vector specifying the type of neighbor to return. Return the delete, add, substitute neighbors of the target when 'd', 'a', and/or 's' is in neighbors respectively |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
the indexes of the competitors in the lexical database
get_embeds_in_targetP("B AA R K IY", c("AA R K", "AA R", "AA R K IY", "B AA R"))
get_embeds_in_targetP("B AA R K IY", c("AA R K", "AA R", "AA R K IY", "B AA R"))
Get the log Frequency Weight (FW) of a competitor set
get_fw(competitors_freq, pad = 0)
get_fw(competitors_freq, pad = 0)
competitors_freq |
Numeric vector containing the frequencies of competitors (including itself) |
pad |
Value to add to frequencies before taking log; if your minimum frequency is 0, consider adding a value between 1 and 2; if your minimum frequency is between 0 and 1, consider adding 1 |
FW
get_fw(c(10, 50), pad = 1)
get_fw(c(10, 50), pad = 1)
Get the log Frequency Weighted Competitor Probability (FWCP)
get_fwcp(target_freq, competitors_freq, pad = 0, add_target = FALSE)
get_fwcp(target_freq, competitors_freq, pad = 0, add_target = FALSE)
target_freq |
Frequency of target word |
competitors_freq |
Numeric vector containing the frequencies of competitors (including itself) |
pad |
Value to add to frequencies before taking log; if your minimum frequency is 0, consider adding a value between 1 and 2; if your minimum frequency is between 0 and 1, consider adding 1 |
add_target |
Boolean; set to TRUE if you want the target frequency added to the denominator; only do this if the target is not already included in the competitor set (e.g., if the target is in the lexicon, it will be captured as its own neighbor, its own cohort, etc.) |
log FWCP
get_fwcp(100, c(10, 50), pad = 1)
get_fwcp(100, c(10, 50), pad = 1)
Homophones are items which sound similar to the target
get_homoforms(target, lexicon, sep = " ", form = FALSE, count = FALSE)
get_homoforms(target, lexicon, sep = " ", form = FALSE, count = FALSE)
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
the indexes of the competitors in the lexical database
get_homoforms("AA R K", c("AA R K", "AA R", "B AA B"))
get_homoforms("AA R K", c("AA R K", "AA R", "B AA B"))
Phonological neighbors are items which can be converted to the target by one add, delete and substitute operation
get_neighbors( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
get_neighbors( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
neighbors |
(get_neighbors only) Character vector specifying the type of neighbor to return. Return the delete, add, substitute neighbors of the target when 'd', 'a', and/or 's' is in neighbors respectively |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
the indexes of the competitors in the lexical database
get_neighbors("AA R K", c("AA R K", "AA R", "B AA B"), "d") get_neighbors("AA R K", c("AA R K", "AA R", "B AA B"), "da") get_neighbors("AA R K", c("AA R K", "AA R", "B AA B"), "das")
get_neighbors("AA R K", c("AA R K", "AA R", "B AA B"), "d") get_neighbors("AA R K", c("AA R K", "AA R", "B AA B"), "da") get_neighbors("AA R K", c("AA R K", "AA R", "B AA B"), "das")
Neighbors which are not cohorts or rhymes
get_neighborsP( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
get_neighborsP( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
neighbors |
(get_neighbors only) Character vector specifying the type of neighbor to return. Return the delete, add, substitute neighbors of the target when 'd', 'a', and/or 's' is in neighbors respectively |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
the indexes of the competitors in the lexical database
get_neighborsP("AA R K", c("AA R K", "AA R", "B AA B"), neighbors = "das")
get_neighborsP("AA R K", c("AA R K", "AA R", "B AA B"), neighbors = "das")
Items which are both cohorts and neighbors
get_nohorts( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
get_nohorts( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
neighbors |
(get_neighbors only) Character vector specifying the type of neighbor to return. Return the delete, add, substitute neighbors of the target when 'd', 'a', and/or 's' is in neighbors respectively |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
the indexes of the competitors in the lexical database
get_nohorts("AA R K", c("AA R K", "AA R", "B AA B"), neighbors = "das")
get_nohorts("AA R K", c("AA R K", "AA R", "B AA B"), neighbors = "das")
Rhymes overlap in all except onset phoneme(s)
get_rhymes( target, lexicon, sep = " ", form = FALSE, count = FALSE, mismatch = 1 )
get_rhymes( target, lexicon, sep = " ", form = FALSE, count = FALSE, mismatch = 1 )
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
mismatch |
(get_rhymes only) Integer specifying the number of onset phonemes to mismatch for matching with the target word |
the indexes of the competitors in the lexical database
get_rhymes("AA R K", c("AA R K", "B AA R K", "B AA B"))
get_rhymes("AA R K", c("AA R K", "B AA R K", "B AA B"))
Embedded competitors are items which the target embedded in.
get_target_embeds_in(target, lexicon, sep = " ", form = FALSE, count = FALSE)
get_target_embeds_in(target, lexicon, sep = " ", form = FALSE, count = FALSE)
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
the indexes of the competitors in the lexical database
get_target_embeds_in("AA R K", c("AA R K", "B AA R K", "B AA B"))
get_target_embeds_in("AA R K", c("AA R K", "B AA R K", "B AA B"))
Items the target embeds into which are not cohorts or neighbors
get_target_embeds_inP( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
get_target_embeds_inP( target, lexicon, neighbors = "das", sep = " ", form = FALSE, count = FALSE )
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
neighbors |
(get_neighbors only) Character vector specifying the type of neighbor to return. Return the delete, add, substitute neighbors of the target when 'd', 'a', and/or 's' is in neighbors respectively |
sep |
Separator in target and lexicon |
form |
Whether to return words in lexicon |
count |
Whether to return count of words |
the indexes of the competitors in the lexical database
get_target_embeds_inP("B AA R K", c("AA R K", "AA R", "B AA R K IY", "B AA R"))
get_target_embeds_inP("B AA R K", c("AA R K", "AA R", "B AA R K IY", "B AA R"))
Phonological uniqueness point is the index at which the target becomes unique in the lexicon
get_uniqpt(target, lexicon, sep = " ")
get_uniqpt(target, lexicon, sep = " ")
target |
Character string containing a target word |
lexicon |
Character vector containing the lexical database |
sep |
Separator in target and lexicon |
Target is not unique: length + 1, else index where target becomes unique in lexicon
get_uniqpt("AA R K", c("AA R", "B AA B", "B AA R K"))
get_uniqpt("AA R K", c("AA R", "B AA B", "B AA R K"))
Lemmalex is primarily based on the SUBTLEXus subtitle corpus (based on American subtitles with 51 million items in total) reduced to lemma using a copyrighted database (Francis and Kučera, 1982). The pronunciation is given by CMU Pronouncing Dictionary
lemmalex
lemmalex
An object of class tbl_df
(inherits from tbl
, data.frame
) with 17750 rows and 3 columns.
Reference: Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior research methods, 41(4), 977-990.
Kučera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Brown university press.
CMU Pronouncing Dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
@format A table with 20,293 rows and 3 variables:
SUBTLEXus dictionary reduced to lemmas
Number of times the item appeared in the SUBTLEXus corpus
ARPAbet transcription according to CMU
...
https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus
TRACE slex lexicon translated by Nenadić and Tucker into ARPAbet pronunciation
slex
slex
An object of class data.table
(inherits from data.frame
) with 212 rows and 3 columns.
TRACE slex lexicon with Frequencies: McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive psychology, 18(1), 1-86.
APRAbet transcription: Nenadić, F., & Tucker, B. V. (2020). Computational modelling of an auditory lexical decision experiment using jTRACE and TISK. Language, Cognition and Neuroscience, 1-29.
@format A table with 212 rows and 2 variables:
TRACE slex transcription
ARPAbet transcription
...
https://era.library.ualberta.ca/items/61319cc6-436a-428c-b960-545bdc9bd5d3