Algorithm
Authenticus aims to build a certified database of scientific publications authored by researchers from Portuguese institutions. Thus, it requires the ability to identify an author name in a publication and correctly associate it with a researcher in the database. Author identification in publications is full of uncertainty, author names are not unique, variations of names may be caused by abbreviations, permutations, accents, hyphens, typos, cultural naming conventions, etc, thus making it difficult to identify homonym author names. Author identification is a special case of identity uncertainty resulting from the fact that objects may not be labelled with unique identifiers.
Authenticus approach for author identification is optimized for contextually reach datasets and was proposed by Sylwia Bugla and Fernando Silva. From each publication record, the algorithm uses author names, emails, institutional addresses, journals, subject categories, and keywords. In our algorithm, authors can be fully identified by their email or by their names. Name identification relies on string matching methods and on attribute verification methods to filter and validate the set of successful matches between the name of the publication author and full names of researchers. The goal is to reduce this set of matches to just one match and the verification rules based on name matching rules, institutional addresses, publication journals, subject areas, publication name and co-author analysis are decisive to correctly and uniquely associate an author with a researcher name. If the validation is not sufficient to fully identify the author, partial information on the identification is kept for future runs.
The process of identifying an author in a publication record is structured in three main modules that are executed in succession: data acquisition and pre-processing, author identification, results analysis and data storage.
The data acquisition and pre-processing module processes, parses and standardizes some fields of the publication meta data and gathers existing contextual information about the record that is required in the next steps for author identification.
The main module of the algorithm, author identification, analyses each author of the publication to determine its correct identity. In the first step, the algorithm tries to establish a correspondence between authors and emails in the publication record. To connect emails and authors it uses the information about the researchers stored in the database. A successful match, fully identifies an author given the uniqueness of email addresses.
If email identification is not successful, the algorithm resorts to identification based on the analysis of author names and other attributes of the publication. We first run a name matching procedure that uses exact (direct LIKE SQL queries) and approximate string matching algorithms (accomplished with Java text search engine library Apache Lucene 5 ) to produce an initial set of potential researchers that are candidates to become the author being analyzed. This set is first simplified by filtering out names that do not obey our name construction rules. If the set of researchers becomes empty, it means that the author being analyzed is not known in the researchers database and, thus, is declared as a new researcher and no further verification is needed. If the set of candidates is not empty, then the algorithm proceeds with the identification by applying a number of verification procedures and calculating a similarity value between each candidate and the author being analyzed. The similarity value is calculated by weighing the scores obtained by a candidate on each attribute being verified. The weights used for each attribute depend on their relevance for disambiguation. For example, co-authorships are weighed more then publication journal names. A candidate is kept in the set only if its similarity score is above a pre-defined minimum threshold. In the end, if the set of potential candidates is reduced to just one candidate, or if it has more then one candidate, but only one has a similarity score above a fully identified threshold, then that candidate is assigned as the author of the publication. The logic behind the verification procedures is to assert a relationship between candidates and one author based on past evidence that relate their publication attributes.
The last module of the algorithm analyses the results and saves all relevant data gathered about the identified authors of the analyzed publication. It includes newly discovered emails of authors, institutional addresses of authors, new co-author relations, journal and subject category of the current publication if any. The newly stored data is a very important resource as it provides information that can help to identify authors in other publications, thus its correctness is a very important issue.