Skip to content

expand scope of arxiv identifier matcher, and fix some training data annotations #858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
expand scope of arxiv identifier matcher
The simple part of this is allowing 'arxiv:' in addition to 'arXiv:'.

The more complex second part is to conservatively match "old" (pre-2008)
style identifiers which do not have a prefix. The conservative matching
is because there is less confidence that a string is actually an arxiv
identifier without the prefix. Explicit collection prefixes are included
(for those that existed pre-2008), internal whitespace is not allowed,
and the identifier must be separated from other alphabetic strings.
  • Loading branch information
bnewbold committed Nov 13, 2021
commit 26204e884485f51a4f7e5e7808464dd9431acf9c
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,12 @@ public class TextUtilities {

// a regular expression for arXiv identifiers
// see https://arxiv.org/help/arxiv_identifier and https://arxiv.org/help/arxiv_identifier_for_services
// three pattern types are allowed, here are examples of each
// "new style" with prefix: 'arXiv:0706.0002v3', 'arxiv: 0706.0002'
// "old style" with prefix: 'arXiv : hep-th/9901001v2', 'arxiv:hep-th/ 9901001'
// "old style" without prefix (strict): 'hep-th/9901001v2', 'math/9901001'
static public final Pattern arXivPattern = Pattern
.compile("(arXiv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(arXiv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]*\\s?/\\s?\\d{7}(v\\d+)?)");
.compile("(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s??\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|([^a-zA-Z](math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/arxiv_public_data/regex_arxiv.py#L12 we have much more categories possible. Like for the sub-categories, we might want to simply have a free range of characters?
e.g. instead of

(math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*

having simply:

[a-zA-Z\\-\\.]*

I am a bit worry then about the robustness of the regex for ill-formed input wrt. catastrophic backtracking as we have seen elsewhere. Maybe best alternative would be to enumerate all the possibilities?


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing a bit the regex:

  • there is a double ?? in the first component which is a typo I think?
  • better avoid having the captured parenthesis for the sub-term (math|hep|astro|cond|gr|nucl|quat|stat|ph...), we could add a non captured parenthesis
    • I don't understand the [^a-zA-Z] before the (math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin), it makes hep-lat/0509026 failing for example as it expects something before the arXiv identifier?

Suggested regex:

(ar[xX]iv\s?(\.org)?\s?\:\s?\d{4}\s?\.\s?\d{4,5}(v\d+)?)|(ar[xX]iv\s?(\.org)?\s?\:\s?[ a-zA-Z\-\.]{3,16}\s?/\s?\d{7}(v\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\-bio|q\-fin)[a-zA-Z\-\.]*/\d{7}(v\d+)?)

In java syntax:

(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)

We then have only one captured group for one arXiv identifier.

// regular expression for PubMed identifiers, last group gives the PMID digits
static public final Pattern pmidPattern = Pattern.compile("((PMID)|(Pub(\\s)?Med(\\s)?(ID)?))(\\s)?(\\:)?(\\s)*(\\d{1,8})");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,18 @@ public void testInArXivPatternLayoutToken2() {
assertThat(positions.get(0).end, is(15));
}

@Test
public void testInArXivPatternLayoutToken3() {
String piece = "K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479.";
List<LayoutToken> tokens = GrobidAnalyzer.getInstance().tokenizeWithLayoutToken(piece);
String text = LayoutTokensUtil.toText(tokens);
List<OffsetPosition> positions = target.tokenPositionsArXivPattern(tokens, text);

assertThat(positions, hasSize(1));
assertThat(positions.get(0).start, is(22));
assertThat(positions.get(0).end, is(27));
}

@Test
public void testInIdentifierPatternLayoutToken() {
String piece = "ATLAS collaboration, Measurements of the Nuclear Modification Factor for Jets in Pb+Pb Collisionsat √ "+
Expand Down Expand Up @@ -396,4 +408,4 @@ public void testInEmailPatternLayoutToken() {
assertThat(positions.get(1).start, is(27));
assertThat(positions.get(1).end, is(33));
}
}
}