Skip to content

Feature/uhighlighter #2052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 13, 2018
Merged

Conversation

idodeclare
Copy link
Contributor

Hello,

Please consider for integration this patch that integrates the Lucene UnifiedHighlighter into OpenGrok's Context [see getContext2()] while preserving for now the original getContext() for falling back when UnifiedHighlighter cannot be used (e.g. when source text has been updated via git-update but Lucene Document postings are not yet updated).

more.jsp and search.jsp both use of getContext2() so their presentations are fully consistent.

OpenGrok's Lucene indexing settings are modified (see OGKTextField) to enable storage of "postings" (i.e., IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS) for use by UnifiedHighlighter so that documents do not have to be re-lexed (as done by PlainLineTokenizer) in order to be highlighted. My testing with some large repos showed only a tiny increase in index size because of this change.

Additionally, for DEFS, tokens must now be indexed with stored term vectors (see OGKTextVecField) or else my testing showed that DEFS multi-term queries (MTQs e.g., prefixes or wildcards) did not operate correctly. This is due to Lucene UnifiedHighlighter currently falling back to ANALYSIS in the presence of MTQs when term vectors are not available. For DEFS, re-analysis from source text does not accord whatsoever with ctags Definitions; but the presence of term vectors makes this work perfectly. (See my long comment in OGKUnifiedHighlighter regarding this issue.)

Since DEFS are a small subset of index data relative to FULL and REFS fields, my testing showed again only a tiny increase in index size due to storage of DEFS term vectors.

This patch also enables optionally the display of additional surrounding lines for each match (as requested in issues #673 and #638). Configuration <contextSurround/> (default 0) specifies the number of lines of leading and trailing context to display surrounding each match.

A new Configuration <contextLimit/> (default 10) specifies the total number of context lines to display for each matching file.

Please see the following new classes that collaborate to produce OpenGrok's style of context under UnifiedHighlighter's coordination:

  • StrictLineBreakIterator : represents a subclass of Java BreakIterator that breaks at standard OpenGrok EOL -- namely \r\n, \n, or \r, which is needed since OpenGrok's context is always w.r.t. whole source-code lines (which may be truncated to fit within the context "window").
  • PhraseHighlight: represents a highlighted phrase within a line.
  • LineHighlight: represents a collection of metadata related to highlighting a single line of code.
  • PassageConverter: represents an object that can translate Lucene Passage instances into indexed LineHighlight instances, taking into account a configurable number of leading and trailing lines of context for each match.
  • ContextFormatter: represents a subclass of Lucene's PassageFormatter that uses OpenGrok's PassageConverter.
  • FormattedLines: represents structured results from ContextFormatter that can be merged with other instances. UnifiedHighlighter returns results separated by field, and OGKUnifiedHighlighter merges them together to return a coherent result for OpenGrok presentation.
  • OGKUnifiedHighlighter: represents a subclass of Lucene UnifiedHighlighter with customizations for OpenGrok.

There are new tests of this functionality. At the highest level, new SearchAndContextFormatterTest tests the indexing, search, and format of the results. (There are many tests at lower levels.)

Thank you.

Also:
- Add Context toggleAlt() so the side effect
  previously implicit in getContext() is now an
  explicit, separate call by clients.
@tarzanek tarzanek added this to the 1.1 milestone Mar 29, 2018
@@ -531,6 +531,9 @@ public static void writeXref(FileAnalyzerFactory factory, Reader in,
args.setProject(project);

FileAnalyzer analyzer = factory.getAnalyzer();
RuntimeEnvironment env = RuntimeEnvironment.getInstance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because of testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing the new highlighter, I found that scopes were not working when looking at historical versions of files or when annotating a file — i.e. where this writeXref() is called by list.jsp instead of its using the stored .gz version.

@tarzanek
Copy link
Contributor

what is the difference between CDDL and http://creativecommons.org/licenses/by-sa/3.0/us/ ?
(is CC similar like BSD?)

@@ -50,6 +53,9 @@
*/
public class PlainAnalyzer extends TextAnalyzer {

private static final Logger LOGGER = LoggerFactory.getLogger(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, removed


/**
* Creates a new un-stored instance with Reader value.
* @param name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add comments for the parameters? I know that they are mostly self explaining so you could remove them if you prefer to (that's what I would do). It's just that if they have this format then they do not provide any additional info than the parameters already in code. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok removed

@@ -134,4 +140,51 @@ public void analyze(Document doc, StreamSource src, Writer xrefOut)
}
}
}

private void tryAddingDefs(Document doc, Definitions defs, StreamSource src,
String fullpath) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add 4 additional spaces? The same you do in other places :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no ambiguity here; the method body is separated from this with a line break.

return ExpandTabsReader.wrap(reader, project);
}

private static boolean bytesEqual(byte[] a, byte[] b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe unused? :) If it is used then you can consider using Arrays.equals().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok removed

* should not have a detrimental impact on overall throughput.
*/
iwc.setCodec(new Lucene70Codec(
Lucene50StoredFieldsFormat.Mode.BEST_COMPRESSION));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lucene usually has some "latest" links pointing to latest available codecs and formats, does it have it for those two too?
the idea I have is that upon every lucene upgrade everything (possible) should get upgraded automatically to latest greatest
(I know, downside is that you need to index from scratch and all the bugs get automatically picked up, but engine upgrade is transparent)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find any codec class or stored field format enum with "latest" in its name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we shall stick with this exact matching then, if I will find a way around I will change it to generic to be able to stick to latest lucene (yes, it breaks backwards compatibility, but this is a developer tool, latest greatest is of the essence)

out.write("</td></tr>");
for (Document doc : entry.getValue()) {
for (int docId : entry.getValue()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch and optimalization :)
thank you

/**
* Fall back to the old view, which re-analyzes text using
* PlainLinetokenizer. E.g., when source code is updated (thus
* affecting SHA hashes) but re-indexing is not yet complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old comment? do you use SHA hashes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fixed

Copy link
Contributor

@ahornace ahornace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! :) It's quite a lot to take all at once. I will try to look at it in more detail later.

@@ -389,6 +398,8 @@ public Configuration() {
setCachePages(5);
setCommandTimeout(600); // 10 minutes
setCompressXref(true);
setContextLimit((short)10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider using a constant? It might be easier to change later. :)

* value.
*/
private static final String NONPOSITIVE_NUMBER_ERROR =
"Invalid value for \"%s\" - \"%s\". Expected value greater than 0";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very skilled in Java's format but shouldn't the second argument be %d? Since short is being passed there. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the original author that %s is more forgiving for error strings. E.g., if the field were changed to a boxed primitive, the error string would still work (printing null if necessary) but %d might throw an exception.

public void setContextLimit(Short value)
throws IllegalArgumentException {
if (value < 1) {
throw new IllegalArgumentException("`value' is not positive");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use the same apostrophe type? :) Now there is ` and '.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a quoting style I use for object names to distinguish from e.g. character values, '\0', or from other quoted words.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would too prefer to use the same character for quoting.

Copy link
Contributor

@tarzanek tarzanek Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we don't use any quotes there? ;) (grokking for that sentence on where it comes from becomes easier)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarzanek, ok removed in this patch

} catch (ParseException ex) {
return fields;
}
String queryString = query.toString("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use query.toString()? :) It translates to query.toString("") in the inside and is nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use the explicit Lucene API instead of implicitly using the Java override of toString() that itself uses the explicit Lucene API. Moreover, nothing about the documentation of toString() indicates how it calls the Lucene API; you had to look at source or decompile lucene-core I expect.

@@ -263,6 +254,46 @@ public static void prettyPrint(Writer out, SearchHelper sh, int start,
}
}

private static void printPlain(SearchHelper searcHelper,
RuntimeEnvironment env, Document doc, int docId, Writer out,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please consider somehow combining the method parameters? This violates current checkstyle parameter number setting which is set to default value, right now it is 7.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok revised to use a private container object for the arguments to printPlain() that do not change in the calling loop.

* @param morePrefix optional link to more... page
* @param limit a value indicating if the number of matching lines should be
* limited. N.b. unlike
* {@link #getContext(java.io.Reader, java.io.Writer, java.lang.String, java.lang.String, java.lang.String, org.opensolaris.opengrok.analysis.Definitions, boolean, boolean, java.util.List, org.opensolaris.opengrok.analysis.Scopes)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe break the line so it won't be that long? I do not think that it will break the javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is auto-generated JavaDoc, so I prefer not to insert a break into it. I do keep it on its own line unchanged (except for a single comma appended for readability) and break the surrounding text.

try {
List<String> fieldList = qbuilder.getContextFields();
String[] fields = new String[fieldList.size()];
fieldList.toArray(fields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please consider using this code: String[] fields = fieldList.toArray(new String[fieldList.size()]); ? I think this might be a little dangerous since it relies heavily on the size of the fields array. If someone would change it then it would break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok revised

bld.append(HtmlConsts.BR);
bld.append("\n");
res.setFooter(bld.toString());
bld.setLength(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bld is local variable so it is eligible for garbage collection so this line is redundant, right? :)

* @throws NoSuchElementException if the instance is empty
*/
public String pop() {
String value = lines.remove(lines.lastKey());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return could be simplified: return lines.remove(lines.lastKey()); :)

for (String line : lines.values()) {
bld.append(line);
}
String f = footer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the variable f? I think it is nicer to have it just as:

if (footer != null && limited) {
    bld.append(footer);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying to a local is a habitual style I use when a mutable field is checked for null in order to make impossible a race condition intra-method w.r.t. the field value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are concerns about race condition could it be solved otherwise?
e.g. sync on setter?
I was under the impression this object would be thread local when used ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well my point is that my making a local copy is rationalized — i.e. it's a pattern I do recommend generally when nullably-mutable fields are involved so that you can have objects that do not use heavyweight synchronization but also that are not inconsistent intra-method.

A local copy is also valuable in some synchronized settings — e.g., where you have field data that strictly evolves lazily from null to non-null (as in e.g. a singleton) and a volatile field is properly used. In that case too, it's better to make a local copy of the field so that for the predominant case happening after initialization (i.e., when non-null), only a single atomic read is required w.r.t. to field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, makes sense

@idodeclare
Copy link
Contributor Author

idodeclare commented Mar 29, 2018

what is the difference between CDDL and http://creativecommons.org/licenses/by-sa/3.0/us/ ?
(is CC similar like BSD?)

Yes, CC is a permissive license like BSD which is compatible for use in a CDDL project because CDDL licenses each file individually so that its files can be kept separate from other files with their licenses (permissive or even proprietary) — and neither CDDL nor CC (nor BSD, Apache etc.) restricts the license of the executable version (as does unlike GPL).

CC was applicable to those two new classes, OGKTextField and OGKTextVecField, which I imported in a slightly-transformed way from the original source. CC requires attribution, a license link, an indication of any revision, and redistribution of source must be under CC — but otherwise there are no restrictions.

} catch (ClassNotFoundException ex) {
// This is not expected, so translate to RuntimeException.
throw new RuntimeException(ex);
int nres = top.totalHits > n ? n : (int)top.totalHits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this update, it will make our life easier with new releases of lucene, where they prefer this new approach to fetching results

/**
* Sets the right elide value.
*/
public void setRelide(int value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should move the project towards scala or kotlin to make the data objects less verbose

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

come on, less typing ;)
(but then one needs to get ready for some code readability changes - for good or for bad)

* REFS should be approximately fine with re-analysis using an
* on-the-fly PlainAnalyzer. It might not accord with the true
* language symbol tokenizer, but it should not be wildly
* divergent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we inject the language symbol tokenizer somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(fwiw we already do that when doing analysis, it's a bit hacky since lucene defaulted to just one default analyzer ( see https://github.com/oracle/opengrok/blob/master/src/org/opensolaris/opengrok/analysis/plain/PlainAnalyzer.java#L98 ) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK after studying the Lucene UnifiedHighlighter implementation, I pushed an arcane override of getIndexAnalyzer() when OpenGrok's own highlightFieldsUnion() is executing so that an OpenGrok TYPE-specific analyzer is returned when re-ANALYSIS is being done. (I also updated those block comments.)


public static final String BR = "<br/>";

public static final String HELLIP = "&hellip;";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-D

@@ -520,7 +536,7 @@ public SearchHelper prepareSummary() {
return this;
}
try {
sourceContext = new Context(query, builder.getQueries());
sourceContext = new Context(query, builder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially curious if regexp queries will work

@tarzanek
Copy link
Contributor

tarzanek commented Apr 8, 2018

thank you very much @idodeclare , i don't have any other comments or musings now
so I'd say +1
@vladak @tulinkry @Orviss is it OK to merge?

@tarzanek
Copy link
Contributor

tarzanek commented Apr 9, 2018

@idodeclare could you please ping me a mail to tarzanek at gmail.com with your email address? Since @Orviss will be updating the build we will start using Idea for development, folks from Jetbrains gave us licenses for ultimate, so if you are interested you can have one license ... (the plan is to have the maven used from Idea for development)

Copy link
Contributor

@tulinkry tulinkry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright issues

@@ -19,7 +19,7 @@ information: Portions Copyright [yyyy] [name of copyright owner]
CDDL HEADER END

Copyright (c) 2010, 2017, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy

@@ -19,6 +19,7 @@ information: Portions Copyright [yyyy] [name of copyright owner]
CDDL HEADER END

Copyright (c) 2010, 2017, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy

@@ -19,7 +19,7 @@

/*
* Copyright (c) 2005, 2017, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy

@@ -378,8 +378,8 @@ private static String cutPattern(String tagLine, int startTab, int endTab) {
* Adds a tag to a {@code Definitions} instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy 2018

@@ -19,20 +19,22 @@

/*
* Copyright (c) 2005, 2013, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy

@@ -20,14 +20,15 @@
/*
* Copyright (c) 2011, 2017, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy

@@ -83,32 +83,32 @@ public void testIsEmpty() throws ParseException {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old copyright

*/

/*
* Copyright (c) 2008, 2017, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong copyright

*/

/*
* Copyright (c) 2008, 2017, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong copyright

@@ -19,8 +19,8 @@ information: Portions Copyright [yyyy] [name of copyright owner]
CDDL HEADER END

Copyright (c) 2010, 2017, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy

@idodeclare
Copy link
Contributor Author

copyright issues

Hi @tulinkry , I do not work for Oracle, so if I were to modify Oracle's notices, it would violate the following part of CDDL section 3.3:

You may
not remove or alter any copyright, patent or trademark notices
contained within the Covered Software, or any notices of licensing
or any descriptive text giving attribution to any Contributor or
the Initial Developer.

The proscription about altering copyright is also in effect when a CDDL-licensed file is copied and altered, as I did with SearchAndContextFormatterTest.java from SearchEngineTest.java (with a JavaDoc comment "Derived from Trond Norbye's SearchEngineTest"). The copyright notices that are not my own are identical between those files.

All the files altered or created by me in this patch include a notice satisfying another part of CDDL section 3.3:

You must include a notice in each of Your Modifications that
identifies You as the Contributor of the Modification

I don't update that manually. I use a script I have, update_cddl.pl, as in e.g.:

    $ git diff master --name-only | xargs update_cddl.pl -P

(-P to use "Portions Copyright" language as appropriate.)

If you like I can post it for use by others. Oracle contributors would run as e.g.:

    $ git diff master --name-only | \
        CDDL_ORG_IDENTITY="Oracle and/or its affiliates" xargs update_cddl.pl -P -r

(-r additionally because Oracle's notices include "All rights reserved.")

@tarzanek
Copy link
Contributor

thank you Chris, merging, let's play with this and file issues as we go

@tarzanek tarzanek merged commit bfaf601 into oracle:master Apr 13, 2018
@idodeclare idodeclare deleted the feature/uhighlighter branch April 22, 2018 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants