-
Notifications
You must be signed in to change notification settings - Fork 778
Feature/uhighlighter #2052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/uhighlighter #2052
Conversation
Also: - Add Context toggleAlt() so the side effect previously implicit in getContext() is now an explicit, separate call by clients.
@@ -531,6 +531,9 @@ public static void writeXref(FileAnalyzerFactory factory, Reader in, | |||
args.setProject(project); | |||
|
|||
FileAnalyzer analyzer = factory.getAnalyzer(); | |||
RuntimeEnvironment env = RuntimeEnvironment.getInstance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because of testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While testing the new highlighter, I found that scopes were not working when looking at historical versions of files or when annotating a file — i.e. where this writeXref()
is called by list.jsp
instead of its using the stored .gz version.
what is the difference between CDDL and http://creativecommons.org/licenses/by-sa/3.0/us/ ? |
@@ -50,6 +53,9 @@ | |||
*/ | |||
public class PlainAnalyzer extends TextAnalyzer { | |||
|
|||
private static final Logger LOGGER = LoggerFactory.getLogger( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, removed
|
||
/** | ||
* Creates a new un-stored instance with Reader value. | ||
* @param name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add comments for the parameters? I know that they are mostly self explaining so you could remove them if you prefer to (that's what I would do). It's just that if they have this format then they do not provide any additional info than the parameters already in code. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok removed
@@ -134,4 +140,51 @@ public void analyze(Document doc, StreamSource src, Writer xrefOut) | |||
} | |||
} | |||
} | |||
|
|||
private void tryAddingDefs(Document doc, Definitions defs, StreamSource src, | |||
String fullpath) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add 4 additional spaces? The same you do in other places :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no ambiguity here; the method body is separated from this with a line break.
return ExpandTabsReader.wrap(reader, project); | ||
} | ||
|
||
private static boolean bytesEqual(byte[] a, byte[] b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe unused? :) If it is used then you can consider using Arrays.equals()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok removed
* should not have a detrimental impact on overall throughput. | ||
*/ | ||
iwc.setCodec(new Lucene70Codec( | ||
Lucene50StoredFieldsFormat.Mode.BEST_COMPRESSION)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lucene usually has some "latest" links pointing to latest available codecs and formats, does it have it for those two too?
the idea I have is that upon every lucene upgrade everything (possible) should get upgraded automatically to latest greatest
(I know, downside is that you need to index from scratch and all the bugs get automatically picked up, but engine upgrade is transparent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find any codec class or stored field format enum with "latest" in its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we shall stick with this exact matching then, if I will find a way around I will change it to generic to be able to stick to latest lucene (yes, it breaks backwards compatibility, but this is a developer tool, latest greatest is of the essence)
out.write("</td></tr>"); | ||
for (Document doc : entry.getValue()) { | ||
for (int docId : entry.getValue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch and optimalization :)
thank you
/** | ||
* Fall back to the old view, which re-analyzes text using | ||
* PlainLinetokenizer. E.g., when source code is updated (thus | ||
* affecting SHA hashes) but re-indexing is not yet complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old comment? do you use SHA hashes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! :) It's quite a lot to take all at once. I will try to look at it in more detail later.
@@ -389,6 +398,8 @@ public Configuration() { | |||
setCachePages(5); | |||
setCommandTimeout(600); // 10 minutes | |||
setCompressXref(true); | |||
setContextLimit((short)10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider using a constant? It might be easier to change later. :)
* value. | ||
*/ | ||
private static final String NONPOSITIVE_NUMBER_ERROR = | ||
"Invalid value for \"%s\" - \"%s\". Expected value greater than 0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very skilled in Java's format but shouldn't the second argument be %d
? Since short
is being passed there. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the original author that %s
is more forgiving for error strings. E.g., if the field were changed to a boxed primitive, the error string would still work (printing null
if necessary) but %d
might throw an exception.
public void setContextLimit(Short value) | ||
throws IllegalArgumentException { | ||
if (value < 1) { | ||
throw new IllegalArgumentException("`value' is not positive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use the same apostrophe type? :) Now there is `
and '
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a quoting style I use for object names to distinguish from e.g. character values, '\0'
, or from other quoted words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would too prefer to use the same character for quoting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we don't use any quotes there? ;) (grokking for that sentence on where it comes from becomes easier)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarzanek, ok removed in this patch
} catch (ParseException ex) { | ||
return fields; | ||
} | ||
String queryString = query.toString(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use query.toString()
? :) It translates to query.toString("")
in the inside and is nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to use the explicit Lucene API instead of implicitly using the Java override of toString()
that itself uses the explicit Lucene API. Moreover, nothing about the documentation of toString()
indicates how it calls the Lucene API; you had to look at source or decompile lucene-core
I expect.
@@ -263,6 +254,46 @@ public static void prettyPrint(Writer out, SearchHelper sh, int start, | |||
} | |||
} | |||
|
|||
private static void printPlain(SearchHelper searcHelper, | |||
RuntimeEnvironment env, Document doc, int docId, Writer out, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please consider somehow combining the method parameters? This violates current checkstyle parameter number setting which is set to default value, right now it is 7
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok revised to use a private container object for the arguments to printPlain()
that do not change in the calling loop.
* @param morePrefix optional link to more... page | ||
* @param limit a value indicating if the number of matching lines should be | ||
* limited. N.b. unlike | ||
* {@link #getContext(java.io.Reader, java.io.Writer, java.lang.String, java.lang.String, java.lang.String, org.opensolaris.opengrok.analysis.Definitions, boolean, boolean, java.util.List, org.opensolaris.opengrok.analysis.Scopes)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe break the line so it won't be that long? I do not think that it will break the javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is auto-generated JavaDoc, so I prefer not to insert a break into it. I do keep it on its own line unchanged (except for a single comma appended for readability) and break the surrounding text.
try { | ||
List<String> fieldList = qbuilder.getContextFields(); | ||
String[] fields = new String[fieldList.size()]; | ||
fieldList.toArray(fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please consider using this code: String[] fields = fieldList.toArray(new String[fieldList.size()]);
? I think this might be a little dangerous since it relies heavily on the size of the fields
array. If someone would change it then it would break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok revised
bld.append(HtmlConsts.BR); | ||
bld.append("\n"); | ||
res.setFooter(bld.toString()); | ||
bld.setLength(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bld
is local variable so it is eligible for garbage collection so this line is redundant, right? :)
* @throws NoSuchElementException if the instance is empty | ||
*/ | ||
public String pop() { | ||
String value = lines.remove(lines.lastKey()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return could be simplified: return lines.remove(lines.lastKey());
:)
for (String line : lines.values()) { | ||
bld.append(line); | ||
} | ||
String f = footer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove the variable f
? I think it is nicer to have it just as:
if (footer != null && limited) {
bld.append(footer);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying to a local is a habitual style I use when a mutable field is checked for null
in order to make impossible a race condition intra-method w.r.t. the field value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are concerns about race condition could it be solved otherwise?
e.g. sync on setter?
I was under the impression this object would be thread local when used ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well my point is that my making a local copy is rationalized — i.e. it's a pattern I do recommend generally when nullably-mutable fields are involved so that you can have objects that do not use heavyweight synchronization but also that are not inconsistent intra-method.
A local copy is also valuable in some synchronized settings — e.g., where you have field data that strictly evolves lazily from null
to non-null
(as in e.g. a singleton) and a volatile
field is properly used. In that case too, it's better to make a local copy of the field so that for the predominant case happening after initialization (i.e., when non-null
), only a single atomic read is required w.r.t. to field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, makes sense
Yes, CC is a permissive license like BSD which is compatible for use in a CDDL project because CDDL licenses each file individually so that its files can be kept separate from other files with their licenses (permissive or even proprietary) — and neither CDDL nor CC (nor BSD, Apache etc.) restricts the license of the executable version ( CC was applicable to those two new classes, |
} catch (ClassNotFoundException ex) { | ||
// This is not expected, so translate to RuntimeException. | ||
throw new RuntimeException(ex); | ||
int nres = top.totalHits > n ? n : (int)top.totalHits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for this update, it will make our life easier with new releases of lucene, where they prefer this new approach to fetching results
/** | ||
* Sets the right elide value. | ||
*/ | ||
public void setRelide(int value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should move the project towards scala or kotlin to make the data objects less verbose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
come on, less typing ;)
(but then one needs to get ready for some code readability changes - for good or for bad)
* REFS should be approximately fine with re-analysis using an | ||
* on-the-fly PlainAnalyzer. It might not accord with the true | ||
* language symbol tokenizer, but it should not be wildly | ||
* divergent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we inject the language symbol tokenizer somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(fwiw we already do that when doing analysis, it's a bit hacky since lucene defaulted to just one default analyzer ( see https://github.com/oracle/opengrok/blob/master/src/org/opensolaris/opengrok/analysis/plain/PlainAnalyzer.java#L98 ) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK after studying the Lucene UnifiedHighlighter
implementation, I pushed an arcane override of getIndexAnalyzer()
when OpenGrok's own highlightFieldsUnion()
is executing so that an OpenGrok TYPE
-specific analyzer is returned when re-ANALYSIS
is being done. (I also updated those block comments.)
|
||
public static final String BR = "<br/>"; | ||
|
||
public static final String HELLIP = "…"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-D
@@ -520,7 +536,7 @@ public SearchHelper prepareSummary() { | |||
return this; | |||
} | |||
try { | |||
sourceContext = new Context(query, builder.getQueries()); | |||
sourceContext = new Context(query, builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
especially curious if regexp queries will work
thank you very much @idodeclare , i don't have any other comments or musings now |
@idodeclare could you please ping me a mail to tarzanek at gmail.com with your email address? Since @Orviss will be updating the build we will start using Idea for development, folks from Jetbrains gave us licenses for ultimate, so if you are interested you can have one license ... (the plan is to have the maven used from Idea for development) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyright issues
@@ -19,7 +19,7 @@ information: Portions Copyright [yyyy] [name of copyright owner] | |||
CDDL HEADER END | |||
|
|||
Copyright (c) 2010, 2017, Oracle and/or its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy
@@ -19,6 +19,7 @@ information: Portions Copyright [yyyy] [name of copyright owner] | |||
CDDL HEADER END | |||
|
|||
Copyright (c) 2010, 2017, Oracle and/or its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy
@@ -19,7 +19,7 @@ | |||
|
|||
/* | |||
* Copyright (c) 2005, 2017, Oracle and/or its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy
@@ -378,8 +378,8 @@ private static String cutPattern(String tagLine, int startTab, int endTab) { | |||
* Adds a tag to a {@code Definitions} instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy 2018
@@ -19,20 +19,22 @@ | |||
|
|||
/* | |||
* Copyright (c) 2005, 2013, Oracle and/or its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy
@@ -20,14 +20,15 @@ | |||
/* | |||
* Copyright (c) 2011, 2017, Oracle and/or its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy
@@ -83,32 +83,32 @@ public void testIsEmpty() throws ParseException { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old copyright
*/ | ||
|
||
/* | ||
* Copyright (c) 2008, 2017, Oracle and/or its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong copyright
*/ | ||
|
||
/* | ||
* Copyright (c) 2008, 2017, Oracle and/or its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong copyright
@@ -19,8 +19,8 @@ information: Portions Copyright [yyyy] [name of copyright owner] | |||
CDDL HEADER END | |||
|
|||
Copyright (c) 2010, 2017, Oracle and/or its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy
Hi @tulinkry , I do not work for Oracle, so if I were to modify Oracle's notices, it would violate the following part of CDDL section 3.3:
The proscription about altering copyright is also in effect when a CDDL-licensed file is copied and altered, as I did with All the files altered or created by me in this patch include a notice satisfying another part of CDDL section 3.3:
I don't update that manually. I use a script I have,
( If you like I can post it for use by others. Oracle contributors would run as e.g.:
( |
…y commit)
thank you Chris, merging, let's play with this and file issues as we go |
Hello,
Please consider for integration this patch that integrates the Lucene
UnifiedHighlighter
into OpenGrok'sContext
[seegetContext2()
] while preserving for now the originalgetContext()
for falling back whenUnifiedHighlighter
cannot be used (e.g. when source text has been updated viagit-update
but LuceneDocument
postings are not yet updated).more.jsp
andsearch.jsp
both use ofgetContext2()
so their presentations are fully consistent.OpenGrok's Lucene indexing settings are modified (see
OGKTextField
) to enable storage of "postings" (i.e.,IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
) for use byUnifiedHighlighter
so that documents do not have to be re-lexed (as done byPlainLineTokenizer
) in order to be highlighted. My testing with some large repos showed only a tiny increase in index size because of this change.Additionally, for DEFS, tokens must now be indexed with stored term vectors (see
OGKTextVecField
) or else my testing showed that DEFS multi-term queries (MTQs e.g., prefixes or wildcards) did not operate correctly. This is due to LuceneUnifiedHighlighter
currently falling back toANALYSIS
in the presence of MTQs when term vectors are not available. For DEFS, re-analysis from source text does not accord whatsoever with ctagsDefinitions
; but the presence of term vectors makes this work perfectly. (See my long comment inOGKUnifiedHighlighter
regarding this issue.)Since DEFS are a small subset of index data relative to FULL and REFS fields, my testing showed again only a tiny increase in index size due to storage of DEFS term vectors.
This patch also enables optionally the display of additional surrounding lines for each match (as requested in issues #673 and #638).
Configuration
<contextSurround/>
(default 0) specifies the number of lines of leading and trailing context to display surrounding each match.A new
Configuration
<contextLimit/>
(default 10) specifies the total number of context lines to display for each matching file.Please see the following new classes that collaborate to produce OpenGrok's style of context under
UnifiedHighlighter
's coordination:StrictLineBreakIterator
: represents a subclass of JavaBreakIterator
that breaks at standard OpenGrok EOL -- namely\r\n
,\n
, or\r
, which is needed since OpenGrok's context is always w.r.t. whole source-code lines (which may be truncated to fit within the context "window").PhraseHighlight
: represents a highlighted phrase within a line.LineHighlight
: represents a collection of metadata related to highlighting a single line of code.PassageConverter
: represents an object that can translate LucenePassage
instances into indexedLineHighlight
instances, taking into account a configurable number of leading and trailing lines of context for each match.ContextFormatter
: represents a subclass of Lucene'sPassageFormatter
that uses OpenGrok'sPassageConverter
.FormattedLines
: represents structured results fromContextFormatter
that can be merged with other instances.UnifiedHighlighter
returns results separated by field, andOGKUnifiedHighlighter
merges them together to return a coherent result for OpenGrok presentation.OGKUnifiedHighlighter
: represents a subclass of LuceneUnifiedHighlighter
with customizations for OpenGrok.There are new tests of this functionality. At the highest level, new
SearchAndContextFormatterTest
tests the indexing, search, and format of the results. (There are many tests at lower levels.)Thank you.