Full-text search

Enabling search across all Deconst-indexed content has been a major goal of this project since we started it back in March, but for some reason we never made a proper issue for it. We already have an Elasticsearch instance in place within our architecture for log aggregation, so it's a question of correctly storing content as it arrives, extending the content service API with endpoints to support search, then consuming them within the presenter to provide a place for sites to hook in search queries and results.

**I. First, we'll need to index incoming metadata envelopes in Elasticsearch.** This could be as simple as jamming them as-is into Elasticsearch, but I'm guessing that we'd get better results if we mutate them somewhat for optimal indexing. I predict many trips to [the Elasticsearch docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html).
- Do we need to scrub HTML markup from the body before indexing?
- We should support some kind of well-defined "meta" tags for explicit keyword indexing. We could use these for SEO, then, too.
- We'll need to be certain that we delete stale information from the search index when new content arrives with the same ID.

**II. Then we need to design a content service API to submit searches and retrieve results.** We _could_ accept Lucene queries and pass them directly to Elasticsearch, but once again, I suspect that there's a better way.
- Search results will be returned in terms of content IDs, because the content service has no knowledge of the control repository.
- We'll likely want to include additional information about each match, like preview snippets with offset data to highlight matching text.
- Ordering of search results is important. I'm guessing that Elasticsearch will give us the knobs to tune this to get it to the way we want, but we'll need to know what the way we want is, first.
- We may want the ability to optionally constrain results to results from a specific domain.

**III. Finally, we need to figure out how to render search results within a page.** This is a little more involved than it seems because the content service talks entirely in terms of content IDs, while to be useful we need search results as presented URLs. I can see two possible directions:
1. Flag certain pages as "search result pages" somehow (in the control repo, possibly?). Nab standard query flags from URLs on those pages (`?q=term&page=3&perPage=10`) and submit a query to the content service. Translate result objects from content ID to presented URL based on the current content mapping. Then, either: _(a)_ inject the result objects into the envelope's body HTML with a content filter, or _(b)_ make the result array available to the nunjucks templates as a variable.
2. Expose a _presenter_ API endpoint that fires queries to the content service, maps the content IDs, and returns results as a JSON object. Consume it entirely from client-side Javascript. This would make it somewhat more difficult to incorporate within each domain, but it would let us do things like implement a loading spinner while the search is being done.

I don't really have any experience designing search systems, so any input from those who do is most welcome :bow:

@steveortiz, @j12y: Please weigh in if the KC has any additional requirements for this.

/cc @kenperkins @jnoller


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Full-text search #187

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Full-text search #187

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions