e621ng / e621ng

e621.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FR: expressions/grouping in searches

mm12 opened this issue · comments

A widely-requested feature. I might tinker with app/logical/tag_query.rb myself to get partway there. Here are some notes:

  • Avoiding conflicts with tag names containing grouping characters. Some options:
    • Require space around grouping: (~A ~B) does not work, but ( ~A ~B )` does
    • Restrict tag characters (such as disallowing { and } and use them for grouping
    • Ignore the problem and hope nothing breaks (not recommended)
  • danbooru has an implementation of this in def expr
  • Should modifiers on grouping be allowed? -(A B C) = (-A -B -C) or ~(A B)= (~A ~B) - but what about -(~A ~B) (ie - has to be missing one of A or B)

why: Currently, ORing and NOTing things is query-wide, which is a huge pain. Using grouping means you can do things like (~A ~B) (~C ~D) E, which require multiple queries using the current system.

edit: notes:
ElasticPostQueryBuilder.new(query).build

Ah yes, the feature request that's nearly as old as e621 itself finally making its way to the GitHub issues. It may forever remain a pipe dream, but we can always hope.


  • Avoiding conflicts with tag names containing grouping characters.
    • Require space around grouping: (~A ~B) does not work, but ( ~A ~B )` does

Would the easiest way be to just not allow ( to be the first character in a tag? There's currently only a handful that would violate this, and most are invalid anyway. Tags can obviously still end with ) but that would be easier to handle since we know where the grouping starts.

There might be some issues with stray parentheses in tags, but I don't think any of those should actually be valid either. A rule could probably be added to the tag validator to prevent any future tags with stray parentheses being created, e.g. person_character).

It would be easy to break down the groups from this string

(a_(character) b) (~b_(artist) ~c_(species)) (~pony_(mlp) ~pony_(eg) -pony)

to the extracted result

(a_(character) b)
(~b_(artist) ~c_(species))
(~pony_(mlp) ~pony_(eg) -pony)

Below is the code that gave that result. I wouldn't recommend actually using it as I threw it together pretty sloppily, only to be provided as a proof-of-concept.

def handle_tag_groups(input)
  tag_groups = []
  current_group = ""
  nesting_level = 0
  
  input.chars.each do |char|
    nesting_level += 1 if char == '('
    nesting_level -= 1 if char == ')'
    current_group += char
  
    if nesting_level == 0 && char == ')'
    tag_groups << current_group.strip
    current_group = ""
    end
  end
  
  tag_groups
end
  
  tag_groups = handle_tag_groups("(a_(character) b) (~b_(artist) ~c_(species)) (~pony_(mlp) ~pony_(eg) -pony)")
  
  tag_groups.each_with_index do |tag_group|
    puts tag_group
  end

...this is why I should fully read the issue before starting to do anything. I imagine if Danbooru has this implemented they've already solved the above problem? I imagine we're too far diverged at this point to just pull this though.

Both of these are good ways to do this, but just using another set characters that is already disallowed in tag names or something would entirely bypass any issues with it.
It is also worth noting that in app/logical/tag_name_validator.rb line 32, we specify that tags cannot begin with many of these characters

grouping with anything but parenthesis will be extremely unintuitive

grouping with anything but parenthesis will be extremely unintuitive

so is using ~ as the or operator imo.

What do you propose we group with? Percent signs? None of the characters that aren't allowed in tags are good for grouping things together

when /\*/
record.errors.add(attribute, "'#{value}' cannot contain asterisks ('*')")
when /,/
record.errors.add(attribute, "'#{value}' cannot contain commas (',')")
when /#/
record.errors.add(attribute, "'#{value}' cannot contain octothorpes ('#')")
when /\$/
record.errors.add(attribute, "'#{value}' cannot contain peso signs ('$')")
when /%/
record.errors.add(attribute, "'#{value}' cannot contain percent signs ('%')")
when /\\/
record.errors.add(attribute, "'#{value}' cannot contain back slashes ('\\')")

What do you propose we group with? Percent signs? None of the characters that aren't allowed in tags are good for grouping things together

when /\*/
record.errors.add(attribute, "'#{value}' cannot contain asterisks ('*')")
when /,/
record.errors.add(attribute, "'#{value}' cannot contain commas (',')")
when /#/
record.errors.add(attribute, "'#{value}' cannot contain octothorpes ('#')")
when /\$/
record.errors.add(attribute, "'#{value}' cannot contain peso signs ('$')")
when /%/
record.errors.add(attribute, "'#{value}' cannot contain percent signs ('%')")
when /\\/
record.errors.add(attribute, "'#{value}' cannot contain back slashes ('\\')")

I actually think {} would be good, it very rarely used in tags (1 result on e6), and still makes sense to use. Something like this would work:
master...mm12:e621ng:grouping

OK, so here is what I am seeing what the code does:
The query_string from the search enters ElasticPostQueryBuilder -- (eg "A -B ~C ~D")
TagQuery is called with query_string - query_string is parsed into being either must (ANDed), should (ORed), or must_not (NOTed). -- (eg, {"q":{"tags":{"must":["A"],"must_not":["B"],"should":["C","D"]}, "status_must_not":"deleted", "resolve_aliases":true, "tag_count":4})
ElasticPostQueryBuilder then passes this to ElasticQueryBuilder (superclass) which turns these into an actual elastic search:

{"query":
  {"bool":
      {
        "must":[{"term":{"tags":"A"}}],
        "must_not":[{"term":{"tags":"B"}}],
        "should":[
          {"term":{"tags":"C"}},
          {"term":{"tags":"D"}}
        ],
        "minimum_should_match":1
  },
  "sort":[{"id":"desc"}],
  "_source":false,
  "timeout":"9000ms"
}

Basically, to implement grouping in elasticsearch, you need for this to be recursive:

{"query":
  {"bool":
      {
        "must":[{
          "bool":{"must":[...],....
        }],
        "must_not":[...],
        "should":[...],
        "minimum_should_match":1
  },
  ...
}

for any parts you want to group.

I believe "boost" can be used to also make posts that match more groups be "more important" - useful for tiebreaker on orders?

First, I want to preface this with: I am not a ruby dev. Until like last week, I never put any thought into ruby... ever.

However, I implemented this on my database mirror using javascript, though I would like to see this eventually in the main site. So, I've gotten started on my fork here: https://github.com/DontTalkToMeThx/e621ng/tree/tag-syntax

It follows the syntax of my existing advanced search syntax which is detailed here. Due to this being a (massively) breaking change to the syntax, I wish for it to be a setting users can enable in their settings and passed along as a query param for the API.

It does not currently support metatags, it only works with just searching tags right now (idk if wildcards work). I do not want to sink too much time into something that won't ever make its way on to the site, so I did this as a proof of concept. It not currently optimized, again I am NOT a ruby dev. If you wish to test it, you need to add &use_new_syntax=true to your query string in the URL bar as I have no idea how to add a persistent setting at the moment.

It is identical to my other syntax, so you can use groups within groups, negation, etc. This is an example: ( female ~ male ) ( solo ~ duo ) would find posts that contain a female or male, and are either solo or duo. Which is currently not possible with the current syntax due to not having grouping. If you want to test this in your local version, this is the URL I use: http://localhost:3000/posts?tags=%28+female+~+male+%29+%28+solo+~+duo+%29&use_new_syntax=true

I wish for this to be used as a starting point for further development, mainly to show that this is possible, and I think we should eventually have some kind of grouping. I will make this fully feature complete with metatags, ordering, etc if it's deemed something that would actually be considered for the site.

My full implementation (in js) is open source as well, which is mainly what I'm kinda moving over to here. One of the main differences is that the builder is made without the intent of the grouping since a lot of the meta tags are top level only, so I'll need to implement my meta tag parser to inject the meta query into the correct location, which shouldn't be too hard.

Booru on Rails has a relatively feature-complete search parser which generates Elasticsearch queries, and has accompanying tests:
https://github.com/derpibooru/booru-on-rails/blob/master/lib/search_parser.rb
https://github.com/derpibooru/booru-on-rails/blob/master/test/lib/search_parser_test.rb

While I would advise against using it directly (due to licensing issues, and to be honest it isn't tailored to the needs of this software), there shouldn't be any reason you couldn't use a similar grammar.

I've opened a draft here for future discussion: #625