Parse target urls from style metadata & make them searchable

Question

Parse target urls from style metadata & make them searchable

RiedleroD opened this issue 4 years ago · comments

Lisa Magdalena Riedler commented 4 years ago

should be easy enough to parse – urls are always between @-moz-document and {, so you would only have to get the string in-between and see if it's a regex, a direct url, a prefix or a domain rule. From there, you can just take the rule. Maybe let the user be able to change it & rescan it when the style updates.

Then, add it to the searching algorithm. Idk if that's too heavy for the server, but you could try checking the regex on search.

I don't know how easy or hard this would be to implement since I have no idea what your project does in the backend, but I think the people would love that feature.

Vadim Chetkov · Answer 1 · Mon Nov 16 2020 01:12:07 GMT+0800 (China Standard Time)

Current repo parser has no information on what is inside the usercss file, so it will require an additional request to the GitHub API for file contents.

Why do you think it should be implemented? At the moment algorithm searches by repo name, custom name, repo owner name, and tags, and it seems that search results are relevant now.

Lisa Magdalena Riedler · Answer 2 · Fri Dec 11 2020 20:53:59 GMT+0800 (China Standard Time)

Why do you think it should be implemented?

For more specific searching. Imagine searching for "docker.com" and nothing gets shown because people put it as "docker" into the tags, then searching "docker" and a whole lot of irrelevant repos get shown that happen to be made by someone named "docker user 1200".

Solutions I though of would include adding advanced search (where "search in owner names" could be disabled, and similar stuff) or adding a kind of "search by url", which would also be pretty convenient and would increase the likelyhood that the stylus inline search includes your site.

I'm not sure how easy or hard this would be to implement since I have absolutely 0 knowledge of Node.js, but something I would do, regardless of programming language would include these steps:

getting the file contents
get the style rules with a regex like @-moz-document +(regexp|url|url-prefix|domain) *\(([^)]+)\) *{
parse the rules with the regex capturing groups into regex (e.g. if the first capturing group is "url", strip the quotation marks from the start and end of the second group and then just escape everything according to regex syntax)
then, when a user searches something per url, you can just fullmatch the regex you got and if any of the regex rules from a style apply to the entered url, add it to the results.
Also, if a user searches for something not prefixed with http:// or https://, either should be prepended because many styles just assume that it's always present and build their rules around that.

Freeplay · Answer 3 · Tue Dec 15 2020 03:54:45 GMT+0800 (China Standard Time)

If this is added, could you also parse @description as well? Since it is now possible to add multiple styles from one repository, and not have that same repository description set for all the styles in that repo

Edit: You could add version numbers, support URL's, etc. just from the metadata to the style pages

Vadim Chetkov · Answer 4 · Mon Jan 04 2021 23:36:55 GMT+0800 (China Standard Time)

Metadata parser was added in 3.4.0. Since then style's title, description, and license are being set from the metadata.

As for sites to which the style is applied, I will need some help with a regex that gets the hostname from the regex rule.
So far, I have no idea how to deal with such expressions:

@-moz-document regexp("https?:\/\/.*wik(i|t).*(org|jp).*") {
@-moz-document regexp("^https?://((gist|guides|help|raw|status|developer)\\.)?github\\.com/((?!generated_pages/preview).)*$") {
@-moz-document regexp("^https?://((gist|guides|docs|lab|launch-editor|raw|resources|status|developer|support)\\.)?github\\.com/((?!generated_pages/preview).)*$") {
@-moz-document regexp("moz-extension:\/\/.*"), regexp("chrome-extension:\/\/.*") {

For a start, I could parse only domain() rules.

Lisa Magdalena Riedler · Answer 5 · Fri Jan 08 2021 03:42:13 GMT+0800 (China Standard Time)

well I can show you how I would do it in pseudocode:

when the style gets parsed:

define this as the main regex: @-moz-document(?: +(regexp|domain|url|url-prefix) *\((.+)\))? *{
create a new array (referred to as the rule array)
for each match in the stylesheet:
- get the rule type (regexp,url,etc.) from the first capturing group (note: capturing group 0 is the whole match, the first actual group is at index 1)
- get the rule string from the seconf capturing group
- remove leading and trailing spaces from the rule string
- remove leading and trailing quotes from the rule string (both ' and ", but only one at each end)
- if the rule type is not regexp:
  - escape the rule string
- if the rule type is domain:
  - prefix the rule string with https?:\/\/(\w+\.)* and postfix it with (\/.*)?
- if the rule type is url:
  - do nothing, it's good to go
- if the rule type is url-prefix:
  - escape the rule string
  - postfix the rule string with .*
- add the rule string to the rule array
then concatenate the rule array like this: (rule1|rule2|[…])
- example: if the array was ["abc","def","15","ghi"], then the concatenated string would be (abc|def|ghi)
then save that string with the rest of the style metadata. that's the regex string we'll use to check if a style is searched for
then, when someone searches for a style:
for each style:
- check if the regex we saved earlier matches the string. Don't forget to do a fullmatch here, meaning the whole string has to match the regular expression.
- if it does, add the style to the …output queue or something idk how you implemented that

If you're stuck somewhere, please ask.