tdewolff / minify

Go minifiers for web formats

Home Page:https://go.tacodewolff.nl/minify

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Javascript minification renders objects impossible to unmarshal with json

RampantDespair opened this issue · comments

Basically I am getting a webpage and extracting a particular javascript object from said page.
The object I extract is valid and worked prior to using minify.

I'm using the boilerplate config:

m = minify.New()
m.AddFunc("text/css", css.Minify)
m.AddFunc("text/html", html.Minify)
m.AddFunc("image/svg+xml", svg.Minify)
m.AddFuncRegexp(regexp.MustCompile("^(application|text)/(x-)?(java|ecma)script$"), js.Minify)
m.AddFuncRegexp(regexp.MustCompile("[/+]json$"), json.Minify)
m.AddFuncRegexp(regexp.MustCompile("[/+]xml$"), xml.Minify)

I'd like to minify everything besides the doublequotes present in javascript objects, is this possible?
I couldn't find an option for it.

Can you show me an example of the input? Are you sure the extracting part of your application is able to handle minified (but correct) HTML or JS code?

There is no possibility to prevent minifying between double quotes in JS objects, not sure what you mean anyways, you mean strings? Why would you not want to minify those? What is your use-case?

@tdewolff

Of course, basically I have 2 use-cases:

  • Minifying the files I serve on my application
  • Minifying the files I receive on my application

My application simply fetches data from other websites and displays it on mine.

When it comes to serving the data on my website, everything works fine and as intended.
However when it comes to fetching data from other websites this is not the case.

The fetching is done by getting a webpage and scraping the necessary information, here's a shortened example of the data:

[...]
<script>
	document.addEventListener(
		"DOMContentLoaded",
		function() {
			var e=window.Search_V3Controller;
			e.default&&(e=e.default),
			e.ReactDOMrender({data:{header:!1}})
			[...]
</script>
[...]

As you can see the javascript object is hardcoded straight on the webpage (and that's what I retrieve).
So stripping all the elements I don't need I am left with: {data:{header:!1}}

I want to unmarshal this into one of my defined models.
However since the double quotes for keys are removed, booleans are converted and more than likely other things are changed; I can't.

Is it possible, to your knowledge, to unmarshal a minified javascript object into a go model?

I've tried using otto to attempt to parse the JSON in JS and then pull it back to Go, but that attempt was unsuccessful.

func JavascriptToJSON(body string) (bodyParsed string) {
	err := vm.Set("body", body)
	if err != nil {
		log.Fatal("Error setting body:", err)
	}

	_, err = vm.Run("var jsonBody = JSON.parse(body)")
	if err != nil {
		log.Fatal("Error parsing body:", err)
	}

	_, err = vm.Run("var jsonString = JSON.stringify(jsonBody)")
	if err != nil {
		log.Fatal("Error stringifying jsonBody:", err)
	}

	value, err := vm.Get("jsonString")
	if err != nil {
		log.Fatal("Error getting jsonString:", err)
	}

	jsonStr, err := value.ToString()
	if err != nil {
		log.Fatal("Error converting jsonString to string:", err)
	}

	return jsonStr
}

But reading back on this reply, I suppose this would be an otto question and not a minify one (assuming I need to use otto to begin with)

I see what the problem is, thanks for sharing. The problem is that JS is not JSON (but JSON is JS). That is, we're minifying JS which ends up being invalid JSON. What you could do I suppose is this:

It is the most correct way, since the minifier just amplifies a design problem you already had: parsing JS as if it were valid JSON. Surely, if there exists a JS unmarshaller, that would be better yet as you don't need point 3 above. Or you could write an unmarshaller manually using the AST from point 2...

@tdewolff

I wasn't able to find a JS unmarshaller.
This is what I am doing right now, following your suggestion:

s.bodyParsed = strings.Split(s.bodyParsed, `.ReactDOMrender(`)[1]
s.bodyParsed = strings.Split(s.bodyParsed, `,document.getElementById(`)[0]
if debug {
	os.WriteFile("test.txt", []byte(s.bodyParsed), OS_ALL_RWX)
}

reader := strings.NewReader(s.bodyParsed)
input := parse.NewInput(reader)
astTree, err := js.Parse(input, js.Options{})
if err != nil {
	Debug("Failed to parse input to astTree -> %s", err)
}

jsonBody, err := astTree.JSONString()
if err != nil {
	Debug("Failed to convert astTree to jsonBody (%s) -> %s", astTree, err)
}
if debug {
	os.WriteFile("../test.json", []byte(jsonBody), OS_ALL_RWX)
}

var bodyJson model.PageData
err = json.Unmarshal([]byte(jsonBody), &bodyJson)
if err != nil {
	log.Fatal(err)
}

I confirm that the javascript object is correct @ test.txt
However, I get an error when attempting to convert that string into an AST Tree:

2024/04/12 14:01:43 Failed to parse input to astTree -> unexpected : in expression on line 1 and column 46
    1: ...eaderTop:0,scrollTop:0,currHeaderTop:100,...
                              ^
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x1385104]

I'm using:

"github.com/tdewolff/parse/v2"
"github.com/tdewolff/parse/v2/js"

Am I doing the procedure incorrectly?

The procedure looks good, but the JS is probably bad. Can you post the content of s.bodyParsed? Are you sure it is valid JS?

@tdewolff
No problem here are the 2 versions:
RAW version (no minification) : https://gist.github.com/RampantDespair/94ac035f113efa34e76a9e2efa6ec7c7
MINIFIED version (with the boilerplate config): https://gist.github.com/RampantDespair/05695fc9c6484ea8c6aa48cadba38136

Your raw version is invalid JavaScript. Specifically, when using an object literal, you must enclose it in parenthesis to avoid confusion with a block statement (eg. {...} => ({...})). Try enclosing it with parenthesis to make it parse.

@tdewolff

That's weird because I am able to json.Unmarshal([]byte(jsonBody), &bodyJson) the raw version without any problems.
Online parsers are also able to parse it to json (http://json.parser.online.fr/)

Nevertheless, I tried adding the parentheses to enclose the object, but I am still getting the invalid JSON error thrown.

Yes, it is valid JSON to not enclose it with parenthesis (taken that all dictionary keys are quoted), it is just not valid JavaScript to specify an object literal as the sole content of a statement without enclosing it by parenthesis. That is because it is otherwise impossible to know if this is a block statement or an object literal. Try a JavaScript parser instead of a JSON parser such as https://astexplorer.net/ and it will show an error with the JS code you have.

You say you're still getting an invalid JSON error, but that is the first time you mention that error. I was helping you with Failed to parse input to astTree -> unexpected : in expression, and I guess that is now fixed? What code is throwing invalid JSON, and what input do you give it?

@tdewolff

My apologies by still, I meant more in the sense that the procedure isn't working as intended.
But yes you are correct the invalid JSON error stems from Failed to convert astTree ([...]) -> invalid JSON.

And I am feeding it the same input as the MINFIED version above, however with the ( )

The revised code is as follows:

s.bodyParsed = strings.Split(s.bodyParsed, `.ReactDOMrender(`)[1]
s.bodyParsed = strings.Split(s.bodyParsed, `,document.getElementById(`)[0]
s.bodyParsed = fmt.Sprintf("(%s)", s.bodyParsed)  // <---- NEW
if debug {
	os.WriteFile("minfied.txt", []byte(s.bodyParsed), OS_ALL_RWX)
}

reader := strings.NewReader(s.bodyParsed)
input := parse.NewInput(reader)
astTree, err := js.Parse(input, js.Options{})
if err != nil {
	log.Printf("Failed to parse input (%v) -> %s", input, err)
	return
}

jsonBody, err := astTree.JSONString()
if err != nil {
	Debug("Failed to convert astTree to jsonBody (%s) -> %s", astTree, err) // <---- ERROR THROWN
}
if debug {
	os.WriteFile("../test.json", []byte(jsonBody), OS_ALL_RWX)
}

var bodyJson model.PageData
err = json.Unmarshal([]byte(jsonBody), &bodyJson)
if err != nil {
	log.Fatal(err)
}

Thanks for the code, I've run some tests and indeed there were several issues when writing JSON from minified JS (specifically !0 => true and template literal => string literal). I've fixed those issues in the parse library, please update it to master (go get -u github.com/tdewolff/parse/v2@master) and see if it works.

Fixed by tdewolff/parse@b5d42d6

@tdewolff
Yup it works now, thanks a bunch for helping me out till the end!