Nixinova / LinguistJS

Analyse and list all languages used in a folder. Implementation of and powered by GitHub's Linguist.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

*.tsx files are classified as XML

kachkaev opened this issue · comments

Steps to reproduce:

mkdir test-dir
cd test-dir

echo "export default function () { return <div>hello</div>; }" > ./MyComponent.jsx   
echo "export default function () { return <div>hello</div>; }" > ./MyComponent.tsx   

npx linguist-js@2.0.0 --analyze .

Current output

Analysed 112 B from 2 files with linguist-js

 Language analysis results:
   1. XML                      50.00%         56 B
   2. JavaScript               50.00%         56 B
 Total: 112 B

Expected output

Analysed 112 B from 2 files with linguist-js

 Language analysis results:
   1. TypeScript               50.00%         56 B
   2. JavaScript               50.00%         56 B
 Total: 112 B

If we create files with

touch ./MyComponent.jsx
touch ./MyComponent.tsx

the output is:

Analysed 0 B from 2 files with linguist-js

 Language analysis results:
   1. XML                        NaN%          0 B
   2. JavaScript                 NaN%          0 B
 Total: 0 B

"tsx" is classified as both a TS and XML filename in GitHub-linguist's languages file.

Running linguist-js, the intermediary file classification is ["TypeScript", "XML"]. This program then runs through heuristics etc, but didn't find anything specific to one of the languages. Here it just defaults to the last language in the aforementioned list, so XML.

In the first case it falls down to the heuristics, which match with the following:

https://github.com/github/linguist/blob/dc577fe72a28ab069f02cc002b4f2e3851a72f93/lib/linguist/heuristics.yml#L577-L582

As you can see the heuristics only check import keywords not export, so the check is missed, and falls back to picking XML out of the list by default.

In the second case theres no file content to match, so it just loads XML from the list by default as well. That can't be fixed at all. (The percent can be though...)

IC, the case with an empty file makes sense. What’s odd is that have quite a few real files importing React which are still classified as XML. Here’s an MVE:

mkdir test-dir
cd test-dir

cat <<EOF > MyComponent1.tsx
import * as React from "react";
export default function () { return <div>hello</div>; }
EOF

cat <<EOF > MyComponent2.tsx
import React from "react";
export default function () { return <div>hello</div>; }
EOF

cat <<EOF > MyComponent3.tsx
const react = require("react");
export default function () { return <div>hello</div>; }
EOF

npx linguist-js@2.0.0 --analyze .
Analysed 256 B from 3 files with linguist-js

 Language analysis results:
   1. XML                      100.00%        256 B
 Total: 256 B

In that case it is a problem, I'll check to see if the heuristics are applied properly.

The heuristics don't actually check for CommonJS imports, only TS-native imports: see https://regexr.com/659pr
so even when this is fixed your Component3 example won't be matched by the heuristic.

Fixed in 2.0.2 👍

I can confirm that I start seeing "TSX" for *.tsx files in 2.0.2.

There are still instances of XML though and it seems to be to do with the location of React import. The heuristics only applies to the first line.

mkdir test-dir
cd test-dir

cat <<EOF > MyComponent1.tsx
import * as React from "react";
export default function () { return <div>hello</div>; }
EOF

cat <<EOF > MyComponent2.tsx
import foo from "bar";
import * as React from "react";
export default function () { return <div>hello</div>; }
EOF

npx linguist-js@2.0.2 --analyze .
Analysed 199 B from 2 files with linguist-js

 Language analysis results:
   1. XML                      55.78%        111 B
   2. TSX                      44.22%         88 B

Shall we reopen this issue or is it better to create a new one?

The heuristics only applies to the first line.

That's a simple fix, I wasn't applying flag /m to the RegEx when it contained ^.

Fixed now, though it will now classify TSX as TypeScript unless --childLanguages is set