Tokenize source code into integer vectors, symbols, or discrete tokens.
The following languages are currently supported.
- C
- C#
- C++
- Java
- PHP
- Python
cd src
make
cd src
sudo make install
tokenizer file.c
tokenizer -l Java -o statement <file.java
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C
35 320 60 2000 46 2001 62 322 2002 40 41 123 2003 40 625 41 59 327 1500 59 125
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C -t s
# include < ID:2000 . ID:2001 > int ID:2002 ( ) { ID:2003 ( STRING_LITERAL
) ; return 0 ; }
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#"
312 2000 123 360 376 2001 40 41 123 2002 46 2003 46 2004 40 627 41 59 125 125
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -t s
class ID:2000 { static void ID:2001 ( ) { ID:2002 . ID:2003 . ID:2004
( STRING_LITERAL ) ; } }
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -o method
123 2002 46 2003 46 2004 40 627 41 59 125
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -t s
# include < ID:2000 > LINE_COMMENT using namespace ID:2001 ; int ID:2002
( ) LINE_COMMENT { ID:2003 LSHIFT STRING_LITERAL LSHIFT ID:2004 ;
LINE_COMMENT return 0 ; LINE_COMMENT }
$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/j/Java.java | tokenizer -l Java -t s
public class ID:2000 { public static void ID:2001 ( ID:2002 [ ] ID:2003 )
{ ID:2004 . ID:2005 . ID:2006 ( STRING_LITERAL ) ; } }
curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -t c
#
include
<
iostream
>
// ...
using
namespace
std
;
int
main
(
)
// ...
{
cout
<<
"..."
<<
endl
;
// ...
return
0
;
// ...
}
You can read the command's Unix manual page through this link.
To support a new language proceed as follows.
- Open an issue with the language name and a pointer to its lexical structure defintion.
- Add a comment indicating that you're working on it.
- List the language's keywords in a file name language
-keyword.txt
. Keep alphabetic order. If the language supports a C-like preprocessor add those keywords as well. - Copy the source code files of an existing language that most resembles
the new language to create the new language files:
language
Tokenizer.cpp
, languageTokenizer.h
, languageTokenizerTest.h
. - In the copied files rename all instances (uppercase, lowercase, CamelCase) of the existing language name to the new language name.
- Create a list of the new language's operators and punctuators, and
methodically go through the language
Tokenizer.cpp
switch
statements to ensure that these are correctly handled. When code is missing or different, base the new code on an existing pattern. - Add code to handle the language's comments.
- Adjust, if needed, the handling of constants and literals. Note that for the sake of simplicity and efficiency, the tokenizer can assume that its input is correct.
- To implement features that aren't handled in the language whose tokenizer implementation you copied, look at the implementation of other language tokenizers that have these features.
- If you need to reuse a method from another language, move it to
TokenizerBase
. - Add the object file language
Tokenizer.o
to theOBJ
list of file names in theMakefile
. - Add unit tests for any new or modified features you implemented.
- Update the file
UnitTests.cpp
to include the unit test header file, and calladdTest
with the unit test suite. - Update the method
process_file
intokenizer.cpp
to call the tokenizer you implemented and the language's name to the list of supported languages. - Ensure the language is correctly tokenized, both by running the
tokenizer and by running the unit tests with
make test
. - Update the manual page
tokenizer.1
and thisREADME.md
file.