Tokenizer replaces document elements with tokens and stores the parsed content elements in a JSON. The library wtf_tokenizer
was designed to work together with the great Wiki Markdown parser wtf_wikipedia
developed by Spencer Kelly. Without his work for wtf_wikipedia
this library wtf_tokenizer
would not exist.
The tokenizer may be implemented based on the description of micro libraries for wtf_tokenizer
as part of the Wiki Transformation Framework
(WTF).
With this tokenizer you will be able to replace the following content elements:
- Mathematical Expressions,
- Citations and References
Mathematical expression are defined with the math
-tag in Wiki markdown syntax:
text before the mathematical expression <MATH>\sum_{i=1}^{\infty} [x_i]
: v_i
</MATH> text after math.
Tokenizing with the encode()
-call for mathematical expressions will create the following content as output.
text before the mathematical expression ___MATH_INLINE_7238234792_5___ text after math.
After the time index 7238234792
the enumeration ID 5
represents the 5th mathematical expression found in the Wiki markdown source text.
Citations and references are defined with the ref
-tag in Wiki markdown syntax:
text before the reference <ref name="MyLabel">Peter Miller (2020) ...</ref> and text after the reference.
cite an already defined reference with <ref name="MyLabel"/> text after citation.
Tokenizing with the encode()
-call for citations and reference will create the following content as output.
The tokenizer converts a XML sections like REF
-tag and mathematical expression wrapped in MATH
-Tag into attribtues of the generated JSON,
text before math <MATH>
\sum_{i=1}^{\infty} [x_i]
: v_i
</MATH> text after math.
text before <ref>my reference ...</ref> and text after
cite an already defined reference with <ref name="MyLabel"/> text after citation.
into
text before math ___MATH_INLINE_7238234792_5___ text after math.
text before ___CITE_7238234792_3___ and text after
cite an already defined reference with ___CITE_7238234792_MyLabel___ text after citation.
The challenge of parsing can be identified in the mathematical expression. The colon :
in the first column of the line defines an indentation. But within a mathematical expression it is just a devision.
The number 7238234792 is a unique integer generated by the time and date in milliseconds, to make the marker unique. Mathematical Expressions, Citation and References are extracted and replaced by an encode()
-call of wtf_tokenizer
. The tokenizer is defined in /src/index.js
and requires the submodules.
For further processing e.g. with the wtf_wikipedia
library the tokens/markers are regarded as ordinary words in text.
If you want to generate different output formats with wtf_wikipedia
(e.g. HTML
, LaTeX
, Markdown
, ...)
the tokens/markers can be replaced in the appropriate syntax by calling a detokenizer/decoder during post processing the generated output from other Wiki Transformation Framework tools like wtf_wikipedia
or Wiki2Reveal
. When the output is generated with wtf_wikipedia.html()
or wtf_wikipedia.markdown()
, then call because during output the final numbering of citations can be generated, if more that on articles are downloaded and aggregated.
So it makes sense, that the markers/tokens remain even in the JSON sentences, sections and paragraphs until the final output is generated. By wtf_tokenizer
the corresponding data of the tokenizer will populate the doc.references
in the same way as wtf_wikipedia
but in addition wtf_wikipedia
the label for backwards replacment in the output was appended to the records for any tokens. E.g. the record for the corresponding label (e.g. ___CITE_7238234792_3___
or ___MATH_INLINE_7238234792_5___
will also be stored for all references and mathematical expressions. This concept will allow that, that later on the markers for citations can be replaced by [6]
in the IEEE citation style. If you want a decoding of tokens for citations in APA-Style, you will able to replace that e.g. by (Kelly 2018)
with the call of wtf_tokenizer.text()
or wtf_tokenizer.html()
. The same would be performed for mathematical inline and block expressions, they need the original location of the mathematical expression e.g. in sentence (e.g. ___MATH_INLINE_7238234792_5___
).
This needs the introduction of wtf_tokenizer.json()
method will not replace any content in the output of the JSON-file. The replacement can be implemented if you want that to be performed in a specific use-case of your application.
Furthermore it must be mentioned that mathematical expression have different rendering styles in Wikipedia, Wikiversity. The block
or inline
type distinguish between mathematical expression in the text and mathematical expressions in a seperated line. The token label attribute for mathematical content incorporates the style information by adding that to the label name of the corresponding token of the mathematical expresssions in the wiki source.
- Step 1:
wtf_fetch()
based oncross-fetch
fetches the wiki source- Input:
language="en"
orlanguage="de"
to specify the language of the wiki sourcedomain="wikipedia"
ordomain="wikiversity"
ordomain="wikispecies"
to select the wiki domain for the Wikifetch()
call to pulls the wiki sources from.
- Input:
- Output:
- wiki source text e.g. from
wikipedia
orwikiversity
- Remark:
wtf_fetch
extracts yourwtf.fetch()
in a separate module.
- wiki source text e.g. from
- Step 2:
wtf_tokenize()
- Input:
- wiki source text e.g. from
wikipedia
orwikiversity
fetched bywtf_fetch
- wiki source text e.g. from
- Output:
- wiki source text where e.g. mathematical expressions are replaced by tokens like
MATH-INLINE-839832492834_N12
.wtf_wikipedia
treats those tokens just as words in a sentence.
- wiki source text where e.g. mathematical expressions are replaced by tokens like
- Input:
- Step 3:
wtf_wikipedia()
- Input:
- wiki source text with tokenized citations and mathematical expressions
- Output: object
doc
of typeDocument
. Application of output methods fortext
,html
,latex
,json
containing the tokens as words in sentences. The tokens appear in the output ofdoc. html()
ordoc.latex()
inwtf_wikipedia
and in the JSON as well.
- Input:
- Step 4:
wtf_tokenizer
- Input:
- string in the export format, text with tokenized citations and mathematical expressions
- Output: detokenized export format in the
out
string is injected in the DeTokenizer w.g.wtf_tokenizer.html(out,data,options)
. In this case the output strintout
is already in the HTML format. In the outputout
or in any other desired output format (e.g.markdown
) the token replacement is performed e.g. for HTML the mathematical expressions are exported to MathJax and e.g. for latex the detokenizer replaces the word/token___MATH_INLINE-839832492834_12___
by$\sum_{n=0}^{\infty} \frac{x^n}{n!}$
. The tokenizer can replace the tokens of type
- Input:
___MATH_INLINE_793249879_5___
___MATH_BLOCK_793249879_6___
and pushes the latex code of mathematical expressions in the JSON data
- Citations reference with a name
<ref name="my citation" />
are replaced by
___CITE_LABEL_793249879_my_citation___
Use wtf_fetch
to fetch Wiki markdown from wikipedia or Wikiversity and the apply wtf_tokenizer
to tokenize
- Citations or
- Mathematical expression
with a unique identifier in the Wiki markdown source.
wtf_wikipedia
developed by Spencer Kelly and <a href="https://github.com/spencermountain/wtf_wikipedia/graphs/contributors" target="ContributorsGithub>contributors can be used to generate output or the tokenizer can be used in Wiki2Reveal
wtf_wikipedia
turns wikipedia's markup language into JSON
while Wiki2Reveal
creates a RevealJS presentation from the Wiki Markup source. In both use-cases of the wtf_tokenizer
support you in handling the citations and mathematical expressions before parsing the content of a MediaWiki source. After Wiki2Reveal
or
The following wtf_tokenizer-demo is HTML-page, that imports the library wtf_fetch.js
and the library wtf_tokenizer.js
, which is generated by this modules.
wtf_fetch.js
fetches articles from Wikipedia, Wikiversity, ... which are used as input files for testing the tokenizer-- uses HTML-form elements to determine the Wikipedia article and the domain from which the article should be downloaded.
- Provides a
Display Source
button to show the current source file in the MediaWiki of Wikiversity or Wikipedia. - The download appends a source info at very end of the downloaded Wiki source, to create a reference in the text (like a citation - see function
append_source_info()
) :: Demo wtf_tokenizer - Wikipedia2Wikiversity uses
wtf_tokenizer
to download the Wikipedia markdown source into a textarea of an HTML file. The Wiki markdown source is processed and so that interwiki links from Wikiversity to Wikipedia work. Wikipedia2Wikiversity is also a demonstrator of an AppLSAC-0.
The following repositories are related to the Wiki Transformation Framework wtf_wikipedia
:
wtf_fetch
is used to download the source of an article for Wikipedia, Wikiversity for further processing and tokenizing mathematical expressions.wtf_wikipedia
is the source repository developed by Spencer Kelly, who created that great library for Wikipedia article processing.- Wiki2Reveal that uses
wtf_fetch
andwtf_wikipedia
to download Wikipedia sources and convert the wiki sources "on-the-fly" into a RevealJS presentation. - Wikipedia2Wikiversity that uses
wtf_fetch
to download Wikipedia sources and convert the links for application in Wikiversity.
If you consider the source of wtf_wikipedia
you can identify 3 major step:
wtf_fetch
retrieving the wiki markup source from the MediaWiki API, i.e. https://www.wikipedia.org, https://www.wikiversity.org, https://www.wikivoyage.org, ...wtf_parse
, that parses wiki source into aDocument
object (Abstract Syntax Tree)wtf_output
, that generates/renders the output for a specific format from a givenDocument
object.
wtf_wikipedia
integrates all these 3 tasks in one module. The provide module decomposes one of those tasks in this submodules. The submodules wtf_fetch
, wtf_parse
and wtf_output
may be required independently in different project repository by a require
command. Furthermore it improves maintenance, reusability of submodules and it separates the tasks
in wtf_wikipedia
in the submodules wtf_fetch
, wtf_parse
, wtf_output
. If the modules are there wtf_wikipedia
can be used just for chaining
the tasks and other submodule can be added to the process chain in wtf_wikipedia
. E.g. citation management would be a submodule called wtf_citation
that could be implemented to insert a the citations in a document and fulfills a certain tasks. This modules uses the modular structure of wtf_wikipedia
in the folder src/
to extract the current task in a separate repository. Later the current local require
commands in wtf_wikipedia
can be replaced by a remote require
from npm
.
Tokenizers parse specific content elements and replaces the content elements by unique identifiers. The identifiers/tokens must be handled as ordinary text elements/words that consist of characters, numbers, ... and which are not handled by the parser itsself. The unique identifiers will appear in the output format (export to HTML, Markdown, text, Open Document Format, ...) and as a final processing output the tokens will be replaced by a desired token handler that replaces mathematical expressions or citations according to requirements of the output format.
This could be documented in the README.md
as developer recommendation and helps developers to understand the way forward and how they could add new wtf_modules
in the chaining process. In this sense wtf_wikipedia
will become the chain managment module of wtf_submodules
.
The following examples uses wtf_fetch
to download the Wiki source from the MediaWiki-API. The library wtf_tokenizer
parses the wiki source and replaces mathematical expressions or citations by tokens, that will not be altered by wtf_wikipedia
.
The decoding the tokens into the output format is dependent on the output format. Citations and mathematical expressions are handled differently according to the syntax of the output format.
The following installation incorporates fetching wiki sources from Wikipedia/Wikiversity.
npm install wtf_fetch npm install wtf_tokenizer
var wtf_fetch = require('wtf_fetch');
var wtf_tokenizer = require('wtf_tokenizer');
wtf_fetch.getPage('Swarm Intelligence', 'en','wikipedia' function(err, doc) {
// doc contains the download wiki
// options will be set that it tokenizes math expressions
// citations will not be encoded
var options = {
"tokenize": {
"math":true,
"citations":false,
"outformat":"html"
}
};
console.log("Source Wiki: " + doc.wiki);
wtf_tokenizer.encode(doc,options);
console.log("Encoded Tokens: " + doc.wiki);
wtf_tokenizer.decode(doc,options);
console.log("Decoded tokens: " + doc.wiki);
});
You can just copy the library wtf_tokenizer
with a script tag or just add the build of wtf_tokenizer.js
or the compressed wtf_tokenizer.min.js
from the repository directly and save the library e.g. into the js/
subdirectory of your HTML file. In this example we added also the library wtf_fetch.js
to the example to fetch the Wiki source from the Wiki API.
<script src="js/wtf_fetch.min.js"></script>
<script src="js/wtf_tokenizer.min.js"></script>
<script>
//(follows redirect)
wtf_fetch.getPage('Water', 'en','wikiversity' function(err, doc) {
// doc contains the download wiki
// options will be set that it tokenizes math expressions
// citations will not be encoded
var options = {
"tokenize": {
"math":true,
"citations":false,
"outformat":"html"
}
};
console.log("Source Wiki: " + doc.wiki);
wtf_tokenizer.encode(doc,options);
console.log("Encoded Tokens: " + doc.wiki);
// decode the mathematical expression
// into HTML format with MathJax
wtf_tokenizer.decode(doc,options);
console.log("Decoded tokens: " + doc.wiki);
});
</script>
- Assume you have downloaded the Wiki source code with
wtf_fetch
downloads Wiki markup source for an article from a MediaWiki of the Wiki Foundation - Allows different MediaWiki source, e.g. Wikipedia, Wikiversity, Wikivoyage, ...
- Creates a JSON with the following format stored as example in JSON with the attributes for the fetch page in
data
. The JSON may look like this:
var data = {
"wiki": "This is the content of the wiki article in wiki markdown ..."
"title": "Swarm Intelligence",
"lang": "en",
"domain": "wikiversity",
"url": "https://en.wikiversity.org/wiki/Swarm_Intelligence",
"pageid": 2130123
}
If you want to access the Wiki markdown of the fetch article, access the data.wiki
. The language and domain is stored in the JSON for the article because the attributes are helpful to expand relative links in the wiki to absolute links, that work also after having the document available on a other domain.
The fetched wiki markdown e.g. from Wikipedia is in general processed within the browser or in the NodeJS application.
The primary library for further processing is wtf_wikipedia
by Spencer Kelly (see wtf_wikipedia ).
wiky.js wiki2html are simple libraries that convert sources from a MediaWiki to HTML. With these converters you can start with, to learn about parsing a wiki source document downloaded from a MediaWiki.
Wikimedia's Parsoid javascript parser is the official wikiscript parser. It reliably turns wikiscript into HTML, but not valid XML.
To use it for data-mining, you'll need to:
parsoid(wikiText) -> [headless/pretend-DOM] -> screen-scraping
which is fine,
but getting structured data out of the Wiki source go ahead with Spencer Kelly library wtf_wikipedia
- wtf_fetch.getPage(title, [lang], [domain], [options], [callback])
The callback or promise will get a JSON of the following type that contains the markdown content in the wiki
property of the returned JSON:
{
"wiki": "This is the fetched markdown source of the article ...",
"title": "My Wikipedia Title",
"lang": "en",
"domain": "wikpedia",
"url": "https://en.wikipedia.org/wiki/My_Wikipedia_Title",
"pageid": 12345
}
You can retrieve the Wiki markdown from different MediaWiki products of the WikiFoundation. The domain name includes the Wiki product (e.g. Wikipedia or Wikiversity) and a language. The WikiID encoded the language and the domain determines the API that is called for fetching the source Wiki. The following WikiIDs are referring to the following domain name.
- Language:
en
Domain:wikipedia
: https://en.wikipedia.org - Language:
de
Domain:wikipedia
: https://de.wikipedia.org - Language:
fr
Domain:wikipedia
: https://fr.wikipedia.org - Language:
en
Domain:wikibooks
: https://en.wikibooks.org', - Language:
en
Domain:wikinews
: https://en.wikinews.org', - Language:
en
Domain:wikiquote
: https://en.wikiquote.org', - Language:
en
Domain:wikisource
: https://en.wikisource.org', - Language:
en
Domain:wikiversity
: https://en.wikiversity.org', - Language:
en
Domain:wikivoyage
: https://en.wikivoyage.org'
retrieves raw contents of a mediawiki article from the wikipedia action API.
This method supports the errback callback form, or returns a Promise if one is missing.
to call non-english wikipedia apis, add it's language-name as the second parameter
wtf_fetch.getPage('Toronto', 'de', 'wikipedia', function(err, doc) {
var url = "https://" + doc.lang + "." + doc.domain + ".org";
console.log("Wiki JSON fetched from '" +
url + "/wiki/" + doc.title + "'\n" + JSON.stringify(doc,null,4));
//doc.wiki = "Toronto ist mit 2,6 Millionen Einwohnern..."
});
you may also pass the wikipedia page id as parameter instead of the page title:
wtf_fetch.getPage(64646, 'de', 'wikipedia', function(err, doc) {
console.log("Wiki JSON\n"+JSON.stringify(doc,null,4));
});
the fetch method follows redirects.
if you're scripting this from the shell, or from another language, install with a -g
, and then run:
$ node ./bin/wtf_fetch.js George Clooney de wikipedia
# George Timothy Clooney (born May 6, 1961) is an American actor ...
$ node ./bin/wtf_fetch.js 'Toronto Blue Jays' en wikipedia
Command Line Interface was not implement so far.
The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -
- 1️⃣ pass a
Api-User-Agent
as something so they can use to easily throttle bad scripts - 2️⃣ bundle multiple pages into one request as an array
- 3️⃣ run it serially, or at least, slowly.
wtf_fetch.getPage(['Royal Cinema', 'Aldous Huxley'], 'en', 'wikipedia',{
'Api-User-Agent': 'youremail@example.com'
}).then((docList) => {
let allDocs = docList.map(doc => doc.wiki);
console.log(allDocs);
});
wtf_fetch
is just the first step in creating other formats directly from the Wikipedia source by "on-the-fly" conversion after downloading the Wiki source e.g. from Wikipedia.
Creating an Office document is just one example of an output file. ODT-output is currently (2018/11/04) not part of wtf_wikipedia
but you may want to play around with the wtf_fetch
or wtf_wikipedia
to parse the Wiki source and convert the file in your browser into an Office document. The following source will support a bit in creating the Office documents.
If you try PanDoc document conversion the key to generate Office documents is the export format ODF.
LibreOffice can load and save even the OpenDocument Format and LibreOffice can load and save MICR0S0FT Office formats. So exporting to Open Document Format will be good option to start with in wtf_wikipedia
. The following description are a summary of aspects that support developers in bringing the Office export format e.g. to web-based environment like the ODF-Editor.
OpenDocument Format provides a comprehensive way forward for wtf_wikipedia
to exchange documents from a MediaWiki
source text reliably and effortlessly to different formats, products and devices. Regarding the different Wikis of the Wiki Foundation as a Content Sink e.g. the educational content in Wikiversity is no longer restricted to a single export format (like PDF) open ups access to other specific editors, products or vendors for all your needs. With wtf_wikipedia
and an ODF export format the users have the opportunity to choose the 'best fit' application of the Wiki content. This section focuses on Office products.
Some important information to support Office Documents in the future
- see WebODF how to edit ODF documents on the web or display slides. Current limitation of WebODF is, that does not render mathematical expressions, but alteration in WebODF editor does not remove the mathematical expressions from the ODF file (state 2018/04/07). WebODF does not render the mathematical expressions but this may be solved in the WebODF-Editor by using MathJax or KaTeX in the future.
- The
ODT
-Format is the default export format of LibreOffice/OpenOffice. Supporting the Open Community Approach OpenSource office products are used to avoid commercial dependencies for using generated Office products.- The
ODT
-Format of LibreOffice is basically a ZIP-File. - Unzip shows the folder structure within the ZIP-format. Create a subdirectory e.g. with the name
zipout/
and callunzip mytext.odt -d zipout
(Linux, MacOSX). - The main text content is stored in
content.xml
as the main file for defining the content of Office document - Remark: Zipping the folder content again will create a parsing error when you load the zipped office document again in
LibreOffice
. This may be caused by an inappropriate order in the generated ZIP-file. The filemimetype
must be the first file in the ZIP-archive. - The best way to generate ODT-files is to generate an ODT-template
mytemplate.odt
with LibreOffice and all the styles you want to apply for the document and place a marker at specific content areas, where you want to replace the cross-compiled content withwtf_wikipedia
incontent.xml
. The filecontent.xml
contains the text and can be updated in ODT-ZIP-file. If you want to have a MlCR0S0FT 0ffice output, just save the ODT-file in LibreOffice as Word-file. Also marker replacement is possible in ODF-files (see also WebODF demos. - Image must be downloaded from the MediaWiki (e.g. with an NPM equivalent of
wget
for fetching the image, audio or video) and add the file to the folder structure in the ZIP. Create a ODT-file with LibreOffice with an image and unzip the ODT-file to learn about way how ODT stores the image in the ODT zip-file.
- The
- JSZip: JSZip can be used to update and add certain files in a given ODT template (e.g.
mytemplate.odt
). Handling ZIP-files in a cross-compilation WebApp withwtf_wikipedia
that runs in your browser and generates an editor environment for the cross-compiled Wiki source text (like the WebODF editor). The updating the ODT template as ZIP-file can be handled with JSZip by replacing thecontent.xml
in a ZIP-archive.content.xml
can be generated withwtf_wikipedia
when theodf
-export format is added to/src/output/odf
(ToDo: Please create a pull request if you have done that). - LibreOffice Export: Loading ODT-files in LibreOffice allows to export the ODT-Format to
- Office documents
doc
- anddocx
-format, - Text files (
.txt
), - HTML files (
.html
), - Rich Text files (
.rtf
), - PDF files (
.pdf
) and even - PNG files (
.png
).
- Office documents
- Planing of the ODT support can be down in this README and collaborative implementation can be organized with Pull Requests PR.
- Helpful Libraries: node-odt, odt
wtf_wikipedia
supports HTML export,- the library
html-docx-js
supports cross-compilation of HTML into docx-format
wtf_fetch
is just a minor micro library to fetch the wiki markdown of an article in Wikipedia, Wikiversity, ... Please consider contribution to the wtf_wikipedia
developed by Spencer Kelly - see wtf_wikipedia for further details and join in!
This library add the tokenizer
to the Wiki Transformation Framework WTF wtf_wikipedia
. The code of the library complements specific features to wtf_wikipedia
was developed by Spencer Kelly. wtf_fetch
is in general used to retrieve a specific article from Wikipedia, Wikiversity and wtf_fetch
is based on
cross-fetch which allows fetch the markdown of articles from Wikipedia, Wikiversity even from a local HTML file. This is great because you can fetch an article and process the article in a browser without the need to perform processing on a remot server. Special thanks to Spencer Kelly for creating and maintaining wtf_wikipedia. A great contribution to the OpenSource community especially for using Wiki content as Open Educational Resources.
See also:
MIT