facebook / duckling

Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for Indian languages and documentation help

shubhamchaurasia1 opened this issue · comments

Can somebody please guide me through the logic behind ducking? I am trying to search for the documentation. I want to add support for Indian languages and rewrite the logic for it in python. For me, the most important use case is to extract time from any text.

Hi @shubhamchaurasia1 - I can try to collect some resources for how to get started working on Duckling language models in the next 2-3 weeks.

Unfortunately we don't have much documentation right now, if you want to get started right of your best bet is to look at some existing Rules.hs to get a sense how the rules are written. The English language support is probably the most mature, that might be a good place to start.

Thank you @stroxler - It would be great help if you could help me with some resources.
Surely I will start with the Rules.hs file to get the understanding of written rules.

Hi @stroxler - I read the rules.hs and corpus.hs files to get the understanding of written rules for different dimensions. However, I am still unable to figure out how the classifier is being used in extracting the entities.

Can I get some basic idea about the flow of the project? Like how the training happens for a dimension and how duckling employs classifiers?

The training happens out-of-band - there's an executable that will use the training corpus to fit a very simple statistical model, and re-generate source files that include hardcoded weights.

I've never tried rebuilding classifiers from the open-source repo, but you ought to be able to do it using the command
stack build :duckling-regen-exe. If you want to see what's going on under the hood you can trace that down, the command is defined in duckling.cabal; at a high level it will run RegenMain.hs, which fits a Naive Bayes model that we use for ambiguous parses.

The README recommends running this to do an end-to-end test of changes that could alter classifier outputs:

stack build :duckling-regen-exe && stack exec duckling-regen-exe && stack test

I did confirm that running

stack build :duckling-regen-exe && stack exec duckling-regen-exe && stack test

on my laptop seems to work alright, I think this is all you should need.

I believe there is a way to regenerate for just one dimension + language, which would be much faster if you need to make a series of updates (usually this isn't necessary). It probably requires manually running a command from stack repl. But it's been a while, I would have to dig around to find the right command.

Hi,
I am trying run debugger, but I am unable to run config the debugger json file for it. Can someone please help me or share how to set it up?

In the below json file I put stack ghci exe/RegenMain.hs as the ghciCmd as I want to run this file. But whenever I try to run debugger using below json file, debugger starts and stops instantly without any results. If possible, can someone please help me on what changes should I do to run the debugger?

Current launch.json settings:


{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        
        {
            "type": "ghc",
            "request": "launch",
            "name": "haskell(stack)",
            "internalConsoleOptions": "openOnSessionStart",
            "workspace": "${workspaceFolder}",
            "startup": "${workspaceFolder}/exe/RegenMain.hs",
            "startupFunc": "",
            "startupArgs": "",
            "stopOnEntry": false,
            "mainArgs": "",
            "ghciPrompt": "H>>= ",
            "ghciInitialPrompt": "> ",
            "ghciCmd": "stack ghci exe/RegenMain.hs",
            "ghciEnv": {},
            "logFile": "${workspaceFolder}/.vscode/phoityne.log",
            "logLevel": "WARNING",
            "forceInspect": false
        }
    ]
}

Hi,
I am trying run debugger, but I am unable to run config the debugger json file for it. Can someone please help me or share how to set it up?

In the below json file I put stack ghci exe/RegenMain.hs as the ghciCmd as I want to run this file. But whenever I try to run debugger using below json file, debugger starts and stops instantly without any results. If possible, can someone please help me on what changes should I do to run the debugger?

Current launch.json settings:


{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        
        {
            "type": "ghc",
            "request": "launch",
            "name": "haskell(stack)",
            "internalConsoleOptions": "openOnSessionStart",
            "workspace": "${workspaceFolder}",
            "startup": "${workspaceFolder}/exe/RegenMain.hs",
            "startupFunc": "",
            "startupArgs": "",
            "stopOnEntry": false,
            "mainArgs": "",
            "ghciPrompt": "H>>= ",
            "ghciInitialPrompt": "> ",
            "ghciCmd": "stack ghci exe/RegenMain.hs",
            "ghciEnv": {},
            "logFile": "${workspaceFolder}/.vscode/phoityne.log",
            "logLevel": "WARNING",
            "forceInspect": false
        }
    ]
}

Unfortunately I'm not familiar with vscode debugging of Haskell code, you'd probably have to find a foruim where there are Haskell stack experts.

For what it's worth, when developing new rules for Duckling I have mostly just relied on the interactive capabilities in ghci - Duckling has built-in support for "debug output" which will help you visualize the parse tree and the rules that ran when interpreting any given output.

For example:

[](https://github.com/facebook/duckling/blob/main/README.md#license)$ stack repl --no-load
> :l Duckling.Debug
> debug (makeLocale EN $ Just US) "in two minutes" [Seal Time]
in|within|after <duration> (in two minutes)
-- regex (in)
-- <integer> <unit-of-duration> (two minutes)
-- -- integer (0..19) (two)
-- -- -- regex (two)
-- -- minute (grain) (minutes)
-- -- -- regex (minutes)
[Entity {dim = "time", body = "in two minutes", value = RVal Time (TimeValue (SimpleValue (InstantValue {vValue = 2013-02-12 04:32:00 -0200, vGrain = Second})) [SimpleValue (InstantValue {vValue = 2013-02-12 04:32:00 -0200, vGrain = Second})] Nothing), start = 0, end = 14}]

As a rule I'd say this kind of debugging is likely to get you further than a debugger, assuming that you're trying to develop the rules as opposed to work on the engine internals.

This debugger is not working for me as I am trying to work on the engine internals and analysis the classifier.

  • I understood some parts of the classifiers but I am not getting what are the bag of features and classes in it?
  • What exactly is the input of makeClass function in the exe.Duckling.Ranking.Train.hs file working (bag of features being one of the attributes of this function)?
  • What exactly are the context, options, examples in the corpus?
  • What is the flow of inference? I have seen that parseHandler function in the ExampleMain.hs. Then it finds the language and locale and create the options, context, clean the dimensions, parse the text to be saved in parsedResult ( parse function). But how does the classifiers working in the parse function?

These details will help me and other contributors as well to get the detailed understanding of the whole process.

The full details will definitely require some digging.

In case it may help unblock you, I posted some notes I took on Duckling internals last year at
https://gist.github.com/stroxler/1187695c98e94b0f3ea7dbc1efadf0a8

I'm hoping to get these into the Duckling source code at some point, but I'm not if or when that will happen

Hopefully this helps