Giters
common-voice
/
cv-sentence-extractor
Scraping Wikipedia for fair use sentences
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
52
Watchers:
10
Issues:
76
Forks:
52
common-voice/cv-sentence-extractor Issues
Replacements not considering whitespace
Closed
10 months ago
Comments count
6
Feature Request - Removal of parentheses.
Closed
10 months ago
Comments count
1
Bad behavior in wikiextractor
Updated
10 months ago
Comments count
3
Is there a limit to the audio duration?
Updated
a year ago
Comments count
17
What is the limit for "replacements"?
Updated
a year ago
Comments count
1
max_characters in rules
Closed
a year ago
Comments count
3
Question: How is the result quality?
Closed
2 years ago
Comments count
1
WikiExtractor doesnt extract text for bn, hi
Closed
2 years ago
Comments count
1
Unable to filter out single letter and dot using abbreviation_patterns
Closed
2 years ago
Comments count
3
Sample Extraction Pipeline: only use shuffled file
Closed
2 years ago
Use best practice in English rules file
Closed
3 years ago
Comments count
14
Actions: Rebase seems to break sample extraction
Closed
3 years ago
Comments count
1
Ignore roman numbers
Closed
3 years ago
Comments count
10
Some wikicode ignored in sentences derived from Wikipedia
Closed
3 years ago
Comments count
2
Implement filtering by title
Closed
3 years ago
Scrape sentences from wikidata
Updated
3 years ago
Comments count
4
Technical analysis: re-running Wikipedia exports
Closed
3 years ago
Comments count
16
Punkt dependency and related issues
Closed
3 years ago
Comments count
6
Implement Unit Tests for app.rs
Closed
3 years ago
Fix Sample Extract
Closed
3 years ago
Comments count
1
Implement Unit Tests for loader.rs
Closed
3 years ago
Add WikiSource as target
Closed
3 years ago
Comments count
4
Simplify and generalize CI scripts
Closed
3 years ago
Comments count
5
Add Thai language
Closed
3 years ago
Comments count
1
Clean incomplete sentences
Closed
3 years ago
Comments count
3
extract-file failed: 'attempt to subtract with overflow'
Closed
3 years ago
Comments count
7
Documenting the order of execution for language rules
Closed
3 years ago
Comments count
5
Punkt Issue for Indian languages
Closed
4 years ago
Comments count
3
Implement Unit Tests for rules.rs
Closed
4 years ago
Make load_file_names glob configurable
Closed
4 years ago
Improve Hungarian Wiki Export
Closed
4 years ago
Comments count
3
Fix Lithuanian Export
Closed
4 years ago
Comments count
5
Fix Extraction for Belarusian (and possibly others)
Closed
4 years ago
Comments count
1
Fix readme guide at step 2 from wikiextractor.py to something else
Closed
4 years ago
Comments count
2
Add flag to change punctuation end from . to something else
Closed
4 years ago
Comments count
1
Sentence-level lookbehind rule
Closed
4 years ago
Comments count
3
Create Github Actions to run full export on PR / create blocklist
Closed
4 years ago
Comments count
4
Implement Unit Tests for config.rs
Closed
4 years ago
Add de-duplication GitHub Actions step
Closed
4 years ago
Comments count
1
Rename repository
Closed
4 years ago
Comments count
8
Matching opposite characters
Closed
4 years ago
Problem with one/two sentences articles
Closed
4 years ago
Comments count
4
Missing numbers in sentences
Closed
4 years ago
Comments count
26
Excluding footnotes and literature
Closed
4 years ago
Comments count
1
Using the language.toml rules for other big sentence collections
Closed
4 years ago
Comments count
7
Combining Acute Accent character refuses to be disabled
Closed
4 years ago
Comments count
2
Unify similar symbols
Closed
4 years ago
Comments count
2
A lot of names
Closed
4 years ago
Comments count
4
allowed_symbols as a whitelist instead of disallowed_symbols
Closed
4 years ago
Comments count
3
Legal issues
Closed
5 years ago
Comments count
2
Previous
Next