Sourcify
Effortlessly find and manage open-source dependencies for your projects.

A Python version of Sudachi, a Japanese morphological analyzer.

SudachiPy is a Python port of the Sudachi Japanese morphological analyzer. It allows for tokenizing Japanese text with multi-granular tokenization modes (A, B, C) enabling flexible text segmentation. It provides part-of-speech tags, normalized forms, reading forms, and dictionary information for each token. SudachiPy can be used both as a command-line tool and as a Python package. It supports user dictionaries to customize the tokenization process. Core architecture involves a dictionary-based approach where the dictionary (small, core, or full) provides the morphological information.
SudachiPy is a Python port of the Sudachi Japanese morphological analyzer.
Explore all tools that specialize in tokenization. This domain focus ensures SudachiPy delivers optimized results for this specific requirement.
Offers three split modes (A, B, C) for different levels of text segmentation. Mode A provides the finest granularity, while Mode C provides the coarsest.
Allows users to define custom dictionaries to handle specific vocabulary or domain-specific terms.
Assigns part-of-speech tags to each token, providing grammatical information about the text.
Normalizes text by converting it to a standard form, handling variations in spelling and character encoding.
Provides the reading form (pronunciation) of each token, useful for tasks like speech synthesis and language learning.
Install SudachiPy using pip: `pip install sudachipy`
Install a Sudachi dictionary (core, small, or full): `pip install sudachidict_core`
Import the necessary modules in Python: `from sudachipy import tokenizer, dictionary`
Create a tokenizer object: `tokenizer_obj = dictionary.Dictionary().create()`
Tokenize text using the tokenizer object and desired split mode: `tokenizer_obj.tokenize("国家公務員", mode)`
Access morpheme information such as surface form, dictionary form, and part-of-speech tags.
All Set
Ready to go
Verified feedback from other users.
"SudachiPy is appreciated for its accurate tokenization and flexibility in handling Japanese text, but its lack of recent updates raises concerns about long-term maintainability."
Post questions, share tips, and help other users.
Effortlessly find and manage open-source dependencies for your projects.

End-to-end typesafe APIs made easy.

Page speed monitoring with Lighthouse, focusing on user experience metrics and data visualization.

Topcoder is a pioneer in crowdsourcing, connecting businesses with a global talent network to solve technical challenges.

Explore millions of Discord Bots and Discord Apps.

Build internal tools 10x faster with an open-source low-code platform.

Open-source RAG evaluation tool for assessing accuracy, context quality, and latency of RAG systems.

AI-powered synthetic data generation for software and AI development, ensuring compliance and accelerating engineering velocity.