الملفات
moad.rag/clean_arabic_spacy git.ipynb
2026-03-04 12:44:55 +00:00

809 أسطر
36 KiB
Plaintext
خام اللوم التاريخ

هذا الملف يحتوي على أحرف Unicode غامضة

هذا الملف يحتوي على أحرف Unicode قد تُخلط مع أحرف أخرى. إذا كنت تعتقد أن هذا مقصود، يمكنك تجاهل هذا التحذير بأمان. استخدم زر الهروب للكشف عنها.

{
"cells": [
{
"cell_type": "markdown",
"id": "a1b2c3d4",
"metadata": {
"id": "a1b2c3d4"
},
"source": [
"# Arabic Text Cleaner — spaCy Edition\n",
"\n",
"Cleans raw HTML scraped from **moad.gov.sy** into clean Arabic text using a custom spaCy pipeline.\n",
"\n",
"### Pipeline steps\n",
"1. **HTML stripping** — BeautifulSoup removes scripts, styles, and all tags\n",
"2. **`arabic_normaliser`** — removes diacritics, normalises Alef variants, strips tatweel\n",
"3. **`arabic_cleaner`** — removes URLs, punctuation-only tokens, lone symbols, whitespace tokens\n",
"4. **Morphologizer** (built into `ar_core_news_sm`) — lemmas & morphological features\n",
"5. **Post-processing** — collapses symbol noise and extra blank lines\n",
"\n",
"### Requirements\n",
"```bash\n",
"pip install spacy beautifulsoup4 lxml\n",
"python -m spacy download ar_core_news_sm\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "b2c3d4e5",
"metadata": {
"id": "b2c3d4e5"
},
"source": [
"## 1 · Install dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3d4e5f6",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "c3d4e5f6",
"outputId": "d90f84f7-0f44-44b0-8103-ddfcfc249240"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: spacy in /usr/local/lib/python3.12/dist-packages (3.8.11)\n",
"Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (4.13.5)\n",
"Requirement already satisfied: lxml in /usr/local/lib/python3.12/dist-packages (6.0.2)\n",
"Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.0.12)\n",
"Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.0.5)\n",
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.0.15)\n",
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.0.13)\n",
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.0.12)\n",
"Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /usr/local/lib/python3.12/dist-packages (from spacy) (8.3.10)\n",
"Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.1.3)\n",
"Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.5.2)\n",
"Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.0.10)\n",
"Requirement already satisfied: weasel<0.5.0,>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from spacy) (0.4.3)\n",
"Requirement already satisfied: typer-slim<1.0.0,>=0.3.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (0.24.0)\n",
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (4.67.3)\n",
"Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.0.2)\n",
"Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.32.4)\n",
"Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.12.3)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.1.6)\n",
"Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from spacy) (75.2.0)\n",
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (26.0)\n",
"Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (2.8.3)\n",
"Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (4.15.0)\n",
"Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.7.0)\n",
"Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (2.41.4)\n",
"Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.4.2)\n",
"Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.4.4)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.11)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.5.0)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2026.2.25)\n",
"Requirement already satisfied: blis<1.4.0,>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from thinc<8.4.0,>=8.3.4->spacy) (1.3.3)\n",
"Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.12/dist-packages (from thinc<8.4.0,>=8.3.4->spacy) (0.1.5)\n",
"Requirement already satisfied: typer>=0.24.0 in /usr/local/lib/python3.12/dist-packages (from typer-slim<1.0.0,>=0.3.0->spacy) (0.24.1)\n",
"Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.12/dist-packages (from weasel<0.5.0,>=0.4.2->spacy) (0.23.0)\n",
"Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.12/dist-packages (from weasel<0.5.0,>=0.4.2->spacy) (7.5.1)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->spacy) (3.0.3)\n",
"Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.4.2->spacy) (2.1.1)\n",
"Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer>=0.24.0->typer-slim<1.0.0,>=0.3.0->spacy) (8.3.1)\n",
"Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer>=0.24.0->typer-slim<1.0.0,>=0.3.0->spacy) (1.5.4)\n",
"Requirement already satisfied: rich>=12.3.0 in /usr/local/lib/python3.12/dist-packages (from typer>=0.24.0->typer-slim<1.0.0,>=0.3.0->spacy) (13.9.4)\n",
"Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer>=0.24.0->typer-slim<1.0.0,>=0.3.0->spacy) (0.0.4)\n",
"Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer>=0.24.0->typer-slim<1.0.0,>=0.3.0->spacy) (4.0.0)\n",
"Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer>=0.24.0->typer-slim<1.0.0,>=0.3.0->spacy) (2.19.2)\n",
"Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=12.3.0->typer>=0.24.0->typer-slim<1.0.0,>=0.3.0->spacy) (0.1.2)\n",
"\n",
"\u001b[38;5;1m✘ No compatible package found for 'ar_core_news_sm' (spaCy v3.8.11)\u001b[0m\n",
"\n"
]
}
],
"source": [
"\n",
"!pip install spacy beautifulsoup4 lxml\n",
"!python -m spacy download ar_core_news_sm"
]
},
{
"cell_type": "markdown",
"id": "d4e5f6a7",
"metadata": {
"id": "d4e5f6a7"
},
"source": [
"## 2 · Imports"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e5f6a7b8",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "e5f6a7b8",
"outputId": "8c862123-5a73-4cc5-ba21-f6321694d6b5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"spaCy version: 3.8.11\n"
]
}
],
"source": [
"import json\n",
"import re\n",
"from pathlib import Path\n",
"\n",
"from bs4 import BeautifulSoup\n",
"import spacy\n",
"from spacy.language import Language\n",
"from spacy.tokens import Doc\n",
"\n",
"print(f\"spaCy version: {spacy.__version__}\")"
]
},
{
"cell_type": "markdown",
"id": "f6a7b8c9",
"metadata": {
"id": "f6a7b8c9"
},
"source": [
"## 3 · Config — set your file paths here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7b8c9d0",
"metadata": {
"id": "a7b8c9d0"
},
"outputs": [],
"source": [
"INPUT_PATH = \"/content/moad_full_scrape.json\" \n",
"OUTPUT_PATH = \"/content/moad_cleaned_spacy.json\""
]
},
{
"cell_type": "markdown",
"id": "b8c9d0e1",
"metadata": {
"id": "b8c9d0e1"
},
"source": [
"## 4 · Arabic Unicode patterns"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9d0e1f2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "c9d0e1f2",
"outputId": "a85c6ef2-de7c-408b-e3d8-cb61c63f32d8"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Patterns compiled ✓\n"
]
}
],
"source": [
"\n",
"DIACRITICS = re.compile(r'[\\u0610-\\u061A\\u064B-\\u065F\\u0670\\u06D6-\\u06DC\\u06DF-\\u06E4\\u06E7\\u06E8\\u06EA-\\u06ED]')\n",
"\n",
"\n",
"ALEF_VARIANTS = re.compile(r'[أإآٱ]')\n",
"\n",
"\n",
"TATWEEL = re.compile(r'\\u0640')\n",
"\n",
"\n",
"SYMBOL_NOISE = re.compile(r'([^\\w\\u0600-\\u06FF])\\1{2,}')\n",
"\n",
"\n",
"MULTI_BLANK = re.compile(r'\\n{3,}')\n",
"\n",
"print(\"Patterns compiled ✓\")"
]
},
{
"cell_type": "markdown",
"id": "d0e1f2a3",
"metadata": {
"id": "d0e1f2a3"
},
"source": [
"## 5 · Custom spaCy pipeline components"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1f2a3b4",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "e1f2a3b4",
"outputId": "5ad2c0d6-d97b-4672-89b0-e6ddf2014b03"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Custom components registered ✓\n"
]
}
],
"source": [
"\n",
"if not spacy.tokens.Token.has_extension(\"keep\"):\n",
" spacy.tokens.Token.set_extension(\"keep\", default=True)\n",
"\n",
"\n",
"@Language.component(\"arabic_normaliser\")\n",
"def arabic_normaliser(doc: Doc) -> Doc:\n",
" \"\"\"\n",
" Normalises each token after morphological analysis:\n",
" - Removes diacritics (tashkeel / harakat)\n",
" - Normalises Alef variants (أإآٱ → ا)\n",
" - Removes tatweel (kashida ـ)\n",
" Result stored in token.norm_ (token.text is untouched).\n",
" \"\"\"\n",
" for token in doc:\n",
" norm = token.text\n",
" norm = DIACRITICS.sub('', norm)\n",
" norm = ALEF_VARIANTS.sub('ا', norm)\n",
" norm = TATWEEL.sub('', norm)\n",
" token.norm_ = norm\n",
" return doc\n",
"\n",
"\n",
"@Language.component(\"arabic_cleaner\")\n",
"def arabic_cleaner(doc: Doc) -> Doc:\n",
" \"\"\"\n",
" Flags tokens for removal via token._.keep:\n",
" - URLs and email addresses\n",
" - Pure whitespace tokens\n",
" - Punctuation-only tokens (non-Arabic)\n",
" - Single non-alphanumeric, non-Arabic characters\n",
" \"\"\"\n",
" url_re = re.compile(r'https?://\\S+|www\\.\\S+|\\S+@\\S+\\.\\S+')\n",
" arabic_re = re.compile(r'[\\u0600-\\u06FF]')\n",
"\n",
" for token in doc:\n",
" text = token.norm_ or token.text\n",
" if url_re.match(text):\n",
" token._.keep = False\n",
" elif token.is_space:\n",
" token._.keep = False\n",
" elif token.is_punct and not arabic_re.search(text):\n",
" token._.keep = False\n",
" elif len(text) == 1 and not text.isalnum() and not arabic_re.match(text):\n",
" token._.keep = False\n",
" else:\n",
" token._.keep = True\n",
" return doc\n",
"\n",
"\n",
"print(\"Custom components registered ✓\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "rMv5CyFrulah",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "rMv5CyFrulah",
"outputId": "15749c95-0894-43ed-eeb9-60d3c7b9d2e0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting xx-sent-ud-sm==3.8.0\n",
" Using cached https://github.com/explosion/spacy-models/releases/download/xx_sent_ud_sm-3.8.0/xx_sent_ud_sm-3.8.0-py3-none-any.whl (4.3 MB)\n",
"\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
"You can now load the package via spacy.load('xx_sent_ud_sm')\n",
"\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n",
"If you are in a Jupyter or Colab notebook, you may need to restart Python in\n",
"order to load all the package's dependencies. You can do this by selecting the\n",
"'Restart kernel' or 'Restart runtime' option.\n"
]
}
],
"source": [
"!python -m spacy download xx_sent_ud_sm\n"
]
},
{
"cell_type": "markdown",
"id": "f2a3b4c5",
"metadata": {
"id": "f2a3b4c5"
},
"source": [
"## 6 · Build the spaCy pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6YUkvUHUv97k",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6YUkvUHUv97k",
"outputId": "9b210d2b-50ad-4574-d121-7750fd382686"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pipeline: ['arabic_normaliser', 'senter', 'arabic_cleaner']\n"
]
}
],
"source": [
"import spacy\n",
"from spacy.language import Language\n",
"\n",
"\n",
"\n",
"def build_nlp() -> Language:\n",
" \n",
" try:\n",
" nlp = spacy.load(\"xx_sent_ud_sm\", disable=[\"parser\", \"ner\"])\n",
" except OSError:\n",
" \n",
" nlp = spacy.blank(\"ar\")\n",
"\n",
" \n",
" if \"morphologizer\" in nlp.pipe_names:\n",
" nlp.add_pipe(\"arabic_normaliser\", after=\"morphologizer\")\n",
" else:\n",
" \n",
" nlp.add_pipe(\"arabic_normaliser\", first=True)\n",
"\n",
" \n",
" nlp.add_pipe(\"arabic_cleaner\", last=True)\n",
"\n",
" return nlp\n",
"\n",
"\n",
"nlp = build_nlp()\n",
"print(\"Pipeline:\", nlp.pipe_names)"
]
},
{
"cell_type": "markdown",
"id": "b4c5d6e7",
"metadata": {
"id": "b4c5d6e7"
},
"source": [
"## 7 · Helper functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5d6e7f8",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "c5d6e7f8",
"outputId": "323dff30-aff5-4a42-cdd6-581df0144724"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Helper functions defined ✓\n"
]
}
],
"source": [
"def extract_text_from_html(html: str) -> str:\n",
" \"\"\"Strip all HTML tags and return visible text.\"\"\"\n",
" soup = BeautifulSoup(html, \"lxml\")\n",
" for tag in soup.find_all([\"script\", \"style\", \"noscript\", \"head\",\n",
" \"meta\", \"link\", \"iframe\", \"svg\", \"img\"]):\n",
" tag.decompose()\n",
" return soup.get_text(separator=\"\\n\")\n",
"\n",
"\n",
"def post_clean(text: str) -> str:\n",
" \n",
" text = SYMBOL_NOISE.sub(r'\\1', text)\n",
"\n",
" \n",
" lines = [line.strip() for line in text.splitlines()]\n",
"\n",
" \n",
" \n",
" text = ' '.join(lines)\n",
"\n",
" \n",
" \n",
" \n",
" text = re.sub(r'\\s+', ' ', text)\n",
"\n",
" return text.strip()\n",
"\n",
"def process_record(record: dict, nlp: Language) -> dict:\n",
" \n",
" raw_text = extract_text_from_html(record.get(\"Text\", \" \"))\n",
"\n",
" \n",
" doc = nlp(raw_text)\n",
"\n",
" \n",
" parts = []\n",
" for token in doc:\n",
" if token._.keep:\n",
" \n",
" text_to_append = token.norm_ or token.text\n",
" parts.append(text_to_append)\n",
"\n",
" \n",
" cleaned = \" \".join(parts)\n",
"\n",
" \n",
" \n",
" cleaned = post_clean(cleaned)\n",
"\n",
" return {\n",
" \"Url\": record.get(\"Url\", \"\"),\n",
" \"Timestamp\": record.get(\"Timestamp\", \"\"),\n",
" \"Text\": cleaned,\n",
" \"token_count\": sum(1 for t in doc if t._.keep),\n",
" }\n",
"\n",
"\n",
"print(\"Helper functions defined ✓\")"
]
},
{
"cell_type": "markdown",
"id": "d6e7f8a9",
"metadata": {
"id": "d6e7f8a9"
},
"source": [
"## 8 · Load the JSON file"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "e7f8a9b0",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "e7f8a9b0",
"outputId": "1207bc14-3a07-43bb-c664-8222befc9761"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded 26 records from /content/moad_full_scrape.json\n",
"Keys per record: ['Url', 'Text', 'Timestamp']\n"
]
}
],
"source": [
"input_path = Path(INPUT_PATH)\n",
"\n",
"with open(input_path, encoding=\"utf-8\") as f:\n",
" data = json.load(f)\n",
"\n",
"print(f\"Loaded {len(data)} records from {input_path}\")\n",
"print(f\"Keys per record: {list(data[0].keys())}\")"
]
},
{
"cell_type": "markdown",
"id": "f8a9b0c1",
"metadata": {
"id": "f8a9b0c1"
},
"source": [
"## 9 · Preview one raw record"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "a9b0c1d2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "a9b0c1d2",
"outputId": "28cefa91-bccb-48bb-dc7b-12f3cc22da97"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"URL : https://moad.gov.sy/ar/contact/important-links\n",
"Timestamp: 2026-03-04T11:45:50.322876\n",
"\n",
"Raw HTML (first 500 chars):\n",
"<!DOCTYPE html><html lang=\"ar\" dir=\"rtl\"><head><meta charSet=\"utf-8\"/><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"/><link rel=\"stylesheet\" href=\"/_next/static/chunks/93a75f8697bed459.css\" data-precedence=\"next\"/><link rel=\"preload\" as=\"script\" fetchPriority=\"low\" href=\"/_next/static/chunks/c62b4372c26b460f.js\"/><script src=\"/_next/static/chunks/c951386faa1af47c.js\" async=\"\"></script><script src=\"/_next/static/chunks/fa9b078b3e28c0da.js\" async=\"\"></script><script src=\"/_nex\n"
]
}
],
"source": [
"sample = data[0]\n",
"print(\"URL :\", sample[\"Url\"])\n",
"print(\"Timestamp:\", sample.get(\"Timestamp\", \"N/A\"))\n",
"print(\"\\nRaw HTML (first 500 chars):\")\n",
"print(sample[\"Text\"][:500])"
]
},
{
"cell_type": "markdown",
"id": "b0c1d2e3",
"metadata": {
"id": "b0c1d2e3"
},
"source": [
"## 10 · Process all records"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "c1d2e3f4",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "c1d2e3f4",
"outputId": "2e668452-bb66-4c5e-a21c-78b255181821"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[01] tokens=169 https://moad.gov.sy/ar/contact/important-links\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[02] tokens=200 https://moad.gov.sy/ar/contact\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[03] tokens=428 https://moad.gov.sy/ar/ministry/about\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[04] tokens=191 https://moad.gov.sy/ar/ministry/minister\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[05] tokens=483 https://moad.gov.sy/ar/contact/faq\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[06] tokens=146 https://moad.gov.sy/legislation/decrees/42\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[07] tokens=217 https://moad.gov.sy/ar/legislation\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[08] tokens=171 https://moad.gov.sy/ar/legislation/decrees\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[09] tokens=185 https://moad.gov.sy/ministry\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[10] tokens=169 https://moad.gov.sy/contact/important-links\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[11] tokens=191 https://moad.gov.sy/ministry/minister\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[12] tokens=185 https://moad.gov.sy/ar/ministry\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[13] tokens=421 https://moad.gov.sy/ar/terms-conditions\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[14] tokens=536 https://moad.gov.sy/ar/privacy-policy\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[15] tokens=997 https://moad.gov.sy/ministry/strategic-directions-map\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[16] tokens=483 https://moad.gov.sy/contact/faq\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[17] tokens=200 https://moad.gov.sy/contact\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[18] tokens=154 https://moad.gov.sy/legislation/service-guides\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[19] tokens=171 https://moad.gov.sy/legislation/decrees\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[20] tokens=154 https://moad.gov.sy/ar/legislation/service-guides\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[21] tokens=146 https://moad.gov.sy/legislation/decrees/43\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[22] tokens=217 https://moad.gov.sy/legislation\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[23] tokens=997 https://moad.gov.sy/ar/ministry/strategic-directions-map\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[24] tokens=146 https://moad.gov.sy/legislation/decrees/46\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[25] tokens=550 https://moad.gov.sy/\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"[26] tokens=428 https://moad.gov.sy/ministry/about\n",
" دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ،…\n",
"\n",
"\n",
"All 26 records processed ✓\n"
]
}
],
"source": [
"cleaned_data = []\n",
"\n",
"for i, record in enumerate(data, 1):\n",
" result = process_record(record, nlp)\n",
" cleaned_data.append(result)\n",
"\n",
" snippet = result[\"Text\"][:120].replace(\"\\n\", \" \")\n",
" print(f\"[{i:02d}] tokens={result['token_count']} {record.get('Url', ' ')}\")\n",
" print(f\" {snippet}…\\n\")\n",
"\n",
"print(f\"\\nAll {len(cleaned_data)} records processed ✓\")"
]
},
{
"cell_type": "markdown",
"id": "d2e3f4a5",
"metadata": {
"id": "d2e3f4a5"
},
"source": [
"## 11 · Preview a cleaned record"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "e3f4a5b6",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "e3f4a5b6",
"outputId": "f36bf873-e537-4246-f115-f6d9aa376fa0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"URL : https://moad.gov.sy/ar/contact/important-links\n",
"Token count : 169\n",
"\n",
"Cleaned text (first 800 chars):\n",
"دمشق القائمة الوزارة حول الوزارة من نحن الوزير نبذة عامة خارطة التوجهات الاستراتيجية الرؤية ، الرسالة ، القيم المؤسسية ، الاهداف الاستراتيجية ، محاور العمل التشريعات و الادلة القوانين تتضمن القوانين النافذة التي تنظم عمل الجهاز الحكومي وتدعم مسار التحول المؤسسي المراسيم تشمل المراسيم الرئاسية المتعلقة بتطوير الادارة العامة والموارد البشرية الادلة الخدمية لوزارة التنمية توضح الاجراءات والمعايير المعتمدة لتقديم الخدمات الحكومية بشفافية وكفاءة الاخبار والفعاليات تواصل معنا الاسئلة الشائعة اكثر الاسئلة تكرارا روابط مهمة جهات مرتبطة بعمل الوزارة English English الوزارة التشريعات و الادلة الاخبار والفعاليات تواصل معنا الرئيسية تواصل معنا الروابط المهمة الروابط المهمة وزارة الخارجية وزارة الخارجية مسؤولة عن سياسة الدولة الخارجية والعلاقات مع الدول الاخرى وزارة الداخلية وزارة الداخلية مسؤولة عن سي\n"
]
}
],
"source": [
"sample_clean = cleaned_data[0]\n",
"print(\"URL :\", sample_clean[\"Url\"])\n",
"print(\"Token count :\", sample_clean[\"token_count\"])\n",
"print(\"\\nCleaned text (first 800 chars):\")\n",
"print(sample_clean[\"Text\"][:800])"
]
},
{
"cell_type": "markdown",
"id": "f4a5b6c7",
"metadata": {
"id": "f4a5b6c7"
},
"source": [
"## 12 · Save output JSON"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "a5b6c7d8",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "a5b6c7d8",
"outputId": "62952f28-c72c-4553-b572-8de2b3ebc400"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Saved → /content/moad_cleaned_spacy.json (104.3 KB)\n",
"Done ✓\n"
]
}
],
"source": [
"output_path = Path(OUTPUT_PATH)\n",
"\n",
"with open(output_path, \"w\", encoding=\"utf-8\") as f:\n",
" json.dump(cleaned_data, f, ensure_ascii=False, indent=2)\n",
"\n",
"print(f\"Saved → {output_path} ({output_path.stat().st_size / 1024:.1f} KB)\")\n",
"print(\"Done ✓\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "xnL9qF4EyF9E",
"metadata": {
"id": "xnL9qF4EyF9E"
},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}