Thống kê dữ liệu trên xam

rongden69 · Hôm qua, lúc 17:22

T chuẩn bị crawl hết dữ liệu trên xàm để coi chữ cặc với chữ Lồn chữ nào xài nhiều hơn. Thằng nào quan tâm không? T sẽ update dần ở đây

rongden69 · Hôm qua, lúc 17:22

Chấm lấy slot

votdien · Hôm qua, lúc 17:23

Chữ Lồn màu đỏ quốc hồn quốc túy

Kutan79 · Hôm qua, lúc 17:23

thống kê số lần kutan79 xin card trên diễn đàn xàm

Thu_Đông · Hôm qua, lúc 17:23

Lồn hơn chứ

Dotucthinhchau · Hôm qua, lúc 17:23

rongden69:
T chuẩn bị crawl hết dữ liệu trên xàm để coi chữ cặc với chữ lồn chữ nào xài nhiều hơn. Thằng nào quan tâm không? T sẽ update dần ở đây

M thống kê luôn thằng nào chống nước với id nó luôn để ae dễ làm việc :vozvn (22):

Thích_Yến_Trân · Hôm qua, lúc 17:24

Trong trường hợp nhóm này bị điều tra bởi các cơ quan trực thuộc bộ công an (hoặc các tổ chức chính trị tương tự phục vụ cho nhà nước CHXHCNVN), tôi khẳng định mình không liên quan tới nhóm hoặc những cá nhân khác trong nhóm này. Tôi không rõ tại sao mình lại có mặt ở đây vào thời điểm này, có lẽ tài khoản của tôi đã được thêm bởi một bên thứ ba. Tôi cũng xin khẳng định rằng mình không hề giúp sức cho những hành động chống phá Đảng và nhà nước của các thành viên trong nhóm này.

rongden69 · Hôm qua, lúc 17:24

Mọi ý kiến góp ý t đều ghi nhận và sẽ thống kê cho tụi m đầy đủ :sweet_kiss:

rongden69 · Hôm qua, lúc 19:10

Step 1:
Để crawl hết dữ liệu trên xamvn thì đầu tiên t cần có danh sách tất cả bài đăng trên diễn đàn.
Đầu tiên, nhìn qua url một bài đăng ví dụ chính bài này: https://xamvn.chat/r/thong-ke-du-lieu-tren-xam.1320600
Thì nó sẽ có dạng: https://xamvn.chat/r/[tiêu đề bài đăng].[id của bài]
T thử vào link https://xamvn.chat/r/1320600/ thì truy cập bình thường => t chỉ cần thử tất cả từ id từ 1 đến số mới nhất là có thể có danh sách tất cả bài đăng.
Tuy nhiên khi t thử một số id thì không tìm thấy bài mà chỉ nhìn thấy con cặc của Manhthuong:

The requested page could not be found.

...............…………………………._¸„„„„_
…………………….…………...„--~*'¯…….'\
………….…………………… („-~~--„¸_….,/ì'Ì
…….…………………….¸„-^"¯ : : : : :¸-¯"¯/'
……………………¸„„-^"¯ : : : : : : : '\¸„„,-"
**¯¯¯'^^~-„„„----~^*'"¯ : : : : : : : : : :¸-"
.:.:.:.:.„-^" : : : : : : : : : : : : : : : : :„-"
:.:.:.:.:.:.:.:.:.:.: : : : : : : : : : ¸„-^¯
.::.:.:.:.:.:.:.:. : : : : : : : ¸„„-^¯
:.' : : '\ : : : : : : : ;¸„„-~"
:.:.:: :"-„""***/*'ì¸'¯
:.': : : : :"-„ : : :"\
.:.:.: : : : :" : : : : \,
:.: : : : : : : : : : : : 'Ì
: : : : : : :, : : : : : :/
"-„_::::_„-*__„„~"

Vậy nên t có một hướng tiếp cận khác là check thử sitemap: https://xamvn.chat/sitemap.xml:

XML:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://xamvn.chat/sitemap-1.xml</loc>
<lastmod>2025-04-04T05:38:29+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://xamvn.chat/sitemap-2.xml</loc>
<lastmod>2025-04-04T05:38:29+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://xamvn.chat/sitemap-3.xml</loc>
<lastmod>2025-04-04T05:38:29+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://xamvn.chat/sitemap-4.xml</loc>
<lastmod>2025-04-04T05:38:29+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://xamvn.chat/sitemap-5.xml</loc>
<lastmod>2025-04-04T05:38:29+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://xamvn.chat/sitemap-6.xml</loc>
<lastmod>2025-04-04T05:38:29+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://xamvn.chat/sitemap-7.xml</loc>
<lastmod>2025-04-04T05:38:29+00:00</lastmod>
</sitemap>
</sitemapindex>

BINGO, trong mấy link ở bên trên là toàn bộ danh sách tất cả bài đăng từ cổ chí kiêm trên xamvn. Việc có được danh sách này sẽ giúp t tiết kiệm thời gian check tất cả id:

XML:

<url><loc>https://xamvn.chat/r/tam-tu-tuoi-moi-lon.47/</loc><lastmod>2019-07-03T14:16:59+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/vai-lmht-con-mi-van-chuyen-ve-gan-chung-cu-tao-song.87/</loc><lastmod>2024-07-10T09:03:09+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/chem-lan-dau-chung-may-thit-gai-no-noi-gi.90/</loc><lastmod>2019-05-08T16:43:44+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/thang-nao-c-c-to-vao-diem-danh.102/</loc><lastmod>2023-12-23T10:08:20+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/khan-cau-su-giup-suc-phi-vu-dau-tien-cua-vozvn.113/</loc><lastmod>2019-09-21T01:47:45+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/tai-va-tat-cua-1-so-mem-xaml-cu.166/</loc><lastmod>2022-12-12T12:08:14+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/eu-may-thang-guest.204/</loc><lastmod>2016-06-10T19:42:59+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/chung-may-biet-phi-doi-nay-ko.216/</loc><lastmod>2018-10-11T12:04:38+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/ki-su-chuyen-mo.225/</loc><lastmod>2023-05-23T09:28:54+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/nhan-xem-tuong-cho-gai-xinh-o-day-nhe.247/</loc><lastmod>2019-10-06T16:55:26+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/pho-nguyen-cai-4rum-lon-ma-vang-nhu-di-toi-ngay.297/</loc><lastmod>2019-05-14T15:08:52+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/chuyen-rau-co-co-it-anh.333/</loc><lastmod>2019-01-14T03:38:50+00:00</lastmod></url>
    <url><loc>https://xamvn.chat/r/tam-su-cua-pin.344/</loc><lastmod>2022-10-30T17:14:22+00:00</lastmod></url>

Từ dữ liệu này thôi t đã có thể có một số góc nhìn về hoạt động trên diễn đàn, chưa cần đến comment bên trong...

rongden69 · Hôm nay, lúc 00:08

Step 2:
Tiến hành crawl dữ liệu từ danh sách ở step 1.
T chọn scrapy để crawl vì vừa dễ xài + có tích hợp đa luồng -> crawl nhiều page cùng lúc.
Chạy code:

Python:

import scrapy
from scrapy.spiders import SitemapSpider
import logging

logging.basicConfig(level=logging.INFO)

class XamvnChatSpider(SitemapSpider):
    name = 'xamvn_chat'
    allowed_domains = ['xamvn.chat']
    sitemap_urls = ['https://xamvn.chat/sitemap.xml']
    custom_settings = {
        'COOKIES_ENABLED': False,
        'LOG_LEVEL': 'INFO',
    }

    sitemap_rules = [
        ('/r/', 'parse_thread'),
    ]

    def parse_thread(self, response, thread_url=None, title=None, accumulated_messages=None):
        if thread_url is None:
            thread_url = response.url
        if accumulated_messages is None:
            accumulated_messages = []
        current_title = title
        if current_title is None:
            current_title_extracted = response.css('h1::text').get()
            if current_title_extracted:
                current_title = current_title_extracted.strip()
            else:
                current_title = "Untitled Thread" # Provide a default
                self.logger.warning(f"Could not extract title for thread starting at {thread_url}")
        current_page_messages = []
        articles = response.css('.block-container .message--post') # Sticking with CSS for clarity here
        for article in articles:
            user_id = article.css('a.username::attr(data-user-id)').get()
            timestamp = article.css('time::attr(datetime)').get()
            content = article.css('.bbWrapper').get()
            order = None
            post_order_texts = article.css('.message-attribution-opposite a::text').getall()
            for text in reversed(post_order_texts):
                cleaned_text = text.strip()
                if cleaned_text.startswith('#') and cleaned_text[1:].isdigit():
                    order = cleaned_text
                    break
            if user_id and content:
                 current_page_messages.append({
                     'data_user_id': user_id,
                     'timestamp': timestamp,
                     'content': content,
                     'order': order
                 })
            else:
                 self.logger.warning(f"Skipping message on {response.url} (User: {user_id}) due to missing data.")
        accumulated_messages.extend(current_page_messages)
        next_page_selector = 'a.pageNav-jump.pageNav-jump--next::attr(href)'
        next_page_relative_url = response.css(next_page_selector).get()
        if next_page_relative_url:
            next_page_url = response.urljoin(next_page_relative_url.strip())
            yield scrapy.Request(
                url=next_page_url,
                callback=self.parse_thread,
                cb_kwargs={
                    'thread_url': thread_url,
                    'title': current_title,
                    'accumulated_messages': accumulated_messages
                },
            )
        else:
            yield {
                'url': thread_url,
                'title': current_title,
                'messages': accumulated_messages,
                'message_count': len(accumulated_messages)
            }

Với cái máy và mạng sida của t thì chỉ có thể đáp ứng được ~500 bài đăng/phút => để crawl tất cả bài ~1320600 thì cần 1320600/500=2641 phút = 44 tiếng liên tục. Hẹn tụi m 44 tiếng sau.

Tuco Salamanca · Hôm nay, lúc 00:17

Tao đặt vào cửa Lồn nha.

loading99phantram · Hôm nay, lúc 00:29

Nhớ vứt vào elastic search luôn nhé, rồi cung cái api ra cho anh em nghịch.

Thống kê dữ liệu trên xam

rongden69

Bò lái xe

rongden69

Bò lái xe

votdien

Trưởng lão

Kutan79

Thanh niên Ngõ chợ

Thu_Đông

Địt Bùng Đạo Tổ

Dotucthinhchau

Súng hết đạn

Thích_Yến_Trân

Lồn phải lá han

rongden69

Bò lái xe

rongden69

Bò lái xe

rongden69

Bò lái xe

Tuco Salamanca

Bò lái xe

loading99phantram

Thôi vậy thì bỏ

Có thể bạn quan tâm