Jump to content

Recommended Posts

We need to scrape WATMM somehow, if even just segments of it, the admin guy said he's not going to archive or save anything and we have to do it ourselves.  This thread is for posting discussion on figuring out how to do it.  I'm no expert at web scraping or archiving, but I've tried a few things so far

Methods:

  • Browsing to a thread and printing it to PDF in chrome, each page individually
  • Scraping via wget
    • Problems:
      • Always gives me 403 forbidden errors, but I haven't tried every option like providing cookies strings, testing different user agent strings, etc
  • Archive.org wayback machine
    • https://archive.org/
    • Benefits:
      • Archives to archive.org's servers which can be accessed by anyone later
    • Problems:
      • Tough to get the website to archive your page, they seem to have low availability lately due to ongoing DDOS attacks
    • Tips:
      • Can theoretically be ran through a browser plugin that gives you a quick way to archive any page you're currently on to the wayback machine, in practice I have no luck with this though and it just gets stuck loading forever
    •  

Share your ideas and experiments

 

Resources:

https://wiki.archiveteam.org/index.php?title=Main_Page

 

 

Edited by Ivy Zemura yvI oo ii oo
Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/
Share on other sites

you can easily use something like curl or wget, it's actually pretty straightforward.

don't overthink it :cisfor: all our precious watmm memories will be backed up. we still have three months.

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005135
Share on other sites

https://web.archive.org/save

The capture failed because Save Page Now does not have access rights for http://forum.watmm.com/topic/76672-vaporwave/page-82 (HTTP status=403).

 

great...

  On 2/19/2025 at 6:28 AM, Dragon said:

you can easily use something like curl or wget, it's actually pretty straightforward.

don't overthink it :cisfor: all our precious watmm memories will be backed up. we still have three months.

ok you are in charge of this, provide me the command

nice:

https://web.archive.org/web/*/https://forum.watmm.com/topic/*

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005136
Share on other sites

yeah a couple of months ago i was backing up the entire DANK MEMES thread in the hope of creating a trippy timelapse video. i finished archiving it (90,000 memes lmao) but then i had trouble finding a way to create the video.

@Ivy Zemura yvI oo ii oo dm me

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005137
Share on other sites

I also think this is important.

  On 4/17/2013 at 2:45 PM, Alcofribas said:

afaik i usually place all my cum drops on scientifically sterilized glass slides which are carefully frozen and placed in trash cans throughout the city labelled "for women ❤️ alco" with my social security and phone numbers.

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005142
Share on other sites

i hope y'all catch the album mega threads and random into threads in the major artists forums. in theory there's still a few months so i'm sure going to find some time to try and catch some to help out

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005164
Share on other sites

I'm mostly concerned about posts like this being lost.

 

  On 4/7/2010 at 11:31 PM, Alcofribas said:

i met up with sean booth, one half of autechre (ae) and talked about everything from the duo's new album to horror films, even to what his mum's favorite sarny is innit.

 

so, how does oversteps compare to your previous records?

 

well, it's much better. we use a lot of melodies and wrote it on a kyma system MIDI'd up to some old roland drum machines we bought from moby. that bloke is a vegan, lol

 

would you say it's more emotional than, say, Quaristice?

 

nah

 

what was different about your work methods?

 

well, rob and i met back in 2001 and we've been working with pretty much the same equipment over in his dad's attic. so, this time around, we just did the same old shit. i usually make the beats and the melodies and he usually records it into his computer. then i master it and he plays about with his two kids. then i do the artwork.

 

under your alias TDR?

 

yeah, you've done yer research mate!

 

thanks! what records inspired this album?

 

ugh, i'd have to say "amber," "incunablua," "drukqs," all by autechre. and i guess mira calix, cuz she's my mate.

 

do you buy a lot of new records?

 

not really,...well, yeah, i by 'em off discogs a lot. i've got like every atom tm record. but mostly i buy mp3s off itunes. itunes is lush!

 

do you think it's possible to make music after 9/11?

 

nah

 

how do you cope with the trauma of war?

 

work on tracks, innit?

 

are you going to tour in america for this album?

 

no, just europe and mexico. mexico is fucking lush!

 

are you working on any new tracks at the moment?

 

yeah.

 

are they any good?

 

yeah. i've got one with this epic crescendo, like huge, fucking half hour crescendo. i'm calling it "bolero" by ravel.

 

ever read watmm?

 

yeah, i'm ET on there. lol

what else is going on in your life these days?

 

i'm making some kids with the missus, that's well lush. rob and i are doing yoga together too but with the distance we have to do it together on skype. nirvana is lush.

 

when's the next album coming out?

 

later this year, autumn prob. autumn is lush.

 

thanks for talking with me!

 

lush mate.

Expand  

 

(jk ofc there's treasure troves in all the genre threads that need saving also)

  On 4/17/2013 at 2:45 PM, Alcofribas said:

afaik i usually place all my cum drops on scientifically sterilized glass slides which are carefully frozen and placed in trash cans throughout the city labelled "for women ❤️ alco" with my social security and phone numbers.

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005165
Share on other sites

  On 2/19/2025 at 1:42 PM, usagi said:

I'm mostly concerned about posts like this being lost.

just download the browser extension and save page every time you see one...(i believe this works? someone correct me if i'm mistaken)

Screenshot2025-02-19at6_48_11AM.thumb.png.10d0d9ed3deb92a98ed828529015bf6a.png

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005168
Share on other sites

Most scrapers wont work because WATMM is using Cloudflare which blocks the usual scraping methods.

I have currently archived all threads from May 2013 to Feb 2025. Should be around 25% of all content. Aim to finish this month.

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005170
Share on other sites

  On 2/19/2025 at 2:58 PM, Ego said:

Most scrapers wont work because WATMM is using Cloudflare which blocks the usual scraping methods.

I have currently archived all threads from May 2013 to Feb 2025. Should be around 25% of all content. Aim to finish this month.

legend.

  On 4/17/2013 at 2:45 PM, Alcofribas said:

afaik i usually place all my cum drops on scientifically sterilized glass slides which are carefully frozen and placed in trash cans throughout the city labelled "for women ❤️ alco" with my social security and phone numbers.

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005172
Share on other sites

  On 2/19/2025 at 2:58 PM, Ego said:

Most scrapers wont work because WATMM is using Cloudflare which blocks the usual scraping methods.

I have currently archived all threads from May 2013 to Feb 2025. Should be around 25% of all content. Aim to finish this month.

Thats great, what format are they archived in?

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005173
Share on other sites

  On 2/19/2025 at 6:32 AM, Dragon said:

yeah a couple of months ago i was backing up the entire DANK MEMES thread in the hope of creating a trippy timelapse video. i finished archiving it (90,000 memes lmao) but then i had trouble finding a way to create the video.

@Ivy Zemura yvI oo ii oo

You are doing God’s work, danke

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005174
Share on other sites

  On 2/19/2025 at 3:16 PM, zazen said:

Thats great, what format are they archived in?

I'm just collecting the full page source for each page into an SQLite database. A bit wasteful but want to make sure we don't miss anything. 22GB and counting...

Later on we can consider parsing out all the actual posts and metadata. I also need to do a run to fetch all the images that were uploaded to the board.

Edited by Ego
Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005183
Share on other sites

I don't authorize the use of my content in the new forum... don't really wish to have any connection whatsoever to watmm and its owner... 

but anyway good job 👍

Edited by cruising for burgers
Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005190
Share on other sites

hilarious we have to do this because of one person's obdurate hubris

  On 11/24/2015 at 12:29 PM, Salvatorin said:

I feel there is a baobab tree growing out of my head, its leaves stretch up to the heavens

  

 

 

Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005192
Share on other sites

I've spent the last 2 weeks transcribing the contents of the LTM forum and Rephlex BBQ forum by hand, with a biro, to a series of hardbound foolscap notepads.

About 7% done.

 

Edited by perunamuusi
Link to comment
https://forum.watmm.com/topic/105501-~scraping-watmm-project~/#findComment-3005198
Share on other sites

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   1 Member

×
×