Jonathan Lalou's blog

Posts Tagged ‘RSS’

RSS to EPUB Converter: Create eBooks from RSS Feeds

Overview

This Python script (rss_to_ebook.py) converts RSS or Atom feeds into EPUB format eBooks, allowing you to read your favorite blog posts and news articles offline in your preferred e-reader. The script intelligently handles both RSS 2.0 and Atom feed formats, preserving HTML formatting while creating a clean, readable eBook.

Key Features

Dual Format Support: Works with both RSS 2.0 and Atom feeds
Smart Pagination: Automatically handles paginated feeds using multiple detection methods
Date Range Filtering: Select specific date ranges for content inclusion
Metadata Preservation: Maintains feed metadata including title, author, and description
HTML Formatting: Preserves original HTML formatting while cleaning unnecessary elements
Duplicate Prevention: Automatically detects and removes duplicate entries
Comprehensive Logging: Detailed progress tracking and error reporting

Technical Details

The script uses several Python libraries:

feedparser: For parsing RSS and Atom feeds
ebooklib: For creating EPUB files
BeautifulSoup: For HTML cleaning and processing
logging: For detailed operation tracking

Usage

python rss_to_ebook.py <feed_url> [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD] [--output filename.epub] [--debug]

Parameters:

feed_url: URL of the RSS or Atom feed (required)
--start-date: Start date for content inclusion (default: 1 year ago)
--end-date: End date for content inclusion (default: today)
--output: Output EPUB filename (default: rss_feed.epub)
--debug: Enable detailed logging

Example

python rss_to_ebook.py https://example.com/feed --start-date 2024-01-01 --end-date 2024-03-31 --output my_blog.epub

Requirements

Python 3.x

Required packages (install via pip):

pip install feedparser ebooklib beautifulsoup4

How It Works

Feed Detection: Automatically identifies feed format (RSS 2.0 or Atom)
Content Processing:
- Extracts entries within specified date range
- Preserves HTML formatting while cleaning unnecessary elements
- Handles pagination to get all available content
EPUB Creation:
- Creates chapters from feed entries
- Maintains original formatting and links
- Includes table of contents and navigation
- Preserves feed metadata

Error Handling

Validates feed format and content
Handles malformed HTML
Provides detailed error messages and logging
Gracefully handles missing or incomplete feed data

Use Cases

Create eBooks from your favorite blogs
Archive important news articles
Generate reading material for offline use
Create compilations of related content

Gist: GitHub

Here is the script:

#!/usr/bin/env python3

import feedparser
import argparse
from datetime import datetime, timedelta
from ebooklib import epub
import re
from bs4 import BeautifulSoup
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)

def clean_html(html_content):
    """Clean HTML content while preserving formatting."""
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()
    
    # Remove any inline styles
    for tag in soup.find_all(True):
        if 'style' in tag.attrs:
            del tag.attrs['style']
    
    # Return the cleaned HTML
    return str(soup)

def get_next_feed_page(current_feed, feed_url):
    """Get the next page of the feed using various pagination methods."""
    # Method 1: next_page link in feed
    if hasattr(current_feed, 'next_page'):
        logging.info(f"Found next_page link: {current_feed.next_page}")
        return current_feed.next_page
    
    # Method 2: Atom-style pagination
    if hasattr(current_feed.feed, 'links'):
        for link in current_feed.feed.links:
            if link.get('rel') == 'next':
                logging.info(f"Found Atom-style next link: {link.href}")
                return link.href
    
    # Method 3: RSS 2.0 pagination (using lastBuildDate)
    if hasattr(current_feed.feed, 'lastBuildDate'):
        last_date = current_feed.feed.lastBuildDate
        if hasattr(current_feed.entries, 'last'):
            last_entry = current_feed.entries[-1]
            if hasattr(last_entry, 'published_parsed'):
                last_entry_date = datetime(*last_entry.published_parsed[:6])
                # Try to construct next page URL with date parameter
                if '?' in feed_url:
                    next_url = f"{feed_url}&before={last_entry_date.strftime('%Y-%m-%d')}"
                else:
                    next_url = f"{feed_url}?before={last_entry_date.strftime('%Y-%m-%d')}"
                logging.info(f"Constructed date-based next URL: {next_url}")
                return next_url
    
    # Method 4: Check for pagination in feed description
    if hasattr(current_feed.feed, 'description'):
        desc = current_feed.feed.description
        # Look for common pagination patterns in description
        next_page_patterns = [
            r'next page: (https?://\S+)',
            r'older posts: (https?://\S+)',
            r'page \d+: (https?://\S+)'
        ]
        for pattern in next_page_patterns:
            match = re.search(pattern, desc, re.IGNORECASE)
            if match:
                next_url = match.group(1)
                logging.info(f"Found next page URL in description: {next_url}")
                return next_url
    
    return None

def get_feed_type(feed):
    """Determine if the feed is RSS 2.0 or Atom format."""
    if hasattr(feed, 'version') and feed.version.startswith('rss'):
        return 'rss'
    elif hasattr(feed, 'version') and feed.version == 'atom10':
        return 'atom'
    # Try to detect by checking for Atom-specific elements
    elif hasattr(feed.feed, 'links') and any(link.get('rel') == 'self' for link in feed.feed.links):
        return 'atom'
    # Default to RSS if no clear indicators
    return 'rss'

def get_entry_content(entry, feed_type):
    """Get the content of an entry based on feed type."""
    if feed_type == 'atom':
        # Atom format
        if hasattr(entry, 'content'):
            return entry.content[0].value if entry.content else ''
        elif hasattr(entry, 'summary'):
            return entry.summary
    else:
        # RSS 2.0 format
        if hasattr(entry, 'content'):
            return entry.content[0].value if entry.content else ''
        elif hasattr(entry, 'description'):
            return entry.description
    return ''

def get_entry_date(entry, feed_type):
    """Get the publication date of an entry based on feed type."""
    if feed_type == 'atom':
        # Atom format uses updated or published
        if hasattr(entry, 'published_parsed'):
            return datetime(*entry.published_parsed[:6])
        elif hasattr(entry, 'updated_parsed'):
            return datetime(*entry.updated_parsed[:6])
    else:
        # RSS 2.0 format uses pubDate
        if hasattr(entry, 'published_parsed'):
            return datetime(*entry.published_parsed[:6])
    return datetime.now()

def get_feed_metadata(feed, feed_type):
    """Extract metadata from feed based on its type."""
    metadata = {
        'title': '',
        'description': '',
        'language': 'en',
        'author': 'Unknown',
        'publisher': '',
        'rights': '',
        'updated': ''
    }
    
    if feed_type == 'atom':
        # Atom format metadata
        metadata['title'] = feed.feed.get('title', '')
        metadata['description'] = feed.feed.get('subtitle', '')
        metadata['language'] = feed.feed.get('language', 'en')
        metadata['author'] = feed.feed.get('author', 'Unknown')
        metadata['rights'] = feed.feed.get('rights', '')
        metadata['updated'] = feed.feed.get('updated', '')
    else:
        # RSS 2.0 format metadata
        metadata['title'] = feed.feed.get('title', '')
        metadata['description'] = feed.feed.get('description', '')
        metadata['language'] = feed.feed.get('language', 'en')
        metadata['author'] = feed.feed.get('author', 'Unknown')
        metadata['copyright'] = feed.feed.get('copyright', '')
        metadata['lastBuildDate'] = feed.feed.get('lastBuildDate', '')
    
    return metadata

def create_ebook(feed_url, start_date, end_date, output_file):
    """Create an ebook from RSS feed entries within the specified date range."""
    logging.info(f"Starting ebook creation from feed: {feed_url}")
    logging.info(f"Date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
    
    # Parse the RSS feed
    feed = feedparser.parse(feed_url)
    
    if feed.bozo:
        logging.error(f"Error parsing feed: {feed.bozo_exception}")
        return False
    
    # Determine feed type
    feed_type = get_feed_type(feed)
    logging.info(f"Detected feed type: {feed_type}")
    
    logging.info(f"Successfully parsed feed: {feed.feed.get('title', 'Unknown Feed')}")
    
    # Create a new EPUB book
    book = epub.EpubBook()
    
    # Extract metadata based on feed type
    metadata = get_feed_metadata(feed, feed_type)
    
    logging.info(f"Setting metadata for ebook: {metadata['title']}")
    
    # Set basic metadata
    book.set_identifier(feed_url)  # Use feed URL as unique identifier
    book.set_title(metadata['title'])
    book.set_language(metadata['language'])
    book.add_author(metadata['author'])
    
    # Add additional metadata if available
    if metadata['description']:
        book.add_metadata('DC', 'description', metadata['description'])
    if metadata['publisher']:
        book.add_metadata('DC', 'publisher', metadata['publisher'])
    if metadata['rights']:
        book.add_metadata('DC', 'rights', metadata['rights'])
    if metadata['updated']:
        book.add_metadata('DC', 'date', metadata['updated'])
    
    # Add date range to description
    date_range_desc = f"Content from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}"
    book.add_metadata('DC', 'description', f"{metadata['description']}\n\n{date_range_desc}")
    
    # Create table of contents
    chapters = []
    toc = []
    
    # Process entries within date range
    entries_processed = 0
    entries_in_range = 0
    consecutive_out_of_range = 0
    current_page = 1
    processed_urls = set()  # Track processed URLs to avoid duplicates
    
    logging.info("Starting to process feed entries...")
    
    while True:
        logging.info(f"Processing page {current_page} with {len(feed.entries)} entries")
        
        # Process current batch of entries
        for entry in feed.entries[entries_processed:]:
            entries_processed += 1
            
            # Skip if we've already processed this entry
            entry_id = entry.get('id', entry.get('link', ''))
            if entry_id in processed_urls:
                logging.debug(f"Skipping duplicate entry: {entry_id}")
                continue
            processed_urls.add(entry_id)
            
            # Get entry date based on feed type
            entry_date = get_entry_date(entry, feed_type)
            
            if entry_date < start_date:
                consecutive_out_of_range += 1
                logging.debug(f"Skipping entry from {entry_date.strftime('%Y-%m-%d')} (before start date)")
                continue
            elif entry_date > end_date:
                consecutive_out_of_range += 1
                logging.debug(f"Skipping entry from {entry_date.strftime('%Y-%m-%d')} (after end date)")
                continue
            else:
                consecutive_out_of_range = 0
                entries_in_range += 1
                
                # Create chapter
                title = entry.get('title', 'Untitled')
                logging.info(f"Adding chapter: {title} ({entry_date.strftime('%Y-%m-%d')})")
                
                # Get content based on feed type
                content = get_entry_content(entry, feed_type)
                
                # Clean the content
                cleaned_content = clean_html(content)
                
                # Create chapter
                chapter = epub.EpubHtml(
                    title=title,
                    file_name=f'chapter_{len(chapters)}.xhtml',
                    content=f'<h1>{title}</h1>{cleaned_content}'
                )
                
                # Add chapter to book
                book.add_item(chapter)
                chapters.append(chapter)
                toc.append(epub.Link(chapter.file_name, title, chapter.id))
        
        # If we have no entries in range or we've seen too many consecutive out-of-range entries, stop
        if entries_in_range == 0 or consecutive_out_of_range >= 10:
            if entries_in_range == 0:
                logging.warning("No entries found within the specified date range")
            else:
                logging.info(f"Stopping after {consecutive_out_of_range} consecutive out-of-range entries")
            break
            
        # Try to get more entries if available
        next_page_url = get_next_feed_page(feed, feed_url)
        if next_page_url:
            current_page += 1
            logging.info(f"Fetching next page: {next_page_url}")
            feed = feedparser.parse(next_page_url)
            if not feed.entries:
                logging.info("No more entries available")
                break
        else:
            logging.info("No more pages available")
            break
    
    if entries_in_range == 0:
        logging.error("No entries found within the specified date range")
        return False
    
    logging.info(f"Processed {entries_processed} total entries, {entries_in_range} within date range")
    
    # Add table of contents
    book.toc = toc
    
    # Add navigation files
    book.add_item(epub.EpubNcx())
    book.add_item(epub.EpubNav())
    
    # Define CSS style
    style = '''
    @namespace epub "http://www.idpf.org/2007/ops";
    body {
        font-family: Cambria, Liberation Serif, serif;
    }
    h1 {
        text-align: left;
        text-transform: uppercase;
        font-weight: 200;
    }
    '''
    
    # Add CSS file
    nav_css = epub.EpubItem(
        uid="style_nav",
        file_name="style/nav.css",
        media_type="text/css",
        content=style
    )
    book.add_item(nav_css)
    
    # Create spine
    book.spine = ['nav'] + chapters
    
    # Write the EPUB file
    logging.info(f"Writing EPUB file: {output_file}")
    epub.write_epub(output_file, book, {})
    logging.info("EPUB file created successfully")
    return True

def main():
    parser = argparse.ArgumentParser(description='Convert RSS feed to EPUB ebook')
    parser.add_argument('feed_url', help='URL of the RSS feed')
    parser.add_argument('--start-date', help='Start date (YYYY-MM-DD)', 
                        default=(datetime.now() - timedelta(days=365)).strftime('%Y-%m-%d'))
    parser.add_argument('--end-date', help='End date (YYYY-MM-DD)',
                        default=datetime.now().strftime('%Y-%m-%d'))
    parser.add_argument('--output', help='Output EPUB file name',
                        default='rss_feed.epub')
    parser.add_argument('--debug', action='store_true', help='Enable debug logging')
    
    args = parser.parse_args()
    
    if args.debug:
        logging.getLogger().setLevel(logging.DEBUG)
    
    # Parse dates
    start_date = datetime.strptime(args.start_date, '%Y-%m-%d')
    end_date = datetime.strptime(args.end_date, '%Y-%m-%d')
    
    # Create ebook
    if create_ebook(args.feed_url, start_date, end_date, args.output):
        logging.info(f"Successfully created ebook: {args.output}")
    else:
        logging.error("Failed to create ebook")

if __name__ == '__main__':
    main()

Posted in en-US | Tags: Atom, epub, Python, RSS | No Comments »

Twitter within GoogleReader

Author: Jonathan Lalou

Let’s say immediatly there is no miracle solution and it required a bit of work…

If the title is not explicit enough, the goal of this post is to give a way to include Twitter flow within your GoogleReader. IMHO, the main features of GoogleReader over regular Twitter are the following:

posts are stored on long term
posts can be searched or filtered by keywords
easy integration with Google Plus share button
ability to give a priority to feeds (as in Google Plus’s circles)
a simple interface to monitor all feeds

Unlike, following Twitter accounts on Google Reader (or even any other RSS aggregator) makes you lose the real time feature of Twitter.

This said, how to include your Twitter follow list within Google Reader syndication?

For each of the accounts you follow on Twitter, get the userId.
The easiest is to select a tweet > right click > source code > search the attribute “data-user-id”, get the associated number, eg: 813286 for @BarackObama, 50055701 for @MittRomney and 248309482 for my prefered one: @John_the_Cowboy ;-).
Alternatively, you can request the following service: http://www.idfromuser.com
In Google Reader, add a feed: https://twitter.com/statuses/user_timeline/XYZ.rss. Replace “XYZ” with the actual number retrieved before

Of course, this is not easy at all, as said in the disclaimer above. It is long, takes much time, and can become fastidious if you follow tens of people. Anyway, Google Reader users are geeks rather than newbies. You may create a Groovy script of setup a Mule instance to automatize de process ;-).

I suggest you to create one or many folders to gather the Twitter feeds.

Posted in en-US, The geek way of life | Tags: GoogleReader, RSS, Twitter | No Comments »

S	M	T	W	T	F	S
« Apr
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31