Posts Tagged ‘Atom’
RSS to EPUB Converter: Create eBooks from RSS Feeds
Overview
This Python script (rss_to_ebook.py
) converts RSS or Atom feeds into EPUB format eBooks, allowing you to read your favorite blog posts and news articles offline in your preferred e-reader. The script intelligently handles both RSS 2.0 and Atom feed formats, preserving HTML formatting while creating a clean, readable eBook.
Key Features
- Dual Format Support: Works with both RSS 2.0 and Atom feeds
- Smart Pagination: Automatically handles paginated feeds using multiple detection methods
- Date Range Filtering: Select specific date ranges for content inclusion
- Metadata Preservation: Maintains feed metadata including title, author, and description
- HTML Formatting: Preserves original HTML formatting while cleaning unnecessary elements
- Duplicate Prevention: Automatically detects and removes duplicate entries
- Comprehensive Logging: Detailed progress tracking and error reporting
Technical Details
The script uses several Python libraries:
feedparser
: For parsing RSS and Atom feedsebooklib
: For creating EPUB filesBeautifulSoup
: For HTML cleaning and processinglogging
: For detailed operation tracking
Usage
python rss_to_ebook.py <feed_url> [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD] [--output filename.epub] [--debug]
Parameters:
feed_url
: URL of the RSS or Atom feed (required)--start-date
: Start date for content inclusion (default: 1 year ago)--end-date
: End date for content inclusion (default: today)--output
: Output EPUB filename (default: rss_feed.epub)--debug
: Enable detailed logging
Example
python rss_to_ebook.py https://example.com/feed --start-date 2024-01-01 --end-date 2024-03-31 --output my_blog.epub
Requirements
- Python 3.x
- Required packages (install via pip):
pip install feedparser ebooklib beautifulsoup4
How It Works
- Feed Detection: Automatically identifies feed format (RSS 2.0 or Atom)
- Content Processing:
- Extracts entries within specified date range
- Preserves HTML formatting while cleaning unnecessary elements
- Handles pagination to get all available content
- EPUB Creation:
- Creates chapters from feed entries
- Maintains original formatting and links
- Includes table of contents and navigation
- Preserves feed metadata
Error Handling
- Validates feed format and content
- Handles malformed HTML
- Provides detailed error messages and logging
- Gracefully handles missing or incomplete feed data
Use Cases
- Create eBooks from your favorite blogs
- Archive important news articles
- Generate reading material for offline use
- Create compilations of related content
Gist: GitHub
Here is the script:
#!/usr/bin/env python3 import feedparser import argparse from datetime import datetime, timedelta from ebooklib import epub import re from bs4 import BeautifulSoup import logging # Configure logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S' ) def clean_html(html_content): """Clean HTML content while preserving formatting.""" soup = BeautifulSoup(html_content, 'html.parser') # Remove script and style elements for script in soup(["script", "style"]): script.decompose() # Remove any inline styles for tag in soup.find_all(True): if 'style' in tag.attrs: del tag.attrs['style'] # Return the cleaned HTML return str(soup) def get_next_feed_page(current_feed, feed_url): """Get the next page of the feed using various pagination methods.""" # Method 1: next_page link in feed if hasattr(current_feed, 'next_page'): logging.info(f"Found next_page link: {current_feed.next_page}") return current_feed.next_page # Method 2: Atom-style pagination if hasattr(current_feed.feed, 'links'): for link in current_feed.feed.links: if link.get('rel') == 'next': logging.info(f"Found Atom-style next link: {link.href}") return link.href # Method 3: RSS 2.0 pagination (using lastBuildDate) if hasattr(current_feed.feed, 'lastBuildDate'): last_date = current_feed.feed.lastBuildDate if hasattr(current_feed.entries, 'last'): last_entry = current_feed.entries[-1] if hasattr(last_entry, 'published_parsed'): last_entry_date = datetime(*last_entry.published_parsed[:6]) # Try to construct next page URL with date parameter if '?' in feed_url: next_url = f"{feed_url}&before={last_entry_date.strftime('%Y-%m-%d')}" else: next_url = f"{feed_url}?before={last_entry_date.strftime('%Y-%m-%d')}" logging.info(f"Constructed date-based next URL: {next_url}") return next_url # Method 4: Check for pagination in feed description if hasattr(current_feed.feed, 'description'): desc = current_feed.feed.description # Look for common pagination patterns in description next_page_patterns = [ r'next page: (https?://\S+)', r'older posts: (https?://\S+)', r'page \d+: (https?://\S+)' ] for pattern in next_page_patterns: match = re.search(pattern, desc, re.IGNORECASE) if match: next_url = match.group(1) logging.info(f"Found next page URL in description: {next_url}") return next_url return None def get_feed_type(feed): """Determine if the feed is RSS 2.0 or Atom format.""" if hasattr(feed, 'version') and feed.version.startswith('rss'): return 'rss' elif hasattr(feed, 'version') and feed.version == 'atom10': return 'atom' # Try to detect by checking for Atom-specific elements elif hasattr(feed.feed, 'links') and any(link.get('rel') == 'self' for link in feed.feed.links): return 'atom' # Default to RSS if no clear indicators return 'rss' def get_entry_content(entry, feed_type): """Get the content of an entry based on feed type.""" if feed_type == 'atom': # Atom format if hasattr(entry, 'content'): return entry.content[0].value if entry.content else '' elif hasattr(entry, 'summary'): return entry.summary else: # RSS 2.0 format if hasattr(entry, 'content'): return entry.content[0].value if entry.content else '' elif hasattr(entry, 'description'): return entry.description return '' def get_entry_date(entry, feed_type): """Get the publication date of an entry based on feed type.""" if feed_type == 'atom': # Atom format uses updated or published if hasattr(entry, 'published_parsed'): return datetime(*entry.published_parsed[:6]) elif hasattr(entry, 'updated_parsed'): return datetime(*entry.updated_parsed[:6]) else: # RSS 2.0 format uses pubDate if hasattr(entry, 'published_parsed'): return datetime(*entry.published_parsed[:6]) return datetime.now() def get_feed_metadata(feed, feed_type): """Extract metadata from feed based on its type.""" metadata = { 'title': '', 'description': '', 'language': 'en', 'author': 'Unknown', 'publisher': '', 'rights': '', 'updated': '' } if feed_type == 'atom': # Atom format metadata metadata['title'] = feed.feed.get('title', '') metadata['description'] = feed.feed.get('subtitle', '') metadata['language'] = feed.feed.get('language', 'en') metadata['author'] = feed.feed.get('author', 'Unknown') metadata['rights'] = feed.feed.get('rights', '') metadata['updated'] = feed.feed.get('updated', '') else: # RSS 2.0 format metadata metadata['title'] = feed.feed.get('title', '') metadata['description'] = feed.feed.get('description', '') metadata['language'] = feed.feed.get('language', 'en') metadata['author'] = feed.feed.get('author', 'Unknown') metadata['copyright'] = feed.feed.get('copyright', '') metadata['lastBuildDate'] = feed.feed.get('lastBuildDate', '') return metadata def create_ebook(feed_url, start_date, end_date, output_file): """Create an ebook from RSS feed entries within the specified date range.""" logging.info(f"Starting ebook creation from feed: {feed_url}") logging.info(f"Date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}") # Parse the RSS feed feed = feedparser.parse(feed_url) if feed.bozo: logging.error(f"Error parsing feed: {feed.bozo_exception}") return False # Determine feed type feed_type = get_feed_type(feed) logging.info(f"Detected feed type: {feed_type}") logging.info(f"Successfully parsed feed: {feed.feed.get('title', 'Unknown Feed')}") # Create a new EPUB book book = epub.EpubBook() # Extract metadata based on feed type metadata = get_feed_metadata(feed, feed_type) logging.info(f"Setting metadata for ebook: {metadata['title']}") # Set basic metadata book.set_identifier(feed_url) # Use feed URL as unique identifier book.set_title(metadata['title']) book.set_language(metadata['language']) book.add_author(metadata['author']) # Add additional metadata if available if metadata['description']: book.add_metadata('DC', 'description', metadata['description']) if metadata['publisher']: book.add_metadata('DC', 'publisher', metadata['publisher']) if metadata['rights']: book.add_metadata('DC', 'rights', metadata['rights']) if metadata['updated']: book.add_metadata('DC', 'date', metadata['updated']) # Add date range to description date_range_desc = f"Content from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}" book.add_metadata('DC', 'description', f"{metadata['description']}\n\n{date_range_desc}") # Create table of contents chapters = [] toc = [] # Process entries within date range entries_processed = 0 entries_in_range = 0 consecutive_out_of_range = 0 current_page = 1 processed_urls = set() # Track processed URLs to avoid duplicates logging.info("Starting to process feed entries...") while True: logging.info(f"Processing page {current_page} with {len(feed.entries)} entries") # Process current batch of entries for entry in feed.entries[entries_processed:]: entries_processed += 1 # Skip if we've already processed this entry entry_id = entry.get('id', entry.get('link', '')) if entry_id in processed_urls: logging.debug(f"Skipping duplicate entry: {entry_id}") continue processed_urls.add(entry_id) # Get entry date based on feed type entry_date = get_entry_date(entry, feed_type) if entry_date < start_date: consecutive_out_of_range += 1 logging.debug(f"Skipping entry from {entry_date.strftime('%Y-%m-%d')} (before start date)") continue elif entry_date > end_date: consecutive_out_of_range += 1 logging.debug(f"Skipping entry from {entry_date.strftime('%Y-%m-%d')} (after end date)") continue else: consecutive_out_of_range = 0 entries_in_range += 1 # Create chapter title = entry.get('title', 'Untitled') logging.info(f"Adding chapter: {title} ({entry_date.strftime('%Y-%m-%d')})") # Get content based on feed type content = get_entry_content(entry, feed_type) # Clean the content cleaned_content = clean_html(content) # Create chapter chapter = epub.EpubHtml( title=title, file_name=f'chapter_{len(chapters)}.xhtml', content=f'<h1>{title}</h1>{cleaned_content}' ) # Add chapter to book book.add_item(chapter) chapters.append(chapter) toc.append(epub.Link(chapter.file_name, title, chapter.id)) # If we have no entries in range or we've seen too many consecutive out-of-range entries, stop if entries_in_range == 0 or consecutive_out_of_range >= 10: if entries_in_range == 0: logging.warning("No entries found within the specified date range") else: logging.info(f"Stopping after {consecutive_out_of_range} consecutive out-of-range entries") break # Try to get more entries if available next_page_url = get_next_feed_page(feed, feed_url) if next_page_url: current_page += 1 logging.info(f"Fetching next page: {next_page_url}") feed = feedparser.parse(next_page_url) if not feed.entries: logging.info("No more entries available") break else: logging.info("No more pages available") break if entries_in_range == 0: logging.error("No entries found within the specified date range") return False logging.info(f"Processed {entries_processed} total entries, {entries_in_range} within date range") # Add table of contents book.toc = toc # Add navigation files book.add_item(epub.EpubNcx()) book.add_item(epub.EpubNav()) # Define CSS style style = ''' @namespace epub "http://www.idpf.org/2007/ops"; body { font-family: Cambria, Liberation Serif, serif; } h1 { text-align: left; text-transform: uppercase; font-weight: 200; } ''' # Add CSS file nav_css = epub.EpubItem( uid="style_nav", file_name="style/nav.css", media_type="text/css", content=style ) book.add_item(nav_css) # Create spine book.spine = ['nav'] + chapters # Write the EPUB file logging.info(f"Writing EPUB file: {output_file}") epub.write_epub(output_file, book, {}) logging.info("EPUB file created successfully") return True def main(): parser = argparse.ArgumentParser(description='Convert RSS feed to EPUB ebook') parser.add_argument('feed_url', help='URL of the RSS feed') parser.add_argument('--start-date', help='Start date (YYYY-MM-DD)', default=(datetime.now() - timedelta(days=365)).strftime('%Y-%m-%d')) parser.add_argument('--end-date', help='End date (YYYY-MM-DD)', default=datetime.now().strftime('%Y-%m-%d')) parser.add_argument('--output', help='Output EPUB file name', default='rss_feed.epub') parser.add_argument('--debug', action='store_true', help='Enable debug logging') args = parser.parse_args() if args.debug: logging.getLogger().setLevel(logging.DEBUG) # Parse dates start_date = datetime.strptime(args.start_date, '%Y-%m-%d') end_date = datetime.strptime(args.end_date, '%Y-%m-%d') # Create ebook if create_ebook(args.feed_url, start_date, end_date, args.output): logging.info(f"Successfully created ebook: {args.output}") else: logging.error("Failed to create ebook") if __name__ == '__main__': main()
Quick and dirty script to convert WordPress export file to Blogger / Atom XML
I’ve created a Python script that converts WordPress export files to Blogger/Atom XML format. Here’s how to use it:
The script takes two command-line arguments:
wordpress_export.xml
: Path to your WordPress export XML fileblogger_export.xml
: Path where you want to save the converted Blogger/Atom XML file
To run the script:
python wordpress_to_blogger.py wordpress_export.xml blogger_export.xml
The script performs the following conversions:
- Converts WordPress posts to Atom feed entries
- Preserves post titles, content, publication dates, and authors
- Maintains categories as Atom categories
- Handles post status (published/draft)
- Preserves HTML content formatting
- Converts dates to ISO format required by Atom
The script uses Python’s built-in xml.etree.ElementTree
module for XML processing and includes error handling to make it robust.
Some important notes:
- The script only converts posts (not pages or other content types)
- It preserves the HTML content of your posts
- It maintains the original publication dates
- It handles both published and draft posts
- The output is a valid Atom XML feed that Blogger can import
The file:
#!/usr/bin/env python3 import xml.etree.ElementTree as ET import sys import argparse from datetime import datetime import re def convert_wordpress_to_blogger(wordpress_file, output_file): # Parse WordPress XML tree = ET.parse(wordpress_file) root = tree.getroot() # Create Atom feed atom = ET.Element('feed', { 'xmlns': 'http://www.w3.org/2005/Atom', 'xmlns:app': 'http://www.w3.org/2007/app', 'xmlns:thr': 'http://purl.org/syndication/thread/1.0' }) # Add feed metadata title = ET.SubElement(atom, 'title') title.text = 'Blog Posts' updated = ET.SubElement(atom, 'updated') updated.text = datetime.now().isoformat() # Process each post for item in root.findall('.//item'): if item.find('wp:post_type', {'wp': 'http://wordpress.org/export/1.2/'}).text != 'post': continue entry = ET.SubElement(atom, 'entry') # Title title = ET.SubElement(entry, 'title') title.text = item.find('title').text # Content content = ET.SubElement(entry, 'content', {'type': 'html'}) content.text = item.find('content:encoded', {'content': 'http://purl.org/rss/1.0/modules/content/'}).text # Publication date pub_date = item.find('pubDate').text published = ET.SubElement(entry, 'published') published.text = datetime.strptime(pub_date, '%a, %d %b %Y %H:%M:%S %z').isoformat() # Author author = ET.SubElement(entry, 'author') name = ET.SubElement(author, 'name') name.text = item.find('dc:creator', {'dc': 'http://purl.org/dc/elements/1.1/'}).text # Categories for category in item.findall('category'): category_elem = ET.SubElement(entry, 'category', {'term': category.text}) # Status status = item.find('wp:status', {'wp': 'http://wordpress.org/export/1.2/'}).text if status == 'publish': app_control = ET.SubElement(entry, 'app:control', {'xmlns:app': 'http://www.w3.org/2007/app'}) app_draft = ET.SubElement(app_control, 'app:draft') app_draft.text = 'no' else: app_control = ET.SubElement(entry, 'app:control', {'xmlns:app': 'http://www.w3.org/2007/app'}) app_draft = ET.SubElement(app_control, 'app:draft') app_draft.text = 'yes' # Write the output file tree = ET.ElementTree(atom) tree.write(output_file, encoding='utf-8', xml_declaration=True) def main(): parser = argparse.ArgumentParser(description='Convert WordPress export to Blogger/Atom XML format') parser.add_argument('wordpress_file', help='Path to WordPress export XML file') parser.add_argument('output_file', help='Path to output Blogger/Atom XML file') args = parser.parse_args() try: convert_wordpress_to_blogger(args.wordpress_file, args.output_file) print(f"Successfully converted {args.wordpress_file} to {args.output_file}") except Exception as e: print(f"Error: {str(e)}") sys.exit(1) if __name__ == '__main__': main()
Premiers pas avec Moblin 2.1
J’ai effectue plus ample connaissance avec Moblin 2.1, le systeme d’exploitation base sur Linux (Fedora pour etre plus precis) et developpe par Intel, specialement pour les netbooks a base d’Atom. Je me suis servi de l’OS installe une clef USB, sur mon netbook Medion Akoya 1210 (a base d’Atom, 1Ghz, 1Go de RAM).
Premieres impressions
- c’est rapide, tres rapide, ca boote beaucoup plus vite que le Windows XP d’origine!
- a l’utilisation c’est vraiment fluide, les animations ne sont pas saccadees.
- premiers bemols: il ne s’agit pas d’un OS a “bureau” comme les OS classiques. On sent bien qu’ici l’accent est mis sur la mobilite, caractere propre aux netbooks.
- pour installer le clavier AZERTY: aller dans “applications”, puis “settings”, “layout” et installer le francais.
- j’ai essaye de passer en localisation francaise (toujours dans applications > settings ), mais le reglage n’etant pris en compte que lors du login cela n’a pas fonctionne pour la session courante.
- reconnaissance materielle: Wifi et ethernet OK, clavier et touchpad fonctionnent sans le moindre souci.
- petit surf rapide sur internet, Flash est detecte et lu. Les pages s’affichent rapidement.
- un terminal est disponible dans le menu applications
- petite difficulte: eteindre le netbook. Oui je sais, c’est assez amusant, mais je n’ai pas trouve comment me delogger ou eteindre la machine. Finalement un appui sur le bouton on-off et l’appareil propose de s’eteindre dans 30 secondes.
Petits regrets
- le fait de passer a un OS sans bureau est assez deroutant. Mais en s’y donnant la peine on s’y adapte facilement.
- mon principal reproche: les reglages effectues pendant la session sur la clef USB ne sont pas sauvegardes. Autant je comprends que ce soit le cas pour un CD, autant pour une clef USB je m’attendais a ce que ca fonctionne. Je pense qu’il doit etre toutefois possible de sauvegarder des sessions sur l’USB, j’effectuerai quelques recherches.
Conclusion
La vitesse, legerete et reactivite du systeme sont vraiment appreciables et constituent les principales qualites de Moblin. Les ingenieurs d’Intel ont tres bien travaille!
Moblin est tres bien adapte aux usages d’internet mobile. Par contre pour un usage plus classique (bureautique ou loisirs) je reste sceptique.
Je pense que dans un premier temps je vais conserver Moblin sur un clef USB lorsque j’aurai uniquement besoin d’un acces rapide et sans risque sur internet.